public class DownsampleLabelWords extends java.lang.ObjectThis class implements the method from "Authorless Topic Models" by Thompson and Mimno, COLING 2018. The goal is to reduce the frequency of words that are unusually associated with a particular label. This is useful as a pre-processing step for topic modeling becuase it reduces the correlation of topics to known class labels. The problem comes up most often in fiction, where topics tend to simply reproduce lists of characters. The input is a labeled feature sequence, of the sort used for topic modeling. Unlike the regular topic modeling system, labels are required, since we need something to correlate. The output is another feature sequence with word tokens removed. Note that some words may disappear from the corpus, but they will still be present in the alphabet. The code takes one parameter, equivalent to a p-value where the null hypothesis is that a word occurs no more frequently in one category than in the collection as a whole.
- David Mimno
Constructors Constructor Description