Class DownsampleLabelWords


  • public class DownsampleLabelWords
    extends java.lang.Object
    This class implements the method from "Authorless Topic Models" by Thompson and Mimno, COLING 2018. The goal is to reduce the frequency of words that are unusually associated with a particular label. This is useful as a pre-processing step for topic modeling becuase it reduces the correlation of topics to known class labels. The problem comes up most often in fiction, where topics tend to simply reproduce lists of characters. The input is a labeled feature sequence, of the sort used for topic modeling. Unlike the regular topic modeling system, labels are required, since we need something to correlate. The output is another feature sequence with word tokens removed. Note that some words may disappear from the corpus, but they will still be present in the alphabet. The code takes one parameter, equivalent to a p-value where the null hypothesis is that a word occurs no more frequently in one category than in the collection as a whole.
    Author:
    David Mimno
    • Method Summary

      All Methods Static Methods Concrete Methods 
      Modifier and Type Method Description
      static void main​(java.lang.String[] args)  
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

      • DownsampleLabelWords

        public DownsampleLabelWords()
    • Method Detail

      • main

        public static void main​(java.lang.String[] args)
                         throws java.io.FileNotFoundException,
                                java.io.IOException
        Throws:
        java.io.FileNotFoundException
        java.io.IOException