Class ParallelTopicModel

  • All Implemented Interfaces:
    Direct Known Subclasses:
    DMRTopicModel, RTopicModel

    public class ParallelTopicModel
    extends java.lang.Object
    Simple parallel threaded implementation of LDA, following Newman, Asuncion, Smyth and Welling, Distributed Algorithms for Topic Models JMLR (2009), with SparseLDA sampling scheme and data structure from Yao, Mimno and McCallum, Efficient Methods for Topic Model Inference on Streaming Document Collections, KDD (2009).
    David Mimno, Andrew McCallum
    See Also:
    Serialized Form
    • Field Detail

      • logger

        public static java.util.logging.Logger logger
      • numTopics

        public int numTopics
      • topicMask

        public int topicMask
      • topicBits

        public int topicBits
      • numTypes

        public int numTypes
      • totalTokens

        public long totalTokens
      • alpha

        public double[] alpha
      • alphaSum

        public double alphaSum
      • beta

        public double beta
      • betaSum

        public double betaSum
      • usingSymmetricAlpha

        public boolean usingSymmetricAlpha
      • typeTopicCounts

        public int[][] typeTopicCounts
      • tokensPerTopic

        public int[] tokensPerTopic
      • docLengthCounts

        public int[] docLengthCounts
      • topicDocCounts

        public int[][] topicDocCounts
      • numIterations

        public int numIterations
      • burninPeriod

        public int burninPeriod
      • saveSampleInterval

        public int saveSampleInterval
      • optimizeInterval

        public int optimizeInterval
      • temperingInterval

        public int temperingInterval
      • showTopicsInterval

        public int showTopicsInterval
      • wordsPerTopic

        public int wordsPerTopic
      • saveStateInterval

        public int saveStateInterval
      • stateFilename

        public java.lang.String stateFilename
      • saveModelInterval

        public int saveModelInterval
      • modelFilename

        public java.lang.String modelFilename
      • randomSeed

        public int randomSeed
      • formatter

        public java.text.NumberFormat formatter
      • printLogLikelihood

        public boolean printLogLikelihood
    • Constructor Detail

      • ParallelTopicModel

        public ParallelTopicModel​(int numberOfTopics)
      • ParallelTopicModel

        public ParallelTopicModel​(int numberOfTopics,
                                  double alphaSum,
                                  double beta)
      • ParallelTopicModel

        public ParallelTopicModel​(LabelAlphabet topicAlphabet,
                                  double alphaSum,
                                  double beta)
    • Method Detail

      • getAlphabet

        public Alphabet getAlphabet()
      • getNumTopics

        public int getNumTopics()
      • setNumTopics

        public void setNumTopics​(int numTopics)
        Set or reset the number of topics. This method will not change any token-topic assignments, so it should only be used before initializing or restoring a previously saved state.
      • getTypeTopicCounts

        public int[][] getTypeTopicCounts()
      • getTokensPerTopic

        public int[] getTokensPerTopic()
      • setNumIterations

        public void setNumIterations​(int numIterations)
      • setBurninPeriod

        public void setBurninPeriod​(int burninPeriod)
      • setTopicDisplay

        public void setTopicDisplay​(int interval,
                                    int n)
      • setRandomSeed

        public void setRandomSeed​(int seed)
      • setOptimizeInterval

        public void setOptimizeInterval​(int interval)
        Interval for optimizing Dirichlet hyperparameters
      • setSymmetricAlpha

        public void setSymmetricAlpha​(boolean b)
      • setTemperingInterval

        public void setTemperingInterval​(int interval)
      • setNumThreads

        public void setNumThreads​(int threads)
      • setSaveState

        public void setSaveState​(int interval,
                                 java.lang.String filename)
        Define how often and where to save a text representation of the current state. Files are GZipped.
        interval - Save a copy of the state every interval iterations.
        filename - Save the state to this file, with the iteration number as a suffix
      • setSaveSerializedModel

        public void setSaveSerializedModel​(int interval,
                                           java.lang.String filename)
        Define how often and where to save a serialized model.
        interval - Save a serialized model every interval iterations.
        filename - Save to this file, with the iteration number as a suffix
      • addInstances

        public void addInstances​(InstanceList training)
      • initializeFromState

        public void initializeFromState​( stateFile)
      • buildInitialTypeTopicCounts

        public void buildInitialTypeTopicCounts()
      • optimizeAlpha

        public void optimizeAlpha​(WorkerCallable[] callables)
      • temperAlpha

        public void temperAlpha​(WorkerCallable[] callables)
      • optimizeBeta

        public void optimizeBeta​(WorkerCallable[] callables)
      • estimate

        public void estimate()
      • maximize

        public void maximize​(int iterations)
        This method implements iterated conditional modes, which is equivalent to Gibbs sampling, but replacing sampling from the conditional distribution with taking the maximum topic. It tends to converge within a small number of iterations for models that have reached a good state through Gibbs sampling.
      • getSortedWords

        public java.util.ArrayList<java.util.TreeSet<IDSorter>> getSortedWords()
        Return an array of sorted sets (one set per topic). Each set contains IDSorter objects with integer keys into the alphabet. To get direct access to the Strings, use getTopWords().
      • getTopWords

        public java.lang.Object[][] getTopWords​(int numWords)
        Return an array (one element for each topic) of arrays of words, which are the most probable words for that topic in descending order. These are returned as Objects, but will probably be Strings.
        numWords - The maximum length of each topic's array of words (may be less).
      • printTopWords

        public void printTopWords​( file,
                                  int numWords,
                                  boolean useNewLines)
      • printTopWords

        public void printTopWords​( out,
                                  int numWords,
                                  boolean usingNewLines)
      • displayTopWords

        public java.lang.String displayTopWords​(int numWords,
                                                boolean usingNewLines)
      • topicXMLReport

        public void topicXMLReport​( out,
                                   int numWords)
      • topicPhraseXMLReport

        public void topicPhraseXMLReport​( out,
                                         int numWords)
      • printTypeTopicCounts

        public void printTypeTopicCounts​( file)
        Write the internal representation of type-topic counts (count/topic pairs in descending order by count) to a file.
      • printTopicWordWeights

        public void printTopicWordWeights​( file)
      • printTopicWordWeights

        public void printTopicWordWeights​( out)
        Print an unnormalized weight for every word in every topic. Most of these will be equal to the smoothing parameter beta.
      • getTopicProbabilities

        public double[] getTopicProbabilities​(int instanceID)
        Get the smoothed distribution over topics for a training instance.
      • getTopicProbabilities

        public double[] getTopicProbabilities​(LabelSequence topics)
        Get the smoothed distribution over topics for a topic sequence, which may be from the training set or from a new instance with topics assigned by an inferencer.
      • printDocumentTopics

        public void printDocumentTopics​( file)
      • printDenseDocumentTopics

        public void printDenseDocumentTopics​( out)
      • printDocumentTopics

        public void printDocumentTopics​( out)
      • printDocumentTopics

        public void printDocumentTopics​( out,
                                        double threshold,
                                        int max)
        out - A print writer
        threshold - Only print topics with proportion greater than this number
        max - Print no more than this many topics
      • getSubCorpusTopicWords

        public double[][] getSubCorpusTopicWords​(boolean[] documentMask,
                                                 boolean normalized,
                                                 boolean smoothed)
      • getTopicWords

        public double[][] getTopicWords​(boolean normalized,
                                        boolean smoothed)
      • getDocumentTopics

        public double[][] getDocumentTopics​(boolean normalized,
                                            boolean smoothed)
      • getTopicDocuments

        public java.util.ArrayList<java.util.TreeSet<IDSorter>> getTopicDocuments​(double smoothing)
      • printTopicDocuments

        public void printTopicDocuments​( out)
      • printTopicDocuments

        public void printTopicDocuments​( out,
                                        int max)
        out - A print writer
        count - Print this number of top documents
      • printState

        public void printState​( f)
      • printState

        public void printState​( out)
      • modelLogLikelihood

        public double modelLogLikelihood()
      • getInferencer

        public TopicInferencer getInferencer()
        Return a tool for estimating topic distributions for new documents
      • getProbEstimator

        public MarginalProbEstimator getProbEstimator()
        Return a tool for evaluating the marginal probability of new documents under this model
      • write

        public void write​( serializedModelFile)
      • read

        public static ParallelTopicModel read​( f)
                                       throws java.lang.Exception