Class FeatureCountPipe

  • All Implemented Interfaces:

    public class FeatureCountPipe
    extends Pipe
    Pruning low-count features can be a good way to save memory and computation. However, in order to use Vectors2Vectors, you need to write out the unpruned instance list, read it back into memory, collect statistics, create new instances, and then write everything back out.

    This class supports a simpler method that makes two passes over the data: one to collect statistics and create an augmented "stop list", and a second to actually create instances.

    See Also:
    Serialized Form
    • Constructor Detail

      • FeatureCountPipe

        public FeatureCountPipe()
      • FeatureCountPipe

        public FeatureCountPipe​(Alphabet dataAlphabet,
                                Alphabet targetAlphabet)
    • Method Detail

      • pipe

        public Instance pipe​(Instance instance)
        Description copied from class: Pipe
        Really this should be 'protected', but isn't for historical reasons.
        pipe in class Pipe
      • getPrunedAlphabet

        public Alphabet getPrunedAlphabet​(int minimumCount)
        Returns a new alphabet that contains only features at or above the specified limit.
      • writePrunedWords

        public void writePrunedWords​( prunedFile,
                                     int minimumCount)
        Writes a list of features that do not occur at or above the specified cutoff to the pruned file, one per line. This file can then be passed to a stopword filter as "additional stopwords".
      • addPrunedWordsToStoplist

        public void addPrunedWordsToStoplist​(SimpleTokenizer tokenizer,
                                             int minimumCount)
        Add all pruned words to the internal stoplist of a SimpleTokenizer.
      • writeCommonWords

        public void writeCommonWords​( commonFile,
                                     int totalWords)
        List the most common words, for addition to a stop file