FeatureDocFreqPipe (Mallet 2 API)

java.lang.Object
- cc.mallet.pipe.Pipe
- - cc.mallet.pipe.FeatureDocFreqPipe

All Implemented Interfaces:

AlphabetCarrying, java.io.Serializable
```
public class FeatureDocFreqPipe
extends Pipe
```
Pruning low-count features can be a good way to save memory and computation. However, in order to use Vectors2Vectors, you need to write out the unpruned instance list, read it back into memory, collect statistics, create new instances, and then write everything back out.
This class supports a simpler method that makes two passes over the data: one to collect statistics and create an augmented "stop list", and a second to actually create instances.

See Also:

Serialized Form

Constructor Summary

Constructors
Constructor Description

FeatureDocFreqPipe()

FeatureDocFreqPipe(Alphabet dataAlphabet, Alphabet targetAlphabet)

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method	Description
`void`	`addPrunedWordsToStoplist(SimpleTokenizer tokenizer, double docFrequencyCutoff)`	Add all pruned words to the internal stoplist of a SimpleTokenizer.
`Instance`	`pipe(Instance instance)`	Really this should be 'protected', but isn't for historical reasons.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Constructor Detail
  - FeatureDocFreqPipe
```
public FeatureDocFreqPipe()
```
  - FeatureDocFreqPipe
```
public FeatureDocFreqPipe(Alphabet dataAlphabet,
                          Alphabet targetAlphabet)
```
- Method Detail
  - pipe
```
public Instance pipe(Instance instance)
```
    Description copied from class: Pipe
    
    Really this should be 'protected', but isn't for historical reasons.
    
    Overrides:
    
    pipe in class Pipe
  - addPrunedWordsToStoplist
```
public void addPrunedWordsToStoplist(SimpleTokenizer tokenizer,
                                     double docFrequencyCutoff)
```
    Add all pruned words to the internal stoplist of a SimpleTokenizer.
    
    Parameters:
    
    docFrequencyCutoff - Remove words that occur in greater than this proportion of documents. 0.05 corresponds to IDF >= 3.