Package cc.mallet.pipe
Class NGramPreprocessor
- java.lang.Object
-
- cc.mallet.pipe.Pipe
-
- cc.mallet.pipe.NGramPreprocessor
-
- All Implemented Interfaces:
AlphabetCarrying,java.io.Serializable
public class NGramPreprocessor extends Pipe implements java.io.Serializable
This pipe changes text to lowercase, removes common XML entities (quot, apos, lt, gt), and replaces all punctuation except the - character with whitespace. It then breaks up tokens on whitespace and applies n-gram token replacements and deletions. Replacements are applied in the order they are specified, first by file and then within files.- See Also:
- Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description classNGramPreprocessor.ReplacementSet
-
Field Summary
Fields Modifier and Type Field Description java.util.ArrayList<NGramPreprocessor.ReplacementSet>replacementSets
-
Constructor Summary
Constructors Constructor Description NGramPreprocessor()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description intloadDeletions(java.lang.String filename)intloadReplacements(java.lang.String filename)Instancepipe(Instance instance)Really this should be 'protected', but isn't for historical reasons.-
Methods inherited from class cc.mallet.pipe.Pipe
alphabetsMatch, getAlphabet, getAlphabets, getDataAlphabet, getInstanceId, getTargetAlphabet, instanceFrom, instancesFrom, instancesFrom, isDataAlphabetSet, isTargetProcessing, newIteratorFrom, preceedingPipeDataAlphabetNotification, preceedingPipeTargetAlphabetNotification, precondition, readResolve, setDataAlphabet, setOrCheckDataAlphabet, setOrCheckTargetAlphabet, setTargetAlphabet, setTargetProcessing
-
-
-
-
Field Detail
-
replacementSets
public java.util.ArrayList<NGramPreprocessor.ReplacementSet> replacementSets
-
-