Package cc.mallet.pipe
Class NGramPreprocessor
- java.lang.Object
-
- cc.mallet.pipe.Pipe
-
- cc.mallet.pipe.NGramPreprocessor
-
- All Implemented Interfaces:
AlphabetCarrying
,java.io.Serializable
public class NGramPreprocessor extends Pipe implements java.io.Serializable
This pipe changes text to lowercase, removes common XML entities (quot, apos, lt, gt), and replaces all punctuation except the - character with whitespace. It then breaks up tokens on whitespace and applies n-gram token replacements and deletions. Replacements are applied in the order they are specified, first by file and then within files.- See Also:
- Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description class
NGramPreprocessor.ReplacementSet
-
Field Summary
Fields Modifier and Type Field Description java.util.ArrayList<NGramPreprocessor.ReplacementSet>
replacementSets
-
Constructor Summary
Constructors Constructor Description NGramPreprocessor()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description int
loadDeletions(java.lang.String filename)
int
loadReplacements(java.lang.String filename)
Instance
pipe(Instance instance)
Really this should be 'protected', but isn't for historical reasons.-
Methods inherited from class cc.mallet.pipe.Pipe
alphabetsMatch, getAlphabet, getAlphabets, getDataAlphabet, getInstanceId, getTargetAlphabet, instanceFrom, instancesFrom, instancesFrom, isDataAlphabetSet, isTargetProcessing, newIteratorFrom, preceedingPipeDataAlphabetNotification, preceedingPipeTargetAlphabetNotification, precondition, readResolve, setDataAlphabet, setOrCheckDataAlphabet, setOrCheckTargetAlphabet, setTargetAlphabet, setTargetProcessing
-
-
-
-
Field Detail
-
replacementSets
public java.util.ArrayList<NGramPreprocessor.ReplacementSet> replacementSets
-
-