Package cc.mallet.pipe
Class SimpleTokenizer
- java.lang.Object
-
- cc.mallet.pipe.Pipe
-
- cc.mallet.pipe.SimpleTokenizer
-
- All Implemented Interfaces:
AlphabetCarrying
,java.io.Serializable
public class SimpleTokenizer extends Pipe
A simple unicode tokenizer that accepts sequences of letters as tokens.- See Also:
- Serialized Form
-
-
Field Summary
Fields Modifier and Type Field Description protected java.util.HashSet<java.lang.String>
stoplist
static int
USE_DEFAULT_ENGLISH_STOPLIST
static int
USE_EMPTY_STOPLIST
-
Constructor Summary
Constructors Constructor Description SimpleTokenizer(int languageFlag)
SimpleTokenizer(java.io.File stopfile)
SimpleTokenizer(java.util.HashSet<java.lang.String> stoplist)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description SimpleTokenizer
deepClone()
java.util.HashSet<java.lang.String>
getStoplist()
Instance
pipe(Instance instance)
Really this should be 'protected', but isn't for historical reasons.void
stop(java.lang.String word)
-
Methods inherited from class cc.mallet.pipe.Pipe
alphabetsMatch, getAlphabet, getAlphabets, getDataAlphabet, getInstanceId, getTargetAlphabet, instanceFrom, instancesFrom, instancesFrom, isDataAlphabetSet, isTargetProcessing, newIteratorFrom, preceedingPipeDataAlphabetNotification, preceedingPipeTargetAlphabetNotification, precondition, readResolve, setDataAlphabet, setOrCheckDataAlphabet, setOrCheckTargetAlphabet, setTargetAlphabet, setTargetProcessing
-
-
-
-
Field Detail
-
USE_EMPTY_STOPLIST
public static final int USE_EMPTY_STOPLIST
- See Also:
- Constant Field Values
-
USE_DEFAULT_ENGLISH_STOPLIST
public static final int USE_DEFAULT_ENGLISH_STOPLIST
- See Also:
- Constant Field Values
-
stoplist
protected java.util.HashSet<java.lang.String> stoplist
-
-
Method Detail
-
deepClone
public SimpleTokenizer deepClone()
-
getStoplist
public java.util.HashSet<java.lang.String> getStoplist()
-
stop
public void stop(java.lang.String word)
-
-