Class TokenSequenceParseFeatureString
- java.lang.Object
-
- cc.mallet.pipe.Pipe
-
- cc.mallet.pipe.TokenSequenceParseFeatureString
-
- All Implemented Interfaces:
AlphabetCarrying,java.io.Serializable
public class TokenSequenceParseFeatureString extends Pipe implements java.io.Serializable
Convert the string in each fieldToken.textto a list of Strings (space delimited). Add each string as a feature to the token. IfrealValuedis true, then treat the position in the list as the feature name and the value as a double. Otherwise, the feature name is the string itself and the value is 1.0.Modified to allow feature names and values to be specified.eg: featureName1=featureValue1 featureName2=featureValue2 ... The name/value separator (here '=') can be specified.
If your data consists of feature/value pairs (eg
height=10.7 width=3.6 length=1.7), usenew TokenSequenceParseFeatureString(true, true, "="). This format is typically used for sparse data, in which most features are equal to 0 in any given instance.If your data consists only of values, and the position determines which feature the value is for (eg
10.7 3.6 1.7), usenew TokenSequenceParseFeatureString(true). This format is typically used for data that has a small number of features that all have non-zero values most of the time.If your data is in the form of named binary indicator variables (eg
yellow quacks has_webbed_feet), use the constructornew TokenSequenceParseFeatureString(false). Each token will be interpreted as the name of a feature, whose value is 1.0.- Author:
- Aron Culotta culotta@cs.umass.edu
- See Also:
- Serialized Form
-
-
Constructor Summary
Constructors Constructor Description TokenSequenceParseFeatureString(boolean _realValued)TokenSequenceParseFeatureString(boolean _realValued, boolean _specifyFeatureNames)TokenSequenceParseFeatureString(boolean _realValued, boolean _specifyFeatureNames, java.lang.String _nameValueSeparator)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description Instancepipe(Instance carrier)Really this should be 'protected', but isn't for historical reasons.-
Methods inherited from class cc.mallet.pipe.Pipe
alphabetsMatch, getAlphabet, getAlphabets, getDataAlphabet, getInstanceId, getTargetAlphabet, instanceFrom, instancesFrom, instancesFrom, isDataAlphabetSet, isTargetProcessing, newIteratorFrom, preceedingPipeDataAlphabetNotification, preceedingPipeTargetAlphabetNotification, precondition, readResolve, setDataAlphabet, setOrCheckDataAlphabet, setOrCheckTargetAlphabet, setTargetAlphabet, setTargetProcessing
-
-
-
-
Constructor Detail
-
TokenSequenceParseFeatureString
public TokenSequenceParseFeatureString(boolean _realValued, boolean _specifyFeatureNames, java.lang.String _nameValueSeparator)- Parameters:
_realValued- interpret each data token as a double, and associate it with a feature called "Feature#K" where K is the order of the token, starting with 0. Note that this option is currently ignored if_specifyFeatureNamesis true._specifyFeatureNames- interpret each data token as a feature name/value pair, separated by some delimiter, which is the equals sign ("=") unless otherwise specified._nameValueSeparator- use a string other than = to separate name/value pairs. Colon (":") is a common choice. Note that this string cannot consist of any whitespace, as the tokens stream will already have been split.
-
TokenSequenceParseFeatureString
public TokenSequenceParseFeatureString(boolean _realValued, boolean _specifyFeatureNames)
-
TokenSequenceParseFeatureString
public TokenSequenceParseFeatureString(boolean _realValued)
-
-