Class TokenSequenceParseFeatureString
- java.lang.Object
-
- cc.mallet.pipe.Pipe
-
- cc.mallet.pipe.TokenSequenceParseFeatureString
-
- All Implemented Interfaces:
AlphabetCarrying
,java.io.Serializable
public class TokenSequenceParseFeatureString extends Pipe implements java.io.Serializable
Convert the string in each fieldToken.text
to a list of Strings (space delimited). Add each string as a feature to the token. IfrealValued
is true, then treat the position in the list as the feature name and the value as a double. Otherwise, the feature name is the string itself and the value is 1.0.Modified to allow feature names and values to be specified.eg: featureName1=featureValue1 featureName2=featureValue2 ... The name/value separator (here '=') can be specified.
If your data consists of feature/value pairs (eg
height=10.7 width=3.6 length=1.7
), usenew TokenSequenceParseFeatureString(true, true, "=")
. This format is typically used for sparse data, in which most features are equal to 0 in any given instance.If your data consists only of values, and the position determines which feature the value is for (eg
10.7 3.6 1.7
), usenew TokenSequenceParseFeatureString(true)
. This format is typically used for data that has a small number of features that all have non-zero values most of the time.If your data is in the form of named binary indicator variables (eg
yellow quacks has_webbed_feet
), use the constructornew TokenSequenceParseFeatureString(false)
. Each token will be interpreted as the name of a feature, whose value is 1.0.- Author:
- Aron Culotta culotta@cs.umass.edu
- See Also:
- Serialized Form
-
-
Constructor Summary
Constructors Constructor Description TokenSequenceParseFeatureString(boolean _realValued)
TokenSequenceParseFeatureString(boolean _realValued, boolean _specifyFeatureNames)
TokenSequenceParseFeatureString(boolean _realValued, boolean _specifyFeatureNames, java.lang.String _nameValueSeparator)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description Instance
pipe(Instance carrier)
Really this should be 'protected', but isn't for historical reasons.-
Methods inherited from class cc.mallet.pipe.Pipe
alphabetsMatch, getAlphabet, getAlphabets, getDataAlphabet, getInstanceId, getTargetAlphabet, instanceFrom, instancesFrom, instancesFrom, isDataAlphabetSet, isTargetProcessing, newIteratorFrom, preceedingPipeDataAlphabetNotification, preceedingPipeTargetAlphabetNotification, precondition, readResolve, setDataAlphabet, setOrCheckDataAlphabet, setOrCheckTargetAlphabet, setTargetAlphabet, setTargetProcessing
-
-
-
-
Constructor Detail
-
TokenSequenceParseFeatureString
public TokenSequenceParseFeatureString(boolean _realValued, boolean _specifyFeatureNames, java.lang.String _nameValueSeparator)
- Parameters:
_realValued
- interpret each data token as a double, and associate it with a feature called "Feature#K" where K is the order of the token, starting with 0. Note that this option is currently ignored if_specifyFeatureNames
is true._specifyFeatureNames
- interpret each data token as a feature name/value pair, separated by some delimiter, which is the equals sign ("=") unless otherwise specified._nameValueSeparator
- use a string other than = to separate name/value pairs. Colon (":") is a common choice. Note that this string cannot consist of any whitespace, as the tokens stream will already have been split.
-
TokenSequenceParseFeatureString
public TokenSequenceParseFeatureString(boolean _realValued, boolean _specifyFeatureNames)
-
TokenSequenceParseFeatureString
public TokenSequenceParseFeatureString(boolean _realValued)
-
-