Class Pipe
- java.lang.Object
-
- cc.mallet.pipe.Pipe
-
- All Implemented Interfaces:
AlphabetCarrying
,java.io.Serializable
- Direct Known Subclasses:
AddClassifierTokenPredictions
,Array2FeatureVector
,AugmentableFeatureVectorAddConjunctions
,AugmentableFeatureVectorLogScale
,BranchingPipe
,CharSequence2CharNGrams
,CharSequence2TokenSequence
,CharSequenceArray2TokenSequence
,CharSequenceLowercase
,CharSequenceNoDiacritics
,CharSequenceRemoveHTML
,CharSequenceRemoveUUEncodedBlocks
,CharSequenceReplace
,CharSequenceReplaceHtmlEntities
,CharSubsequence
,Classification2ConfidencePredictingFeatureVector
,Clusterings2Clusterer.ClusteringPipe
,ConllNer2003Sentence2TokenSequence
,ConllNer2003Sentence2TokenSequence
,CountMatches
,CountMatchesAlignedWithOffsets
,CountMatchesMatching
,CountsToFeatureSequencePipe
,Csv2Array
,Csv2FeatureVector
,Directory2FileIterator
,EnronMessage2TokenSequence
,FeatureCountPipe
,FeatureDocFreqPipe
,FeatureSequence2AugmentableFeatureVector
,FeatureSequence2FeatureVector
,FeatureSequenceConvolution
,FeaturesInWindow
,FeaturesOfFirstMention
,FeatureValueString2FeatureVector
,FeatureVectorConjunctions
,FeatureVectorSequence2FeatureVectors
,FeatureWindow
,Filename2CharSequence
,FilterEmptyFeatureVectors
,FixedVocabTokenizer
,Input2CharSequence
,InstanceListTrimFeaturesByCount
,LengthBins
,LexiconMembership
,LineGroupString2TokenSequence
,ListMember
,LongRegexMatches
,MakeAmpersandXMLFriendly
,NGramPreprocessor
,Noop
,OffsetConjunctions
,OffsetFeatureConjunction
,OffsetPropertyConjunctions
,PrintInput
,PrintInputAndTarget
,PrintTokenSequenceFeatures
,RegexMatches
,SaveDataInSource
,SelectiveSGML2TokenSequence
,SequencePrintingPipe
,SerialPipes
,SGML2TokenSequence
,SimpleTagger.SimpleTaggerSentence2FeatureVectorSequence
,SimpleTaggerSentence2TokenSequence
,SimpleTokenizer
,SourceLocation2TokenSequence
,StringAddNewLineDelimiter
,StringList2FeatureSequence
,SvmLight2FeatureVectorAndLabel
,Target2BIOFormat
,Target2Double
,Target2FeatureSequence
,Target2Integer
,Target2Label
,Target2LabelSequence
,TargetRememberLastLabel
,TargetStringToFeatures
,TestCRF.TestCRF2String
,TestCRF.TestCRFTokenSequenceRemoveSpaces
,TestInstancePipe.Array2ArrayIterator
,TestMEMM.TestMEMM2String
,TestMEMM.TestMEMMTokenSequenceRemoveSpaces
,TestSGML2TokenSequence.Array2ArrayIterator
,Token2FeatureVector
,TokenFirstPosition
,TokenSequence2FeatureSequence
,TokenSequence2FeatureSequenceWithBigrams
,TokenSequence2FeatureVectorSequence
,TokenSequence2PorterStems
,TokenSequence2Tokenization
,TokenSequenceDocHeader
,TokenSequenceLowercase
,TokenSequenceMatchDataAndTarget
,TokenSequenceNGrams
,TokenSequenceParseFeatureString
,TokenSequenceRemoveNonAlpha
,TokenSequenceRemoveStopPatterns
,TokenSequenceRemoveStopwords
,TokenText
,TokenTextCharNGrams
,TokenTextCharPrefix
,TokenTextCharSuffix
,TokenTextNGrams
,TrieLexiconMembership
,ValueString2FeatureVector
,WordVectors
public abstract class Pipe extends java.lang.Object implements java.io.Serializable, AlphabetCarrying
The abstract superclass of all Pipes, which transform one data type to another. Pipes are most often used for feature extraction.Although Pipe does not have any "abstract methods", in order to use a Pipe subclass you must override either the
pipe(cc.mallet.types.Instance)
method or thenewIteratorFrom(java.util.Iterator<cc.mallet.types.Instance>)
method. The former is appropriate when the pipe's processing of an Instance is strictly one-to-one. For every Instance coming in, there is exactly one Instance coming out. The later is appropriate when the pipe's processing may result in more or fewer Instances than arrive through its source iterator.A pipe operates on an
Instance
, which is a carrier of data. A pipe reads from and writes to fields in the Instance when it is requested to process the instance. It is up to the pipe which fields in the Instance it reads from and writes to, but usually a pipe will read its input from and write its output to the "data" field of an instance.A pipe doesn't have any direct notion of input or output - it merely modifies instances that are handed to it. A set of helper classes, which implement the interface
Iterator
, iterate over commonly encountered input data structures and feed the elements of these data structures to a pipe as instances.A pipe is frequently used in conjunction with an
InstanceList
As instances are added to the list, they are processed by the pipe associated with the instance list and the processed Instance is kept in the list.In one common usage, a
FileIterator
is given a list of directories to operate over. The FileIterator walks through each directory, creating an instance for each file and putting the data from the file in the data field of the instance. The directory of the file is stored in the target field of the instance. The FileIterator feeds instances to an InstanceList, which processes the instances through its associated pipe and keeps the results.Pipes can be hierachically composed. In a typical usage, a SerialPipe is created, which holds other pipes in an ordered list. Piping an instance through a SerialPipe means piping the instance through each of the child pipes in sequence.
A pipe holds two separate Alphabets: one for the symbols (feature names) encountered in the data fields of the instances processed through the pipe, and one for the symbols (e.g. class labels) encountered in the target fields.
- Author:
- Andrew McCallum mccallum@cs.umass.edu
- See Also:
- Serialized Form
-
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description boolean
alphabetsMatch(AlphabetCarrying object)
Alphabet
getAlphabet()
Alphabet[]
getAlphabets()
Alphabet
getDataAlphabet()
java.util.UUID
getInstanceId()
Alphabet
getTargetAlphabet()
Instance
instanceFrom(Instance inst)
Instance[]
instancesFrom(Instance inst)
Instance[]
instancesFrom(java.util.Iterator<Instance> source)
A convenience method that will pull all instances from source through this pipe, and return the results as an array.boolean
isDataAlphabetSet()
boolean
isTargetProcessing()
Return true iff this pipe expects and processes information in the target slot.java.util.Iterator<Instance>
newIteratorFrom(java.util.Iterator<Instance> source)
Given an InstanceIterator, return a new InstanceIterator whose instances have also been processed by this pipe.Instance
pipe(Instance inst)
Really this should be 'protected', but isn't for historical reasons.protected void
preceedingPipeDataAlphabetNotification(Alphabet a)
protected void
preceedingPipeTargetAlphabetNotification(Alphabet a)
boolean
precondition(Instance inst)
Each instance processed is tested by this method.java.lang.Object
readResolve()
This gets called after readObject; it lets the object decide whether to return itself or return a previously read in version.void
setDataAlphabet(Alphabet dDict)
void
setOrCheckDataAlphabet(Alphabet a)
void
setOrCheckTargetAlphabet(Alphabet a)
void
setTargetAlphabet(Alphabet tDict)
void
setTargetProcessing(boolean lookForAndProcessTarget)
Set whether input is taken from target field of instance during processing.
-
-
-
Constructor Detail
-
Pipe
public Pipe()
Construct a pipe with no data and target dictionaries
-
Pipe
public Pipe(Alphabet dataDict, Alphabet targetDict)
Construct pipe with data and target dictionaries. Note that, since the default values of the dataDictClass and targetDictClass are null, that if you specify null for one of the arguments here, this pipe step will not ever create any corresponding dictionary for the argument.- Parameters:
dataDict
- Alphabet that will be used as the data dictionary.targetDict
- Alphabet that will be used as the target dictionary.
-
-
Method Detail
-
precondition
public boolean precondition(Instance inst)
Each instance processed is tested by this method. If it returns true, then the instance by-passes processing by this Pipe. Common usage is to override this method in an anonymous inner sub-class of Pipe.SerialPipes sp = new SerialPipes (new Pipe[] { new CharSequence2TokenSequence() { public boolean precondition (Instance inst) { return inst instanceof CharSequence; } }, new TokenSequence2FeatureSequence(), });
-
pipe
public Instance pipe(Instance inst)
Really this should be 'protected', but isn't for historical reasons.
-
newIteratorFrom
public java.util.Iterator<Instance> newIteratorFrom(java.util.Iterator<Instance> source)
Given an InstanceIterator, return a new InstanceIterator whose instances have also been processed by this pipe. If you override this method, be sure to check and obey this pipe'sskipIfFalse(Instance)
method.
-
instancesFrom
public Instance[] instancesFrom(java.util.Iterator<Instance> source)
A convenience method that will pull all instances from source through this pipe, and return the results as an array.
-
setTargetProcessing
public void setTargetProcessing(boolean lookForAndProcessTarget)
Set whether input is taken from target field of instance during processing. If argument is false, don't expect to find input material for the target. By default, this is true.
-
isTargetProcessing
public boolean isTargetProcessing()
Return true iff this pipe expects and processes information in the target slot.
-
getDataAlphabet
public Alphabet getDataAlphabet()
-
getTargetAlphabet
public Alphabet getTargetAlphabet()
-
getAlphabet
public Alphabet getAlphabet()
- Specified by:
getAlphabet
in interfaceAlphabetCarrying
-
getAlphabets
public Alphabet[] getAlphabets()
- Specified by:
getAlphabets
in interfaceAlphabetCarrying
-
alphabetsMatch
public boolean alphabetsMatch(AlphabetCarrying object)
-
setDataAlphabet
public void setDataAlphabet(Alphabet dDict)
-
isDataAlphabetSet
public boolean isDataAlphabetSet()
-
setOrCheckDataAlphabet
public void setOrCheckDataAlphabet(Alphabet a)
-
setTargetAlphabet
public void setTargetAlphabet(Alphabet tDict)
-
setOrCheckTargetAlphabet
public void setOrCheckTargetAlphabet(Alphabet a)
-
preceedingPipeDataAlphabetNotification
protected void preceedingPipeDataAlphabetNotification(Alphabet a)
-
preceedingPipeTargetAlphabetNotification
protected void preceedingPipeTargetAlphabetNotification(Alphabet a)
-
getInstanceId
public java.util.UUID getInstanceId()
-
readResolve
public java.lang.Object readResolve() throws java.io.ObjectStreamException
This gets called after readObject; it lets the object decide whether to return itself or return a previously read in version. We use a hashMap of instanceIds to determine if we have already read in this object.- Returns:
- Throws:
java.io.ObjectStreamException
-
-