Package cc.mallet.extract
Interface Extractor
-
- All Superinterfaces:
java.io.Serializable
- All Known Implementing Classes:
CRFExtractor
public interface Extractor extends java.io.Serializable
Generic interface for objects that do information extraction. Typically, this will mean extraction of database records (see @link{Record}) from Strings, but this interface is not specific to this case.
-
-
Method Summary
All Methods Instance Methods Abstract Methods Modifier and Type Method Description Extraction
extract(Tokenization toks)
Performs extraction from an object that has been already been tokenized.Extraction
extract(java.lang.Object o)
Performs extraction given a raw object.Extraction
extract(java.util.Iterator<Instance> source)
Performs extraction on a a set of raw documents.Pipe
getFeaturePipe()
Returns the pipe used by this extractor for.Alphabet
getInputAlphabet()
Returns an alphabet of the features used by the extractor.LabelAlphabet
getTargetAlphabet()
Returns an alphabet of the labels used by the extractor.Pipe
getTokenizationPipe()
Returns the pipe used by this extractor to tokenize the input.void
setTokenizationPipe(Pipe pipe)
Sets the pipe used by this extractor for tokenization.
-
-
-
Method Detail
-
extract
Extraction extract(java.lang.Object o)
Performs extraction given a raw object. The object will be passed through the Extractor's pipe.- Parameters:
o
- The document to extract from (often a String).- Returns:
- Extraction the results of performing extraction
-
extract
Extraction extract(Tokenization toks)
Performs extraction from an object that has been already been tokenized. This method will pass spans through the extractor's pipe.- Parameters:
toks
- A tokenized document- Returns:
- Extraction the results of performing extraction
-
extract
Extraction extract(java.util.Iterator<Instance> source)
Performs extraction on a a set of raw documents. The Instances output from source will be passed through both the tokentization pipe and the feature extraction pipe.- Parameters:
source
- A source of raw documents- Returns:
- Extraction the results of performing extraction
-
getFeaturePipe
Pipe getFeaturePipe()
Returns the pipe used by this extractor for. The pipe takes an Instance and converts it into a form usable by the particular extraction algorithm. This pipe expects the Instance's data field to be a Tokenization. For example, pipes often perform feature extraction. The type of raw object expected by the pipe depends on the particular subclass of extractor.- Returns:
- a pipe
-
getTokenizationPipe
Pipe getTokenizationPipe()
Returns the pipe used by this extractor to tokenize the input. The type of Instance of this pipe expects is specific to the individual extractor. This pipe will return an Instance whose data is a Tokenization.- Returns:
- a pipe
-
setTokenizationPipe
void setTokenizationPipe(Pipe pipe)
Sets the pipe used by this extractor for tokenization. The pipe should takes a raw object and convert it into a Tokenization.The pipe @link{edu.umass.cs.mallet.base.pipe.CharSequence2TokenSequence} is an example of a pipe that could be used here.
-
getInputAlphabet
Alphabet getInputAlphabet()
Returns an alphabet of the features used by the extractor. The alphabet maps strings describing the features to indices.- Returns:
- the input alphabet
-
getTargetAlphabet
LabelAlphabet getTargetAlphabet()
Returns an alphabet of the labels used by the extractor. Labels include entity types (such as PERSON) and slot names (such as EMPLOYEE-OF).- Returns:
- the target alphabet
-
-