All Superinterfaces:

java.io.Serializable

All Known Implementing Classes:

CRFExtractor
```
public interface Extractor
extends java.io.Serializable
```
Generic interface for objects that do information extraction. Typically, this will mean extraction of database records (see @link{Record}) from Strings, but this interface is not specific to this case.

Method Summary

All Methods Instance Methods Abstract Methods
Modifier and Type	Method	Description
`Extraction`	`extract(Tokenization toks)`	Performs extraction from an object that has been already been tokenized.
`Extraction`	`extract(java.lang.Object o)`	Performs extraction given a raw object.
`Extraction`	`extract(java.util.Iterator<Instance> source)`	Performs extraction on a a set of raw documents.
`Pipe`	`getFeaturePipe()`	Returns the pipe used by this extractor for.
`Alphabet`	`getInputAlphabet()`	Returns an alphabet of the features used by the extractor.
`LabelAlphabet`	`getTargetAlphabet()`	Returns an alphabet of the labels used by the extractor.
`Pipe`	`getTokenizationPipe()`	Returns the pipe used by this extractor to tokenize the input.
`void`	`setTokenizationPipe(Pipe pipe)`	Sets the pipe used by this extractor for tokenization.

- Method Detail
  - extract
```
Extraction extract(java.lang.Object o)
```
    Performs extraction given a raw object. The object will be passed through the Extractor's pipe.
    
    Parameters:
    
    o - The document to extract from (often a String).
    
    Returns:
    
    Extraction the results of performing extraction
  - extract
```
Extraction extract(Tokenization toks)
```
    Performs extraction from an object that has been already been tokenized. This method will pass spans through the extractor's pipe.
    
    Parameters:
    
    toks - A tokenized document
    
    Returns:
    
    Extraction the results of performing extraction
  - extract
```
Extraction extract(java.util.Iterator<Instance> source)
```
    Performs extraction on a a set of raw documents. The Instances output from source will be passed through both the tokentization pipe and the feature extraction pipe.
    
    Parameters:
    
    source - A source of raw documents
    
    Returns:
    
    Extraction the results of performing extraction
  - getFeaturePipe
```
Pipe getFeaturePipe()
```
    Returns the pipe used by this extractor for. The pipe takes an Instance and converts it into a form usable by the particular extraction algorithm. This pipe expects the Instance's data field to be a Tokenization. For example, pipes often perform feature extraction. The type of raw object expected by the pipe depends on the particular subclass of extractor.
    
    Returns:
    
    a pipe
  - getTokenizationPipe
```
Pipe getTokenizationPipe()
```
    Returns the pipe used by this extractor to tokenize the input. The type of Instance of this pipe expects is specific to the individual extractor. This pipe will return an Instance whose data is a Tokenization.
    
    Returns:
    
    a pipe
  - setTokenizationPipe
```
void setTokenizationPipe(Pipe pipe)
```
    Sets the pipe used by this extractor for tokenization. The pipe should takes a raw object and convert it into a Tokenization.
    The pipe @link{edu.umass.cs.mallet.base.pipe.CharSequence2TokenSequence} is an example of a pipe that could be used here.
  - getInputAlphabet
```
Alphabet getInputAlphabet()
```
    Returns an alphabet of the features used by the extractor. The alphabet maps strings describing the features to indices.
    
    Returns:
    
    the input alphabet
  - getTargetAlphabet
```
LabelAlphabet getTargetAlphabet()
```
    Returns an alphabet of the labels used by the extractor. Labels include entity types (such as PERSON) and slot names (such as EMPLOYEE-OF).
    
    Returns:
    
    the target alphabet

Interface Extractor

Method Summary

Method Detail

extract

extract

extract

getFeaturePipe

getTokenizationPipe

setTokenizationPipe

getInputAlphabet

getTargetAlphabet