Interface Extractor

  • All Superinterfaces:
    java.io.Serializable
    All Known Implementing Classes:
    CRFExtractor

    public interface Extractor
    extends java.io.Serializable
    Generic interface for objects that do information extraction. Typically, this will mean extraction of database records (see @link{Record}) from Strings, but this interface is not specific to this case.
    • Method Detail

      • extract

        Extraction extract​(java.lang.Object o)
        Performs extraction given a raw object. The object will be passed through the Extractor's pipe.
        Parameters:
        o - The document to extract from (often a String).
        Returns:
        Extraction the results of performing extraction
      • extract

        Extraction extract​(Tokenization toks)
        Performs extraction from an object that has been already been tokenized. This method will pass spans through the extractor's pipe.
        Parameters:
        toks - A tokenized document
        Returns:
        Extraction the results of performing extraction
      • extract

        Extraction extract​(java.util.Iterator<Instance> source)
        Performs extraction on a a set of raw documents. The Instances output from source will be passed through both the tokentization pipe and the feature extraction pipe.
        Parameters:
        source - A source of raw documents
        Returns:
        Extraction the results of performing extraction
      • getFeaturePipe

        Pipe getFeaturePipe()
        Returns the pipe used by this extractor for. The pipe takes an Instance and converts it into a form usable by the particular extraction algorithm. This pipe expects the Instance's data field to be a Tokenization. For example, pipes often perform feature extraction. The type of raw object expected by the pipe depends on the particular subclass of extractor.
        Returns:
        a pipe
      • getTokenizationPipe

        Pipe getTokenizationPipe()
        Returns the pipe used by this extractor to tokenize the input. The type of Instance of this pipe expects is specific to the individual extractor. This pipe will return an Instance whose data is a Tokenization.
        Returns:
        a pipe
      • setTokenizationPipe

        void setTokenizationPipe​(Pipe pipe)
        Sets the pipe used by this extractor for tokenization. The pipe should takes a raw object and convert it into a Tokenization.

        The pipe @link{edu.umass.cs.mallet.base.pipe.CharSequence2TokenSequence} is an example of a pipe that could be used here.

      • getInputAlphabet

        Alphabet getInputAlphabet()
        Returns an alphabet of the features used by the extractor. The alphabet maps strings describing the features to indices.
        Returns:
        the input alphabet
      • getTargetAlphabet

        LabelAlphabet getTargetAlphabet()
        Returns an alphabet of the labels used by the extractor. Labels include entity types (such as PERSON) and slot names (such as EMPLOYEE-OF).
        Returns:
        the target alphabet