Class CRFExtractor

  • All Implemented Interfaces:
    Extractor, java.io.Serializable

    public class CRFExtractor
    extends java.lang.Object
    implements Extractor
    Created: Oct 12, 2004
    Version:
    $Id: CRFExtractor.java,v 1.1 2007/10/22 21:37:44 mccallum Exp $
    Author:
    See Also:
    Serialized Form
    • Constructor Detail

      • CRFExtractor

        public CRFExtractor​(CRF crf)
      • CRFExtractor

        public CRFExtractor​(java.io.File crfFile)
                     throws java.io.IOException
        Throws:
        java.io.IOException
      • CRFExtractor

        public CRFExtractor​(CRF crf,
                            Pipe tokpipe)
    • Method Detail

      • extract

        public Extraction extract​(java.lang.Object o)
        Description copied from interface: Extractor
        Performs extraction given a raw object. The object will be passed through the Extractor's pipe.
        Specified by:
        extract in interface Extractor
        Parameters:
        o - The document to extract from (often a String).
        Returns:
        Extraction the results of performing extraction
      • extract

        public Extraction extract​(Tokenization spans)
        Description copied from interface: Extractor
        Performs extraction from an object that has been already been tokenized. This method will pass spans through the extractor's pipe.
        Specified by:
        extract in interface Extractor
        Parameters:
        spans - A tokenized document
        Returns:
        Extraction the results of performing extraction
      • extract

        public Extraction extract​(InstanceList ilist)
        Assumes Instance.source contains the Tokenization object.
      • extract

        public Extraction extract​(java.util.Iterator<Instance> source)
        Description copied from interface: Extractor
        Performs extraction on a a set of raw documents. The Instances output from source will be passed through both the tokentization pipe and the feature extraction pipe.
        Specified by:
        extract in interface Extractor
        Parameters:
        source - A source of raw documents
        Returns:
        Extraction the results of performing extraction
      • getBackgroundTag

        public java.lang.String getBackgroundTag()
      • getTokenizationPipe

        public Pipe getTokenizationPipe()
        Description copied from interface: Extractor
        Returns the pipe used by this extractor to tokenize the input. The type of Instance of this pipe expects is specific to the individual extractor. This pipe will return an Instance whose data is a Tokenization.
        Specified by:
        getTokenizationPipe in interface Extractor
        Returns:
        a pipe
      • setTokenizationPipe

        public void setTokenizationPipe​(Pipe tokenizationPipe)
        Description copied from interface: Extractor
        Sets the pipe used by this extractor for tokenization. The pipe should takes a raw object and convert it into a Tokenization.

        The pipe @link{edu.umass.cs.mallet.base.pipe.CharSequence2TokenSequence} is an example of a pipe that could be used here.

        Specified by:
        setTokenizationPipe in interface Extractor
      • getFeaturePipe

        public Pipe getFeaturePipe()
        Description copied from interface: Extractor
        Returns the pipe used by this extractor for. The pipe takes an Instance and converts it into a form usable by the particular extraction algorithm. This pipe expects the Instance's data field to be a Tokenization. For example, pipes often perform feature extraction. The type of raw object expected by the pipe depends on the particular subclass of extractor.
        Specified by:
        getFeaturePipe in interface Extractor
        Returns:
        a pipe
      • setFeaturePipe

        public void setFeaturePipe​(Pipe featurePipe)
      • getInputAlphabet

        public Alphabet getInputAlphabet()
        Description copied from interface: Extractor
        Returns an alphabet of the features used by the extractor. The alphabet maps strings describing the features to indices.
        Specified by:
        getInputAlphabet in interface Extractor
        Returns:
        the input alphabet
      • getTargetAlphabet

        public LabelAlphabet getTargetAlphabet()
        Description copied from interface: Extractor
        Returns an alphabet of the labels used by the extractor. Labels include entity types (such as PERSON) and slot names (such as EMPLOYEE-OF).
        Specified by:
        getTargetAlphabet in interface Extractor
        Returns:
        the target alphabet
      • getCrf

        public CRF getCrf()
      • slicePipes

        public void slicePipes​(int num)
        Transfer some Pipes from the feature pipe to the tokenization pipe. The feature pipe must be a SerialPipes. This will destructively modify the CRF object of the extractor. This is useful if you have a CRF hat has been trained from a single pipe, which you need to split up int feature and tokenization pipes
      • pipeInput

        public Sequence pipeInput​(java.lang.Object input)