Package cc.mallet.fst

Class SimpleTagger


  • public class SimpleTagger
    extends java.lang.Object
    This class's main method trains, tests, or runs a generic CRF-based sequence tagger.

    Training and test files consist of blocks of lines, one block for each instance, separated by blank lines. Each block of lines should have the first form specified for the input of SimpleTagger.SimpleTaggerSentence2FeatureVectorSequence. A variety of command line options control the operation of the main program, as described in the comments for main.

    Version:
    1.0
    Author:
    Fernando Pereira pereira@cis.upenn.edu
    • Method Summary

      All Methods Static Methods Concrete Methods 
      Modifier and Type Method Description
      static Sequence[] apply​(Transducer model, Sequence input, int k)
      Apply a transducer to an input sequence to produce the k highest-scoring output sequences.
      static void main​(java.lang.String[] args)
      Command-line wrapper to train, test, or run a generic CRF-based tagger.
      static void test​(TransducerTrainer tt, TransducerEvaluator eval, InstanceList testing)
      Test a transducer on the given test data, evaluating accuracy with the given evaluator
      static CRF train​(InstanceList training, InstanceList testing, TransducerEvaluator eval, int[] orders, java.lang.String defaultLabel, java.lang.String forbidden, java.lang.String allowed, boolean connected, int iterations, double var, CRF crf)
      Create and train a CRF model from the given training data, optionally testing it on the given test data.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Method Detail

      • train

        public static CRF train​(InstanceList training,
                                InstanceList testing,
                                TransducerEvaluator eval,
                                int[] orders,
                                java.lang.String defaultLabel,
                                java.lang.String forbidden,
                                java.lang.String allowed,
                                boolean connected,
                                int iterations,
                                double var,
                                CRF crf)
        Create and train a CRF model from the given training data, optionally testing it on the given test data.
        Parameters:
        training - training data
        testing - test data (possibly null)
        eval - accuracy evaluator (possibly null)
        orders - label Markov orders (main and backoff)
        defaultLabel - default label
        forbidden - regular expression specifying impossible label transitions current,next (null indicates no forbidden transitions)
        allowed - regular expression specifying allowed label transitions (null indicates everything is allowed that is not forbidden)
        connected - whether to include even transitions not occurring in the training data.
        iterations - number of training iterations
        var - Gaussian prior variance
        Returns:
        the trained model
      • test

        public static void test​(TransducerTrainer tt,
                                TransducerEvaluator eval,
                                InstanceList testing)
        Test a transducer on the given test data, evaluating accuracy with the given evaluator
        Parameters:
        model - a Transducer
        eval - accuracy evaluator
        testing - test data
      • apply

        public static Sequence[] apply​(Transducer model,
                                       Sequence input,
                                       int k)
        Apply a transducer to an input sequence to produce the k highest-scoring output sequences.
        Parameters:
        model - the Transducer
        input - the input sequence
        k - the number of answers to return
        Returns:
        array of the k highest-scoring output sequences
      • main

        public static void main​(java.lang.String[] args)
                         throws java.lang.Exception
        Command-line wrapper to train, test, or run a generic CRF-based tagger.
        Parameters:
        args - the command line arguments. Options (shell and Java quoting should be added as needed):
        --help boolean
        Print this command line option usage information. Give true for longer documentation. Default is false.
        --prefix-code Java-code
        Java code you want run before any other interpreted code. Note that the text is interpreted without modification, so unlike some other Java code options, you need to include any necessary 'new's. Default is null.
        --gaussian-variance positive-number
        The Gaussian prior variance used for training. Default is 10.0.
        --train boolean
        Whether to train. Default is false.
        --iterations positive-integer
        Number of training iterations. Default is 500.
        --test lab or perclass or seg=start-1.continue-1,...,start-n.continue-n
        Test measuring labeling or segmentation (start-i, continue-i) accuracy. Default is no testing.
        --training-proportion number-between-0-and-1
        Fraction of data to use for training in a random split. Default is 0.5.
        --model-file filename
        The filename for reading (train/run) or saving (train) the model. Default is null.
        --random-seed integer
        The random seed for randomly selecting a proportion of the instance list for training Default is 0.
        --orders comma-separated-integers
        List of label Markov orders (main and backoff) Default is 1.
        --forbidden regular-expression
        If label-1,label-2 matches the expression, the corresponding transition is forbidden. Default is \\s (nothing forbidden).
        --allowed regular-expression
        If label-1,label-2 does not match the expression, the corresponding expression is forbidden. Default is .* (everything allowed).
        --default-label string
        Label for initial context and uninteresting tokens. Default is O.
        --viterbi-output boolean
        Print Viterbi periodically during training. Default is false.
        --fully-connected boolean
        Include all allowed transitions, even those not in training data. Default is true.
        --weights sparse|some-dense|dense
        Create sparse, some dense (using a heuristic), or dense features on transitions. Default is some-dense.
        --n-best positive-integer
        Number of answers to output when applying model. Default is 1.
        --include-input boolean
        Whether to include input features when printing decoding output. Default is false.
        --threads positive-integer
        Number of threads for CRF training. Default is 1.
        Remaining arguments:
        • training-data-file if training
        • training-and-test-data-file, if training and testing with random split
        • training-data-file test-data-file if training and testing from separate files
        • test-data-file if testing
        • input-data-file if applying to new data (unlabeled)
        Throws:
        java.lang.Exception - if an error occurs