Class SimpleTaggerWithConstraints
- java.lang.Object
-
- cc.mallet.fst.semi_supervised.tui.SimpleTaggerWithConstraints
-
public class SimpleTaggerWithConstraints extends java.lang.Object
Version of SimpleTagger that trains CRFs with expectation constraints rather than labeled data. This class's main method trains, tests, or runs a generic CRF-based sequence tagger.Training and test files consist of blocks of lines, one block for each instance, separated by blank lines. Each block of lines should have the first form specified for the input of
SimpleTagger.SimpleTaggerSentence2FeatureVectorSequence
. A variety of command line options control the operation of the main program, as described in the comments formain
.- Version:
- 1.0
- Author:
- Gregory Druck gdruck@cs.umass.edu
-
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description static Sequence[]
apply(Transducer model, Sequence input, int k)
Apply a transducer to an input sequence to produce the k highest-scoring output sequences.static CRF
getCRF(InstanceList training, int[] orders, java.lang.String defaultLabel, java.lang.String forbidden, java.lang.String allowed, boolean connected)
static void
main(java.lang.String[] args)
Command-line wrapper to train, test, or run a generic CRF-based tagger.static void
test(TransducerTrainer tt, TransducerEvaluator eval, InstanceList testing)
Test a transducer on the given test data, evaluating accuracy with the given evaluatorstatic CRF
trainGE(InstanceList training, InstanceList testing, java.util.ArrayList<GEConstraint> constraints, CRF crf, TransducerEvaluator eval, int iterations, double var, int resets)
Create and train a CRF model from the given training data, optionally testing it on the given test data.static CRF
trainPR(InstanceList training, InstanceList testing, java.util.ArrayList<PRConstraint> constraints, CRF crf, TransducerEvaluator eval, int iterations, double var)
Create and train a CRF model from the given training data, optionally testing it on the given test data.
-
-
-
Method Detail
-
trainGE
public static CRF trainGE(InstanceList training, InstanceList testing, java.util.ArrayList<GEConstraint> constraints, CRF crf, TransducerEvaluator eval, int iterations, double var, int resets)
Create and train a CRF model from the given training data, optionally testing it on the given test data.- Parameters:
training
- training datatesting
- test data (possiblynull
)constraints
- constraintscrf
- modeleval
- accuracy evaluator (possiblynull
)iterations
- number of training iterationsvar
- Gaussian prior varianceresets
- Number of resets.- Returns:
- the trained model
-
trainPR
public static CRF trainPR(InstanceList training, InstanceList testing, java.util.ArrayList<PRConstraint> constraints, CRF crf, TransducerEvaluator eval, int iterations, double var)
Create and train a CRF model from the given training data, optionally testing it on the given test data.- Parameters:
training
- training datatesting
- test data (possiblynull
)constraints
- constraintscrf
- modeleval
- accuracy evaluator (possiblynull
)iterations
- number of training iterationsvar
- Gaussian prior variance- Returns:
- the trained model
-
getCRF
public static CRF getCRF(InstanceList training, int[] orders, java.lang.String defaultLabel, java.lang.String forbidden, java.lang.String allowed, boolean connected)
-
test
public static void test(TransducerTrainer tt, TransducerEvaluator eval, InstanceList testing)
Test a transducer on the given test data, evaluating accuracy with the given evaluator- Parameters:
model
- aTransducer
eval
- accuracy evaluatortesting
- test data
-
apply
public static Sequence[] apply(Transducer model, Sequence input, int k)
Apply a transducer to an input sequence to produce the k highest-scoring output sequences.- Parameters:
model
- theTransducer
input
- the input sequencek
- the number of answers to return- Returns:
- array of the k highest-scoring output sequences
-
main
public static void main(java.lang.String[] args) throws java.lang.Exception
Command-line wrapper to train, test, or run a generic CRF-based tagger.- Parameters:
args
- the command line arguments. Options (shell and Java quoting should be added as needed):--help
boolean- Print this command line option usage information. Give
true
for longer documentation. Default isfalse
. --prefix-code
Java-code- Java code you want run before any other interpreted code. Note that the text is interpreted without modification, so unlike some other Java code options, you need to include any necessary 'new's. Default is null.
--gaussian-variance
positive-number- The Gaussian prior variance used for training. Default is 10.0.
--train
boolean- Whether to train. Default is
false
. --iterations
positive-integer- Number of training iterations. Default is 500.
--test
lab
orseg=
start-1.
continue-1,
...,
start-n.
continue-n- Test measuring labeling or segmentation (start-i, continue-i) accuracy. Default is no testing.
--training-proportion
number-between-0-and-1- Fraction of data to use for training in a random split. Default is 0.5.
--model-file
filename- The filename for reading (train/run) or saving (train) the model. Default is null.
--random-seed
integer- The random seed for randomly selecting a proportion of the instance list for training Default is 0.
--orders
comma-separated-integers- List of label Markov orders (main and backoff) Default is 1.
--forbidden
regular-expression- If label-1
,
label-2 matches the expression, the corresponding transition is forbidden. Default is\\s
(nothing forbidden). --allowed
regular-expression- If label-1
,
label-2 does not match the expression, the corresponding expression is forbidden. Default is.*
(everything allowed). --default-label
string- Label for initial context and uninteresting tokens. Default is
O
. --viterbi-output
boolean- Print Viterbi periodically during training. Default is
false
. --fully-connected
boolean- Include all allowed transitions, even those not in training data. Default is
true
. --weights
sparse|some-dense|dense- Create sparse, some dense (using a heuristic), or dense features on transitions. Default is
some-dense
. --n-best
positive-integer- Number of answers to output when applying model. Default is 1.
--include-input
boolean- Whether to include input features when printing decoding output. Default is
false
. --threads
positive-integer- Number of threads for CRF training. Default is 1.
- training-data-file if training
- training-and-test-data-file, if training and testing with random split
- training-data-file test-data-file if training and testing from separate files
- test-data-file if testing
- input-data-file if applying to new data (unlabeled)
- Throws:
java.lang.Exception
- if an error occurs
-
-