Class FeatureConstraintUtil


  • public class FeatureConstraintUtil
    extends java.lang.Object
    Utility functions for creating feature constraints that can be used with GE training.
    Author:
    Gregory Druck gdruck@cs.umass.edu
    • Method Summary

      All Methods Static Methods Concrete Methods 
      Modifier and Type Method Description
      static double[][] getFeatureLabelCounts​(InstanceList list, boolean useValues)  
      static java.util.HashMap<java.lang.Integer,​java.util.ArrayList<java.lang.Integer>> labelFeatures​(InstanceList list, java.util.ArrayList<java.lang.Integer> features)  
      static java.util.HashMap<java.lang.Integer,​java.util.ArrayList<java.lang.Integer>> labelFeatures​(InstanceList list, java.util.ArrayList<java.lang.Integer> features, boolean reject)
      Label features using heuristic described in "Learning from Labeled Features using Generalized Expectation Criteria" Gregory Druck, Gideon Mann, Andrew McCallum.
      static java.util.HashMap<java.lang.Integer,​double[]> readConstraintsFromFile​(java.lang.String filename, InstanceList data)
      Reads feature constraints from a file, whether they are stored using Strings or indices.
      static java.util.HashMap<java.lang.Integer,​double[]> readConstraintsFromFileIndex​(java.lang.String filename, InstanceList data)
      Reads feature constraints stored using strings from a file.
      static java.util.HashMap<java.lang.Integer,​double[]> readConstraintsFromFileString​(java.lang.String filename, InstanceList data)
      Reads feature constraints stored using strings from a file.
      static java.util.HashMap<java.lang.Integer,​double[][]> readRangeConstraintsFromFile​(java.lang.String filename, InstanceList data)
      Reads range constraints stored using strings from a file.
      static java.util.ArrayList<java.lang.Integer> selectFeaturesByInfoGain​(InstanceList list, int numFeatures)
      Select features with the highest information gain.
      static java.util.ArrayList<java.lang.Integer> selectTopLDAFeatures​(int numSelFeatures, ParallelTopicModel lda, Alphabet alphabet)
      Select top features in LDA topics.
      static java.util.HashMap<java.lang.Integer,​double[]> setTargetsUsingData​(InstanceList list, java.util.ArrayList<java.lang.Integer> features)  
      static java.util.HashMap<java.lang.Integer,​double[]> setTargetsUsingData​(InstanceList list, java.util.ArrayList<java.lang.Integer> features, boolean normalize)  
      static java.util.HashMap<java.lang.Integer,​double[]> setTargetsUsingData​(InstanceList list, java.util.ArrayList<java.lang.Integer> features, boolean useValues, boolean normalize)
      Set target distributions using estimates from data.
      static java.util.HashMap<java.lang.Integer,​double[]> setTargetsUsingFeatureVoting​(java.util.HashMap<java.lang.Integer,​java.util.ArrayList<java.lang.Integer>> labeledFeatures, InstanceList trainingData)
      Set target distributions using feature voting heuristic described in "Learning from Labeled Features using Generalized Expectation Criteria" Gregory Druck, Gideon Mann, Andrew McCallum.
      static java.util.HashMap<java.lang.Integer,​double[]> setTargetsUsingHeuristic​(java.util.HashMap<java.lang.Integer,​java.util.ArrayList<java.lang.Integer>> labeledFeatures, int numLabels, double majorityProb)
      Set target distributions using "Schapire" heuristic described in "Learning from Labeled Features using Generalized Expectation Criteria" Gregory Druck, Gideon Mann, Andrew McCallum.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

      • FeatureConstraintUtil

        public FeatureConstraintUtil()
    • Method Detail

      • readRangeConstraintsFromFile

        public static java.util.HashMap<java.lang.Integer,​double[][]> readRangeConstraintsFromFile​(java.lang.String filename,
                                                                                                         InstanceList data)
        Reads range constraints stored using strings from a file. Format can be either: feature_name (label_name:lower_probability,upper_probability)+ or feature_name (label_name:probability)+ Constraints are only added for feature-label pairs that are present.
        Parameters:
        filename - File with feature constraints.
        data - InstanceList used for alphabets.
        Returns:
        Constraints.
      • readConstraintsFromFile

        public static java.util.HashMap<java.lang.Integer,​double[]> readConstraintsFromFile​(java.lang.String filename,
                                                                                                  InstanceList data)
        Reads feature constraints from a file, whether they are stored using Strings or indices.
        Parameters:
        filename - File with feature constraints.
        data - InstanceList used for alphabets.
        Returns:
        Constraints.
      • readConstraintsFromFileString

        public static java.util.HashMap<java.lang.Integer,​double[]> readConstraintsFromFileString​(java.lang.String filename,
                                                                                                        InstanceList data)
        Reads feature constraints stored using strings from a file. feature_name (label_name:probability)+ Labels that do appear get probability 0.
        Parameters:
        filename - File with feature constraints.
        data - InstanceList used for alphabets.
        Returns:
        Constraints.
      • readConstraintsFromFileIndex

        public static java.util.HashMap<java.lang.Integer,​double[]> readConstraintsFromFileIndex​(java.lang.String filename,
                                                                                                       InstanceList data)
        Reads feature constraints stored using strings from a file. feature_index label_0_prob label_1_prob ... label_n_prob Here each label must appear.
        Parameters:
        filename - File with feature constraints.
        data - InstanceList used for alphabets.
        Returns:
        Constraints.
      • selectFeaturesByInfoGain

        public static java.util.ArrayList<java.lang.Integer> selectFeaturesByInfoGain​(InstanceList list,
                                                                                      int numFeatures)
        Select features with the highest information gain.
        Parameters:
        list - InstanceList for computing information gain.
        numFeatures - Number of features to select.
        Returns:
        List of features with the highest information gains.
      • selectTopLDAFeatures

        public static java.util.ArrayList<java.lang.Integer> selectTopLDAFeatures​(int numSelFeatures,
                                                                                  ParallelTopicModel lda,
                                                                                  Alphabet alphabet)
        Select top features in LDA topics.
        Parameters:
        numSelFeatures - Number of features to select.
        ldaEst - LDAEstimatePr which provides an interface to an LDA model.
        seqAlphabet - The alphabet for the sequence dataset, which may be different from the vector dataset alphabet.
        alphabet - The vector dataset alphabet.
        Returns:
        ArrayList with the int indices of the selected features.
      • setTargetsUsingData

        public static java.util.HashMap<java.lang.Integer,​double[]> setTargetsUsingData​(InstanceList list,
                                                                                              java.util.ArrayList<java.lang.Integer> features)
      • setTargetsUsingData

        public static java.util.HashMap<java.lang.Integer,​double[]> setTargetsUsingData​(InstanceList list,
                                                                                              java.util.ArrayList<java.lang.Integer> features,
                                                                                              boolean normalize)
      • setTargetsUsingData

        public static java.util.HashMap<java.lang.Integer,​double[]> setTargetsUsingData​(InstanceList list,
                                                                                              java.util.ArrayList<java.lang.Integer> features,
                                                                                              boolean useValues,
                                                                                              boolean normalize)
        Set target distributions using estimates from data.
        Parameters:
        list - InstanceList used to estimate targets.
        features - List of features for constraints.
        normalize - Whether to normalize by feature counts
        Returns:
        Constraints (map of feature index to target), with targets set using estimates from supplied data.
      • setTargetsUsingHeuristic

        public static java.util.HashMap<java.lang.Integer,​double[]> setTargetsUsingHeuristic​(java.util.HashMap<java.lang.Integer,​java.util.ArrayList<java.lang.Integer>> labeledFeatures,
                                                                                                   int numLabels,
                                                                                                   double majorityProb)
        Set target distributions using "Schapire" heuristic described in "Learning from Labeled Features using Generalized Expectation Criteria" Gregory Druck, Gideon Mann, Andrew McCallum.
        Parameters:
        labeledFeatures - HashMap of feature indices to lists of label indices for that feature.
        numLabels - Total number of labels.
        majorityProb - Probability mass divided among majority labels.
        Returns:
        Constraints (map of feature index to target distribution), with target distributions set using heuristic.
      • setTargetsUsingFeatureVoting

        public static java.util.HashMap<java.lang.Integer,​double[]> setTargetsUsingFeatureVoting​(java.util.HashMap<java.lang.Integer,​java.util.ArrayList<java.lang.Integer>> labeledFeatures,
                                                                                                       InstanceList trainingData)
        Set target distributions using feature voting heuristic described in "Learning from Labeled Features using Generalized Expectation Criteria" Gregory Druck, Gideon Mann, Andrew McCallum.
        Parameters:
        labeledFeatures - HashMap of feature indices to lists of label indices for that feature.
        trainingData - InstanceList to use for computing expectations with feature voting.
        Returns:
        Constraints (map of feature index to target distribution), with target distributions set using feature voting.
      • labelFeatures

        public static java.util.HashMap<java.lang.Integer,​java.util.ArrayList<java.lang.Integer>> labelFeatures​(InstanceList list,
                                                                                                                      java.util.ArrayList<java.lang.Integer> features,
                                                                                                                      boolean reject)
        Label features using heuristic described in "Learning from Labeled Features using Generalized Expectation Criteria" Gregory Druck, Gideon Mann, Andrew McCallum.
        Parameters:
        list - InstanceList used to compute statistics for labeling features.
        features - List of features to label.
        reject - Whether to reject labeling features.
        Returns:
        Labeled features, HashMap mapping feature indices to list of labels.
      • labelFeatures

        public static java.util.HashMap<java.lang.Integer,​java.util.ArrayList<java.lang.Integer>> labelFeatures​(InstanceList list,
                                                                                                                      java.util.ArrayList<java.lang.Integer> features)
      • getFeatureLabelCounts

        public static double[][] getFeatureLabelCounts​(InstanceList list,
                                                       boolean useValues)