Class InstanceList

  • All Implemented Interfaces:
    AlphabetCarrying, java.io.Serializable, java.lang.Cloneable, java.lang.Iterable<Instance>, java.util.Collection<Instance>, java.util.List<Instance>, java.util.RandomAccess
    Direct Known Subclasses:
    MultiInstanceList, PagedInstanceList

    public class InstanceList
    extends java.util.ArrayList<Instance>
    implements java.io.Serializable, java.lang.Iterable<Instance>, AlphabetCarrying
    A list of machine learning instances, typically used for training or testing of a machine learning algorithm.

    All of the instances in the list will have been passed through the same Pipe, and thus must also share the same data and target Alphabets. InstanceList keeps a reference to the pipe and the two alphabets.

    The most common way of adding instances to an InstanceList is through the add(PipeInputIterator) method. PipeInputIterators are a way of mapping general data sources into instances suitable for processing through a pipe. As each Instance is pulled from the PipeInputIterator, the InstanceList copies the instance and runs the copy through its pipe (with resultant destructive modifications) before saving the modified instance on its list. This is the usual way in which instances are transformed by pipes.

    InstanceList also contains methods for randomly generating lists of feature vectors; splitting lists into non-overlapping subsets (useful for test/train splits), and iterators for cross validation.

    Author:
    Andrew McCallum mccallum@cs.umass.edu
    See Also:
    Instance, Pipe, Serialized Form
    • Field Detail

    • Constructor Detail

      • InstanceList

        public InstanceList​(Pipe pipe,
                            int capacity)
        Construct an InstanceList having given capacity, with given default pipe. Typically Instances added to this InstanceList will have gone through the pipe (for example using instanceList.addThruPipe); but this is not required. This InstanaceList will obtain its dataAlphabet and targetAlphabet from the pipe. It is required that all Instances in this InstanceList share these Alphabets.
        Parameters:
        pipe - The default pipe used to process instances added via the addThruPipe methods.
        capacity - The initial capacity of the list; will grow further as necessary.
      • InstanceList

        public InstanceList​(Pipe pipe)
        Construct an InstanceList with initial capacity of 10, with given default pipe. Typically Instances added to this InstanceList will have gone through the pipe (for example using instanceList.addThruPipe); but this is not required. This InstanaceList will obtain its dataAlphabet and targetAlphabet from the pipe. It is required that all Instances in this InstanceList share these Alphabets.
        Parameters:
        pipe - The default pipe used to process instances added via the addThruPipe methods.
      • InstanceList

        public InstanceList​(Alphabet dataAlphabet,
                            Alphabet targetAlphabet)
        Construct an InstanceList with initial capacity of 10, with a Noop default pipe. Used in those infrequent circumstances when Instances typically would not have further processing, and objects containing vocabularies are entered directly into the InstanceList; for example, the creation of a random InstanceList using Dirichlets and Multinomials.

        Parameters:
        dataAlphabet - The vocabulary for added instances' data fields
        targetAlphabet - The vocabulary for added instances' targets
      • InstanceList

        @Deprecated
        public InstanceList()
        Deprecated.
        Creates a list that will have its pipe set later when its first Instance is added.
      • InstanceList

        public InstanceList​(Randoms r,
                            Dirichlet classCentroidDistribution,
                            double classCentroidAverageAlphaMean,
                            double classCentroidAverageAlphaVariance,
                            double featureVectorSizePoissonLambda,
                            double classInstanceCountPoissonLambda,
                            java.lang.String[] classNames)
        Creates a list consisting of randomly-generated FeatureVectors.
      • InstanceList

        public InstanceList​(Randoms r,
                            Alphabet vocab,
                            java.lang.String[] classNames,
                            int meanInstancesPerLabel)
      • InstanceList

        public InstanceList​(Randoms r,
                            int vocabSize,
                            int numClasses)
    • Method Detail

      • clone

        public java.lang.Object clone()
        Overrides:
        clone in class java.util.ArrayList<Instance>
      • subList

        public InstanceList subList​(int start,
                                    int end)
        Specified by:
        subList in interface java.util.List<Instance>
        Overrides:
        subList in class java.util.ArrayList<Instance>
      • subList

        public InstanceList subList​(double proportion)
      • addThruPipe

        public void addThruPipe​(java.util.Iterator<Instance> ii)
        Adds to this list every instance generated by the iterator, passing each one through this InstanceList's pipe.
      • addThruPipe

        public void addThruPipe​(Instance inst)
        Adds the input instance to this list, after passing it through the InstanceList's pipe.

        If several instances are to be added then accumulate them in a List\ and use addThruPipe(Iterator) instead.

      • add

        @Deprecated
        public boolean add​(java.lang.Object data,
                           java.lang.Object target,
                           java.lang.Object name,
                           java.lang.Object source,
                           double instanceWeight)
        Deprecated.
        Use trainingset.addThruPipe (new Instance(data,target,name,source)) instead.
        Constructs and appends an instance to this list, passing it through this list's pipe and assigning it the specified weight.
        Returns:
        true
      • add

        @Deprecated
        public boolean add​(java.lang.Object data,
                           java.lang.Object target,
                           java.lang.Object name,
                           java.lang.Object source)
        Deprecated.
        Use trainingset.add (new Instance(data,target,name,source)) instead.
        Constructs and appends an instance to this list, passing it through this list's pipe. Default weight is 1.0.
        Returns:
        true
      • add

        public boolean add​(Instance instance)
        Appends the instance to this list without passing the instance through the InstanceList's pipe. The alphabets of this Instance must match the alphabets of this InstanceList.
        Specified by:
        add in interface java.util.Collection<Instance>
        Specified by:
        add in interface java.util.List<Instance>
        Overrides:
        add in class java.util.ArrayList<Instance>
        Returns:
        true
      • add

        public boolean add​(Instance instance,
                           double instanceWeight)
        Appends the instance to this list without passing it through this InstanceList's pipe, assigning it the specified weight.
        Returns:
        true
      • set

        public Instance set​(int index,
                            Instance instance)
        Specified by:
        set in interface java.util.List<Instance>
        Overrides:
        set in class java.util.ArrayList<Instance>
      • add

        public void add​(int index,
                        Instance element)
        Specified by:
        add in interface java.util.List<Instance>
        Overrides:
        add in class java.util.ArrayList<Instance>
      • remove

        public Instance remove​(int index)
        Specified by:
        remove in interface java.util.List<Instance>
        Overrides:
        remove in class java.util.ArrayList<Instance>
      • remove

        public boolean remove​(Instance instance)
      • addAll

        public boolean addAll​(java.util.Collection<? extends Instance> instances)
        Specified by:
        addAll in interface java.util.Collection<Instance>
        Specified by:
        addAll in interface java.util.List<Instance>
        Overrides:
        addAll in class java.util.ArrayList<Instance>
      • addAll

        public boolean addAll​(int index,
                              java.util.Collection<? extends Instance> c)
        Specified by:
        addAll in interface java.util.List<Instance>
        Overrides:
        addAll in class java.util.ArrayList<Instance>
      • clear

        public void clear()
        Specified by:
        clear in interface java.util.Collection<Instance>
        Specified by:
        clear in interface java.util.List<Instance>
        Overrides:
        clear in class java.util.ArrayList<Instance>
      • shuffle

        public void shuffle​(java.util.Random r)
      • split

        public InstanceList[] split​(java.util.Random r,
                                    double[] proportions)
        Shuffles the elements of this list among several smaller lists.
        Parameters:
        proportions - A list of numbers (not necessarily summing to 1) which, when normalized, correspond to the proportion of elements in each returned sublist. This method (and all the split methods) do not transfer the Instance weights to the resulting InstanceLists.
        r - The source of randomness to use in shuffling.
        Returns:
        one InstanceList for each element of proportions
      • split

        public InstanceList[] split​(double[] proportions)
      • splitInOrder

        public InstanceList[] splitInOrder​(double[] proportions)
        Chops this list into several sequential sublists.
        Parameters:
        proportions - A list of numbers corresponding to the proportion of elements in each returned sublist. If not already normalized to sum to 1.0, it will be normalized here.
        Returns:
        one InstanceList for each element of proportions
      • stratifiedSplit

        public InstanceList[] stratifiedSplit​(java.util.Random r,
                                              double[] proportions)
        Shuffles the elements of this list among several smaller lists, each sublist having a number of elements proportional to the amount given in the array. If the target alphabet of this list is a LabelAlphabet, then each sublist has (approximately and to the extent possible) the same distribution of the target classes as the original list. Otherwise, the sublists are randomly generated without committing to the underlying distribution.

        TODO Sublists must conform tothe underlying distribution, even when the target alphabet is not of LabelAlplhabet type.

        Parameters:
        proportions - A list of numbers (not necessarily summing to 1) which, when normalized, correspond to the proportion of elements in each returned sublist. This method (and all the split methods) do not transfer the Instance weights to the resulting InstanceLists.
        r - The source of randomness to use in shuffling.
        Returns:
        one InstanceList for each element of proportions
      • stratifiedSplitInOrder

        public InstanceList[] stratifiedSplitInOrder​(double[] proportions)
        Chops this list into several sequential sublists, where each sublist contains an (approximately) equal proportion of each target label.
        Parameters:
        proportions - A list of numbers corresponding to the proportion of elements in each returned sublist. If not already normalized to sum to 1.0, it will be normalized here.
        Returns:
        one InstanceList for each element of proportions
      • splitInOrder

        public InstanceList[] splitInOrder​(int[] counts)
      • splitInTwoByModulo

        public InstanceList[] splitInTwoByModulo​(int m)
        Returns a pair of new lists such that the first list in the pair contains every mth element of this list, starting with the first. The second list contains all remaining elements.
      • sampleWithReplacement

        public InstanceList sampleWithReplacement​(java.util.Random r,
                                                  int numSamples)
      • sampleWithInstanceWeights

        @Deprecated
        public InstanceList sampleWithInstanceWeights​(java.util.Random r)
        Deprecated.
        Returns an InstanceList of the same size, where the instances come from the random sampling (with replacement) of this list using the instance weights. The new instances all have their weights set to one.
      • sampleWithWeights

        public InstanceList sampleWithWeights​(java.util.Random r,
                                              double[] weights)
        Returns an InstanceList of the same size, where the instances come from the random sampling (with replacement) of this list using the given weights. The length of the weight array must be the same as the length of this list The new instances all have their weights set to one.
      • getDataClass

        public java.lang.Class getDataClass()
        Returns the Java Class 'data' field of Instances in this list.
      • getTargetClass

        public java.lang.Class getTargetClass()
        Returns the Java Class 'target' field of Instances in this list.
      • setInstance

        public void setInstance​(int index,
                                Instance instance)
        Replaces the Instance at position index with a new one.
      • getInstanceWeight

        public double getInstanceWeight​(Instance instance)
      • getInstanceWeight

        public double getInstanceWeight​(int index)
      • setInstanceWeight

        public void setInstanceWeight​(int index,
                                      double weight)
      • setInstanceWeight

        public void setInstanceWeight​(Instance instance,
                                      double weight)
      • setFeatureSelection

        public void setFeatureSelection​(FeatureSelection selectedFeatures)
      • setPerLabelFeatureSelection

        public void setPerLabelFeatureSelection​(FeatureSelection[] selectedFeatures)
      • getPerLabelFeatureSelection

        public FeatureSelection[] getPerLabelFeatureSelection()
      • removeTargets

        public void removeTargets()
        Sets the "target" field to null in all instances. This makes unlabeled data.
      • removeSources

        public void removeSources()
        Sets the "source" field to null in all instances. This will often save memory when the raw data had been placed in that field.
      • load

        public static InstanceList load​(java.io.File file)
        Constructs a new InstanceList, deserialized from file. If the string value of file is "-", then deserialize from System.in.
      • save

        public void save​(java.io.File file)
        Saves this InstanceList to file. If the string value of file is "-", then serialize to System.out.
      • getPipe

        public Pipe getPipe()
        Returns the pipe through which each added Instance is passed, which may be null.
      • setPipe

        public void setPipe​(Pipe p)
        Change the default Pipe associated with InstanceList. This method is very dangerous and should only be used in extreme circumstances!!
      • getDataAlphabet

        public Alphabet getDataAlphabet()
        Returns the Alphabet mapping features of the data to integers.
      • getTargetAlphabet

        public Alphabet getTargetAlphabet()
        Returns the Alphabet mapping target output labels to integers.
      • targetLabelDistribution

        public LabelVector targetLabelDistribution()
      • hideSomeLabels

        public void hideSomeLabels​(double proportionToHide,
                                   Randoms r)
      • hideSomeLabels

        public void hideSomeLabels​(java.util.BitSet bs)
      • unhideAllLabels

        public void unhideAllLabels()