Class InstanceList
- java.lang.Object
-
- java.util.AbstractCollection<E>
-
- java.util.AbstractList<E>
-
- java.util.ArrayList<Instance>
-
- cc.mallet.types.InstanceList
-
- All Implemented Interfaces:
AlphabetCarrying
,java.io.Serializable
,java.lang.Cloneable
,java.lang.Iterable<Instance>
,java.util.Collection<Instance>
,java.util.List<Instance>
,java.util.RandomAccess
- Direct Known Subclasses:
MultiInstanceList
,PagedInstanceList
public class InstanceList extends java.util.ArrayList<Instance> implements java.io.Serializable, java.lang.Iterable<Instance>, AlphabetCarrying
A list of machine learning instances, typically used for training or testing of a machine learning algorithm.All of the instances in the list will have been passed through the same
Pipe
, and thus must also share the same data and target Alphabets. InstanceList keeps a reference to the pipe and the two alphabets.The most common way of adding instances to an InstanceList is through the
add(PipeInputIterator)
method. PipeInputIterators are a way of mapping general data sources into instances suitable for processing through a pipe. As eachInstance
is pulled from the PipeInputIterator, the InstanceList copies the instance and runs the copy through its pipe (with resultant destructive modifications) before saving the modified instance on its list. This is the usual way in which instances are transformed by pipes.InstanceList also contains methods for randomly generating lists of feature vectors; splitting lists into non-overlapping subsets (useful for test/train splits), and iterators for cross validation.
- Author:
- Andrew McCallum mccallum@cs.umass.edu
- See Also:
Instance
,Pipe
, Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description class
InstanceList.CrossValidationIterator
CrossValidationIterator
allows iterating over pairs ofInstanceList
, where each pair is split into training/testing based on nfolds.class
InstanceList.StratifiedCrossValidationIterator
StratifiedCrossValidationIterator
allows iterating over pairs ofInstanceList
, where each pair is split into training/testing based on nfolds, and each fold maintains the distribution properties of the original InstanceList as much as possible.
-
Field Summary
Fields Modifier and Type Field Description static java.lang.String
TARGET_PROPERTY
-
Constructor Summary
Constructors Constructor Description InstanceList()
Deprecated.InstanceList(Pipe pipe)
Construct an InstanceList with initial capacity of 10, with given default pipe.InstanceList(Pipe pipe, int capacity)
Construct an InstanceList having given capacity, with given default pipe.InstanceList(Alphabet dataAlphabet, Alphabet targetAlphabet)
Construct an InstanceList with initial capacity of 10, with a Noop default pipe.InstanceList(Randoms r, int vocabSize, int numClasses)
InstanceList(Randoms r, Alphabet vocab, java.lang.String[] classNames, int meanInstancesPerLabel)
InstanceList(Randoms r, Dirichlet classCentroidDistribution, double classCentroidAverageAlphaMean, double classCentroidAverageAlphaVariance, double featureVectorSizePoissonLambda, double classInstanceCountPoissonLambda, java.lang.String[] classNames)
Creates a list consisting of randomly-generatedFeatureVector
s.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Deprecated Methods Modifier and Type Method Description void
add(int index, Instance element)
boolean
add(Instance instance)
Appends the instance to this list without passing the instance through the InstanceList's pipe.boolean
add(Instance instance, double instanceWeight)
Appends the instance to this list without passing it through this InstanceList's pipe, assigning it the specified weight.boolean
add(java.lang.Object data, java.lang.Object target, java.lang.Object name, java.lang.Object source)
Deprecated.Use trainingset.add (new Instance(data,target,name,source)) instead.boolean
add(java.lang.Object data, java.lang.Object target, java.lang.Object name, java.lang.Object source, double instanceWeight)
Deprecated.Use trainingset.addThruPipe (new Instance(data,target,name,source)) instead.boolean
addAll(int index, java.util.Collection<? extends Instance> c)
boolean
addAll(java.util.Collection<? extends Instance> instances)
void
addThruPipe(Instance inst)
Adds the input instance to this list, after passing it through the InstanceList's pipe.void
addThruPipe(java.util.Iterator<Instance> ii)
Adds to this list every instance generated by the iterator, passing each one through this InstanceList's pipe.void
clear()
java.lang.Object
clone()
InstanceList
cloneEmpty()
protected InstanceList
cloneEmptyInto(InstanceList ret)
InstanceList.CrossValidationIterator
crossValidationIterator(int nfolds)
InstanceList.CrossValidationIterator
crossValidationIterator(int nfolds, int seed)
Alphabet
getAlphabet()
Alphabet[]
getAlphabets()
Alphabet
getDataAlphabet()
Returns theAlphabet
mapping features of the data to integers.java.lang.Class
getDataClass()
Returns the Java Class 'data' field of Instances in this list.FeatureSelection
getFeatureSelection()
double
getInstanceWeight(int index)
double
getInstanceWeight(Instance instance)
FeatureSelection[]
getPerLabelFeatureSelection()
Pipe
getPipe()
Returns the pipe through which each addedInstance
is passed, which may benull
.Alphabet
getTargetAlphabet()
Returns theAlphabet
mapping target output labels to integers.java.lang.Class
getTargetClass()
Returns the Java Class 'target' field of Instances in this list.void
hideSomeLabels(double proportionToHide, Randoms r)
void
hideSomeLabels(java.util.BitSet bs)
static InstanceList
load(java.io.File file)
Constructs a newInstanceList
, deserialized fromfile
.Instance
remove(int index)
boolean
remove(Instance instance)
void
removeSources()
Sets the "source" field tonull
in all instances.void
removeTargets()
Sets the "target" field tonull
in all instances.InstanceList
sampleWithInstanceWeights(java.util.Random r)
Deprecated.InstanceList
sampleWithReplacement(java.util.Random r, int numSamples)
InstanceList
sampleWithWeights(java.util.Random r, double[] weights)
Returns anInstanceList
of the same size, where the instances come from the random sampling (with replacement) of this list using the given weights.void
save(java.io.File file)
Saves thisInstanceList
tofile
.Instance
set(int index, Instance instance)
void
setFeatureSelection(FeatureSelection selectedFeatures)
void
setInstance(int index, Instance instance)
Replaces theInstance
at positionindex
with a new one.void
setInstanceWeight(int index, double weight)
void
setInstanceWeight(Instance instance, double weight)
void
setPerLabelFeatureSelection(FeatureSelection[] selectedFeatures)
void
setPipe(Pipe p)
Change the default Pipe associated with InstanceList.InstanceList
shallowClone()
void
shuffle(java.util.Random r)
InstanceList[]
split(double[] proportions)
InstanceList[]
split(java.util.Random r, double[] proportions)
Shuffles the elements of this list among several smaller lists.InstanceList[]
splitInOrder(double[] proportions)
Chops this list into several sequential sublists.InstanceList[]
splitInOrder(int[] counts)
InstanceList[]
splitInTwoByModulo(int m)
Returns a pair of new lists such that the first list in the pair contains everym
th element of this list, starting with the first.InstanceList[]
stratifiedSplit(java.util.Random r, double[] proportions)
Shuffles the elements of this list among several smaller lists, each sublist having a number of elements proportional to the amount given in the array.InstanceList[]
stratifiedSplitInOrder(double[] proportions)
Chops this list into several sequential sublists, where each sublist contains an (approximately) equal proportion of each target label.InstanceList
subList(double proportion)
InstanceList
subList(int start, int end)
LabelVector
targetLabelDistribution()
void
unhideAllLabels()
-
Methods inherited from class java.util.ArrayList
contains, ensureCapacity, equals, forEach, get, hashCode, indexOf, isEmpty, iterator, lastIndexOf, listIterator, listIterator, remove, removeAll, removeIf, removeRange, replaceAll, retainAll, size, sort, spliterator, toArray, toArray, trimToSize
-
-
-
-
Field Detail
-
TARGET_PROPERTY
public static final java.lang.String TARGET_PROPERTY
- See Also:
- Constant Field Values
-
-
Constructor Detail
-
InstanceList
public InstanceList(Pipe pipe, int capacity)
Construct an InstanceList having given capacity, with given default pipe. Typically Instances added to this InstanceList will have gone through the pipe (for example using instanceList.addThruPipe); but this is not required. This InstanaceList will obtain its dataAlphabet and targetAlphabet from the pipe. It is required that all Instances in this InstanceList share these Alphabets.- Parameters:
pipe
- The default pipe used to process instances added via the addThruPipe methods.capacity
- The initial capacity of the list; will grow further as necessary.
-
InstanceList
public InstanceList(Pipe pipe)
Construct an InstanceList with initial capacity of 10, with given default pipe. Typically Instances added to this InstanceList will have gone through the pipe (for example using instanceList.addThruPipe); but this is not required. This InstanaceList will obtain its dataAlphabet and targetAlphabet from the pipe. It is required that all Instances in this InstanceList share these Alphabets.- Parameters:
pipe
- The default pipe used to process instances added via the addThruPipe methods.
-
InstanceList
public InstanceList(Alphabet dataAlphabet, Alphabet targetAlphabet)
Construct an InstanceList with initial capacity of 10, with a Noop default pipe. Used in those infrequent circumstances when Instances typically would not have further processing, and objects containing vocabularies are entered directly into theInstanceList
; for example, the creation of a randomInstanceList
usingDirichlet
s andMultinomial
s.- Parameters:
dataAlphabet
- The vocabulary for added instances' data fieldstargetAlphabet
- The vocabulary for added instances' targets
-
InstanceList
@Deprecated public InstanceList()
Deprecated.Creates a list that will have its pipe set later when its first Instance is added.
-
InstanceList
public InstanceList(Randoms r, Dirichlet classCentroidDistribution, double classCentroidAverageAlphaMean, double classCentroidAverageAlphaVariance, double featureVectorSizePoissonLambda, double classInstanceCountPoissonLambda, java.lang.String[] classNames)
Creates a list consisting of randomly-generatedFeatureVector
s.
-
InstanceList
public InstanceList(Randoms r, Alphabet vocab, java.lang.String[] classNames, int meanInstancesPerLabel)
-
InstanceList
public InstanceList(Randoms r, int vocabSize, int numClasses)
-
-
Method Detail
-
shallowClone
public InstanceList shallowClone()
-
clone
public java.lang.Object clone()
- Overrides:
clone
in classjava.util.ArrayList<Instance>
-
subList
public InstanceList subList(int start, int end)
-
subList
public InstanceList subList(double proportion)
-
addThruPipe
public void addThruPipe(java.util.Iterator<Instance> ii)
Adds to this list every instance generated by the iterator, passing each one through this InstanceList's pipe.
-
addThruPipe
public void addThruPipe(Instance inst)
Adds the input instance to this list, after passing it through the InstanceList's pipe.If several instances are to be added then accumulate them in a List\
and use addThruPipe(Iterator ) instead.
-
add
@Deprecated public boolean add(java.lang.Object data, java.lang.Object target, java.lang.Object name, java.lang.Object source, double instanceWeight)
Deprecated.Use trainingset.addThruPipe (new Instance(data,target,name,source)) instead.Constructs and appends an instance to this list, passing it through this list's pipe and assigning it the specified weight.- Returns:
true
-
add
@Deprecated public boolean add(java.lang.Object data, java.lang.Object target, java.lang.Object name, java.lang.Object source)
Deprecated.Use trainingset.add (new Instance(data,target,name,source)) instead.Constructs and appends an instance to this list, passing it through this list's pipe. Default weight is 1.0.- Returns:
true
-
add
public boolean add(Instance instance)
Appends the instance to this list without passing the instance through the InstanceList's pipe. The alphabets of this Instance must match the alphabets of this InstanceList.
-
add
public boolean add(Instance instance, double instanceWeight)
Appends the instance to this list without passing it through this InstanceList's pipe, assigning it the specified weight.- Returns:
true
-
add
public void add(int index, Instance element)
-
remove
public Instance remove(int index)
-
remove
public boolean remove(Instance instance)
-
addAll
public boolean addAll(java.util.Collection<? extends Instance> instances)
-
addAll
public boolean addAll(int index, java.util.Collection<? extends Instance> c)
-
clear
public void clear()
-
cloneEmpty
public InstanceList cloneEmpty()
-
cloneEmptyInto
protected InstanceList cloneEmptyInto(InstanceList ret)
-
shuffle
public void shuffle(java.util.Random r)
-
split
public InstanceList[] split(java.util.Random r, double[] proportions)
Shuffles the elements of this list among several smaller lists.- Parameters:
proportions
- A list of numbers (not necessarily summing to 1) which, when normalized, correspond to the proportion of elements in each returned sublist. This method (and all the split methods) do not transfer the Instance weights to the resulting InstanceLists.r
- The source of randomness to use in shuffling.- Returns:
- one
InstanceList
for each element ofproportions
-
split
public InstanceList[] split(double[] proportions)
-
splitInOrder
public InstanceList[] splitInOrder(double[] proportions)
Chops this list into several sequential sublists.- Parameters:
proportions
- A list of numbers corresponding to the proportion of elements in each returned sublist. If not already normalized to sum to 1.0, it will be normalized here.- Returns:
- one
InstanceList
for each element ofproportions
-
stratifiedSplit
public InstanceList[] stratifiedSplit(java.util.Random r, double[] proportions)
Shuffles the elements of this list among several smaller lists, each sublist having a number of elements proportional to the amount given in the array. If the target alphabet of this list is aLabelAlphabet
, then each sublist has (approximately and to the extent possible) the same distribution of the target classes as the original list. Otherwise, the sublists are randomly generated without committing to the underlying distribution. TODO Sublists must conform tothe underlying distribution, even when the target alphabet is not of LabelAlplhabet type.- Parameters:
proportions
- A list of numbers (not necessarily summing to 1) which, when normalized, correspond to the proportion of elements in each returned sublist. This method (and all the split methods) do not transfer the Instance weights to the resulting InstanceLists.r
- The source of randomness to use in shuffling.- Returns:
- one
InstanceList
for each element ofproportions
-
stratifiedSplitInOrder
public InstanceList[] stratifiedSplitInOrder(double[] proportions)
Chops this list into several sequential sublists, where each sublist contains an (approximately) equal proportion of each target label.- Parameters:
proportions
- A list of numbers corresponding to the proportion of elements in each returned sublist. If not already normalized to sum to 1.0, it will be normalized here.- Returns:
- one
InstanceList
for each element ofproportions
-
splitInOrder
public InstanceList[] splitInOrder(int[] counts)
-
splitInTwoByModulo
public InstanceList[] splitInTwoByModulo(int m)
Returns a pair of new lists such that the first list in the pair contains everym
th element of this list, starting with the first. The second list contains all remaining elements.
-
sampleWithReplacement
public InstanceList sampleWithReplacement(java.util.Random r, int numSamples)
-
sampleWithInstanceWeights
@Deprecated public InstanceList sampleWithInstanceWeights(java.util.Random r)
Deprecated.Returns anInstanceList
of the same size, where the instances come from the random sampling (with replacement) of this list using the instance weights. The new instances all have their weights set to one.
-
sampleWithWeights
public InstanceList sampleWithWeights(java.util.Random r, double[] weights)
Returns anInstanceList
of the same size, where the instances come from the random sampling (with replacement) of this list using the given weights. The length of the weight array must be the same as the length of this list The new instances all have their weights set to one.
-
getDataClass
public java.lang.Class getDataClass()
Returns the Java Class 'data' field of Instances in this list.
-
getTargetClass
public java.lang.Class getTargetClass()
Returns the Java Class 'target' field of Instances in this list.
-
setInstance
public void setInstance(int index, Instance instance)
Replaces theInstance
at positionindex
with a new one.
-
getInstanceWeight
public double getInstanceWeight(Instance instance)
-
getInstanceWeight
public double getInstanceWeight(int index)
-
setInstanceWeight
public void setInstanceWeight(int index, double weight)
-
setInstanceWeight
public void setInstanceWeight(Instance instance, double weight)
-
setFeatureSelection
public void setFeatureSelection(FeatureSelection selectedFeatures)
-
getFeatureSelection
public FeatureSelection getFeatureSelection()
-
setPerLabelFeatureSelection
public void setPerLabelFeatureSelection(FeatureSelection[] selectedFeatures)
-
getPerLabelFeatureSelection
public FeatureSelection[] getPerLabelFeatureSelection()
-
removeTargets
public void removeTargets()
Sets the "target" field tonull
in all instances. This makes unlabeled data.
-
removeSources
public void removeSources()
Sets the "source" field tonull
in all instances. This will often save memory when the raw data had been placed in that field.
-
load
public static InstanceList load(java.io.File file)
Constructs a newInstanceList
, deserialized fromfile
. If the string value offile
is "-", then deserialize fromSystem.in
.
-
save
public void save(java.io.File file)
Saves thisInstanceList
tofile
. If the string value offile
is "-", then serialize toSystem.out
.
-
getPipe
public Pipe getPipe()
Returns the pipe through which each addedInstance
is passed, which may benull
.
-
setPipe
public void setPipe(Pipe p)
Change the default Pipe associated with InstanceList. This method is very dangerous and should only be used in extreme circumstances!!
-
getDataAlphabet
public Alphabet getDataAlphabet()
Returns theAlphabet
mapping features of the data to integers.
-
getTargetAlphabet
public Alphabet getTargetAlphabet()
Returns theAlphabet
mapping target output labels to integers.
-
getAlphabet
public Alphabet getAlphabet()
- Specified by:
getAlphabet
in interfaceAlphabetCarrying
-
getAlphabets
public Alphabet[] getAlphabets()
- Specified by:
getAlphabets
in interfaceAlphabetCarrying
-
targetLabelDistribution
public LabelVector targetLabelDistribution()
-
crossValidationIterator
public InstanceList.CrossValidationIterator crossValidationIterator(int nfolds, int seed)
-
crossValidationIterator
public InstanceList.CrossValidationIterator crossValidationIterator(int nfolds)
-
hideSomeLabels
public void hideSomeLabels(double proportionToHide, Randoms r)
-
hideSomeLabels
public void hideSomeLabels(java.util.BitSet bs)
-
unhideAllLabels
public void unhideAllLabels()
-
-