Non-parametric Bayes - David Mimno

Modern datasets are often large, complicated, and disorganized. Clustering algorithms create data-driven organizations. These algorithms include a wide range of methods, from k-means to mixture models and mixed membership models (e.g. topic models and admixture models). Most of these algorithms assume that the number of clusters is a fixed, user-supplied parameter. Bayesian non-parametric models are an attractive alternative. I built a widget for my grad class that implements a simple example of sampling from a Dirichlet process, a Pitman-Yor process, and a hierarchical Dirichlet process. I find this approach pretty intuitive and thought I’d share it.

One way to think of Bayesian non-parametric distributions is that they convert distributions over continuous values, which will never give you the same value twice, into discrete distributions, which could return multiple copies of the exact same value. They do this by keeping a cache of previously seen values.

Let’s start by sampling x and y points from a uniform distribution. I’m going to do this by creating a javascript object that has one property: a function called sample() that returns a random number from zero to one.

var uniform = { sample: function() { return Math.random(); } };

Click the box to sample a point.

Uniform samples

If you sample enough points, you might see a few overlapping circles (they’re slightly transparent so you can see duplicates), but there shouldn’t be many. Each new sample is independent of the previous ones.

Now I’m going to define a new kind of distribution that caches values from another distribution. It has two parameters. The first is that other distribution, which for this purpose means an object that has a sample() function. Sometimes we’ll sample a new value from this distribution, and other times we’re just going to return a previous value. The second is a positive number that I’m calling the novelty weight. This value controls how often we sample a new value, and how often we just return a previous, cached value. Internally, we keep an array of cached values, an array containing the number of times we’ve returned each of the cached values, and the total number of previous samples returned. The object has a function sample() that returns a cached value with probability proportional to the number of times we have returned that value in the past, and a new value with probability proportional to the novelty weight.

var cachedDistribution = function(d, a) {
    return {
        baseDistribution: d,
        noveltyWeight : a,
        cache : [],
        counts : [],
        previousSamples : 0,
        sample : function() {
            var s = Math.random() * (this.noveltyWeight + this.previousSamples);
            if (s > this.previousSamples) {
                var newValue = this.baseDistribution.sample();
                this.cache.push(newValue);
                this.counts.push(1);
                this.previousSamples++;
                return newValue;
            }
            else {
                var valueIndex = 0;
                while (s > this.counts[ valueIndex ]) {
                    s -= this.counts[ valueIndex ];
                    valueIndex++;
                }
                this.counts[valueIndex]++;
                this.previousSamples++;
                return this.cache[valueIndex];
            }
        }
    };
}

In the next box, the x and y values are both sampled from (separate) caches of my uniform distribution, with novelty weight 1.0.

var dpXDistribution = cachedDistribution(uniform, 1.0);
var dpYDistribution = cachedDistribution(uniform, 1.0);

The current state of both caches is shown below the box.

Dirichlet process (1.0, 1.0)

After about 20 points, you will probably see an irregular grid with distinct rows and columns. Some points will be darker, indicating that a particular x,y pair is occurring frequently. The two caches are likely to have a few frequent entries at the beginning, and several rare entries later.

What happens if we change the novelty weight? In the next box we have the exact same experiment, but the x distribution has novelty weight 10.0. When the novelty weight is 1.0, the probability of choosing a cached value is equal to the probability of choosing a new value after the first sample. When the novelty weight is 10.0, this probability isn’t equal until we have sampled ten values.

Dirichlet process (10.0, 1.0)

What you’re likely to see is horizontal stripes, with lots of distinct x values but only a few y values. The y distribution is much happier to return a previously cached value than the x distribution.

What if we want to share values between x and y? We could just use the same distribution for both, but then we couldn’t vary the frequency of values on the two axes. Remember that the only requirement I put on the “other” distribution for a cached distribution is that it have a sample() function. The cached distribution itself satisfies this, so why not cache a cached distribution? When multiple distributions are caching from the same distribution that is itself a cached distribution, we get a hierarchical Dirichlet process. Here’s an example:

var topLevel = cachedDistribution(uniform, 3.0);
var hdpXDistribution = cachedDistribution(topLevel, 2.0);
var hdpYDistribution = cachedDistribution(topLevel, 2.0);

Clicking on the box below will sample from these distributions.

Hierarchical Dirichlet process (3.0, 2.0, 2.0)

Look at the cache as you sample points. You may notice the same value appearing in both the x and y cache. Both the “local” distributions requested values from the “global” distribution, and one of them received a cached copy of a value previously returned to the other cache. More surprisingly, you may see multiple copies of the same value in a cache. Since the top-level distribution is free to return a previously seen value, it may give the x and y distributions another copy of the same value. So a point could appear at, say, x=0.775 because it is copying from a previous point with x=0.775, or it could have sampled a (supposedly) new x value from the top-level distribution that happens to be the same value. (Since I’m only showing three decimal places, it’s possible the uniform could have returned the same approximate value twice, but you’d see that in the top-level cache.)

Repeated values in the local caches is a problem for inference, where we’re trying to reconstruct the generative process for observed data. For example, when Gibbs sampling, you need to decide whether a given data point that uses a particular cluster-specific value got that value by using an existing cached instance of that value or by sampling a new local copy of that value. (For those who like convoluted restaurant analogies: values are “dishes”, cached copies are “tables”.)

Finally, no matter how large the novelty weight is, for a Dirichlet process the probability of sampling a new value from the base distribution will approach zero more quickly than we observe in a lot of real data. One of the best examples is word frequencies in natural language. As a whole, the class of “rare” words is considerably more frequent than we would expect under a Dirichlet process. An alternative is the Pitman-Yor process, which makes one small change: it borrows a certain amount of probability weight from each cached value. We represent this weight with a discount parameter that has to be between zero and one. The probability of asking the base distribution for a new value is proportional to the novelty weight plus the product of the discount and the number of cached values. The probability of returning a previously cached value is proportional to the number of times we’ve returned that value minus the discount. The normalization constant for this distribution is still the same: novelty weight plus previous samples. But we’re reallocating probability so that each time we cache a new value, the probability of sampling another new value in the future goes down less than it would under the Dirichlet process.

Here’s the code, which is almost identical to the simple cached distribution.

var discountedCachedDistribution = function(d, a, b) {
    return {
        baseDistribution: d,
        noveltyWeight : a,
        discount : b,
        ...
        sample : function() {
            var s = Math.random() * (this.noveltyWeight + this.previousSamples);
            if (s > this.previousSamples - this.discount * this.cache.length) {
                ...
            }
            else {
                var valueIndex = 0;
                while (s > this.counts[ valueIndex ] - this.discount) {
                    s -= this.counts[ valueIndex ] - this.discount;
                    valueIndex++;
                }
                ...
            }
        }
    };
}

Click on the box below to sample x and y pairs from a two independent Pitman-Yor processes with novelty weight 1.0 and discount 0.05.

Pitman-Yor process (1.0/0.05, 1.0/0.05)

Try clicking exactly 20 times, and compare the size of the cache under the PY model to the size of the cache in the Dirichlet process (1.0, 1.0) box above, after exactly the same number of clicks.

It’s important to note that there are some theoretical and practical issues with Bayesian non-parametric models. Miller and Harrison show that Dirichlet processes will consistently over-estimate the number of mixture components. Buntine and Mishra recently found that learning asymmetric hyperparameters in a finite Dirichlet model is a remarkably good approximation to a Dirichlet process, which is important because finite models are usually a lot easier to work with.

Finally, although I think it’s right to be nervous about setting a fixed number of clusters or topics or admixture populations, don’t assume that NP Bayes models will find the “right” number of clusters. Many models are such pathetic caricatures of real processes (hint: I am not writing this by sampling from distributions over words) that the whole concept of a correct number is questionable. Perhaps more troublingly, most NP models have some other parameter, usually not the novelty weight, that has a strong effect on the number of clusters used. If I have to choose between a clear, overt modeling choice and a hidden, unpredictable effect, I’ll pick transparency. But even if they’re not always the answer, I think it’s still really important to understand how a Dirichlet process works, and when it might be useful.