18. November 2014

Word counting, squared

The Google n-gram viewer has become a common starting point for historical analysis of word use. But it only tells us about individual words, with no indication of their context or meaning. Several months ago the Hathi Trust Research Center released a dataset of page-level word counts extracted from 250,000 out-of-copyright books. I’ve used it to build a word similarity tool that tracks word co-occurrence patterns from 1800 to 1923. In the default example we see that “lincoln” is a town in England until around 1859, when it becomes a politician. In this article I’ll describe how I made this tool, and what’s wrong with it.


16. October 2014

Non-parametric Bayes

Modern datasets are often large, complicated, and disorganized. Clustering algorithms create data-driven organizations. These algorithms include a wide range of methods, from k-means to mixture models and mixed membership models (e.g. topic models and admixture models). Most of these algorithms assume that the number of clusters is a fixed, user-supplied parameter. Bayesian non-parametric models are an attractive alternative. I built a widget for my grad class that implements a simple example of sampling from a Dirichlet process, a Pitman-Yor process, and a hierarchical Dirichlet process. I find this approach pretty intuitive and thought I’d share it.


18. September 2014

Labels and Patterns

I’ve been using this blog as a more philosophical platform, this is going to be about some new features in the machine learning package that I work on, Mallet. One of these, LabeledLDA, is some code that I’ve had lying around for a few years. The other, stop patterns, is a simple addition that may be useful in vocabulary curation for text mining. You’ll need to grab the latest development version from GitHub to run these.


19. August 2014

Data carpentry

The New York Times has an article titled For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights. Mostly I really like it. The fact that raw data is rarely usable for analysis without significant work is a point I try hard to make with my students. I told them “do not underestimate the difficulty of data preparation”. When they turned in their projects, many of them reported that they had underestimated the difficulty of data preparation. Recognizing this as a hard problem is great.

What I’m less thrilled about is calling this “janitor work”. For one thing, it’s not particularly respectful of custodians, whose work I really appreciate. But it also mischaracterizes what this type of work is about. I’d like to propose a different analogy that I think fits a lot better: data carpentry.


14. August 2014

A useful word

I was reading a paper the other day and came across the word aleatory. This turns out to be an excellent word. It comes from the Latin alea for “dice”, as in alea jacta est, which is what you say when you’re Julius Caesar and you cross the Rubicon. It means random, or subject to chance. It seems to come up mainly in legal contexts: an aleatory contract is one whose terms depend on future events, like an insurance policy. This got me thinking about other words for the property of randomness.