17. February 2015

Using phrases in Mallet topic models

Bag-of-words models are surprisingly powerful, but there are often cases where several words are really a single semantic unit. How we handle these terms can have a major impact on how well we can model a text corpus. Several years ago, while working on a project involving NIH grants and associated papers, I implemented some tools for combining multiple tokens into single tokens as a preprocessing step. In this post I’ll demonstrate how I identify and use multi-word terms in Mallet.

more

18. November 2014

Word counting, squared

The Google n-gram viewer has become a common starting point for historical analysis of word use. But it only tells us about individual words, with no indication of their context or meaning. Several months ago the Hathi Trust Research Center released a dataset of page-level word counts extracted from 250,000 out-of-copyright books. I’ve used it to build a word similarity tool that tracks word co-occurrence patterns from 1800 to 1923. In the default example we see that “lincoln” is a town in England until around 1859, when it becomes a politician. In this article I’ll describe how I made this tool, and what’s wrong with it.

more

16. October 2014

Non-parametric Bayes

Modern datasets are often large, complicated, and disorganized. Clustering algorithms create data-driven organizations. These algorithms include a wide range of methods, from k-means to mixture models and mixed membership models (e.g. topic models and admixture models). Most of these algorithms assume that the number of clusters is a fixed, user-supplied parameter. Bayesian non-parametric models are an attractive alternative. I built a widget for my grad class that implements a simple example of sampling from a Dirichlet process, a Pitman-Yor process, and a hierarchical Dirichlet process. I find this approach pretty intuitive and thought I’d share it.

more

18. September 2014

Labels and Patterns

I’ve been using this blog as a more philosophical platform, this is going to be about some new features in the machine learning package that I work on, Mallet. One of these, LabeledLDA, is some code that I’ve had lying around for a few years. The other, stop patterns, is a simple addition that may be useful in vocabulary curation for text mining. You’ll need to grab the latest development version from GitHub to run these.

more

19. August 2014

Data carpentry

The New York Times has an article titled For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights. Mostly I really like it. The fact that raw data is rarely usable for analysis without significant work is a point I try hard to make with my students. I told them “do not underestimate the difficulty of data preparation”. When they turned in their projects, many of them reported that they had underestimated the difficulty of data preparation. Recognizing this as a hard problem is great.

What I’m less thrilled about is calling this “janitor work”. For one thing, it’s not particularly respectful of custodians, whose work I really appreciate. But it also mischaracterizes what this type of work is about. I’d like to propose a different analogy that I think fits a lot better: data carpentry.

more