Mallet past present and future

There was a conversation on Twitter about the current state of Mallet. My goal for Mallet is that it should do a few things very well. Future development will focus on making the process of using machine learning easier and more informative. Also, be sure to use the current GitHub version.

Before discussing where Mallet is going, I should describe where it came from. Mallet was created by Andrew McCallum at UMass in 2002. I started working on the project when I joined Andrew’s lab in 2005. Most of the other developers have moved on, some to Andrew’s current project factorie.

Mallet is a mature project. There will be bug fixes as they arise and targeted functional updates. But it mostly does what I want it to do. I describe myself as a “maintainer”.

I recently moved the Mallet repository to GitHub. The current packaged release of Mallet (2.0.7) on the Mallet UMass site is out-of-date. Use the current GitHub version instead.

I am open to suggestions about minor and major functionality. I would consider nominations for Mallet committers. I am very happy to accept pull requests.

Mallet’s original focus was on text classification and sequence tagging. It also includes tools for text input such as tokenization and stopword removal. I worked mainly on the topic modeling toolkit, and that seems to be the center of current use. Mallet’s topic modeling toolkit brings together several clever tools, a few of which I created, that make learning standard LDA topic models really fast and (as topic models go) reliable. As inference on the simplest model got better, the gap in speed and reliability between plain LDA and “fancy models” got bigger. Today I’m mostly interested in how people use machine learning to model data, and then designing methodologies that make that process easier and more effective.

My goal for Mallet is that it should do a few things very well. This is both because I’m only one person, and also because I think it’s the niche I want to fill. I expect further Mallet development to focus on corpus pre- and post-processing, for example with posterior predictive checks, along with some specific algorithmic improvements like spectral inference. Tools like Factorie fill a different niche, for people who want a general purpose library for defining graphical models and deriving inference algorithms.

It’s worth mentioning languages. Mallet is written in Java, which predates a lot of useful features present in popular contemporary languages like Go, Scala, and Python. I’d really like to have type inference, but I’m not ready to bet against Java yet. The best thing about Scala for me is that it’s byte compatible with Java and can build off existing libraries. Even without direct compatibility, I’ve had a lot of success recently in embedding Mallet in R and Node.js (but that’s a story for another post).