17. February 2015
Using phrases in Mallet topic models
Bag-of-words models are surprisingly powerful, but there are often cases where several words are really a single semantic unit. How we handle these terms can have a major impact on how well we can model a text corpus. Several years ago, while working on a project involving NIH grants and associated papers, I implemented some tools for combining multiple tokens into single tokens as a preprocessing step. In this post I’ll demonstrate how I identify and use multi-word terms in Mallet.