Class NGramPreprocessor

  • All Implemented Interfaces:
    AlphabetCarrying, java.io.Serializable

    public class NGramPreprocessor
    extends Pipe
    implements java.io.Serializable
    This pipe changes text to lowercase, removes common XML entities (quot, apos, lt, gt), and replaces all punctuation except the - character with whitespace. It then breaks up tokens on whitespace and applies n-gram token replacements and deletions. Replacements are applied in the order they are specified, first by file and then within files.
    See Also:
    Serialized Form
    • Constructor Detail

      • NGramPreprocessor

        public NGramPreprocessor()
    • Method Detail

      • loadReplacements

        public int loadReplacements​(java.lang.String filename)
                             throws java.io.IOException
        Throws:
        java.io.IOException
      • loadDeletions

        public int loadDeletions​(java.lang.String filename)
                          throws java.io.IOException
        Throws:
        java.io.IOException
      • pipe

        public Instance pipe​(Instance instance)
        Description copied from class: Pipe
        Really this should be 'protected', but isn't for historical reasons.
        Overrides:
        pipe in class Pipe