Class SimpleTokenizer

  • All Implemented Interfaces:
    AlphabetCarrying, java.io.Serializable

    public class SimpleTokenizer
    extends Pipe
    A simple unicode tokenizer that accepts sequences of letters as tokens.
    See Also:
    Serialized Form
    • Field Detail

      • USE_DEFAULT_ENGLISH_STOPLIST

        public static final int USE_DEFAULT_ENGLISH_STOPLIST
        See Also:
        Constant Field Values
      • stoplist

        protected java.util.HashSet<java.lang.String> stoplist
    • Constructor Detail

      • SimpleTokenizer

        public SimpleTokenizer​(int languageFlag)
      • SimpleTokenizer

        public SimpleTokenizer​(java.io.File stopfile)
      • SimpleTokenizer

        public SimpleTokenizer​(java.util.HashSet<java.lang.String> stoplist)
    • Method Detail

      • getStoplist

        public java.util.HashSet<java.lang.String> getStoplist()
      • stop

        public void stop​(java.lang.String word)
      • pipe

        public Instance pipe​(Instance instance)
        Description copied from class: Pipe
        Really this should be 'protected', but isn't for historical reasons.
        Overrides:
        pipe in class Pipe