Package cc.mallet.pipe
Class StringIterator
- java.lang.Object
-
- cc.mallet.pipe.StringIterator
-
- All Implemented Interfaces:
java.util.Iterator<java.lang.Character>
public final class StringIterator extends java.lang.Object implements java.util.Iterator<java.lang.Character>Java implementation of Jonathan Wood's "Text Parsing Helper Class".- See Also:
- Text Parsing Helper Class
-
-
Constructor Summary
Constructors Constructor Description StringIterator(java.lang.String text)
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description java.lang.Stringextract(int start)Extracts a substring from the specified range of the current text.java.lang.Stringextract(int start, int end)Extracts a substring from the specified range of the current text.booleanhasNext()static booleanisApostrophe(char c)Check if a character is an apostrophe.static booleanisArrow(char c)Check if a character is an arrow symbol.static booleanisBlank(java.lang.String s)Check if a string is blank.static booleanisBracket(char c)Check if a character is a bracket.static booleanisCapitalized(java.lang.String s)Check if a string is capitalized.static booleanisCjkSymbol(char c)Check if a character is a CJK symbol.static booleanisCurrency(char c)Check if a character is a currency symbol.static booleanisDoubleQuotationMark(char c)Check if a character is a double quotation mark.booleanisEndOfText()Indicates if the current position is at the end of the current document.static booleanisGeneralPunctuation(char c)Check if a character is a punctuation in Unicode.static booleanisHyphen(char c)Check if a character is an hyphen.static booleanisLeftBracket(char c)Check if a character is a left bracket.static booleanisListMark(char c)Check if a character is a list mark.static booleanisLowerCase(java.lang.String s)Check if a string is lower case.static booleanisPunctuation(char c)Check if a character is a punctuation in the standard ASCII.static booleanisQuotationMark(char c)Check if a character is a quotation mark.static booleanisRightBracket(char c)Check if a character is a right bracket.static booleanisSeparatorMark(char c)Check if a character is a separator.static booleanisSingleQuotationMark(char c)Check if a character is a single quotation mark.static booleanisTerminalMark(char c)Check if a character is a final mark.static booleanisUpperCase(java.lang.String s)Check if a string is upper case.static booleanisWhitespace(int c)Check if a character is a whitespace.static java.lang.Stringjoin(java.util.List<java.lang.String> strings, char separator)Join a list of strings.voidmoveAhead()Moves the current position ahead of one character.voidmoveAhead(int ahead)Moves the current position ahead the specified number of characters.voidmovePast(char[] chars)Moves to the next occurrence of any character that is not one of the specified characters.voidmovePastWhitespace()Moves the current position to the next character that is not whitespace.voidmoveTo(char c)Moves to the next occurrence of the specified character.voidmoveTo(char[] chars)Moves to the next occurrence of any one of the specified.voidmoveTo(java.lang.String s)Moves to the next occurrence of the specified string.voidmoveToEndOfLine()Moves the current position to the first character that is part of a newline.voidmoveToWhitespace()Moves the current position to the next character that is a whitespace.java.lang.Characternext()static java.lang.Stringnormalize(java.lang.String text)Normalize quotation marks and apostrophes.charpeek()Returns the character beyond the current position, or a null character if the specified position is at the end of the document.charpeek(int ahead)Returns the character at the specified number of characters beyond the current position, or a null character if the specified position is at the end of the document.intposition()intremaining()static java.lang.StringremoveDiacriticalMarks(java.lang.String s)A string normalizer which performs the following steps: Unicode canonical decomposition (Normalizer.Form.NFD) Removal of diacritical marks Unicode canonical composition (Normalizer.Form.NFC)voidreset(java.lang.String text)Sets the current document and resets the current position to the start of it.java.lang.Stringstring()static java.lang.StringtrimLeft(java.lang.String s)Remove whitespace prefix from string.static java.lang.StringtrimRight(java.lang.String s)Remove whitespace suffix from string.
-
-
-
Field Detail
-
CR
public static final char CR
- See Also:
- Constant Field Values
-
LF
public static final char LF
- See Also:
- Constant Field Values
-
SPACE
public static final char SPACE
- See Also:
- Constant Field Values
-
-
Method Detail
-
normalize
public static java.lang.String normalize(java.lang.String text)
Normalize quotation marks and apostrophes.- Parameters:
text- document.- Returns:
- A normalized text.
-
trimLeft
public static java.lang.String trimLeft(java.lang.String s)
Remove whitespace prefix from string.- Parameters:
s- string.- Returns:
- string without whitespaces at the beginning.
-
trimRight
public static java.lang.String trimRight(java.lang.String s)
Remove whitespace suffix from string.- Parameters:
s- string.- Returns:
- string without whitespaces at the end.
-
isBlank
public static boolean isBlank(java.lang.String s)
Check if a string is blank.- Parameters:
s- string.- Returns:
- true iif s is only made of whitespace characters.
-
isCapitalized
public static boolean isCapitalized(java.lang.String s)
Check if a string is capitalized.- Parameters:
s- string.- Returns:
- true iif s starts with an upper case character and all other characters are lower case.
-
isUpperCase
public static boolean isUpperCase(java.lang.String s)
Check if a string is upper case.- Parameters:
s- string.- Returns:
- true iif s is only made of upper case characters.
-
isLowerCase
public static boolean isLowerCase(java.lang.String s)
Check if a string is lower case.- Parameters:
s- string.- Returns:
- true iif s is only made of lower case characters.
-
removeDiacriticalMarks
public static java.lang.String removeDiacriticalMarks(java.lang.String s)
A string normalizer which performs the following steps:- Unicode canonical decomposition (
Normalizer.Form.NFD) - Removal of diacritical marks
- Unicode canonical composition (
Normalizer.Form.NFC)
- Unicode canonical decomposition (
-
isWhitespace
public static boolean isWhitespace(int c)
Check if a character is a whitespace. This method takes into account Unicode space characters.- Parameters:
c- character as a unicode code point.- Returns:
- true if c is a space character.
-
isPunctuation
public static boolean isPunctuation(char c)
Check if a character is a punctuation in the standard ASCII.- Parameters:
c- character.- Returns:
- true iif c is a punctuation character.
-
isGeneralPunctuation
public static boolean isGeneralPunctuation(char c)
Check if a character is a punctuation in Unicode.- Parameters:
c- character.- Returns:
- true iif c is a punctuation character.
-
isCjkSymbol
public static boolean isCjkSymbol(char c)
Check if a character is a CJK symbol.- Parameters:
c- character.- Returns:
- true iif c is a CJK symbol.
-
isCurrency
public static boolean isCurrency(char c)
Check if a character is a currency symbol.- Parameters:
c- character.- Returns:
- true iif c is a currency symbol.
-
isArrow
public static boolean isArrow(char c)
Check if a character is an arrow symbol.- Parameters:
c- character.- Returns:
- true iif c is an arrow symbol.
-
isHyphen
public static boolean isHyphen(char c)
Check if a character is an hyphen.- Parameters:
c- character.- Returns:
- true iif c is an hyphen.
-
isApostrophe
public static boolean isApostrophe(char c)
Check if a character is an apostrophe.- Parameters:
c- character.- Returns:
- true iif c is an apostrophe.
-
isListMark
public static boolean isListMark(char c)
Check if a character is a list mark.- Parameters:
c- character.- Returns:
- true iif c is a list mark.
-
isTerminalMark
public static boolean isTerminalMark(char c)
Check if a character is a final mark.- Parameters:
c- character.- Returns:
- true iif c is a final mark.
-
isSeparatorMark
public static boolean isSeparatorMark(char c)
Check if a character is a separator.- Parameters:
c- character.- Returns:
- true iif c is a separator.
-
isQuotationMark
public static boolean isQuotationMark(char c)
Check if a character is a quotation mark.- Parameters:
c- character.- Returns:
- true iif c is a quotation mark.
-
isSingleQuotationMark
public static boolean isSingleQuotationMark(char c)
Check if a character is a single quotation mark.- Parameters:
c- character.- Returns:
- true iif c is a single quotation mark.
-
isDoubleQuotationMark
public static boolean isDoubleQuotationMark(char c)
Check if a character is a double quotation mark.- Parameters:
c- character.- Returns:
- true iif c is a double quotation mark.
-
isBracket
public static boolean isBracket(char c)
Check if a character is a bracket.- Parameters:
c- character.- Returns:
- true iif c is a bracket.
-
isLeftBracket
public static boolean isLeftBracket(char c)
Check if a character is a left bracket.- Parameters:
c- character.- Returns:
- true iif c is a left bracket.
-
isRightBracket
public static boolean isRightBracket(char c)
Check if a character is a right bracket.- Parameters:
c- character.- Returns:
- true iif c is a right bracket.
-
join
public static java.lang.String join(java.util.List<java.lang.String> strings, char separator)Join a list of strings. Similar to Guava'sJoiner.on(separator).join(strings)
- Returns:
- a string.
-
reset
public void reset(java.lang.String text)
Sets the current document and resets the current position to the start of it.
-
hasNext
public boolean hasNext()
- Specified by:
hasNextin interfacejava.util.Iterator<java.lang.Character>
-
next
public java.lang.Character next()
- Specified by:
nextin interfacejava.util.Iterator<java.lang.Character>
-
isEndOfText
public boolean isEndOfText()
Indicates if the current position is at the end of the current document.- Returns:
- true iif we reached the end of the document, false otherwise.
-
peek
public char peek(int ahead)
Returns the character at the specified number of characters beyond the current position, or a null character if the specified position is at the end of the document.- Parameters:
ahead- The number of characters beyond the current position.- Returns:
- The character at the current position.
-
peek
public char peek()
Returns the character beyond the current position, or a null character if the specified position is at the end of the document.- Returns:
- The character at the current position.
-
moveAhead
public void moveAhead()
Moves the current position ahead of one character.
-
moveAhead
public void moveAhead(int ahead)
Moves the current position ahead the specified number of characters.- Parameters:
ahead- The number of characters to move ahead.
-
string
public java.lang.String string()
-
position
public int position()
-
remaining
public int remaining()
-
extract
public java.lang.String extract(int start)
Extracts a substring from the specified range of the current text.
-
extract
public java.lang.String extract(int start, int end)Extracts a substring from the specified range of the current text.
-
moveTo
public void moveTo(java.lang.String s)
Moves to the next occurrence of the specified string.- Parameters:
s- String to find.
-
moveTo
public void moveTo(char c)
Moves to the next occurrence of the specified character.- Parameters:
c- Character to find.
-
moveTo
public void moveTo(char[] chars)
Moves to the next occurrence of any one of the specified.- Parameters:
chars- Array of characters to find.
-
movePast
public void movePast(char[] chars)
Moves to the next occurrence of any character that is not one of the specified characters.- Parameters:
chars- Array of characters to move past.
-
moveToEndOfLine
public void moveToEndOfLine()
Moves the current position to the first character that is part of a newline.
-
moveToWhitespace
public void moveToWhitespace()
Moves the current position to the next character that is a whitespace.
-
movePastWhitespace
public void movePastWhitespace()
Moves the current position to the next character that is not whitespace.
-
-