Package cc.mallet.pipe
Class StringIterator
- java.lang.Object
-
- cc.mallet.pipe.StringIterator
-
- All Implemented Interfaces:
java.util.Iterator<java.lang.Character>
public final class StringIterator extends java.lang.Object implements java.util.Iterator<java.lang.Character>
Java implementation of Jonathan Wood's "Text Parsing Helper Class".- See Also:
- Text Parsing Helper Class
-
-
Constructor Summary
Constructors Constructor Description StringIterator(java.lang.String text)
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description java.lang.String
extract(int start)
Extracts a substring from the specified range of the current text.java.lang.String
extract(int start, int end)
Extracts a substring from the specified range of the current text.boolean
hasNext()
static boolean
isApostrophe(char c)
Check if a character is an apostrophe.static boolean
isArrow(char c)
Check if a character is an arrow symbol.static boolean
isBlank(java.lang.String s)
Check if a string is blank.static boolean
isBracket(char c)
Check if a character is a bracket.static boolean
isCapitalized(java.lang.String s)
Check if a string is capitalized.static boolean
isCjkSymbol(char c)
Check if a character is a CJK symbol.static boolean
isCurrency(char c)
Check if a character is a currency symbol.static boolean
isDoubleQuotationMark(char c)
Check if a character is a double quotation mark.boolean
isEndOfText()
Indicates if the current position is at the end of the current document.static boolean
isGeneralPunctuation(char c)
Check if a character is a punctuation in Unicode.static boolean
isHyphen(char c)
Check if a character is an hyphen.static boolean
isLeftBracket(char c)
Check if a character is a left bracket.static boolean
isListMark(char c)
Check if a character is a list mark.static boolean
isLowerCase(java.lang.String s)
Check if a string is lower case.static boolean
isPunctuation(char c)
Check if a character is a punctuation in the standard ASCII.static boolean
isQuotationMark(char c)
Check if a character is a quotation mark.static boolean
isRightBracket(char c)
Check if a character is a right bracket.static boolean
isSeparatorMark(char c)
Check if a character is a separator.static boolean
isSingleQuotationMark(char c)
Check if a character is a single quotation mark.static boolean
isTerminalMark(char c)
Check if a character is a final mark.static boolean
isUpperCase(java.lang.String s)
Check if a string is upper case.static boolean
isWhitespace(int c)
Check if a character is a whitespace.static java.lang.String
join(java.util.List<java.lang.String> strings, char separator)
Join a list of strings.void
moveAhead()
Moves the current position ahead of one character.void
moveAhead(int ahead)
Moves the current position ahead the specified number of characters.void
movePast(char[] chars)
Moves to the next occurrence of any character that is not one of the specified characters.void
movePastWhitespace()
Moves the current position to the next character that is not whitespace.void
moveTo(char c)
Moves to the next occurrence of the specified character.void
moveTo(char[] chars)
Moves to the next occurrence of any one of the specified.void
moveTo(java.lang.String s)
Moves to the next occurrence of the specified string.void
moveToEndOfLine()
Moves the current position to the first character that is part of a newline.void
moveToWhitespace()
Moves the current position to the next character that is a whitespace.java.lang.Character
next()
static java.lang.String
normalize(java.lang.String text)
Normalize quotation marks and apostrophes.char
peek()
Returns the character beyond the current position, or a null character if the specified position is at the end of the document.char
peek(int ahead)
Returns the character at the specified number of characters beyond the current position, or a null character if the specified position is at the end of the document.int
position()
int
remaining()
static java.lang.String
removeDiacriticalMarks(java.lang.String s)
A string normalizer which performs the following steps: Unicode canonical decomposition (Normalizer.Form.NFD
) Removal of diacritical marks Unicode canonical composition (Normalizer.Form.NFC
)void
reset(java.lang.String text)
Sets the current document and resets the current position to the start of it.java.lang.String
string()
static java.lang.String
trimLeft(java.lang.String s)
Remove whitespace prefix from string.static java.lang.String
trimRight(java.lang.String s)
Remove whitespace suffix from string.
-
-
-
Field Detail
-
CR
public static final char CR
- See Also:
- Constant Field Values
-
LF
public static final char LF
- See Also:
- Constant Field Values
-
SPACE
public static final char SPACE
- See Also:
- Constant Field Values
-
-
Method Detail
-
normalize
public static java.lang.String normalize(java.lang.String text)
Normalize quotation marks and apostrophes.- Parameters:
text
- document.- Returns:
- A normalized text.
-
trimLeft
public static java.lang.String trimLeft(java.lang.String s)
Remove whitespace prefix from string.- Parameters:
s
- string.- Returns:
- string without whitespaces at the beginning.
-
trimRight
public static java.lang.String trimRight(java.lang.String s)
Remove whitespace suffix from string.- Parameters:
s
- string.- Returns:
- string without whitespaces at the end.
-
isBlank
public static boolean isBlank(java.lang.String s)
Check if a string is blank.- Parameters:
s
- string.- Returns:
- true iif s is only made of whitespace characters.
-
isCapitalized
public static boolean isCapitalized(java.lang.String s)
Check if a string is capitalized.- Parameters:
s
- string.- Returns:
- true iif s starts with an upper case character and all other characters are lower case.
-
isUpperCase
public static boolean isUpperCase(java.lang.String s)
Check if a string is upper case.- Parameters:
s
- string.- Returns:
- true iif s is only made of upper case characters.
-
isLowerCase
public static boolean isLowerCase(java.lang.String s)
Check if a string is lower case.- Parameters:
s
- string.- Returns:
- true iif s is only made of lower case characters.
-
removeDiacriticalMarks
public static java.lang.String removeDiacriticalMarks(java.lang.String s)
A string normalizer which performs the following steps:- Unicode canonical decomposition (
Normalizer.Form.NFD
) - Removal of diacritical marks
- Unicode canonical composition (
Normalizer.Form.NFC
)
- Unicode canonical decomposition (
-
isWhitespace
public static boolean isWhitespace(int c)
Check if a character is a whitespace. This method takes into account Unicode space characters.- Parameters:
c
- character as a unicode code point.- Returns:
- true if c is a space character.
-
isPunctuation
public static boolean isPunctuation(char c)
Check if a character is a punctuation in the standard ASCII.- Parameters:
c
- character.- Returns:
- true iif c is a punctuation character.
-
isGeneralPunctuation
public static boolean isGeneralPunctuation(char c)
Check if a character is a punctuation in Unicode.- Parameters:
c
- character.- Returns:
- true iif c is a punctuation character.
-
isCjkSymbol
public static boolean isCjkSymbol(char c)
Check if a character is a CJK symbol.- Parameters:
c
- character.- Returns:
- true iif c is a CJK symbol.
-
isCurrency
public static boolean isCurrency(char c)
Check if a character is a currency symbol.- Parameters:
c
- character.- Returns:
- true iif c is a currency symbol.
-
isArrow
public static boolean isArrow(char c)
Check if a character is an arrow symbol.- Parameters:
c
- character.- Returns:
- true iif c is an arrow symbol.
-
isHyphen
public static boolean isHyphen(char c)
Check if a character is an hyphen.- Parameters:
c
- character.- Returns:
- true iif c is an hyphen.
-
isApostrophe
public static boolean isApostrophe(char c)
Check if a character is an apostrophe.- Parameters:
c
- character.- Returns:
- true iif c is an apostrophe.
-
isListMark
public static boolean isListMark(char c)
Check if a character is a list mark.- Parameters:
c
- character.- Returns:
- true iif c is a list mark.
-
isTerminalMark
public static boolean isTerminalMark(char c)
Check if a character is a final mark.- Parameters:
c
- character.- Returns:
- true iif c is a final mark.
-
isSeparatorMark
public static boolean isSeparatorMark(char c)
Check if a character is a separator.- Parameters:
c
- character.- Returns:
- true iif c is a separator.
-
isQuotationMark
public static boolean isQuotationMark(char c)
Check if a character is a quotation mark.- Parameters:
c
- character.- Returns:
- true iif c is a quotation mark.
-
isSingleQuotationMark
public static boolean isSingleQuotationMark(char c)
Check if a character is a single quotation mark.- Parameters:
c
- character.- Returns:
- true iif c is a single quotation mark.
-
isDoubleQuotationMark
public static boolean isDoubleQuotationMark(char c)
Check if a character is a double quotation mark.- Parameters:
c
- character.- Returns:
- true iif c is a double quotation mark.
-
isBracket
public static boolean isBracket(char c)
Check if a character is a bracket.- Parameters:
c
- character.- Returns:
- true iif c is a bracket.
-
isLeftBracket
public static boolean isLeftBracket(char c)
Check if a character is a left bracket.- Parameters:
c
- character.- Returns:
- true iif c is a left bracket.
-
isRightBracket
public static boolean isRightBracket(char c)
Check if a character is a right bracket.- Parameters:
c
- character.- Returns:
- true iif c is a right bracket.
-
join
public static java.lang.String join(java.util.List<java.lang.String> strings, char separator)
Join a list of strings. Similar to Guava'sJoiner.on(separator).join(strings)
- Returns:
- a string.
-
reset
public void reset(java.lang.String text)
Sets the current document and resets the current position to the start of it.
-
hasNext
public boolean hasNext()
- Specified by:
hasNext
in interfacejava.util.Iterator<java.lang.Character>
-
next
public java.lang.Character next()
- Specified by:
next
in interfacejava.util.Iterator<java.lang.Character>
-
isEndOfText
public boolean isEndOfText()
Indicates if the current position is at the end of the current document.- Returns:
- true iif we reached the end of the document, false otherwise.
-
peek
public char peek(int ahead)
Returns the character at the specified number of characters beyond the current position, or a null character if the specified position is at the end of the document.- Parameters:
ahead
- The number of characters beyond the current position.- Returns:
- The character at the current position.
-
peek
public char peek()
Returns the character beyond the current position, or a null character if the specified position is at the end of the document.- Returns:
- The character at the current position.
-
moveAhead
public void moveAhead()
Moves the current position ahead of one character.
-
moveAhead
public void moveAhead(int ahead)
Moves the current position ahead the specified number of characters.- Parameters:
ahead
- The number of characters to move ahead.
-
string
public java.lang.String string()
-
position
public int position()
-
remaining
public int remaining()
-
extract
public java.lang.String extract(int start)
Extracts a substring from the specified range of the current text.
-
extract
public java.lang.String extract(int start, int end)
Extracts a substring from the specified range of the current text.
-
moveTo
public void moveTo(java.lang.String s)
Moves to the next occurrence of the specified string.- Parameters:
s
- String to find.
-
moveTo
public void moveTo(char c)
Moves to the next occurrence of the specified character.- Parameters:
c
- Character to find.
-
moveTo
public void moveTo(char[] chars)
Moves to the next occurrence of any one of the specified.- Parameters:
chars
- Array of characters to find.
-
movePast
public void movePast(char[] chars)
Moves to the next occurrence of any character that is not one of the specified characters.- Parameters:
chars
- Array of characters to move past.
-
moveToEndOfLine
public void moveToEndOfLine()
Moves the current position to the first character that is part of a newline.
-
moveToWhitespace
public void moveToWhitespace()
Moves the current position to the next character that is a whitespace.
-
movePastWhitespace
public void movePastWhitespace()
Moves the current position to the next character that is not whitespace.
-
-