Tuesday, May 1, 2018

New Methods on Java String with JDK 11

It appears likely that Java's String class will be gaining some new methods with JDK 11, expected to be released in September 2018.

BUG #BUG TITLENEW String METHODDESCRIPTION
JDK-8200425 String::lines lines() "String instance method that uses a specialized Spliterator to lazily provide lines from the source string."
JDK-8200378 String::strip, String::stripLeading, String::stripTrailing strip() "Unicode-aware" evolution of trim()
stripLeading() "removal of Unicode white space from the beginning"
stripTrailing() "removal of Unicode white space from the ... end"
JDK-8200437 String::isBlank isBlank() "instance method that returns true if the string is empty or contains only white space"

Evidence of the progress that has been made related to these methods can be found in messages requesting "compatibility and specification reviews" (CSR) on the core-libs-dev mailing list:

A common characteristic of four of these five new methods is that they use a different (newer) definition of "whitespace" than did old methods such as String.trim(). Bug JDK-8200373 ["String::trim JavaDoc should clarify meaning of space"] even addresses this for the String.trim() method (mailing list review request):

The current JavaDoc for String::trim does not make it clear which definition of "space" is being used in the code. With additional trimming methods coming in the near future that use a different definition of space, clarification is imperative. String::trim uses the definition of space as any codepoint that is less than or equal to the space character codepoint (\u0040.) Newer trimming methods will use the definition of (white) space as any codepoint that returns true when passed to the Character::isWhitespace predicate.

The method isWhitespace(char) was added to Character with JDK 1.1, but the method isWhitespace(int) was not introduced to the Character class until JDK 1.5. The latter method (the one accepting a parameter of type int) was added to support supplementary characters. The Javadoc comments for the Character class define supplementary characters (typically modeled with int-based "code point") versus BMP characters (typically modeled with single character):

The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP). Characters whose code points are greater than U+FFFF are called supplementary characters. The Java platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes. In this representation, supplementary characters are represented as a pair of char values ... A char value, therefore, represents Basic Multilingual Plane (BMP) code points, including the surrogate code points, or code units of the UTF-16 encoding. An int value represents all Unicode code points, including supplementary code points. ... The methods that only accept a char value cannot support supplementary characters. ... The methods that accept an int value support all Unicode characters, including supplementary characters.

I added the bold emphasis in the above quote to emphasize the significance of a "code point," which is defined for the Java context as "a value that can be used in a coded character set". Four of the five proposed new methods for String in JDK 11 rely heavily on the concept embodied in Character.isWhitespace(int) to determine how to "trim" a given string or when determining if a given string is "blank."

Speaking of Unicode, JEP 327 ["Unicode 10"] has been proposed to be added to JDK 11 as well. As that JEP states, its intent is to "upgrade existing platform APIs to support version 10.0 of the Unicode Standard."

Conclusion

The new methods on String currently proposed for JDK 11 provide a more consistent approach to handling white space in strings that can better handle internationalization, provide methods for trimming whitespace only at the beginning of the string or at the end of the string, and provide a method especially intended for coming raw string literals.

Additional References