Rational Developer for Power Systems Software
V7.6

com.ibm.etools.iseries.util
Class NlsUtil

java.lang.Object
  extended by com.ibm.etools.iseries.util.NlsUtil

public class NlsUtil
extends Object

This class provides national language support (NLS) functions. These are mostly DBCS/MBCS-related functions, and may not work correctly for other character encodings. Some bidirectional support will also be added gradually, mainly for the editor's internal use.

These NLS functions attempt to ease the handling by users of files originating on and/or targeted for a remote system, in a manner similar to their handling in that particular environment (e.g., iSeries members being edited with an iSeries editor): emulation of SO/SI control characters, awareness of the text's actual positions, columns, length, and sequence numbers (all byte-determined), etc.

The current implementation, while attempting to be generic, focuses on Windows as the workstation, and zSeries (S/390) and iSeries (AS/400) as the remote system. These remote systems use EBCDIC character encodings: a DBCS character uses two bytes, a DBCS string is delimited by SO/SI controls, certain character combinations (Arabic lam + alef) may translate into a single-byte character (visual lamalef code point). Also, for these systems each byte in the character encoding takes up one display column on the screen (1-byte SBCS character = 1 display column, 2-byte DBCS character = 2-column display width, the SO and SI control characters = 1 display column each).

For EUC character encodings (UNIX, AIX, Linux), NlsUtil may not provide adequate emulation of their source edit environments. The byte-length of characters differs from their display-column width, so you may have to use native code (JNI), e.g., the *mb* C library functions, for display-width calculations (as only byte-length information is currently obtainable in Java); these calculations affect the (column-based) tabs expansion in LPEX.

Terminology used in here:

Unicode Any Java program, uses Unicode for its internal representation of characters. More specifically, this is the UTF-16 encoding, which encodes the basic multilingual plane of Unicode version 1 directly, and uses surrogate pairs as the escape mechanism to encode the next 16 planes of Unicode version 3
encoding a Java-supported character encoding, e.g., "Cp1252" (Windows Latin-1)
native
encoding
this is the default character encoding of the platform (host operating system) that LPEX runs on, according to the default locale. This is, usually, an ASCII character encoding on a workstation (Windows, Linux, OS/2, etc.); an EBCDIC character encoding on a mainframe/midi (S/390, AS/400, etc.). This encoding is normally determined from the "file.encoding" Java system property
file
encoding
this is the character encoding of the underlying file. The file encoding is, normally, the native encoding, as files are usually stored in an encoding that is same as the default encoding of the host operating system (for example, on the Japanese Linux, files are typically stored in EUC-JP).

In a heterogeneous platform environment, the encoding of the host operating system may be different from the encoding of the file we want to load into the editor. In such a case, one must explicitly specify the encoding of the file, or let the editor attempt to detect it; the editor will then perform the character code conversion on loading the file in, and similarly whan saving the document.

source
encoding
the source file's character encoding: the file being edited may originate from and/or be targeted for a remote system (i.e., different from the platform that LPEX runs on). Setting the source encoding information in the editor allows LPEX to emulate features of the file's original editing environment (for example, display emulated SO/SI controls), correctly establish the sequence numbers in effect, calculate the length limit of text lines for save operations, etc.
DBCS Asian character set/encoding that contains double-byte characters
MBCS Asian character set/encoding that contains multi-byte characters
SO, SI Shift-out and Shift-in control characters. Only EBCDIC DBCS encodings use SO/SI escape characters. Balanced SO/SI characters enclose sequences of DBCS character bytes. LPEX can display emulation SO/SI characters in order to present the user an image of the file similar to the one seen in its source natural habitat (e.g., an iSeries member being edited with an iSeries editor).

Notes:


Method Summary
static int countLamAlefs(String buffer, int ccsid)
          Counts the number of Lam-Alefs in a string.
static int encodingCharIndex(String s, int index, String encoding)
          Return the character index into the encoded string (i.e., as converted from Java Unicode s using the character encoding), which corresponds to the index into text String s.
static int encodingLength(char c, String encoding)
          Get the byte-length for a string consisting of one Java Unicode character c converted to the specified character encoding.
static int encodingLength(String s, String encoding)
          Get the byte-length of a Java Unicode String s in the specified character encoding.
static int getEBCDICLengthOfLogicalBuffer(String buffer)
          Return the length of the Unicode buffer in EBCDIC bytes, taking lam-alef chars and LRM and RLM markters into account
static byte[] getLamAlefBytes(int ccsid)
          Retrieves the list of iSeries Lam-Alef bytes
static int getLamAlefsCountInBufferRange(String buffer, int len)
          Count the number of lam-alef characters in the given text buffer beginning at the first character for the given length If the character at index len is an alef and the following character is a lam, then still increment the count of lam-alefs found.
static String getNativeEncoding()
          Retrieve the native (platform's default) character encoding.
static int indexFromEncodingIndex(String s, int index, String encoding)
          Return the index into the Java Unicode text String s which corresponds to the index into its encoding string (i.e., as converted using the specified character encoding).
static boolean isALEF(char c)
          Indicates whether or not the character is an Arabic Alef
static boolean isBidiEncoding(String encoding)
          Determine whether a character encoding is bidirectional.
static boolean isEucEncoding(String encoding)
          Determine whether a character encoding is EUC (AIX MBCS).
static boolean isIgnoringBidiMarks(String strEncoding, int CCSID)
          Returns whether the document might contain ignorable bidi marks.
static boolean isLAM(char c)
          Indicates whether or not the character is an Arabic Lam
static boolean isLamAlefByte(byte bufferByte, byte[] lamAlefArray)
          Indicates wether or not a byte is lam-alef
static boolean isLamAlefChar(char c)
          Indicates whether or not a character is a joined lam-alef in UNICODE
static boolean isMbcsEncoding(String encoding)
          Determine whether a character encoding is DBCS/MBCS.
static boolean isShapedBidiCcsid(int ccsid)
          Does this iSeries CCSID contain a Lam-Alef ligature
static boolean isSosiEncoding(String encoding)
          Determine whether a character encoding uses SO/SI control characters - i.e., whether it is an EBCDIC DBCS character encoding.
static boolean isValidEncoding(String encoding)
          Validate a character encoding.
static byte[] massageLamAlefs(byte[] buffer, int ccsid)
          Adds a blank for each lam-alef, so that they get transformed from one iSeries byte, to two UNICODE bytes
static String massageLamAlefs(String buffer, int ccsid)
          Adds a blank for each lam-alef, so that they get transformed from one iSeries byte, to two UNICODE bytes
static char toUpperCase(char c)
          Uppercases the character taking into account variant characters (which do not get uppercased)
static String toUpperCase(String text)
          Uppercases the text taking into account variant characters (which do not get uppercased) Also does not uppercase any substring enclosed in double quotes.
static String truncateString(String visualString, int visualLength)
          Truncate the given visual string, to the visual length given in EBCDIC bytes
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Method Detail

isValidEncoding

public static boolean isValidEncoding(String encoding)
Validate a character encoding.

Parameters:
encoding - character encoding to validate
Returns:
true = the character encoding is valid and supported by the active Java run environment, or
false = the encoding is null or not supported

isMbcsEncoding

public static boolean isMbcsEncoding(String encoding)
Determine whether a character encoding is DBCS/MBCS. The given encoding is checked against internal tables of DBCS/MBCS character encodings. You must also use isValidEncoding(java.lang.String) to ensure that this character encoding is supported by the active Java run environment.

Parameters:
encoding - canonical name of a character encoding

isEucEncoding

public static boolean isEucEncoding(String encoding)
Determine whether a character encoding is EUC (AIX MBCS). The given encoding is checked against an internal table of EUC character encodings. You must also use isValidEncoding(java.lang.String) to ensure that this character encoding is supported by the active Java run environment.

Parameters:
encoding - canonical name of a character encoding

isSosiEncoding

public static boolean isSosiEncoding(String encoding)
Determine whether a character encoding uses SO/SI control characters - i.e., whether it is an EBCDIC DBCS character encoding. The given encoding is checked against an internal table of EBCDIC DBCS character encodings. You must also use isValidEncoding(java.lang.String) to ensure that this character encoding is supported by the active Java run environment.

Parameters:
encoding - canonical name of a character encoding

isBidiEncoding

public static boolean isBidiEncoding(String encoding)
Determine whether a character encoding is bidirectional. The given encoding is checked against an internal table of bidi (Arabic and Hebrew) character encodings. You must also use isValidEncoding(java.lang.String) to ensure that this character encoding is supported by the active Java run environment.

Parameters:
encoding - canonical name of a character encoding

getNativeEncoding

public static String getNativeEncoding()
Retrieve the native (platform's default) character encoding. The native encoding is normally determined from the "file.encoding" Java system property.

Returns:
the canonical name of the native encoding

encodingLength

public static int encodingLength(String s,
                                 String encoding)
Get the byte-length of a Java Unicode String s in the specified character encoding. For certain character encodings the length returned includes control bytes. For example, for EBCDIC DBCS encodings the length includes the SO/SI control characters; for UTF-16, it includes the byte-order mark.


encodingLength

public static int encodingLength(char c,
                                 String encoding)
Get the byte-length for a string consisting of one Java Unicode character c converted to the specified character encoding. For an EBCDIC DBCS character, this method returns 2 (i.e., the length of the two-byte character itself, without the SO/SI controls). For other character encodings, the length returned may include control bytes.

Returns:
1 if c converts to a single-byte character; 2 if the encoding character is double-byte; n if the encoding character is multi-byte.

encodingCharIndex

public static int encodingCharIndex(String s,
                                    int index,
                                    String encoding)
Return the character index into the encoded string (i.e., as converted from Java Unicode s using the character encoding), which corresponds to the index into text String s.

If the encoding is EBCDIC DBCS, the index returned is positioned away from a SO/SI control character.

Parameters:
s - Java Unicode String
index - ZERO-based index into s
encoding - character encoding
Returns:
ZERO-based index into encoded string

indexFromEncodingIndex

public static int indexFromEncodingIndex(String s,
                                         int index,
                                         String encoding)
Return the index into the Java Unicode text String s which corresponds to the index into its encoding string (i.e., as converted using the specified character encoding).

Parameters:
s - Java Unicode String
index - ZERO-based index into the encoding string of s
Returns:
ZERO-based index into s

isIgnoringBidiMarks

public static boolean isIgnoringBidiMarks(String strEncoding,
                                          int CCSID)
Returns whether the document might contain ignorable bidi marks.

Bidirectional marks LRM and RLM may be found in files brought over from an iSeries remote, as a result of the conversion from the visual-order EBCDIC iSeries file to a logical-order UTF-8 / Unicode workstation file.

LPEX should ignore these marks for most intents and purposes in this scenario (iSeries Arabic and Hebrew CCSIDs), as they are removed when the file is converted back to the remote.

This method returns true when the source encoding is bidirectional, and the source CCSID defines a visual encoding.

Since:
6.0.1 59827 copied from LPEX 3.0.3

isLAM

public static boolean isLAM(char c)
Indicates whether or not the character is an Arabic Lam

Returns:
true if the character is Lam, false otherwise
Since:
6.0.1 59962

isALEF

public static boolean isALEF(char c)
Indicates whether or not the character is an Arabic Alef

Returns:
true if the character is Alef, false otherwise
Since:
6.0.1 59962

isShapedBidiCcsid

public static boolean isShapedBidiCcsid(int ccsid)
Does this iSeries CCSID contain a Lam-Alef ligature

Parameters:
ccsid -

getLamAlefBytes

public static byte[] getLamAlefBytes(int ccsid)
Retrieves the list of iSeries Lam-Alef bytes

Parameters:
ccsid - The ccsid to get the bytes.
Returns:
an array of bytes which are lam-alefs or null, if there are no lam-alef bytes in the specified CCSID
Since:
6.0.1 59853/58349

isLamAlefByte

public static boolean isLamAlefByte(byte bufferByte,
                                    byte[] lamAlefArray)
Indicates wether or not a byte is lam-alef

Parameters:
bufferByte - The byte to check
lamAlefArray - The list of lam-alefs to check against
Returns:
true if the byte is in the lam-alef list.
Since:
6.0.1 59853/58349

massageLamAlefs

public static byte[] massageLamAlefs(byte[] buffer,
                                     int ccsid)
Adds a blank for each lam-alef, so that they get transformed from one iSeries byte, to two UNICODE bytes

Parameters:
buffer - The orginal buffer to massage
ccsid - The ccsid of the buffer
Returns:
A blank padded array which contains parameter buffer
Since:
6.0.1 59853/58349

countLamAlefs

public static int countLamAlefs(String buffer,
                                int ccsid)
Counts the number of Lam-Alefs in a string. Used when going from UNICODE to iSeries bytes, as the transform adds 1 blank per lam-alef to the start of the string. This API doesn't count lam-alefs (ie returns 0) when the CCSID specified doesn't represent lam-alefs as 1 byte.

Parameters:
buffer - The string to check
ccsid - The destination CCISD
Returns:
The number of lam-alefs in the string. If the CCSID stores lam-alefs as 2 bytes, then 0 will be returned.
Since:
6.0.1 59853/58349

massageLamAlefs

public static String massageLamAlefs(String buffer,
                                     int ccsid)
Adds a blank for each lam-alef, so that they get transformed from one iSeries byte, to two UNICODE bytes

Parameters:
buffer - The orginal buffer to massage
ccsid - The ccsid of the buffer
Returns:
A blank padded string which contains parameter buffer
Since:
6.0.1 59853/58349

isLamAlefChar

public static boolean isLamAlefChar(char c)
Indicates whether or not a character is a joined lam-alef in UNICODE

Parameters:
c - The characters to checks
Returns:
true if the character is a joined lam-alef
Since:
6.0.1 59853/58349

toUpperCase

public static String toUpperCase(String text)
Uppercases the text taking into account variant characters (which do not get uppercased) Also does not uppercase any substring enclosed in double quotes.

Parameters:
text - The text to uppercase
Returns:
The uppercased text

toUpperCase

public static char toUpperCase(char c)
Uppercases the character taking into account variant characters (which do not get uppercased)

Parameters:
c - The character to uppercase
Returns:
The uppercased character
Since:
7.0

getLamAlefsCountInBufferRange

public static int getLamAlefsCountInBufferRange(String buffer,
                                                int len)
Count the number of lam-alef characters in the given text buffer beginning at the first character for the given length If the character at index len is an alef and the following character is a lam, then still increment the count of lam-alefs found.

Parameters:
buffer - - text buffer to count lam-alef
len - - number of characters from the beginning of the buffer to count lam-alefs in
Returns:

getEBCDICLengthOfLogicalBuffer

public static int getEBCDICLengthOfLogicalBuffer(String buffer)
Return the length of the Unicode buffer in EBCDIC bytes, taking lam-alef chars and LRM and RLM markters into account

Parameters:
text - - the Unicode buffer of interest
Returns:

truncateString

public static String truncateString(String visualString,
                                    int visualLength)
Truncate the given visual string, to the visual length given in EBCDIC bytes

Parameters:
visualString -
visualLength -
Returns:

Rational Developer for Power Systems Software
V7.6

Copyright © 2011 IBM Corp. All Rights Reserved.

Note: This documentation is for part of an interim API that is still under development and expected to change significantly before reaching stability. It is being made available at this early stage to solicit feedback from pioneering adopters on the understanding that any code that uses this API will almost certainly be broken (repeatedly) as the API evolves.