Truncating UTF String to the given number of bytes while preserving its validity [for DB insert]
Posted by Jakub Holý on November 2, 2007
Often you need to insert a String from Java into a database column with a fixed length specified in bytes.
isn’t enough because it only cuts down the number of characters but in UTF-8 a single character may be represented by 1-4 bytes. But you cannot just turn the string into an array of bytes and use its first DB_FIELD_LENGTH elements because you could end up with an invalid UTF-8 character at the end (one that is represented by 2+ bytes while only its 1st byte fits into the field). There are two solutions for truncation the string in such a way, that it has at most DB_FIELD_LENGTH bytes and is a valid UTF-8 string.
Approach 1: Replace the invalid trailing byte(s) with a ‘rectangle’
This is as simple as:
int maxLen = DB_FIELD_LENGTH-2; string = new String( string.getBytes("UTF-8") , 0, maxLen, "UTF-8");
The new String constructor will automatically replace any invalid character (i.e. incomplete utf-8 char; we may only have one at the end) with the character \uFFFD, which looks like an empty rectangle. This character requires 3 bytes in utf-8 – therefore we decrease DB_FIELD_LENGTH by 2; the resulting string will have either exactly maxLen bytes if its last byte(s) is a valid utf-8 character or maxLen+2 bytes if it isn’t valid and this 1 byte was replaced by \uFFFD (3B).
Approach 2: Skip the invalid trailing byte(s) altogether
If you don’t want to have the rectangle character in the place of a split multibyte character, you must do yourself what the String constructor does internally, in a bit different way:
import java.nio.*; import java.nio.charset.*; Charset utf8Charset = Charset.forName("UTF-8"); CharsetDecoder cd = utf8Charset.newDecoder(); byte sba = string.getBytes("UTF-8"); // Ensure truncating by having byte buffer = DB_FIELD_LENGTH ByteBuffer bb = ByteBuffer.wrap(sba, 0, DB_FIELD_LENGTH); // len in [B] CharBuffer cb = CharBuffer.allocate(DB_FIELD_LENGTH); // len in [char] <= # [B] // Ignore an incomplete character cd.onMalformedInput(CodingErrorAction.IGNORE) cd.decode(bb, cb, true); cd.flush(cb); string = new String(cb.array(), 0, cb.position());
The string will end with the last valid character in the given range.
Approach 3: Manually remove the invalid trailing bytes
As you can see, the approach 2 requires quite lot of coding and method calls. If you know the details of UTF-8, namely how to distinguish an invalid byte or byte sequence then you can simply truncate the byte array and then remove/replace the invalid bytes. I’d be glad for the code
Sorry, the comment form is closed at this time.