The Holy Java

Building the right thing, building it right, fast

Posts Tagged ‘database’

Making Sense Out of Datomic, The Revolutionary Non-NoSQL Database

Posted by Jakub Holý on June 16, 2013

I have finally managed to understand one of the most unusual databases of today, Datomic, and would like to share it with you. Thanks to Stuart Halloway and his workshop!

Why? Why?!?

As we shall see shortly, Datomic is very different from the traditional RDBMS databases as well as the various NoSQL databases. It even isn’t a database – it is a database on top of a database. I couldn’t wrap my head around that until now. The key to the understanding of Datomic and its unique design and advantages is actually simple.

The mainstream databases (and languages) have been designed around the following constraints of 1970s:

  • memory is expensive
  • storage is expensive
  • it is necessary to use dedicated, expensive machines

Datomic is essentially an exploration of what database we would have designed if we hadn’t these constraints. What design would we choose having gigabytes of RAM, networks with bandwidth and speed matching and exceeding harddisk access, the ability to spin and kill servers at a whim.

Read the rest of this entry »

Posted in Databases | Tagged: , , , , | Comments Off

Truncating UTF String to the given number of bytes while preserving its validity [for DB insert]

Posted by Jakub Holý on November 2, 2007

Often you need to insert a String from Java into a database column with a fixed length specified in bytes.
Using

string.substring(0, DB_FIELD_LENGTH);

isn’t enough because it only cuts down the number of characters but in UTF-8 a single character may be represented by 1-4 bytes. But you cannot just turn the string into an array of bytes and use its first DB_FIELD_LENGTH elements because you could end up with an invalid UTF-8 character at the end (one that is represented by 2+ bytes while only its 1st byte fits into the field). There are two solutions for truncation the string in such a way, that it has at most DB_FIELD_LENGTH bytes and is a valid UTF-8 string.

Approach 1: Replace the invalid trailing byte(s) with a ‘rectangle’

This is as simple as:

int maxLen = DB_FIELD_LENGTH-2;
string = new String( string.getBytes("UTF-8") , 0, maxLen, "UTF-8");

The new String constructor will automatically replace any invalid character (i.e. incomplete utf-8 char; we may only have one at the end) with the character \uFFFD, which looks like an empty rectangle. This character requires 3 bytes in utf-8 – therefore we decrease DB_FIELD_LENGTH by 2; the resulting string will have either exactly maxLen bytes if its last byte(s) is a valid utf-8 character or maxLen+2 bytes if it isn’t valid and this 1 byte was replaced by \uFFFD (3B).

Approach 2: Skip the invalid trailing byte(s) altogether

If you don’t want to have the rectangle character in the place of a split multibyte character, you must do yourself what the String constructor does internally, in a bit different way:

import java.nio.*; import java.nio.charset.*;
Charset utf8Charset = Charset.forName("UTF-8");
CharsetDecoder cd = utf8Charset.newDecoder();
byte[] sba = string.getBytes("UTF-8");
// Ensure truncating by having byte buffer = DB_FIELD_LENGTH
ByteBuffer bb = ByteBuffer.wrap(sba, 0, DB_FIELD_LENGTH); // len in [B]
CharBuffer cb = CharBuffer.allocate(DB_FIELD_LENGTH); // len in [char] <= # [B]
// Ignore an incomplete character
cd.onMalformedInput(CodingErrorAction.IGNORE)
cd.decode(bb, cb, true);
cd.flush(cb);
string = new String(cb.array(), 0, cb.position());

The string will end with the last valid character in the given range.

Approach 3: Manually remove the invalid trailing bytes

As you can see, the approach 2 requires quite lot of coding and method calls. If you know the details of UTF-8, namely how to distinguish an invalid byte or byte sequence then you can simply truncate the byte array and then remove/replace the invalid bytes. I’d be glad for the code :-)

Posted in Databases, Languages | Tagged: , , , , | Comments Off