Posted by Jakub Holý on June 16, 2013
I have finally managed to understand one of the most unusual databases of today, Datomic, and would like to share it with you. Thanks to Stuart Halloway and his workshop!
As we shall see shortly, Datomic is very different from the traditional RDBMS databases as well as the various NoSQL databases. It even isn’t a database – it is a database on top of a database. I couldn’t wrap my head around that until now. The key to the understanding of Datomic and its unique design and advantages is actually simple.
The mainstream databases (and languages) have been designed around the following constraints of 1970s:
- memory is expensive
- storage is expensive
- it is necessary to use dedicated, expensive machines
Datomic is essentially an exploration of what database we would have designed if we hadn’t these constraints. What design would we choose having gigabytes of RAM, networks with bandwidth and speed matching and exceeding harddisk access, the ability to spin and kill servers at a whim.
Read the rest of this entry »
Posted in Databases | Tagged: clojure, database, datomic, fp, performance | Comments Off on Making Sense Out of Datomic, The Revolutionary Non-NoSQL Database
Posted by Jakub Holý on May 31, 2012
This was a rich month, bringing some hope for ORM, providing a peep-hole into the bright and awesome future with in-browser video and other cool web-stuff presented at WebRebels 2012 and IDEs providing immediate feedback and visualisation. There were valuable articles about simplicity and quality in software and good talks about the lean startup (i.e. enabling innovation) and other topics.
- M.Fowler: ORM Hate – Why ORM is actually a good solution – a very valuable article where Fowler opposes the popular trend of criticising Object-Relational Mappers such as Hibernate. Yes, using an ORM is difficult and a leaky abstraction – but that’s because the problem of mapping from a rich in-memory object model to a relational store is inherently difficult (and you need to do it with or w/o an ORM tool) and because those last 10-20% of DB access require human intelligence. If you can avoid the need for ORM by using the relational model also in memory or by using a NoSQL database with a data model that fits your in-memory model then it’s great to do so but often you can’t and then using ORM is the best solution. You certainly don’t want to code your own “lightweight” ORM.
- Communicating Sequential Processes: Theory for reasoning about concurrent, interacting processes – an inspirational reading about a much better way to do concurrency than Java threads; “… [CSP] is a language for describing patterns of interaction between concurrent objects. It is supported by an elegant, mathematical theory, a set of proof tools, and an extensive literature.” The beauty is that thanks to the theory behind, you can actually reason about the interactions and verify their correctness, contrary to the feared mess of Java threads. CSP is broadly similar to the popular Actors model and is implemented in Occam while it also influenced Erlang’s concurrency model and Go. The library JSCP brings it to Java. I guess we’re better of using Actors due to their popularity and maturity though the mathematical backing of CSP with the potential of formal proofs of correctness is indeed attractive. Any of the two is better than using threads directly because:
The monitor-threads model provided by Java, whilst easy to understand, proves very difficult to apply safely in any system above a modest level of complexity. One problem is that monitor methods are tightly interdependent, so that their semantics compose in non-trivial ways […]
- Rich Hickey introduces the Reducers library: simplicity in practice – a beautiful example of simplifying something by taking appart all the unrelated but mingled concerns and focus only on those really needed. Whether you’re interested in Clojure or not, you should read the beginning of the post where Hickey explains how the current collection functions based on first (returns 1st element) and rest (returns the remaining ones) mix too many things (ordering, output representation, etc.) and how this “new super-generalized and minimal abstraction for collections” avoids that and thus provides e.g. for doing things in parallel and composing transformation without producing intermediate collections. Beautiful! (PS: I’ve blogged about more examples of pursuing simplicity & gaining power.)
- M. Fowler: Cannot Measure Productivity – a thoughful discussion of why the productivity of programmers is hard/impossible to measure (i.e. you should concentrate on measuring other, more useful metrics) “[..] false measures only make things worse.”
- Gojko Adzic: Redefining software quality – an obligatory read that introduces a holistic view of SW quality and the quality pyramid. The key idea is that there are multiple, vertically organized facets of quality and once a more basic facet is saturated, you should move and and concentrate on the next facet and level of quality. The quality pyramid: Deployable & functionally OK > Performant & secure > Usable > Useful > Successful. Once a particular level is satisfied, it is wasteful to put more effort into it and you’ll bring much more value to the customer by focusing on the next higher level. Gojko: “Yet from what I see most software teams invest, build and test only at the lowest two levels, gold-plating things without a way to explain why that is bad.”
- Is Pair Programming for Me? – the author, who claims to have taught pair programming to 200+500 people, points out that pair programming is a skill that must be (consciously) learned, or actually a number of inter-personal skills. He also describes the cycle people go through when learning it, including a temporary downswing in productivity and negative view of pairing (therefore people should do it at least for 3-4 weeks to overcome the problems and gain the benefits).
- Bret Victor: Inventing on Principle (55 min, see at least the first 5 min) – very inspiring! Victor firmly believes that “creators need an immediate connection to what they create” and demonstrates how this can be achieve when coding image rendering, a game, an algorithm, when designing a circuit. After watching it for few minutes you will think: How could we have been working with such crappy tools without realizing how limited they are? Fortunately people started to apply the idea of an immediate connection between code and the result, f.ex. in LightTable and Bikeshed‘s IDE. On the other hand, there is an evidence that this may be too hard with the current programming languages.
- Eric Ries: Evangelizing for the Lean Startup – entertaining and enriching introduction into an approach for bringing innovation to life – f.ex. in startups – withou failing unnecessary, demonstrated on the example of the author’s failed and successful startup. Many innovators fail because they don’t realize that their key challenge is that they don’t now neither the problem (who are our customers and what they need) nor the solution (the product to satisfy the need) and thus what they need to do is to experiment and learn in the shortest cycles possible. If you wander what the buzz about lean startup is or how to build innovations, this is the ultimite source you should watch. The video has 1h but the first 20-30 min will give you a sufficient overview. The key points summarized by the Iterate lean guru Anders Haugeto:
– There is only one way to measure progress: Progress == The amount of things you have learned from your real customers
– Hence, you need to work a continuous loop to build, measure and learn as fast as possible. Typical iterations, like sprints, are too long, hence inefficient
– Until you have an established product, even recognized engineering practices like TDD, sprints, refactoring, and all the XP-stuff are less important than this feedback cycle
– Even the perfect agile method is nothing, if you’re using it to build the wrong product: How can you know you are heading in the right direction?
Links to Keep
- E. King: Maximizing the Value of Your Stand-up – interesting techniques to try out at your stand-ups – Speed Scrum, Pass-the-Conch Scrum (passing a token randomly to define the order), Time-Box Scrum, Challenge Scrum (the team may ask 1 question each presenter), Impediments-Only Scrum, Award Scrum (reward for best articulation of his/her information), Business Value-Focused Scrum, No-Board Scrum, Whiteboard Scrum, Buddy Scrum (report for sb. else)
- puppet-lint – check code style of your Puppet files
- Guard – cross-platform tool that can watch for file changes and execute actions (“guards”) when a file is changed, useful e.g. to execute tests/style checks only on the files being modified. Includes support for many testing/checking tools and multiple notification means such as Growl.
- ThreadLogic – Thread dump analyzer that understands common patterns found in application servers and enabling the definition of custom patterns. Supports Sun, IBM, and JRockit.
- Dumbster – mock SMTP server for unit testing (start in @Before, get sent messages in the test, stop afterwards)
Kai Thomas Gilb, in a talk proposal for JavaZone 2012:
Accurate estimation is impossible for complex technical projects, but keeping to agreed budgets, and deadlines is achievable by using feedback and change.
- StackOverflow: Clojure Performance Benchmarks – links to various discussions and benchmarks (beware that older results and facts are likely to be outdated). And of course you must keep in mind that 1) benchmark only measure what they measure, e.g. the outcomes cannot be generalized and that 2) benchmark results aren’t relevant for your problem unless you’re doing exactly the kind of operations being benchmarked (e.g. who cares that X is 100 ms slower if your code spends 1s waiting for a XML file download?) (Craig Andera had a pretty good experience report from webapp performance testing including what (not) to do)
- A good wrapup of the EuroClojure conference by Deon Garrett. Such a pity I missed it!
- Goldberg (at GitHub) – Johann Sebastian Bach’s Goldberg Variations in Overtone by @ctford; using Overtone and Clojure to build up mathematical and functional definitions of canons. Deon Garrett: “Go right now and download the code from Chris’ talk. If you don’t know Clojure, use this as an excuse to learn it – it’s that good.”
Rich Hickey interviewed by M. Fogus:
Reducing incidental complexity is a primary focus of Clojure, and you could dig into how it does that in every area.
In particular, the use of objects to represent simple informational data is almost criminal in its generation of per-piece-of-information micro-languages, i.e. the class methods, versus far more powerful, declarative, and generic methods like relational algebra. Inventing a class with its own interface to hold a piece of information is like inventing a new language to write every short story. This is anti-reuse, and, I think, results in an explosion of code in typical OO applications.
Posted in Databases, General, Testing, Tools, Top links of month | Tagged: agile, clojure, concurrency, design, html5, leanstartup, ORM, scrum, smtp, trends | Comments Off on Most interesting links of May ’12
Posted by Jakub Holý on April 30, 2012
- V. Duarte: Story Points Considered Harmful – Or why the future of estimation is really in our past… (also as 1h video) – thoughtful and data-backed claim that there is a much cheaper way for estimating work throughput than estimating each story in story points (SP) and that is simply counting the stories. Even though their sizes differ, over (not that much) longer periods, where it really matters, these differences will even out. The author argues that estimating in number of stories provides the same reliability and benefits as SP and is much easier. (Keep in mind that estimation is just an attempt at predicting the future and humans are proved to be terrible at doing that; why to pretend that we can do it?) I’d recommand this to anybody doing Scrum and similar.
- M. Fowler: Test Coverage – it’s obvious that increasing test coverage for the sake of test coverage it’s a nonsense but some people still need to be reminded of it :-). Fowler explains what the real benefit of test coverage measurements is and how to use it for good instead of for evil.
- Brian Marick: How to Misuse Code Coverage (pdf) – cited a lot by Fowler in his article, this is really a good paper. Marick has participated in the development of several code coverage tools and understands well their limitations. One of the key points is that code coverage tools can discover only one class of test weakness (not testing some paths through your code) but cannot discover that you are missing some code you should have (e.g. when you check only for two of three possible return values). Thus the code coverage metric tells you “this code isn’t well tested, are you sure you don’t to look more into it”? It’s crucial not to write tests so as to increase the code coverage; look at the code and improve the test without any regard for coverage. You may thus decrease the likeliness of both the class of problems.
- A Year with MongoDB – Kiip has found out that Mongo isn’t the best choice for them (having 240GB, 500+ operations/s, 85M docs and their specific usage of the store) and migrated to the combination of Riak (key-value store) and PostgreSQL. Some of the issues they hit are slow counts and limit/offset queries due to using non-counting B-trees for indexing, memory management that could be more intelligent and tuned for the use to make sure the data needed is indeed in RAM, no built-in support for compressing key names (their size adds up as they’re repeated in each document; you’ve to compress them [user -> u etc.] in the client if you want to), limited concurrency due to process-wide write lock (which becomes a problem if the write’s aren’t short enough w.r.t. number of ops/s, e.g. because data isn’t in RAM and/or the query is complicated), safe settings (waiting for a write to finish, …) off by default, offline-only table compaction (w/o it the disk usage grows unbounded). The lessons learnt for me: Know your storage, its weaknesses and intended way of usage, and make sure it matches your needs.
- Rudolf Winestock: The Lisp Curse – Lisp’s expressive power is actually a cause of its lack of momentum because it’s so easy to implement anything that people have no need to join forces and thus there are many half-baked (“works-for-me”) solutions for anything – but no complete, generally accepted one. An interesting essay. “Lisp is so powerful that problems which are technical issues in other programming languages are social issues in Lisp.”
- Understanding JDBC Internals & Timeout Configuration – the article itself could have been written better but it conveys the important information that configuring timeouts for JDBC isn’t trivial because they need to be set correctly at different levels and without a socket timeout set in a driver-specific way it can hang forever if the DB cannot be reached due to network/system failure
- Circos: An Amazing Tool for Visualizing Big Data – this article is interesting primarily for its combination of Google Analytics API, Neo4J and an unusual data visualization with circular graphs
- CRaSH: Extensible shell for the JVM (docs) – a shell that you can embedd into a web server as a WAR, run standalone or attach to a running JVM, connect to it via SSH or Telnet, and use it to execute commands against the JVM. Some commands: configure loggers, control threads, monitor the system (mem, threads, ..), connect/issue queries via JDBC. More commands can be written in Groovy. There is a whole set of commands for working with JCR. Pluggable authentication.
Posted in Databases, General, Testing, Tools, Top links of month | Tagged: agile, bigdata, nosql, Testing | Comments Off on Most interesting links of April ’12
Posted by Jakub Holý on November 30, 2011
- Recommended Reading by Poppendiecks – an excellent selection, starting with Lean from Trenches, Management 3.0, Specification by Example, The Lean Startup etc.
- Eric Allman says that Programming Isn’t Fun Any More because problem solving has been replaced with learning, configuring, and integrating tons of libraries, frameworks, and tools and many people agree with that (as discussion on reddit proves). In other words we tend to go for any benefit we can have without considering the costs and for “easy” solutions without considering the true enemy: complexity. Perhaps we should always listen to the Rich Hickey’s Simple Made Easy talk before we add a lib/tool/framework?
- Dean Wampler claims that functional programming can bring the joy back – “[..] a functional language, Scala, Clojure, Haskell, etc. will greatly reduce the amount of code you create. That won’t solve the problem of trying to integrate with too many libraries, but you’ll be less tempted. I also believe those libraries will be less bulky, etc.“
- Few quotes from a related article by M. Taylor: To put it another way, libraries make excellent servants, but terrible masters. | [..] frameworks [..] do keep their promise of making things very quick and easy … so long as you do things in exactly the way the framework author intended | On libraries: [..] we all assume (I know I do) that “plug in solutions X1 and X3″ is going to be trivial. But it never is — it’s a tedious exercise in impedance-matching, requiring lots of time spent grubbing around in poorly-written manuals [..] | On the effect of language choice: [..] different languages, with their different expressive power and especially their different culture, yield very different experiences.
- To sum it up: Choose your tools and libraries wisely and always mind the global complexity. More usually means worse.
- Java Magazine – Adam Bien: Stress Testing Java EE 6 Applications (page 41+) – do developer stress testing! Using: JMeter, VisualVM to find out resource consumption and behavior in the application, VisualVM’s Sampler profiling tool [cca 20% overhead], a webapp to extract metrics from GF (STM)
- Java Magazine – Polyglot Programming on the JVM (page 50; excerpt from The Well-Grounded Java Developer) – why you should consider polyglot programming and how to decide whether to use it and what languages to pick, f.ex.: “These [Java’s] qualities make the language a great choice for implementing functionality in the stable layer [of the polyglot programming pyramid]. However, these same attributes become a burden in the middle and upper [lower, DSL, on the linked image] tiers of the pyramid; for example: Recompilation is laborious; Static typing can be inflexible and lead to long refactoring times; Deployment is a heavyweight process; Java’s syntax is not a natural fit for producing DSLs.” “There is a wide range of natural use cases for *alternative languages*. [after identifying such a UC] You *now need to evaluate* whether using an alternative language is appropriate.”
- Intrusion Detection for Web Apps – Detection Points – If security is a concern of your web application then you should build intrusion detection into the application f.ex. leveraging the OWASP AppSensor project. The key is to detect malicious/unexpected behavior and proactively do something such as locking the user out or alerting the admins. The page linked above lists some common suspicious behaviors such as the use of multiple usernames, unexpected HTTP command/method, additional/duplicated data in request. Worth checking out!
- Yammer Moving From Scala to Java– Scala is a cool language but sometimes its cost is higher than the benefits. Snippets from the post: “…the friction and complexity that
comes with using Scala instead of Java isn’t offset by enough productivity benefit or reduction of maintenance burden …”. “Scala, as a language, has some profoundly interesting ideas in it. […] But it’s also a very complex language. The number of concepts I had to explain to new members of our team for even the simplest usage of a collection was surprising: implicit parameters, builder typeclasses, ‘operator overloading’, return type inference, etc. etc.” (It’s claimed that only library authors need to know some of that but if it’s a part of library APIs, the users need to understand it too.) Notice that the author isn’t saying “Scala is bad” but only that Scala isn’t the best balance of their needs at this time, as Alex Miller put it*
: The text wasn’t intended for publication and it is a private opinion of a Yammer developer, not the company itself. You should read the official Yammer’s position
where Coda puts it into the right context.
- Opportunistic Refactoring by Martin Fowler – refactor on the go – how & why
- Michael Feathers: Getting Empirical about Refactoring – gather information that helps us understand the impact of our refactoring decisions using data from a SCM, namely File Churn (frequency of changes, i.e. commits) vs. Complexity – files with both high really need refactoring. Summary: “If we refactor as we make changes to our code, we end up working in progressively better code. Sometimes, however, it’s nice to take a high-level view of a code base so that we can discover where the dragons are. I’ve been finding that this churn-vs.-complexity view helps me find good refactoring candidates and also gives me a good snapshot view of the design, commit, and refactoring styles of a team.“
UIs and Web Frameworks
- Devoxx 2011 – WWW: World Wide Wait? A Performance Comparison of Java Web Frameworks (slides) – the authors did extensive performance testing of some of the most popular web frameworks. Of course it’s always hard to guess how general their results are, if/how they apply to one’s particular situation, and if they aren’t distorted in some way but it’s worth for their approach alone (AWS with its CloudWatch monitoring, WebDriver, additional measurement of page load with HAR and a browser plugin). In their particular tests GWT scored best, followed by Spring MVC, with JSF and Wicket lagging far behind (especially the MyFaces implementation). Conclusion: A web framework may have strong impact on performance and scalability, if they are important for you then do test the performance early with as realistic code and load as possible.
- JSF2 – Benchmark datatable by N. Labrot, 2/2011 – performance comparison of PrimeFaces 2.2.1, IceFaces 2.0, Richfaces 4.0.0M4 on a simple page with Ajax. I do not trust any benchmark that I don’t fake myself 🙂 (for there are always too many factors that influence the conclusions to be drawn) but it’s interesting anyway – and perhaps a good thing to do before you decide for a JSF component library.
- Alex MacCaw: Asynchronous UIs – the future of web user interfaces and the Spine framework – users in 2011 shouldn’t anymore wait for pages to load and operations to complete, we should build asynchronous UIs where changes to the UI are performed immediately while a request to the server is sent in the background, similarly to sending e-mail in GMail, which returns at once displaying a non-intrusive “Sending…” notification. As a user I very much agree with Alex.
- Matt Raible’s 20 criteria for evaluating web frameworks, 2010 (detailed description, here’s a brief list) – Matt’s results are disputable and as he himself says you should always do your own evaluation and spikes but the criteria are pretty useful: Developer Productivity, Developer Perception, Learning Curve, Project Health, Developer Availability, Job Trends, Templating, Components, Ajax, Plugins or Add-Ons, Scalability, Testing, i18n and l10n, Validation, Multi-language Support (Groovy / Scala), Quality of Documentation/Tutorials, Books Published, REST Support (client and server), Mobile / iPhone Support, Degree of Risk.
Don’t use MongoDB via @nicolaiarocci – a (fake?!) bad experience with MongoDB – the text is not credible (the author is anonymous, s/he doesn’t explicitely state which version of MongoDB they used, the 10gen CTO can’t find a matching client and any evidence for some of the issues mentioned) but it gives context for the read-worthy response from the 10gen CTO, and a post that nicely explains how to correctly design for MongoDB. A comment about MongoDB experience at Forsquare: “Currently we have dozens of MongoDB instances across several different data clusters storing over a TB of data and handling 10s of thousands of requests per second (mostly reads but the write load is reasonably high as well).Have we run into problems with MongoDB along the way? Yes, of course we have. It is a new technology and problems happen.Have they been problematic enough to seriously threaten our data? No they have not.“
- Martin Fowler on Polyglot Persistence – the are when will be choosing persistence solution with respect to our needs instead of mindlessly picking RDBMS is coming. Applications will combine multiple, specific solutions, f.ex. we could pick Redis (key-value) for caching, MongoDB (document DB) for product catalog, Neo4J (graph DB) for recommendations, RDBMS for financial data and reporting… (of course not all in one project!). Polyglot persistence will come at a cost (complexity, learning) – but it will come because the benefits are worth it – performance, data storage model and behavior more aligned with the business logic (NoSql databases ofer various models and tradeoffs and thus we can find a much better fit than with general-purpose RDBMs).
Talks & Video
- Adam Bien’s JavaOne talk Java EE 6: The Cool Parts (1h) – absolutely worth the time – a very practical fly through the cool features of Java EE (eventing, ..), most of the time is spent actually coding. Don’t forget to check also the interesting discussion below the video (JEE and other frameworks, Java FX and JSF 2, …).
- Jurgen Appelo’s keynote How to Change the World at Smidig 2011 is well done and highly useful. We all strive to change the world around us – as consultants we want to make our clients more agile, as team members we want to make our Scrum teams more self-organizing, as employees we want to help building knowledge-sharing and open culture, … . However it isn’t easy to influence or change people and culture and if we aren’t aware of all the dimensions of a change (system, individuals, interactions, environment) and how to work along each of them, we are much less likely to succeed. The knowledge and experience that Jurgen shares with us can help us a lot in having an impact. You can also download the slides and change management questions.
- Project X: What is being a programmer like? (5min) If ever again a non-geek asks you what you as a developer are doing, just show him this short and extremely funny video (created by my ex-employer – perhaps they estimated how much time and energy developers loose trying to explain it to normal people and decided to prevent this great waste :-))
- RSA Animate – Drive: The surprising truth about what motivates us (10 min) – entertaining and enlightening; once we’ve enough money to cover our needs, it’s autonomy (self-direction), mastery, and purpose what motivates us (money actually decrease our performance). Now this is a great evidence for lean/agile – for they’re based on making people self-directing and encourage mastery (as in continous integration and top quality to enable steady pace). Autonomy enables engagement as does a higher purpose (“make the world a better place”) – Steve Jobs with his visions was able to provide such a purpose. Atlassian’s FedEx Days are a good example of what engagement and benefits autonomy brings.
- Simon Sinek: How great leaders inspire action (18 min, subtitles in 37 languages) – do you want to succeed, to change the world around you for the better, to start a new company? Then you must start by communicating “why” you do what you do, not “what” – like M. L. King, bro Wrights, and Apple. Very inspiring! (More in his Why book.)
Links to Keep
Refactoring is like advertising: it doesn’t cost, it pays.
– Mary & Tom Poppendiecks, Implementing Lean Software Development, p.166
Posted by Jakub Holý on August 5, 2011
I’ve recently introduced here DbUnit Express, a wrapper around DbUnit intended to get you started with testing DB-related code in no time, and now I’d like to share two productivity tips: simplifying db tester setup with a parent test and implementing your own convention for data set files, for example one data set per test class.
Read the rest of this entry »
Posted in Databases, Languages, Testing | Tagged: dbunit, DbUnitExpress, java | 1 Comment »
Posted by Jakub Holý on July 27, 2011
Update: The project has been renamed to dbunit-express.
DbUnit Express is my wrapper around DbUnit that intends to make it extremely easy to set up a test of a code that interacts with a database. It is preconfigured to use an embedded Derby database (a.k.a. JavaDB, part of SDK) and uses convention over configuration to find the test data. It can also with one call create the test DB from a .ddl file. Aside of simplifying the setup it contains few utilities to make testing easier such as getDataSource() (essential for testing Spring JDBC) and RowComparator.
Originally I was using DbUnit directly but I found out that for every project I was copying lot of code and thus I decided to extract it into a reusable project. Since that it has further grown to be more flexible and make testing even easier.
Here are the seven easy steps to have a running database test:
Read the rest of this entry »
Posted in Databases, Languages, Testing | Tagged: dbunit, java, open_source, Testing | 1 Comment »
Posted by Jakub Holý on September 20, 2010
We all know that one coarse-grained operation is more efficient than a number of fine-grained ones when communicating over the network boundary but until recently I haven’t realized how big that difference may be. While performing a simple query individually for each input record proceeded with the speed of 11k records per hour, when I grouped each 100 queries together (with “… WHERE id IN (value1, .., value100)), all 200k records were processed in 13 minutes. In other words, using a batch of the size 100 led to the speed-up of nearly two orders of magnitude!
The moral: It really pays of to spend a little more time on writing the more complex batch-enabled JDBC code whenever dealing with larger amounts of data. (And it wasn’t that much more effort thanks to Groovy SQL.)
Posted in Databases, Languages | Tagged: java, jdbc, performance, sql | Comments Off on The power of batching or speeding JDBC by 100
Posted by Jakub Holý on November 2, 2007
Often you need to insert a String from Java into a database column with a fixed length specified in bytes.
isn’t enough because it only cuts down the number of characters but in UTF-8 a single character may be represented by 1-4 bytes. But you cannot just turn the string into an array of bytes and use its first DB_FIELD_LENGTH elements because you could end up with an invalid UTF-8 character at the end (one that is represented by 2+ bytes while only its 1st byte fits into the field). There are two solutions for truncation the string in such a way, that it has at most DB_FIELD_LENGTH bytes and is a valid UTF-8 string.
Approach 1: Replace the invalid trailing byte(s) with a ‘rectangle’
This is as simple as:
int maxLen = DB_FIELD_LENGTH-2;
string = new String( string.getBytes("UTF-8") , 0, maxLen, "UTF-8");
The new String constructor will automatically replace any invalid character (i.e. incomplete utf-8 char; we may only have one at the end) with the character \uFFFD, which looks like an empty rectangle. This character requires 3 bytes in utf-8 – therefore we decrease DB_FIELD_LENGTH by 2; the resulting string will have either exactly maxLen bytes if its last byte(s) is a valid utf-8 character or maxLen+2 bytes if it isn’t valid and this 1 byte was replaced by \uFFFD (3B).
Approach 2: Skip the invalid trailing byte(s) altogether
If you don’t want to have the rectangle character in the place of a split multibyte character, you must do yourself what the String constructor does internally, in a bit different way:
import java.nio.*; import java.nio.charset.*;
Charset utf8Charset = Charset.forName("UTF-8");
CharsetDecoder cd = utf8Charset.newDecoder();
byte sba = string.getBytes("UTF-8");
// Ensure truncating by having byte buffer = DB_FIELD_LENGTH
ByteBuffer bb = ByteBuffer.wrap(sba, 0, DB_FIELD_LENGTH); // len in [B]
CharBuffer cb = CharBuffer.allocate(DB_FIELD_LENGTH); // len in [char] <= # [B]
// Ignore an incomplete character
cd.decode(bb, cb, true);
string = new String(cb.array(), 0, cb.position());
The string will end with the last valid character in the given range.
Approach 3: Manually remove the invalid trailing bytes
As you can see, the approach 2 requires quite lot of coding and method calls. If you know the details of UTF-8, namely how to distinguish an invalid byte or byte sequence then you can simply truncate the byte array and then remove/replace the invalid bytes. I’d be glad for the code 🙂
Posted in Databases, Languages | Tagged: database, java, jdbc, text, utf8 | Comments Off on Truncating UTF String to the given number of bytes while preserving its validity [for DB insert]
Posted by Jakub Holý on November 2, 2006
I tried to create a database but couldn’t because of “SQL1005N The database alias “W3IBMDB” already exists in either the local database directory or system database directory.“. I knew I had once such a database, but I thought it existed no more. And indeed “list database directory” didn’t list it. So it was there and it wasn’t there.
It seems that DB2 remembers the old databases if you don’t remove them in the correct way (whatever that is). The steps to fix this were:
- catalog db w3ibmdb
- check “list database directory” to verify that it is listed now
- drop database W3IBMDB
- create database W3IBMDB — Voila – no more error messages
Posted in Databases | Tagged: db2 | Comments Off on Kill a zombie database (not in the directory but can’t create it)
Posted by Jakub Holý on October 19, 2006
To find out current locks in a DB and other information about its current state, you need to take a “snapshot”:
db2 "attach to user using "
db2 "get snapshot for applications on " > mydatabase_snapshot.txt
Look into the resulting mydatabase_snapshot.txt for “Application ID holding” (if there is a lock, this will read the ID of the aplication holding the lock). To find long queries [that are currently running], look for “Executing”.
Posted in Databases | Tagged: db2, performance, troubleshooting | Comments Off on DB2: Find out current locks, long transactions etc. [snapshot]