Challenging Myself With Coplien’s Why Most Unit Testing is Waste

James O. Coplien has written in 2014 the thought-provoking essay Why Most Unit Testing is Waste and further elaborates the topic in his Segue. I love testing but I also value challenging my views to expand my understanding so it was a valuable read. When encountering something so controversial, it’s crucial to set aside one’s emotions and opinions and ask: “Provided that it is true, what in my world view might need questioning and updating?” Judge for yourself how well have I have managed it. (Note: This post is not intended as a full and impartial summary of his writing but rather a overveiw of what I may learn from it.)

Perhaps the most important lesson is this: Don’t blindly accept fads, myths, authorities and “established truths.” Question everything, collect experience, judge for yourself. As J. Coplien himself writes:

Be skeptical of yourself: measure, prove, retry. Be skeptical of me for heaven’s sake.

I am currently fond of unit testing so my mission is now to critically confront Coplien’s ideas and my own preconceptions with practical experience on my next projects.

I would suggest that the main thing you take away isn’t “minimize unit testing” but rather “value thinking, focus on system testing, employ code reviews and other QA measures.”

I’ll list my main take-aways first and go into detail later on:

Think! Communicate! Tools and processes (TDD) cannot introduce design and quality for you
The risk mitigation provided by unit tests is highly overrated; it’s better to spend resources where return on investment in terms of increased quality is higher (system tests, code reviews etc.)
Unit testing is still meaningful in some cases
Actually, testing – though crucial – is overrated. Analyse, design, and use a broad repertoire of quality assurance techniques
Regarding automated testing, focus on system tests. Integrate code and run them continually
Human insight beats machines; employ experience-based testing including exploratory testing
Turn asserts from unit test into runtime pre-/post-condition checks in your production code
We cannot test everything and not every bug impacts users and is thus worth discovering and fixing (It’s wasteful to test things that never occur in practice)

I find two points especially valuable and thought-provoking, namely the (presumed) fact that the efficiency/cost ratio of unit testing is so low and that the overlap between what we test and what is used in production is so small. Additionally, the suggestion to use heavily pre- and post-condition checks in production code seems pretty good to me. And I agree that we need more automated acceptance/system tests and alternative QA techniques.

Think!

Unit testing is sometimes promoted as a way towards a better design (more modular and following principles such as SRP). However, according to Kent Beck, tests don’t help you to create a better design – they only help to surface the pain of a bad design. (TODO: Reference)

If we set aside time to think about design for the sake of a good design, we will surely get a better result than when we employ tools and processes (JUnit, TDD) to force us to think about it (in the narrow context of testability). (On a related note, the “hammock-driven development” is also based on the appreciation of the value of thinking.)

Together with thinking, it is also efficient to discuss and explore design with others.

[..] it’s much better to use domain modeling techniques than testing to shape an artefact.

I heartily agree with this.

(Side note: Though thinking is good, overthinking and premature abstraction can lead us to waste resources on overengineered systems. So be careful.)

How to achieve high quality according to Coplien

Communicate and analyse better to minimize faults due to misunderstanding of requirements and miscommunication
Don’t neglect design as faults in design are expensive to fix and hard to detect with testing
Use a broad repertoire of quality assurance techniques such as code reviews, static code analysis, pair programming, inspections of requirements and design
Focus automated testing efforts on system (and, if necessary, integration) tests while integrating code and running these tests continually
Leverage experience-based testing such as exploratory testing
Employ defensive and design-by-contract programming. Turn asserts in unit test into runtime pre-/post-condition checks in your production code and rely on your stress tests, integration tests, and system tests to exercise the code
Apply unit testing where meaningful
Test where risk of defects is high

Tests cannot catch faults due to miscommunication/bad assumptions in requirements and design, which reportedly account for 45% of faults. Therefore communication and analysis are key.

Unit testing in particular and testing in general is just one way of ensuring quality, and reportedly not the most efficient one. It is therefore best to combine a number of approaches for defect prevention and detection, such as those mentioned above. (I believe that in particular formal inspection of requirements, design, and code is one of the most efficient ones. Static code analysis can also help a lot with certain classes of defects.) (Also there is a history of high-quality software with little or no unit testing, I believe.)

As argued below, unit testing is wasteful and provides a low value according to Coplien. Thus you should focus on tests that check the whole system or, if not practical, at least parts of it – system and integration tests. (I assume that the automated acceptance testing as promoted by Specification by Example belongs here as well.) Provide “hooks” to be able to observe the state and behavior of the system under test as necessary. Given modern hardware and possibilities, these tests should run continually. And developers should integrate their code and submit it for testing frequently (even the delay of 1 hour is too much). Use debugging and (perhaps temporary?) unit tests to pinpoint sources of failures.

My thoughts: I agree that whole-system tests provide more value than unit tests. But I believe that they are typically much more costly to create, maintain and run since they, by definition, depend on the whole system and thus have to relate to the whole (or a non-negligible portion) of it and can break because of changes at number of places. It is also difficult to locate the cause of a failure since they invoke a lot of code. And if we should test a small feature thoroughly, there would typically be a lot of duplication in its system tests (f.ex. to get the system to the state where we can start exercising the feature). Therefore people write more lower-level, more focused and cheaper tests as visualized in the famous Test Pyramid. I would love to learn how Mr. Coplien addresses these concerns. In any case we should really try to get at least a modest number of automated acceptance tests; I have rarely seen this.

Experienced testers are great at guessing where defects are likely, simulating unexpected user behavior and generally breaking systems. Experience-based – f.ex. exploratory – testing is thus irreplaceable complement to (automated) requirements-based testing.

Defensive and design-by-contract programming: Coplien argues that most asserts in unit tests are actually pre- and post-condition and invariant checks and that it is therefore better to have them directly in the production code as the design-by-contract programming proposes. You would of course need to generalize them to property-based rather than value-based assertions, f.ex. checking that the length of concatenated lists is the sum of the length of the individual lists, instead of concrete values. He writes:

Assertions are powerful unit-level guards that beat most unit tests in two ways. First, each one can do the job of a large number (conceivably, infinite) of scenario-based unit tests that compare computational results to an oracle [me: because they check properties, not values]. Second, they extend the run time of the test over a much wider range of contexts and detailed scenario variants by extending the test into the lifetime of the product.

If something is wrong, it is certainly better to catch it and fail soon (and perhaps recover) than silently going on and failing later; failing fast also makes it much easier to locate the bug later on. You could argue that these runtime checks are costly but compared to the latency of network calls etc. they are negligible.

Some unit tests have a value – f.ex. regression tests and testing algorithmic code where there is a “formal oracle” for assessing their correctness. Also, as mentioned elsewhere, unit tests can be a useful debugging (known defect localization) tool. (Regression tests should still be preferably written as system tests. They are valuable because, in contrary to most unit tests, they have clear business value and address a real risk. However Mr. Coplien also proposes to delete them if they haven’t failed within a year.)

Doubts about the value of unit testing

In summary, it seems that agile teams are putting most of their effort into the quality improvement area with the least payoff.
…
unless [..] you are prevented from doing broader testing, unit testing isn’t the best you can do

“Most software failures come from the interactions between objects rather than being a property of an object or method in isolation.” Thus testing objects in isolation doesn’t help
The cost to create, maintain, and run unit tests is often forgotten or underestimated. There is also cost incurred by the fact that unit tests slow down code evolution because to change a function we must also change all its tests.
In practice we often encounter unit tests that have little value (or actually a negative one when we consider the associated costs) (F.ex. a test that essentially only verifies that we have set up mocks correctly)
When we test a unit in isolation, we will likely test many cases that in practice do not occur in the system and we thus waste our limited resources without adding any real value; the smaller a unit the more difficult to link it to the business value of the system and thus the probability of testing something without a business impact is high
The significance of code coverage and the risk mitigation provided by unit tests is highly overrated; in reality there are so many possibly relevant states of the system under test, code paths and interleavings that we can hardly approach the coverage of 1 in 10¹².
It’s being claimed that unit tests serve as a safety net for refactoring but that is based upon an unverified and doubtful assumption (see below)
According to some studies, the efficiency of unit testing is much lower than that of other QA approaches (Casper Jones 2013)
Thinking about design for the sake of design yields better results than design thinking forced by a test-first approach (as discussed previously)

Limited resources, risk assessment, and testing of unused cases

One of the main points of the essays is that our resources are limited and it is thus crucial to focus our quality assurance efforts where it pays off most. Coplien argues that unit testing is quite wasteful because we are very likely to test cases that never occur in practice. One of the reasons is that we want to test the unit – f.ex. a class – in its entirety while in the system it is used only in a particular way. It is typically impossible to link such a small unit to the business value of the application and thus to guide us to test cases that actually have business value. Due to polymorphism etc. there is no way, he argues, a static analysis of an object-oriented code could tell you whether a particular method is invoked and in what context/sequence. Discovering and fixing bugs that are in reality never triggered is waste of our limited resources.

For example a Map (Dictionary) provides a lot of functionality but any application uses typically only a small portion of it. Thus:

One can make ideological arguments for testing the unit, but the fact is that the map is much larger as a unit, tested as a unit, than it is as an element of the system. You can usually reduce your test mass with no loss of quality by testing the system at the use case level instead of testing the unit at the programming interface level.

Unit tests can also lead to “test-induced design damage” – degradation of code and quality in the interest of making testing more convenient (introduction of interfaces for mocking, layers of indirection etc.).

Moreover, the fact that unit tests are passing does not imply that the code does what the business needs, contrary to acceptance tests.

Testing – at any level – should be driven by risk assessment and cost-benefit analysis. Coplien stresses that risk assessment/management requires at least rudimentary knowledge of statistics and information theory.

The myth of code coverage

A high code coverage makes us confident in the quality of our code. But the assumption that testing reproduces the majority of states and paths that the code will experience in deployment is rather dodgy. Even 100% line coverage is a big lie. A single line can be invoked in many different cases and testing just one of them provides little information. Moreover, in concurrent systems the result might depend on the interleaving of concurrently executing threads, and there are many. It may also – directly or indirectly – depend on the state of the whole system (an extreme example – a deadlock that is triggered only if the system is short of memory).

I define 100% coverage as having examined all possible combinations of all possible paths through all methods of a class, having reproduced every possible configuration of data bits accessible to those methods, at every machine language instruction along the paths of execution. Anything else is a heuristic about which absolutely no formal claim of correctness can be made.

My thoughts: Focusing on the worst-case coverage might be too extreme, even though nothing else has a “formal claim of correctness.” Even though humans are terrible at foreseeing the future, we perhaps still have a chance to guess the most likely and frequent paths through the code and associated states since we know its business purpose, even though we will miss all the rare, hard-to-track-down defects. So I do not see this as an absolute argument against unit testing but it really makes the point that we must be humble and skeptical about what our tests can achieve.

The myth of the refactoring and regression safety net

Unit tests are often praised as a safety net against inadvertent changes introduced when refactoring or when introducing new functionality. (I certainly feel much safer when changing a codebase with a good test coverage (ehm, ehm); but maybe it is just a self-imposed illusion.) Coplien has a couple of counterarguments (in addition to the low-coverage one):

Given that the whole is greater than a sum of its parts, testing a part in isolation cannot really tell you that a change to it hasn’t broken the system (a theoretical argument based on Weinberg’s Law of Composition)
Tests as a safety net to catch inadvertent changes due to refactoring – “[..] it pretends that we can have a high, but not perfect, degree of confidence that we can write tests that examine only that portion of a function’s behavior that that will remain constant across changes to the code.”
“Changing code means re-writing the tests for it. If you change its logic, you need to reevaluate its function. […] If a change to the code doesn’t require a change to the tests, your tests are too weak or incomplete.”

Tips for reducing the mass of unit tests

If the cost of maintaining and running your unit tests is too high, you can follow Coplien’s guidelines for eliminating the least valuable ones:

Remove tests that haven’t failed in a year (informative value < maintenance and running costs)
Remove tests that do not test functionality (i.e. don’t break when the code is modified)
Remove tautological tests (f.ex. .setY(5); assert x.getY() == 5)
Remove tests that cannot be tracked to business requirements and value
Get rid of unit tests that duplicate what system tests do; if they’re too expensive to run, create subunit integration tests.

(Notice that theoretically, and perhaps also practically speaking, a test that never fails has zero information value.)

Open questions

What is Coplien’s definition of a “unit test”? Is it one that checks an individual method? Or, at the other end of the scale, a one that checks an object or a group of related objects but runs quickly, is independent of other tests, manages its own setup and test data, and generally does not access the file system, network, or a database?

He praises testing based on Monte Carlo techniques. What is it?

Does the system testing he speaks about differ from BDD/Specification by Example way of automated acceptance testing?

Coplien’s own summary

Keep regression tests around for up to a year — but most of those will be system-level tests rather than unit tests.

Keep unit tests that test key algorithms for which there is a broad, formal, independent oracle of correctness, and for which there is ascribable business value.

Except for the preceding case, if X has business value and you can test X with either a system test or a unit test, use a system test — context is everything.

Design a test with more care than you design the code.

Turn most unit tests into assertions.

Throw away tests that haven’t failed in a year.

Testing can’t replace good development: a high test failure rate suggests you should shorten development intervals, perhaps radically, and make sure your architecture and design regimens have teeth

If you find that individual functions being tested are trivial, double-check the way you incentivize developers’ performance. Rewarding coverage or other meaningless metrics can lead to rapid architecture decay.

Be humble about what tests can achieve. Tests don’t improve quality: developers do.

[..] most bugs don’t happen inside the objects, but between the objects.

Conclusion

What am I going to change in my approach to software development? I want to embrace the design-by-contract style runtime assertions. I want to promote and apply automated system/acceptance testing and explore ways to make these tests cheaper to write and maintain. I won’t hesitate to step off the computer with a piece of paper to do some preliminary analysis and design. I want to experiment more with some of the alternative, more efficient QA approaches. And, based on real-world experiences, I will critically scrutinize the perceived value of unit testing as a development aid and regression/refactoring safety net and the actual cost of the tests themselves and as an obstacle to changing code. I will also scrutinize Coplien’s claims such as the discrepancy between code tested and code used.

What about you, my dear reader? Find out for yourself what works for you in practice. Our resources are limited – use them on the most efficient QA approaches. Inverse the test pyramid – focus on system (and integration) tests. Prune unit tests regularly to keep their mass manageable. Think.

Follow-Up

There is a recording of Jim Coplien and Bob Martin debating TDD. Mr. Coplien says he doesn’t have an issue with TDD as practiced by Uncle Bob, his main problem is with people that forego architecture and expect it to somehow magically emerge from tests. He stresses that it is important to leverage domain knowledge when proposing an architecture. But he also agrees that architecture has to evolve and that one should not spend “too much” time on architecting. He also states that he prefers design by contract because it provides many of the same benefits as TDD (making one think about the code, outside view of the code) while it also has a much better coverage (since it applies to all input/output values, not just a few randomly selected ones) and it at least “has a hope” being traced to business requirements. Aside of that, his criticism of TDD and unit testing is based a lot on experiences with how people (mis)use it in practice. Side note: He also mentions that behavior-driven development is “really cool.”

Slides from Coplien’s keynote Beyond Agile Testing to Lean Development

The “Is TDD Dead?” dialogue – A series of conversations between Kent Beck, David Heinemeier Hansson, and Martin Fowler on the topic of Test-Driven Development (TDD) and its impact upon software design. Kent Beck’s reasons for participating are quite enlightening:

I’m puzzled by the limits of TDD–it works so well for algorithm-y, data-structure-y code. I love the feeling of confidence I get when I use TDD. I love the sense that I have a series of achievable steps in front of me–can’t imagine the implementation? no problem, you can always write a test. I recognize that TDD loses value as tests take longer to run, as the number of possible faults per test failure increases, as tests become coupled to the implementation, and as tests lose fidelity with the production environment. How far out can TDD be pushed? Are there special cases where TDD works surprisingly well? Poorly? At what point is the cure worse than the disease? How can answers to any and all of these questions be communicated effectively?

(If you are short of time, you might get an impression of the conversations in these summaries)

Kent Beck’s RIP TDD lists some good reasons for doing TDD (and thus unit testing) (Aside of “Documentation,” none of them prevents you from using tests just as a development aid and deleting them afterwards, thus avoiding accumulating costs. It should also be noted that people evidently are different, to some TDD might indeed be a great focus and design aid.)

Martin Fowler’s Goto Fail, Heartbleed, and Unit Testing Culture (via @dfkaye)

Brian Marick has written a great analysis of the value (and cost) of automated (high-level) testing in his When Should a Test Be Automated? Highly recommended! (See my highlights.)