Continuous Delivery Digest: Ch.9 Testing Non-Functional Requirements

(Cross-posted from

Digest of chapter 9 of the Continuous Delivery bible by Humble and Farley. See also the digest of ch 8: Automated Acceptance Testing.

(“cross-functional” might be better as they too are crucial for functionality)

  • f.ex. security, usability, maintainability, auditability, configurability but especially capacity, throughput, performance
  • performance = time to process 1 transaction (tx); throughput = #tx/a period; capacity = max throughput we can handle under a given load while maintaining acceptable response times.
  • NFRs determine the architecture so we must define them early; all (ops, devs, testers, customer) should meet to estimate their impact (it costs to increase any of them and often they go contrary to each other, e.g. security and performance)
  • das appropriate, you can either create specific stories for NFRs (e.g. capacity; => easier to prioritize explicitely) or/and add them as requirements to feature stories
  • if poorly analyzed then they constrain thinking, lead to overdesign and inappropriate optimization
  • only ever optimize based on facts, i.e. realistic measurements (if you don’t know it: developers are really terrible at guessing the source of performance problems)

A strategy to address capacity problems:

  1. Decide upon an architecture; beware process/network boundaries and I/O in general
  2. Apply stability/capacity patterns and avoid antipatterns – see Release It!
  3. Other than that, avoid premature optimization; prefer clear, simple code; never optimize without a proof it is necessary
  4. Make sure your algorithms and data structures are suitable for your app (O(n) etc.)
  5. Be extremely careful about threading (-> “blocked threads anti-pattern”)
  6. Create automated tests asserting the desired capacity; they also will guide you when fixing failures
  7. Only profile to fix issues identified by tests
  8. Use real-world capacity measures whenever possible – measure in your prod system (# users, patterns of behavior, data volumes, …)

Measuring Capacity

There are different possible tests, f.ex.:

  • Scalability testing – how does the response time of an individual request and # concurrent users changes as we add more servers, services, or threads?
  • Longevity t. – see performance changes when running for a long time – detect memory leaks, stability problems
  • Throughput t. – #tx/messages/page hits per second
  • Load t. – capacity as functional of load to and beyond the prod-like volumes; this is the most common
  • it’s vital to use realistic scenarios; on the contrary, technical benchmark-style measurements (# reads/s from DB,..) can be sometimes useful to guard against specific problems, to optimize specific areas, or to choose a technology
  • systems do many things so it’s important to run different capacity tests in parallel; it’s impossible to replicate prod traffic => use traffic analysis, experience, intuition to achieve as close a simulation as possible

How to Define Success or Failure

  • tip: collect measurements (absolute values, trends) during the testing and present them in a graphical form to gain insight into what happened
  • too strict limits will lead to intermittent failures (f.ex. when network overloaded by another operation) X too relaxed limits => won’t discover a partial drop in capacity =>
    1. Aim for stable, reproducible results – isolate the test env as much as possible
    2. Tune the pass threshold up once it passes at a minimum acceptable level; back down if it starts failing after a commit due to well-understood and acceptable reason

Capacity-Testing Environment

  • replicates Prod as much as possible; extrapolation from a different environment is highly speculative, unless based on good measurements. “Configuration changes tend to have nonlinear effect on capacity characteristics.” p234
  • an exact replica of Prod sometimes impossible or not sensible (small project, capacity little important, or when prod has 100s of servers) => capacity testing can be done on a subset of prod servers as a part of Canary Releasing, see p263
  • scaling is rarely linear, even if the app is designed for it; if test env is a scaled-down prod, do few scalings runs to measure the size effect
  • saving money on a downscaled test env is a false economy if capacity is critical; no matter what it won’t be able to find all issues and it will be expensive to fix them later – see the storu on p236

Automating Capacity Testing

  • it’s expensive but if important, it must be a part of the deployment pipeline
  • these tests are complex, fragile, easily broken with minor changes
  • Ideal tests: use real-world scenarios; predefine success threshold; relatively short duration to finish in a reasonable time; robust wrt. change to improve maintainability; composable into larger-scale scenarios so that we can simulate real-world patterns of use; repeatable and runnable sequentially or in parallel => suitable both for load and longevity testing
  • start with some existing (robust and realistic) acceptance tests, adapt them for capacity testing – add success threshold and auditability to scale up


  1. Creat realistic, prod-like load (in form and volume)
  2. Test realistic but pathological real-life loading scenarios, i.e. not just the happy path; tip: identify the most expensive transactions and double/triple their ratio

To scale up, you can record the communication generated by acceptance tests, postprocess it to scale up (multiply, insert unique data where necessary), reply at high volume

  • Question: Where to record and play back:
    1. UI – realistic but impractical for 10,000s users (and expensive)
    2. Service/public API (e.g. HTTP req.)
    3. Lower-level API (such as a direct call to the service layer or DB)

Testing via UI

  • Not suitable for high-volume systems, when too many clients are necessary to generate a high load (partially due to UI client [browser] overhead); also expensive to run many machines
  • UI condenses a number of actions (clicks, selections) into few interactions with back-end (e.g. 1 form submission) that has a more stable API. To answer: are we interested in performance of the clients or of the back-end.
  • “[..] we generally prefer to avoid capacity testing through the UI.” – unless the UI itself or the client-server interaction are of a concern

Recording Interactions against a Service or Public API

  • run acceptance tests, record in/outputs (e.g. SOAP XML, HTTP), replace what must vary with placeholders (e.g. ${ORDER_ID}), create test data, merge the two
  • Recommended compromise: Aim to change as little as possible between instances of a test – less coupling between the test and test data, more flexible, less fragile. Ex.: unique orderId, customerId but same product, quantity.

Using Capacity Test Stubs To Develop Tests

In high-performance systems testing may fail because the tests themselves do not run fast enough. To discover this case, run them originally against a no-op stub of the application.

Adding Capacity Tests to the Deployment Pipeline

  • beware that warm-up time may be necessary (JIT, …)
  • for known hot spots, you can simple “guard tests” already to the commit stage
  • typically we run them separately from acceptance tests – they’ve different environment needs, perhaps are long-running, we want to avoid undesirable interactions between acceptance and capacity tests; acceptance test stage may include a few performance smoke tests

Other Benefits of Capacity Tests

Composable, scenario-based tests enable us to simulate complex interactions, together with prod-like env we can

  • reproduce complex prod defects
  • detect/debug memory leaks
  • evaluate impact of garbage collection (GC); tune GC
  • tune app config and 3rd party app (OS, AS, DB, …) config
  • simulate worst-day scenarios
  • evaluate different solutions to a complex problem
  • simulate integration failures
  • measure scalability with different hardware configs
  • load-test communication with external systems even though the tests were originally designed for stubbed interfaces
  • rehears rollback
  • and many more …

Published by Jakub Holý

I’m a JVM-based developer since 2005, consultant, and occasionally a project manager, working currently with Iterate AS in Norway.