Wednesday, June 29, 2016

How to detect fake tests - Introduction to Mutation Testing

In the last posts (1, 2, 3) I showed various ways for producing fake tests. Of course, good developers won't fake their tests, and the chances to encounter a test suite purely made of fake tests in real life is rather low. Nevertheless, in certain environments it may occasionally happen that metrics are polished for various reasons. But it's more likely, that the quality of a test suites deteriorates over time because of various reasons, i.e. project pressure, sloppy moments during coding, wrong assumptions, etc. And typically we rely on metrics to determine whether our project is in good shape.

My intention for the last three posts was to show, how easy the common metrics - test count, line and condition coverage - can be tricked and are of very low value without the proper context. They are as good for determining the health of a software project as lines of codes are. They might be an weak indicator but nothing more.

The main question is, how could we determine the actual value of our tests and test suites? How would others do it? Firebrigades test their procedures and techniques on a real fire. Military is holding maneuvers, martial arts fighters test their skills in championships, NetFlix is letting the Chaos Monkey terminate instances to detect holes in the recovery procedures.

What is the main reason to have automated tests? To detects bugs that slipped into existing code unintentionally. It doesn't matter if you wrote the tests beforehand by practicing Uncle Bob style TDD or afterwards to create a safepoint for your code. The base assumption is, once you've written your code and your tests, it's free of errors. But it's called Software for a reason: it may change over time. The once written, error-free code will eventually be changed. To ensure, it is still functional, the test suites are run and if it's all green, nothing was broken. But how can you be sure of that?

The only thing to verify your test suite is capable of detecting bugs is to induce bugs in your code.

The technique of altering your code and re-run your test suite to verify the test suite detects the code change is called Mutation Testing. The concept is known for quite a while and was mostly subject to academic research with the tools being somewhat theoretical and less practical to use. But since the arrival of a practical, stable and well integrated tool has been around that should be in every developer's toolbox.

Pitest mutates bytecode and runs highly parallel making it the fastest mutation testing tool for the JVM. Pitest offers a set of Mutation Operators that modify bytecode according to a defined ruleset and thus creates a modified version of the code, a Mutation. The test suite is run again and if at least one test fails, the Mutation is killed. In the end, the Mutation Score is calculated from the number of killed mutations vs the total number of mutations.

Different to line or branch coverage, which can be determined with a single test suite execution, Pitest requires one test suite execution per mutation. With larger code-bases the execution time increases exponentially due to the sheer number of combinations of mutations. Although Pitest offers a variety of settings and options to limit execution time - i.e. delta execution, selection of mutation operators, exclusion of classes, to name a few - it requires some thorough planning how this technique should be incorporated into the CI/CD pipeline. The value it delivers, comes with a price.

In the next post of this series, I will provide examples of how to setup and run pitest with practical examples, so stay tuned.

No comments: