whoopdicity: 2016

Wednesday, June 29, 2016

How to detect fake tests - Introduction to Mutation Testing

In the last posts (1, 2, 3) I showed various ways for producing fake tests. Of course, good developers won't fake their tests, and the chances to encounter a test suite purely made of fake tests in real life is rather low. Nevertheless, in certain environments it may occasionally happen that metrics are polished for various reasons. But it's more likely, that the quality of a test suites deteriorates over time because of various reasons, i.e. project pressure, sloppy moments during coding, wrong assumptions, etc. And typically we rely on metrics to determine whether our project is in good shape.

My intention for the last three posts was to show, how easy the common metrics - test count, line and condition coverage - can be tricked and are of very low value without the proper context. They are as good for determining the health of a software project as lines of codes are. They might be an weak indicator but nothing more.

The main question is, how could we determine the actual value of our tests and test suites? How would others do it? Firebrigades test their procedures and techniques on a real fire. Military is holding maneuvers, martial arts fighters test their skills in championships, NetFlix is letting the Chaos Monkey terminate instances to detect holes in the recovery procedures.

What is the main reason to have automated tests? To detects bugs that slipped into existing code unintentionally. It doesn't matter if you wrote the tests beforehand by practicing Uncle Bob style TDD or afterwards to create a safepoint for your code. The base assumption is, once you've written your code and your tests, it's free of errors. But it's called Software for a reason: it may change over time. The once written, error-free code will eventually be changed. To ensure, it is still functional, the test suites are run and if it's all green, nothing was broken. But how can you be sure of that?

The only thing to verify your test suite is capable of detecting bugs is to induce bugs in your code.

The technique of altering your code and re-run your test suite to verify the test suite detects the code change is called Mutation Testing. The concept is known for quite a while and was mostly subject to academic research with the tools being somewhat theoretical and less practical to use. But since the arrival of Pitest.org a practical, stable and well integrated tool has been around that should be in every developer's toolbox.

Pitest mutates bytecode and runs highly parallel making it the fastest mutation testing tool for the JVM. Pitest offers a set of Mutation Operators that modify bytecode according to a defined ruleset and thus creates a modified version of the code, a Mutation. The test suite is run again and if at least one test fails, the Mutation is killed. In the end, the Mutation Score is calculated from the number of killed mutations vs the total number of mutations.

Different to line or branch coverage, which can be determined with a single test suite execution, Pitest requires one test suite execution per mutation. With larger code-bases the execution time increases exponentially due to the sheer number of combinations of mutations. Although Pitest offers a variety of settings and options to limit execution time - i.e. delta execution, selection of mutation operators, exclusion of classes, to name a few - it requires some thorough planning how this technique should be incorporated into the CI/CD pipeline. The value it delivers, comes with a price.

In the next post of this series, I will provide examples of how to setup and run pitest with practical examples, so stay tuned.

Wednesday, June 22, 2016

How to fake tests (Part 3)

In this 3rd part of the series I want to show how assertions can be faked, so that not only lines and branches get covered but the test themselves also contain some assertions.
Faking Assertions only makes sense if a metric such as "assertions/test" is computed at all. Otherwise you may skip that part, because every proper code review would reveal your test as fake.
Test libraries such as Junit or TestNG contain various means for expressing assertions. In addition to this, some frameworks exist for that sole purpose, i.e. Hamcrest, Truth, to name a few. Basic approach for all is, to invoke the system under test (generating coverage information) and to verify outcomes against assertions.
But the outcomes doesn’t have to be related to what is declared as expected for the test to succeed. So all of the following assertions might do the trick

assertTrue(true);
assertNotNull(new Object()); (a real life example I’ve encountered during a code review)
assertEquals("2","2");
…

After having applied fake assertions and fake coverage, our testsuite satisfies the following criteria

Big, lots of tests for all the methods
100% Line Coverage
100% Condition Coverage
Tests contain assertions (maybe 1 assertion/method as a metric)

This would make every project manager happy, because the quality of the product is so good and there is proof for that....

Not!

You’ve probably produced the most sophisticated test suite with best quality ratings with minimum effort to create that has no value at all (Achievement unlocked).

In the next post I'll show how all these fakes described in this and the earlier posts can be revealed as such - and more important, how the effectiveness of a test suite can be determined and gaps in a sensible test suite be found. So stay tuned.

Thursday, June 9, 2016

How to fake tests (Part 2)

In the last post I described how to write fake tests to statisfy number-of-tests KPI. Apparently this is not a good practice for software craftsmen. Unfortunately some organisation do value KPIs more than good craftsmenship and may be simply tricked by fake tests. So in today's post I'd like to show you how to fake line and condition coverage of tests. This is a call to action for decision makers who base their decisions on such numbers: don't trust them. And for developers: if encouter things like the following (or like in the last post): fix them. So let's start with line coverage.

Faking Line Coverage

Line Coverage is a metric that measures how many and which lines have been covered during execution. There are various tools to measure coverage.

Jacoco – Measuring on ByteCode level which has the advantage that you can test your actual artifacts, but bytecode can be different to its source at times.
ECobertura, Clover – Measuring on SourceCode level which is more precise than byte-code measuring but injects additional code before compilation, ending up in different artifacts than you want to deliver.

When running your tests with line coverage enabled, all lines touched are recorded to produce the metric. If you have 0% line coverage, you didn’t run any code. So let’s extend our test to get some coverage:


@Test
public void test() {
  subject.invokeSomeMethod();
}

Obviously this test is broken because in cannot fail – unless the code itself produces an exception. But with tests like these you may achieve quite easily a high line coverage and a stable test suite.
But typical programs are rarely linear and have some sort of loop or branch constructs. So it’s unlikely you achieve 100% line coverage. So we have to fake branch coverage, too.

Faking Condition Coverage

Lets assume our simple program consists of the following code


Object compute(Object input) {
  if("left".equals(input) {
    return "right";
  } 
  return "left";
}

It has one condition with two branches. With a single test, you may get 66% Line Coverage and 50% Condition Coverage. I’ve experiences several times that branch coverage is perceived as “better” or of “more value” because it’s harder to achieve. If “harder” means “more code” it’s certainly true, but branch coverage suffers the same basic problem as line coverage does: it’s just a measure for which code is executed and not how good your tests are. It also depends on the code base, what is harder to achieve. If the happy-flow you test covers only a minor part of the code, you may have 50% branch-coverage but only 10% line coverage. Given the above example, assume the “left”-branch contains 10 lines of code, but you only test for the “right”-branch.

But as we are developers who want to make happy managers, let’s fake branch coverage!
Given, we only test a single flow in a single test, we need two tests:


@Test
public void testLeft() {
 String output = compute("left");
}
@Test
public void testRight() {
 String output = compute("right");
}

This test will produce 100% branch- and line coverage and is very unlikely to fail, ever.
But again: it’s worthless, because we don’t check any output of the operation. So the operation may return anything without failing the test. But still in terms of KPI metrics we achieved:

2 tests for 1 method (great ratio!)
100% line coverage
100% condition coverage

What we missed to have is an assertion. Assertion postulate expected outcomes of an operation. If the actual outcome is different than expected, the test fails. Theoretically it would would be possible to count assertions per test in static code analysis. But I’ve never seen such metric although it’s value would be similar to line- or condition coverage. Nevertheless: we can fake it!

So in the next post, I'll show you how to fake assertions.

Tuesday, May 31, 2016

How to fake tests (Part 1)

In most projects, metrics play an important role to determine the status, health, quality etc. of the project. Not rarely the common metrics for quality have been

Number of Unit Tests (Total, Failed, Successful)
Line Coverage
Branch Coverage

Usually those “KPI” (Key Performance Indicators) were used by “managers” to steer the project to success. The problem with these metrics is: they are totally useless if taken out of context - and the context is usually not that well defined in terms of metrics, but often requires knowledge and insight into the system that’s been measured.

This post is about to show how to game the system and life-hack those KPIs to fake good quality. It’s NOT a best practice but a heads up to those who make decision based on those metrics to look behind the values.

Faking Number of Unit Tests

Most (if not all?) frameworks count the number of tests executed, which failed and which succeeded. A high number of tests is usually perceived as a good indicator of quality. The increase of the amount of tests should correlate with the increase in lines of code (another false-friend KPI). But what is counted as a test?

Let’s look at the Junit which is the de-facto standard for developing and executing Java based unit tests, but other frameworks such as TestNG follow similar concepts.

In Junit 3 it was every parameterless public void method starting with “test” in a class extending TestCase. Since Junit 4 every method annotated with @Test counts as a Test.

That’s it. Just a name convention or an Annotation and you have your test, so let’s fake it!

@Test
public void test() {

}

This is pure gold: a stable and ever succeeding Unit Test!

Copy and paste or even generate those and you produce a test suite satisfying the criteria:

Big, tons of tests, probable even more than you have LoCs
Stable, none of these tests is failing. Ever.
Fast, you have feedback about the success within seconds.

The downside: it’s worthless (surprise, surprise!). There are basically two primary reasons, why its worthless:

It doesn’t run any code
It doesn’t pose any assertion about the outcome

Good indicators to check the first one are line or condition coverage analysis. The latter is more difficult to check.

In the upcoming posts we'll have a look into both.

Pages