Mutation Testing Example: How To Leverage Failure by Experimenting

An in-depth guide to leaner code through failing fast

Sep 28, 2022

a man standing in front of a house — Photo by Alex Bunardzic

In my article, Hands-on explanation of how TDD works, I exposed the power of iteration to guarantee a solution when a measurable test is available. In that article, an iterative approach helped to determine how to implement code that calculates the square root of a given number.

I also demonstrated that the most effective method is to find a measurable goal or test, then start iterating with best guesses. The first guess at the correct answer will most likely fail, as expected, so the failed guess needs to be refined. The refined guess must be validated against the measurable goal or test. Based on the result, the guess is either validated or must be further refined.

In this model, the only way to learn how to reach the solution is to fail repeatedly. It sounds counterintuitive, but amazingly, it works.

Following in the footsteps of that analysis, this article examines the best way to use a DevOps approach when building a solution containing some dependencies. The first step is to write a test that can be expected to fail.

The Problem With Dependencies Is That You Can’t Depend on Them

The problem with dependencies, as Michael Nygard wittily expresses in Architecture without an end state, is a huge topic better left for another article. Here, you’ll look into potential pitfalls that dependencies tend to bring to a project and how to leverage test-driven development (TDD) to avoid those pitfalls.

First, pose a real-life challenge, then see how it can be solved using TDD.

Who Let the Cat Out?

In Agile development environments, it’s helpful to start building the solution by defining the desired outcomes. Typically, the desired outcomes are described in a user story:

Using my home automation system (HAS)

I want to control when the cat can go outside

Because I want to keep the cat safe overnight

Now that you have a user story, you need to elaborate on it by providing some functional requirements (that is, by specifying the acceptance criteria). Start with the simplest of scenarios described in pseudocode:

Scenario #1: Disable cat trap door during nighttime

Given that the clock detects that it is nighttime
When the clock notifies the HAS
Then HAS disables the Internet of Things (IoT)-capable cat trap door

Decompose the System

The system you are building (the HAS) needs to be decomposed — broken down to its dependencies — before you can start working on it. The first thing you must do is identify any dependencies (if you’re lucky, your system has no dependencies, which would make it easy to build, but then it arguably wouldn’t be a very useful system).

From the simple scenario above, you can see that the desired business outcome (automatically controlling a cat door) depends on detecting nighttime. This dependency hinges upon the clock. But the clock is not capable of determining whether it is daylight or nighttime. It’s up to you to supply that logic.

Another dependency in the system you’re building is the ability to automatically access the cat door and enable or disable it. That dependency most likely hinges upon an API provided by the IoT-capable cat door.

Fail Fast Toward Dependency Management

To satisfy one dependency, we will build the logic that determines whether the current time is daylight or nighttime. In the spirit of TDD, we will start with a small failure.

Refer to my previous article for detailed instructions on how to set the development environment and scaffolds required for this exercise. We will be reusing the same NET environment and relying on the xUnit.net framework.

Next, create a new project called HAS (for “home automation system”) and create a file called UnitTest1.cs. In this file, write the first failing test. In this test, describe your expectations. For example, when the system runs, if the time is 7 p.m., then the component responsible for deciding whether it’s daylight or nighttime returns the value “Nighttime.”

Here is the test that describes that expectation:

By this point, you may be familiar with the shape and form of a test. A quick refresher: describe the expectation by giving the test a descriptive name, Given7pmReturnNighttime, in this example. Then in the body of the test, a variable named expected is created, and it is assigned the expected value (in this case, the value “Nighttime”). Following that, a variable named actual is assigned the actual value (available after the component or service processes the time of day).

Finally, it checks whether the expectation has been met by asserting that the expected and actual values are equal: Assert.Equal(expected, actual).

You can also see in the above listing a component or service called dayOrNightUtility. This module is capable of receiving the message GetDayOrNight and is supposed to return the value of the type string.

Again, in the spirit of TDD, the component or service being described hasn’t been built yet (it is merely being described with the intention to prescribe it later). Building it is driven by the described expectations.

Create a new file in the app folder and give it the name DayOrNightUtility.cs. Add the following C# code to that file and save it:

Now, go to the command line, change the directory to the unittests folder, and run the test:

[Xunit.net 00:00:02.33] unittest.UnitTest1.Given7pmReturnNighttime [FAIL]
 Failed unittest.UnitTest1.Given7pmReturnNighttime
 [...]

Congratulations, you have written the first failing test. The test was expecting DayOrNightUtility to return string value “Nighttime” but instead, it received the string value “Undetermined.”

Fix the Failing Test

A quick and dirty way to fix the failing test is to replace the value “Undetermined” with the value “Nighttime” and save the change:

Now when we run the test, it passes:

Starting test execution, please wait...

Total tests: 1. Passed: 1. Failed: 0. Skipped: 0.
 Test Run Successful.
 Test execution time: 2.6470 Seconds

However, hard coding the values is basically cheating, so it’s better to endow DayOrNightUtility with some intelligence. Modify the GetDayOrNight method to include some time-calculation logic:

The method now gets the current time from the system and compares the Hour value to see if it is less than 7 a.m. If it is, the logic transforms the dayOrNight string value from “Daylight” to “Nighttime.” The test now passes.

What About the Daylight Hours?

Next, you need to describe the expectations of what happens when the current time is greater than 7 a.m. Here is the new test is called Given7amReturnDaylight:

The new test now fails (it bears repeating — it is very desirable to fail as early as possible!):

Starting test execution, please wait...
 [Xunit.net 00:00:01.23] unittest.UnitTest1.Given7amReturnDaylight [FAIL]
 Failed unittest.UnitTest1.Given7amReturnDaylight
 [...]

It was expecting to receive the string value “Daylight” but instead received the string value “Nighttime”.

Analyze the Failed Test Case

Upon closer inspection, it seems that our code has trapped itself in a corner. It turns out that the implementation of the GetDayOrNight method is not testable!

Take a look at the core challenges we have:

1. GetDayOrNight relies on hidden input.
The value of dayOrNight is dependent upon the hidden input (it obtains the value for the time of day from the built-in system clock).

2. GetDayOrNight contains non-deterministic behavior.
The value of the time of day obtained from the system clock is non-deterministic. It depends on the point in time when you run the code, which we must consider unpredictable.

3. Low quality of the GetDayOrNight API.
This API is tightly coupled to the concrete data source (system DateTime).

4. GetDayOrNight violates the single-responsibility principle.
You have implemented a method that consumes information and processes information at the same time. It is a good practice that a method should be responsible for only performing a single duty.

5. GetDayOrNight has more than one reason to change.
It is possible to imagine a scenario where the internal source of time may change. Also, it is quite easy to imagine that the processing logic will change. These disparate reasons for changing must be isolated from each other.

6. The API signature of GetDayOrNight is not sufficient when it comes to trying to understand its behavior.
It is very desirable to be able to understand what type of behavior to expect from an API by simply looking at its signature.

7. GetDayOrNight depends on global shared mutable state.
Shared mutable state is to be avoided at all costs!

8. The behavior of the GetDayOrNight method cannot be predicted even after reading the source code.
That is a scary proposition. It should always be very clear from reading the source code what kind of behavior can be predicted once the system is operational.

The Principles Behind What Failed

Whenever you’re faced with an engineering problem, it is advisable to use the time-tested strategy of divide and conquer. In this case, following the principle of separation of concerns is the way to go.

Separation of concerns ( SoC) is a design principle for separating a computer program into distinct sections, so that each section addresses a separate concern. A concern is a set of information that affects the code of a computer program. A concern can be as general as the details of the hardware the code is being optimized for, or as specific as the name of a class to instantiate. A program that embodies SoC well is called a modular program.

The GetDayOrNight method should be concerned only with deciding whether the date and time value means daylight or nighttime. It should not be concerned with finding the source of that value. That concern should be left to the calling client.

You must leave it to the calling client to take care of obtaining the current time. This approach aligns with another valuable engineering principle- inversion of control. Martin Fowler explores this concept in detail, here.

One important characteristic of a framework is that the methods defined by the user to tailor the framework will often be called from within the framework itself, rather than from the user’s application code. The framework often plays the role of the main program in coordinating and sequencing application activity. This inversion of control gives frameworks the power to serve as extensible skeletons. The methods supplied by the user tailor the generic algorithms defined in the framework for a particular application. — Ralph Johnson and Brian Foote

Refactoring the Test Case

Obviously, we need to refactor the code. Get rid of the dependency on the internal clock (the DateTime system utility):

DateTime time = new DateTime();

Delete the above line (which should be line 7 in your file). Refactor your code further by adding an input parameter DateTime time to the GetDayOrNight method.

Here’s the refactored class DayOrNightUtility.cs:

Refactoring the code requires the tests to change. You need to prepare values for the nightHour and the dayHour and pass those values into the GetDayOrNight method. Here are the refactored tests:

Lessons learned

Before moving forward with this simple scenario, take a look back and review the lessons in this exercise.

It is easy to inadvertently create a trap by implementing code that is untestable. On the surface, such code may appear to be functioning correctly. However, if we follow Test-Driven Development (TDD) practice — describing the expectations first and only then prescribing the implementation — it immediately reveals serious problems in the code.

This shows that TDD is the ideal methodology for ensuring code does not get too messy. TDD points out problem areas, such as the absence of single responsibility and the presence of hidden inputs. Also, TDD assists in removing non-deterministic code and replacing it with fully testable code that behaves deterministically.

Finally, TDD helps to deliver code that is easy to read because the implemented logic is easy to follow.

Let’s now look into how to use the logic created during this exercise to implement functioning code and how further testing can make it even better.

Disable the Cat Trap Door During Nighttime

Assume the cat door is a sophisticated Internet of Things (IoT) product that has an IP address and can be accessed by sending a request to its API. For the sake of brevity, this series doesn’t go into how to program an IoT device; rather, it simulates the service to keep the focus on test-driven development (TDD) and mutation testing.

Start by writing a failing test:

[Fact]
public void GivenNighttimeDisableTrapDoor() {
   var expected = "Cat trap door disabled";
   var timeOfDay = dayOrNightUtility.GetDayOrNight(nightHour);
   var actual = catTrapDoor.Control(timeOfDay);
   Assert.Equal(expected, actual);
}

This describes a brand new component or service (catTrapDoor). That component (or service) has the capability to control the trap door given the current time. Now it’s time to implement catTrapDoor.

To simulate this service, you must first describe its capabilities by using the interface. Create a new file in the app folder and name it ICatTrapDoor.cs (by convention, an interface name starts with an uppercase letter I). Add the following code to that file:

namespace app{
   public interface ICatTrapDoor {
       string Control(string dayOrNight);
   }
}

This interface is not capable of functioning. It merely describes your intention when building the CatTrapDoor service. Interfaces are a nice way to create abstractions of the services you are working with. In a way, you could regard this interface as an API of the CatTrapDoor service.

To implement the API, create a new file in the app folder and name it FakeCatTrapDoor.cs. Enter the following code into the class file:

This new FakeCatTrapDoor class implements the interface ICatTrapDoor. Its method Control accepts string value dayOrNight and checks whether the value passed in is “Nighttime.” If it is, it modifies trapDoorStatus from “Undetermined” to “Cat trap door disabled” and returns that value to the calling client.

Why is it called FakeCatTrapDoor? Because it’s not a representation of the real cat trap door. The fake just helps you work out the processing logic. Once your logic is airtight, the fake service is replaced with the real service (this topic is reserved for the discipline of integration testing).

With everything implemented, all the tests pass when they run:

Starting test execution, please wait...

Total tests; 3. Passed: 3. failed: 0. Skipped: 0.
 Test Run Successful.
 Test execution time: 1.3913 Seconds

Enable the Cat Trap Door During Daytime

It’s time to look at the next scenario in our user story:

Scenario #2: Enable cat trap door during daylight

Given that the clock detects the daylight
When the clock notifies the HAS
Then the HAS enables the cat trap door

This should be easy, just the flip side of the first scenario. First, write the failing test. Add the following test to your UnitTest1.cs file in the unittest folder:

[Fact]
public void GivenDaylightEnableTrapDoor() {
   var expected = "Cat trap door enabled";
   var timeOfDay = dayOrNightUtility.GetDayOrNight(dayHour);
   var actual = catTrapDoor.Control(timeOfDay);
   Assert.Equal(expected, actual);
}

You can expect to receive a “Cat trap door enabled” notification when sending the “Daylight” status to catTrapDoor service. When you run tests, you see the result you expect, which fails as expected:

Starting test execution, please wait...
 [Xunit unittest.UnitTest1.UnitTest1.GivenDaylightEnableTrapDoor [FAIL]
 Failed unittest.UnitTest1.UnitTest1.GivenDaylightEnableTrapDoor
 [...]

The test expected to receive a “Cat trap door enabled” notification but instead was notified that the cat trap door status is “Undetermined.” Cool; now’s the time to fix this minor failure.

Adding three lines of code to the FakeCatTrapDoor does the trick:

if(dayOrNight == "Daylight") {
   trapDoorStatus = "Cat trap door enabled";
}

Run the tests again, and all tests pass:

Starting test execution, please wait...

Total tests: 4. Passed: 4. Failed: 0. Skipped: 0.
 Test Run Successful.
 Test execution time: 2.4888 Seconds

Awesome! Everything looks good. All the tests are in green; you have a rock-solid solution. Thank you, TDD!

Not So Fast!

Experienced engineers would not be convinced that the solution is rock-solid. Why? Because the solution hasn’t been mutated yet.

While it seemed that the journey was over with a successful sample Internet of Things (IoT) application to control a cat door, experienced programmers know that solutions need mutation testing.

What’s Mutation Testing?

Mutation testing is the process of iterating through each line of implemented code, mutating that line, then running tests and checking if the mutation broke the expectations. If it hasn’t, you have created a surviving mutant.

Surviving mutants are always an alarming issue that points to potentially risky areas in a code base. As soon as you catch a surviving mutant, you must kill it. And the only way to kill a surviving mutant is to create additional descriptions — new tests that describe your expectations regarding the output of your function or module. In the end, you deliver a lean, mean solution that is airtight and guarantees no pesky bugs or defects are lurking in your code base.

If you leave surviving mutants to kick around and proliferate, live long, and prosper, then you are creating the much dreaded technical debt. On the other hand, if any test complains that the temporarily mutated line of code produces output that’s different from the expected output, the mutant has been killed.

Installing Stryker

The quickest way to try mutation testing is to leverage a dedicated framework. This example uses Stryker.

To install Stryker, go to the command line and run:

$ dotnet tool install -g dotnet-stryker

To run Stryker, navigate to the unittest folder and type:

$ dotnet-stryker

Here is Stryker’s report on the quality of our solution:

14 mutants have been created. Each mutant will now be tested, this could take a while.

Tests progress | 14/14 | 100% | ~0m 00s | 
 Killed : 13
 Survived : 1
 Timeout : 0

All mutants have been tested, and your mutation score has been calculated 
 - \app [13/14 (92.86%)]
 [...]

The report says:

Stryker created 14 mutants
Stryker saw 13 mutants were killed by the tests
Stryker saw one mutant survive the onslaught of the tests
Stryker calculated that the existing code base contains 92.86% of code that serves the expectations
Stryker calculated that 7.14% of the code base contains code that does not serve the expectations

Overall, Stryker claims that the application we’ve built so far failed to produce a reliable solution.

How To Kill a Mutant

When software developers encounter surviving mutants, they typically reach for the implemented code and look for ways to modify it. For example, in the case of the application for cat door automation, change the line:

string trapDoorStatus = "Undetermined";

to:

string trapDoorStatus = "";

and run Stryker again. A mutant has survived:

All mutants have been tested, and your mutation score has been calculated
 - \app [13/14 (92.86%)]
 [...]
 [Survived] String mutation on line 4: '""' ==> '"Stryker was here!"'
 [...]

This time, you can see that Stryker mutated the line:

string trapDoorStatus = "";

into:

string trapDoorStatus = ""Stryker was here!";

This is a great example of how Stryker works: it mutates every line of shipping code, in a smart way, in order to see if there are further test cases we have yet to think about. It’s forcing us to consider our expectations in greater depth.

Defeated by Stryker, you can attempt to improve the implemented code by adding more logic to it:

But after running Stryker again, you see this attempt created a new mutant:

ll mutants have been tested, and your mutation score has been calculated
 - \app [13/15 (86.67%)]
 [...]
 [Survived] String mutation on line 4: '"Undetermined"' ==> '""'
 [...]
 [Survived] String mutation on line 10: '"Undetermined"' ==> '""'
 [...]

You cannot wiggle out of this tight spot by modifying the implemented code. It turns out the only way to kill surviving mutants is to describe additional expectations. And how do you describe expectations? By writing tests.

It’s time to add a new test. Since the surviving mutant is located on line 4, you realize you have not specified expectations for the output with the value “Undetermined.”

Let’s add a new test:

[Fact]
public void GivenIncorrectTimeOfDayReturnUndetermined() {
   var expected = "Undetermined";
   var actual = catTrapDoor.Control("Incorrect input");
   Assert.Equal(expected, actual);
}

The fix worked! Now all mutants are killed:

All mutants have been tested, and your mutation score has been calculated
 - \app [14/14 (100%)]
 [Killed] [...]

You finally have a complete solution, including a description of what is expected as output if the system receives incorrect input values.

Mutation Testing to the Rescue

Suppose you decide to over-engineer a solution and add this method to the FakeCatTrapDoor:

private string getTrapDoorStatus(string dayOrNight) {
   string status = "Everything okay";
   if(dayOrNight != "Nighttime" || dayOrNight != "Daylight") {
       status = "Undetermined";
   }
   return status;
}

Then replace the line 4 statement:

string trapDoorStatus = "Undetermined";

with:

string trapDoorStatus = getTrapDoorStatus(dayOrNight);

When you run tests, everything passes:

Starting test execution, please wait...

Total tests: 5. Passed: 5. Failed: 0. Skipped: 0.
 Test Run Successful.
 Test execution time: 2.7191 Seconds

The test has passed without an issue. TDD has worked. But bring Stryker to the scene, and suddenly the picture looks a bit grim:

All mutants have been tested, and your mutation score has been calculated
 - \app [14/20 (70%)]
 [...]

Stryker created 20 mutants; 14 mutants were killed, while six mutants survived. This lowers the success score to 70%. This means only 70% of our code is there to fulfill the described expectations. The other 30% of the code is there for no clear reason, which puts us at risk of misuse of that code.

In this case, Stryker helps fight the bloat. It discourages the use of unnecessary and convoluted logic because it is within the crevices of such unnecessary complex logic where bugs and defects breed.

Conclusion

As you’ve seen, mutation testing ensures that no uncertain fact goes unchecked.

You could compare Stryker to a chess master who is thinking of all possible moves to win a match. When Stryker is uncertain, it’s telling you that winning is not yet a guarantee. The more tests we record as facts, the further we are in our match, and the more likely Stryker can predict a win. In any case, Stryker helps detect losing scenarios even when everything looks good on the surface.

It is always a good idea to engineer code properly. You’ve seen how TDD helps in that regard. TDD is especially useful when it comes to keeping your code extremely modular. However, TDD on its own is not enough for delivering lean code that works exactly to expectations.

Developers can add code to an already implemented code base without first describing the expectations. That puts the entire code base at risk. Mutation testing is especially useful in catching breaches in the regular test-driven development (TDD) cadence. You need to mutate every line of implemented code to be certain no line of code is there without a specific reason.

Alex’s Newsletter

Discussion about this post