Why Writing Quality Tests Matters More Than Ever
Code design and maintainability are not an issue anymore
I get surprised whenever I meet software developers who insist that they don’t like to write tests. Some even admit that they don’t know how to write tests. Why do I find that surprising?
Well, for starters, anyone who’s doing software development has most likely written quite a lot of code using the 3GL programming language, such as C, Java, JavaScript, Python, Ruby, etc. And everyone knows that the code written in those programming languages is not something that computers can understand and execute. What needs to happen before machines could run a program written in any of those programming languages is that the code will have to be translated into the machine code (or binary code). Either by compiling the code written in a 3GL language or by interpreting it.
Once the code written in a 3GL language gets compiled/interpreted, machines can understand it and execute it (machines can run the code). However, before that transformation can happen, the compiler/interpreter will have to make sure the code written by programmers is correct with regards to the language syntax. If the code does not comply with the expectations provided by the compiler/interpreter, the attempt to compile or interpret it will fail.
If, on the other hand, the code complies with the expectations provided by the compiler/interpreter, the attempt to compile or interpret it will succeed. Basically, either the code written by programmers succeeds and gets compiled/interpreted, or it fails. If it fails, the compiler/interpreter will issue an error message explaining what went wrong and, depending on the programming language, may also provide a suggestion on how to fix the failure.
If we now take a step back and examine that process, we will notice that it is identical to the process we take when writing automated tests. When writing a test, we express our expectations, and then, upon running the test, we see whether our expectation passes or fails. Same is with the programming code—we write the code expecting it to compile (or to get successfully interpreted), and then, upon running the compiler/interpreter, we see whether our expectation passes or fails.
From this, we see that every computer programmer/coder must, by necessity, be very well versed in the process of writing tests. The mindset when writing computer program code is no different than the mindset when writing a test that will provide an expectation regarding the way the implemented code behaves.
Test Driven Development And Maintainability
One of the main reasons people advocate Test Driven Development (TDD) is the fact that when taking that approach to developing software, we are given an opportunity to aggressively redesign and refactor the code. By having the luxury of a comprehensive regression test suite, we can comfortably move into redesigning/rearranging the structure of the code with minimal or even no risk of jeopardizing the established and accepted functionality.
Why is the ability to refactor aggressively so desirable? Whenever we make changes to the implemented code, the most important factor to keep in mind is maintainability. Just because today we have reached the point where our implemented code fully delivers desired functionality does not mean that we have reached the end of the road. The seemingly incessant demand for ever new functionality of software never lessens. Which means we must keep our code in such good shape that it will be easy and risk-free to modify it, improve it, and enhance it. And there is no better way to get to that point than to fully adopt TDD.
What If The Code Is Not Written By Human Programmers?
I’ve spent a lot of time recently experimenting with writing software code by utilizing the capabilities of various LLM models. Those LLM models appear to possess a lot of detailed and intricate knowledge related to writing software code in various programming languages. I wanted to see how far I could push that capability and how reliable would be my attempts to leverage LLMs.
One problem with LLMs is, as everyone knows, that those systems are not deterministic. Given identical starting conditions (i.e., identical input values), LLMs will provide different answers no matter how many times we repeat the experiment. And since computer programming is an exact discipline where non-deterministic behaviour is definitely not desirable, it seems that LLMs are not a good fit for helping us to write computer programs.
While I can definitely confirm that the above conclusion holds, I must also report that I have come up with some interesting findings as I was experimenting with creating software using LLMs. As you have probably figured out by now, I am a big proponent of TDD. That is to say, I prefer to always write a test—an expectation—before I even begin writing code. In short, I think about the expectation—what am I expecting this program to do—and then I write a formalized expectation (a test) that can be executed. I then run that test, see it fail, and then move in to write the code that will make the failing test pass.
So far, so good. And of course, I carried that line of reasoning over to my little experiment with leveraging LLMs to write software. I would therefore write my expectation—my test—run it, see it fail, and then ask LLM to make the failing test pass. To be more specific, I was using the “chain of thought” approach as, for example, implemented in the DeepSeek R1 model.
As the LLM was working toward making the failing test pass, I was examining the resulting code. I could see that the code was not being generated deterministically, as there were rather wide variances in different approaches. And as my experiment was progressing, I started noticing that I was spending a lot of time reviewing the code produced by the LLM. I was growing concerned that the resulting code is quite messy, often not easy to read and reason about. Basically, it felt the code LLM was producing was not very maintainable. And no matter how hard I tried, I was not able to get it to the point where the resulting code was predictably and reliably elegant.
That situation started raising a lot of red flags in my head. Since I am big on clean code practices, to me, maintainability is paramount. Code that is bloated, duplicated all over the place, difficult to read, difficult to reason about, and difficult to modify without breaking something—that’s simply not acceptable. Given the situation, I was this close to throwing the towel in and declaring that it will not be advisable to use LLMs as a way to develop software code.
But then, as I kept pushing around, I came to a sudden realization—as I was adding more tests and getting the LLM to make those tests pass, I noticed that I was less and less curious to examine the code that the LLM produced to make the tests pass. I was getting into the swing of things, coming up with new expectations/tests, and using LLM to make those tests pass. My app was slowly and surely growing. It looked like I could actually produce a nice, workable system that could be deployed and used by people.
But then I pushed back: “Yeah, sure, this seems to be working. But imagine poor people who would be saddled with maintaining this bowl of spaghetti code! I shudder to even think about that!”
No, I cannot bring myself to release that messy code, that rat’s nest. It would definitely not be fair to any human being to have to deal with that mess.
But wait a minute. That code was not written by human programmers. If that’s the case, why am I assuming it will be maintained by human programmers? Same as machines wrote that code and made it work (somehow), those same machines could be deployed to maintain that same code.
Suddenly, my precious obsession with code maintainability flew out the window. Who cares if the resulting code is unmaintainable by humans? So long as machines continue making improvements to it and making all tests pass, that is all we need.
And indeed, looking back at the situation I described at the beginning of this article—software code written by humans being translated to machine code by the machines—I realized that no one ever bothers to go and examine the resulting binary code. Like, why would we do that? OK, maybe the resulting bytecode is incredibly messy, but hey, it works, and no one is complaning, Why? Because no human will ever be expected to maintain that binary code.
The same situation is with machines writing the 3GL code. They wrote it, and there is no reason for us to ever examine it because that code will continue to be maintained by the machines.
So, hoorah! No more problem. We have solved the issue with hand-cranking the 3GL code. But now the real problem begins—how many software developers know how to write good tests?
But that’s a topic for another article.
I’d like to thank my friend Albert Meyburgh for reviewing my chain of thoughts regarding this problem domain and for validating my findings, as I expressed them here.