TDD...it works?

For the last six months or so, I've spent a lot of time reading about the SOLID principles and listening to lectures about agile development practices by the likes of Bob Martin, Dave Thomas, and Martin Fowler.  The result has been a new and different understanding of what "agile" really means.

One of the agile practices I've really warmed up to is test-driven development. Uncle Bob advocates very strongly for it, and it was one of his talks that got me to finally try it again.  I'm too lazy to find the exact quote, but it was something like:

What if you had a button on your keyboard that you could press and it would tell you if your code was working?  How would that change your life?  Test driven development gives you that button.

Of course, that's a bit hyperbolic, but the point still stands.  The idea that you could have a tool that could tell you, "yup, this code works as intended" is really powerful.  Even if building such a tool is a lot of work, it seems like it would be worth it.

But the thing is, I've tried TDD before.  In fact, I've tried it for about a week or so every year for about the last five years.  And you know what?  It never really worked for me.  It was awkward, the tests were brittle and hard to read, and it just didn't seem to help me much at all.  I came away thinking that it was a stupid fad that didn't help and didn't really make sense. 

But things were different this time.  I'd spent months immersing myself in the SOLID principles, object-oriented design, and techniques for writing good tests.  This time I was ready.  It turns out that knowing how to structure your code and your tests makes the entire process infinitely smoother and simpler.  Heck, it's even kind of fun.

TDD Results

So how did it go this time?  Well, I'd say it was fairly successful.  I've been able to actually use TDD and stick with it without feeling like I was wasting my time.  That alone is a big improvement from my previous attempts.  But I can go deeper than that.  Thanks to the data I've collected using the PSP, I can actually do some analysis of just how well TDD is working for me.

About the Process

Before getting into the details, let's talk about what my process looked like before and what it looks like now.  That will help give some context to the numbers and make it more obvious why they look the way they do.

My process is based on the (now deprecated) PSP 3.  For each task, I start with a quick high-level design sketch, and then one or more development cycles, each of which consist of design, design review, code, code review, and then test.  The "design" phase consisted of conventional design and some skeleton coding (defining interfaces, defining UI components, outlining algorithms, etc.), the "code" phase consisted of fleshing out those outlines and implementing the rest of the design, and the "test" phase was writing unit tests followed by manual system-level testing.

The new TDD-based process changes that a bit.  The "design" phase is very similar, but I've been going into less detail in the algorithms involved.  The idea was to let the detailed design "emerge" in the process of writing tests, though I've started moving away from that a little.  The "code" phase is now a series of TDD cycles.  That means that it encompasses both actual coding and unit testing.  The "test" phase has become just manual testing.  I did this because, realistically, trying to track test vs. code time when I'm switching back and forth every few minutes was going to be tedious and error prone and I didn't want to do it. 

Unfortunately, this makes the analysis much harder because we're not comparing apples to apples.  Both processes did try to achieve comprehensive unit test coverage, so they're not completely incomparable, but the change in definition of the "testing" and "coding" phases does make certain comparisons moot.

About the Data

I've been collecting PSP data on the coding tasks I've had over the last six months at my current job.  Currently I have 25 data points - 15 using the old process and 10 using TDD.  The data is collected on a per-task basis, and the task sizes vary wildly, from about an hour and a couple dozen lines of code, to two weeks and a couple thousand lines of code.  The tasks are all webdev work and include a combination of back-end PHP, front-end functionality in JavaScript, and HTML/CSS display stuff in varying proportions.

Of course, the variety in the individual tasks undermines the data a bit, but that can't really be helped.  I'm just working with the tasks I'm given - I don't have the time to do a genuinely scientific comparison.I mean, I do actually need to get some work done at some point - I'm not getting paid to analyze my own performance.  

The Results - Summary

Overall, the TDD-based process seems to be working as well as, if not better than, what I'll call the "conventional" process.  Below is a summary table of overall metrics:

Summary Plan (TDD) Actual (TDD) Plan (Conv.) Actual (Conv.)
LOC/Hour 53.2 58.4 38.4 48.9
Total Time 73:18:00 107:41:00 84:32:00 102:30:00
Total New & Changed LOC 3903 6290 3249 5017
Test Defects/KLOC 21.3 15.3 14.4 26.7
Total Defects/KLOC 53.1 39.4 65.4 58.6
Yield %
55.90% 61.00% 59.00% 54.20%
% Appraisal COQ 16.00% 15.30% 16.40% 16.20%
% Failure COQ 32.30% 15.30% 34.00% 38.00%
PSP summary data for TDD vs. conventional process.
Definitions: Yield % = percentage of defects removed before testing, % Appraisal COQ = percentage of time spent on review activities, % Failure COQ = percentage of time spent in testing.

For the two data sets, the total time and code volume were substantial and in the same neighborhood, so the data sets are at least comparable.  We can see the TDD data shows higher productivity (in LOC/hour), fewer defects, better review yield, and lower cost of quality. 

So this means that TDD is definitively better, right?

Well...no.  The change in definition of the "code" phase means that I'm logging fewer defects in general.  Part of my defect standard is that I don't record a defect if it's introduced in the same phase where it's fixed.  So with TDD, little things like typos and obvious logic errors are caught by unit tests in the code phase, so they don't show up at all in these numbers.  In the conventional process, they wouldn't have been picked up until code review or testing.  The same reasoning applies to the failure cost-of-quality - since writing unit tests is now part of the code phase rather than the test phase, less time would naturally be spent in testing because there's simply less work that needs to be done in that phase now.

However, the productivity difference is still interesting, as is the improvement in percent yield. Those do suggest that there's some improvement from using TDD.

Drilling Down - Time

So since the change in phase definition makes defect numbers incomparable, let's look at time data instead.  This table shows the percentage of time spent in each phase.

Phase Actual % (TDD) Actual % (Conv.)
  Planning 3.08% 4.60%
  High-level Design 1.42% 1.59%
  High-level Design Review 0.87% 1.04%
  Detailed Design 8.05% 12.30%
  Detailed Design Review 4.27% 4.59%
  Code 51.90% 22.10%
  Code Review 10.20% 10.50%
  Test 15.30% 38.00%
  Postmortem 4.91% 5.35%
Percentage of time by phase.

 The results here are largely unsurprising.  The percentages are roughly equivalent for most of the phases, with the notable exceptions of code and test. 

However, it is interesting to note that there's still a discrepancy between the total code+test time for the two processes.  Theoretically, based on the phase definitions, that combination should encompass the same work for both processes.  But the TDD process spent 67.2% of the time on coding and testing, but it was only 50.1% for the conventional process.  It appears that the single largest chunk of that disparity comes out of design time.  This makes sense because there was less detailed design in the TDD process.

Note that the review times were roughly equivalent for each process - in fact, they were slightly lower with TDD.  Yet the percent yield for reviews was higher with TDD.  This gives us some evidence that the difference in percent yield can be attributed is likely due to the use of TDD.

It's not entirely clear why this would be the case.  The most obvious outcome would be lower yield, as the use of TDD means that fewer bugs escape to be found.  My working hypothesis, however, is that using TDD lowers the cognitive load of code review by removing most of the noise.  With TDD, you already know the code works.  You don't have to review for things like typos, syntax errors, or misused functions because you've already tested for those things.  That leaves you more time and energy to dig into more substantive issues of design, requirements coverage, and usability.

Back to Defects

Hmm....  Maybe we should take another look at the defect data after all.  We've already established that number of defects isn't going to be useful due to the change in phase definition.  But what about average fix times?  If TDD is weeding out the simpler bugs so that they don't escape the coding phase and don't get recorded, we'd expect the average fix time to be higher.

    Found in test (TDD) Found in other phases (TDD) Total defects found (TDD) Found in test (Conv.) Found in other phases (Conv.) Total defects found (Conv.)
Injected in HLD Tot. fix time - 28.4 28.4 - 18.2 18.2
Tot. defects - 9 9 - 13 13
Avg. fix time - 3.16 3.16 - 1.4 1.4
Injected in Design Tot. fix time 448.4 321 769.4 271.1 241.6 512.7
Tot. defects 48 57 105 51 63 114
Avg. fix time 9.34 5.63 7.33 5.32 3.83 4.5
Injected in Code Tot. fix time 223.2 293.3 516.5 341.6 230.4 572
Tot. defects 39 91 130 78 84 162
Avg. fix time 5.72 3.22 3.97 4.38 2.74 3.53
Total Injected Tot. fix time 724.9 647.1 1372 629.3 500.1 1129.4
Tot. defects 96 160 256 134 165 299
Avg. fix time 7.55 4.04 5.36 4.7 3.03 3.78
Defect count and fix time (in minutes) by injection phase.

The above table shows the break-down of defect count and fix time by injection phase.  So what does this tell us?  Well, my hypothesis that TDD would produce higher average fix times seems to be correct.  If we look at the fix times for defects injected in code and found in test, the average fix time is about 30% higher. 

However, it's worth noting that the TDD data has higher average fix times across the board.  It's not entirely clear why this should be.  One possible explanation is that the use of TDD means that unit tests are introduced earlier in the process, so defects fixed in code and code review would require changes to the test suite, whereas in the conventional process that would only happen for defects found in test.  That's something I'll have to watch in the future.

Conclusion

So far, my new TDD-based process seems to be working out pretty well.  The results are somewhat ambiguous, but the numbers suggest that TDD is resulting in higher LOC/hour productivity and more efficient defect removal. 

From a more human perspective, it has the benefit of making unit testing much less tedious and painful.  It gets you earlier feedback on whether your code is working and gives a nice sense of satisfaction as you watch that list of test cases grow.

But ore importantly, TDD is a great way to test your test cases.  When you're writing your unit tests after the fact, it's very easy to get stuck in a situation where a test is unexpectedly failing (or, worse, passing) and you can't tell whether the problem is with your test case or the code under test.  With TDD, you're changing things in very small increments, so it's easier to pinpoint the source of problems like that.

But the big lesson here is that TDD isn't something you can just jump into.  You have to understand the context and the design techniques first.  If you do, then it's pretty great.  If not, then you're going to have a hard time and end up wondering who came up with this hair-brained idea.

The anti-agile

I think I've put my finger on the reason that, despite being a rookie in it's practice, I'm so enamoured of Test-Driven Development. It's because TDD, despite being an "agile" programming practice, is the very inverse of agile methods. It's good, old-fashioned design. The only difference is that it's dressed up to look like programming.

At least that's what I get from reading Dave Astel's thoughts on TDD. He's an advocate of Behavior-Driven Development. This is an evolution of Test-Driven Development where the emphasis is shifted from testing units of code to describing the desired behaviour of the system. The idea is that rather than concentrate on writing a test for each class method, you figure out what the system should do and design your test cases around that rather than around the program structure. (Apologies if I mangled that explanation - I only learned about BDD today.)

This is somewhat similar to the other expansion of the initialism "TDD" that I discovered early in my research - Test-Driven Design. Note that the emphasis here is on using the unit tests as a design tool, i.e. specifying the behavior of the system. In other words, the unit tests are not written to do testing per se, but rather as a vehicle for expressing the low-level system specification.

Having studied formal development methods, it seems to me that, at a fundamental level, practitioners of TDD and formal methods are really doing pretty much the same thing. They're both trying to describe how a program is supposed to function. The TDD people do it by creating a test suite that asserts what the program should and shouldn't do. The formal methods people do it by writing Z specifications and SPARK verification conditions. Both are creating artifacts that describe how the system is supposed to work. The formal methods people do it with verifiable mathematical models, while the TDD people do it with executable test. They're just two different ways of trying to get to the same place.

In a way, this seems to be the opposite of what agile methods promote. Most of the agile propoganda I've read advocates doing away with specifications and documentaiton to the maximum extent possible. In fact, I've even read some statements that the amount of non-code output in a project should approach zero. And yet TDD can be viewed primarily as a design activity of the type that, using a different notation, would generate reams of paper. The only thing that saves it is the technicallity that the design language is also a programming language. It just seems a little ironic to me.

I suppose that's really the genius of TDD. It's the type of specification activity programmers are always told they should do, but repackaged into a code-based format that's palatable to them. It's sort of a compromise between traditional "heavy-weight" design methods and the more common "make it up as we go" design. I don't want to say it's the best of both worlds, but it's certainly better than the undisciplined, half-assed approach to design that's so common in our industry.