Account for cache

Here's another interesting debugging tid-bit for you. Assuming, that is, that anyone but me cares about such things.

This past Friday, we rolled out a new feature of the site. Without giving away any identifying details, it involved a change to the way data related to user activity is tracked in one of my modules. In the past, we tracked this "user activity data" (UAD) via a batch calculation. Every night we ran a cron job that analyzed various database tables and calculated some results that we then stored in the database. It was a fairly intenxive calculation that touched a lot of tables, so it didn't make sense to do it in real time.

For this new site feature, I had to switch the module over to an event-driven approach. That is, rather than the batch calculation, we're now keeping a "running tally" of the UAD based on user actions. This means that the users can now see this information change in real time, which is good, but it also means it's now hard/impossible to determine exactly how a particular user arrived at their current UAD.

So here's the catch: for reasons I won't get into, I need to track the per-day UAD change. I did this by simply adding a "daily change" column in the database that I update along with the UAD score. Then, every night, we run a cron job to zero-out that column in the database. The SQL query it runs is simply:
update uad_table set daily_change = 0;
For non-coders, that just sets the daily_change to 0 for every row, i.e. every user, in the database. Nice and simple. For a query like that, there isn't much testing you can do - it either works or it doesn't.

With that in mind, I was quite surprised this morning to learn that this simple query was apparently not working for all users. While the daily change column was indeed cleared for some users, it was not cleared for some others. But, for SQL above, that is absolutely impossible. The only way that could happen is if MySQL was badly, badly broken.

But, as with all things in software, the devil is in the details. I failed to account for two things in this equation: my database update code and caching.

First, my update code. The relevant portion of my PHP update code looked something like this:
$this->uad += $adtl_uad;
$this->daily_change += $adtl_uad;
$this->save();

In this, the save() method writes the object properties to the database. This is pretty standard stuff. The problem is: what happens if the database row changes between the time the object fetches it and the time it writes the updates? The answer: the change will be clobbered and the object will write bad data. So if the user is doing something that changes his UAD and if things are timed just right, we could end up in a situation where this code reads a "daily change" value of 500, our cron job updates the database to make it zero, and then the update code writes back a "daily change" value of 500 plus something, thus clobbering our update. However, this alone could not be the problem because the time between reading the data and writing it is so small - way less than a second. The timing would have to be just right, and the odds of that happening often enough to account for the number of un-updated users are just too small.

No, the real problem was when we combined this update code with cachine. You see, we use memcached to cache the results of database queries in order to take some of the load off of our database servers and increase performance. I was using memcached for the UAD retreival, so whenever I retreived a user's UAD from the database, I would first check if we had a memcached copy of it. If so, I'd use that and bypass the database altogether. If not, then I'd query the database and then save the results in memcached. When I wrote the updated UAD back to the database, I would clear the memcached data for that record so that it would be refreshed on the next access.

If you're sharp, you may have picked up on the problem now. Where does the cron job clear memcached? It doesn't. So what happens if I run the cron job and then someone accesses a user UAD record for which there is still memcached data? Well, the PHP script is going to get the cached data that still has a non-zero "daily change" value, update that value, and write it back to the database, effectively canceling out the column zero-out for that record. Doh!

My solution was simply to change the PHP method that updates the UAD. Rather than calculating the UAD in code and writing the result back to the database, I now just let the database do the calculation. The parameterized SQL looks something like this:
update uad_table set uad = uad + :adtl_uad, daily_change = daily_change + :adtl_uad where uid = :uid

The moral of the story is two-fold. First, scale changes everything. On a smaller site that didn't need memcached, the race condition in the update code probably never would have been an issue.

Second, keep your database access consistent. The reason this problem popped up is because I was handling data through cache-aware domain objects in the web application, but going straight to the database in the cache-ignorant cron job. Ideally, I would have used the same update method in both the cron job and the web application. The only problem with that is that running one bulk update query in the cron job is a lot faster than iterating over 400,000 rows of user records to run a PHP method that has the same effect. Which is yet another example of how little in software is ever as simple as it seems.

The truth about government

Leon Bambrick had a really great blog post today. You know the "one laptop per child" program? Well, he thinks it's great that the IT industry is doing this and that other industries should have similar programs, such as "one handgun per child." My favorite, though, was the program for government:

Governments could offer "One soul-destroying job in a miasma of bureaucratic inconsequentiality Per Child"

If you're a regular reader of this blog, or a member of my family, then you probably know that my last job was as a Systems Analyst in the IT department of a local government, so that quote really spoke to me. I know you hear people say things like that about public sector jobs all the time, but based on 6 years of experience, I can say that it's all true. There's a general atmosphere of hopelessness, where most things don't make any sense, lots of them don't really work, and everything is more complicated than it needs to be, yet nobody seems to think this is a problem. The only way to keep your sanity in such an environment is to simply stop caring. After all, nobody else has.

Of course, the private sector is no panacea. But at least there's always the possibility that you can improve your organization or you lot within it. Or at least change something.

The one thing I will say for government work is that if gives you a different angle on conspiracy theories. After a job in public service, it's impossible to waste even a second seriously considering any of those grand, sweeping government conspiracy theories you see floating around the internet. I mean, most government offices can just barely manage the job they're supposed to be doing. How are they going to manage some sort of vast cover-up? That would require motivation and initiative, and those aren't allowed in public service.

Remember the magic properties

You know what's a good way to feel really beaten-down at the end of the day? Spend 2 hours debugging a 1-line problem.

That's just what happened to me the other day. And, while I place some of the blame on PHP in order to make myself feel better, it was actually due to my own forgetfulness and tunnel-vision. If I had just taken a minute to step back and consider the possibilities, I probably would have had it figured out much faster.

Since last week, I have been working on the data access code for our site. Basically, we're going from using ad hoc SQL queries scattered throughout the codebase to a model-based design, where we have an abstract data model class which handles database access and then classes that inherit from that. You know, sort of an ActiveRecord pattern, a la Ruby on Rails or CakePHP.

Anyway, things were going along quite nicely. That kind of code conversion is always a little tedious, but I hadn't hit any major road-blocks and all my unit tests were passing.

Well, Tuesday afternoon, I hit a snag. I'm not sure at what point it happened, but I ran my unit tests, and suddenly I was seeing red. Lots of red. Everything was failing!

Well, I immediately when into homing mode. I started taking out tests and trying to isolate a single failure. It turns out that the first failure was in the test case for my class constructor. A little more debugging revealed the problem - none of the class members were being populated from the database. And yet, there was nothing in the error logs and the connection to the database wasn't failing.

Now, the way our (custom) abstract model class works is that it reads the table layout from the database, stores that in the object, uses it to determine the primary key of the table, and then retrieves a row based on the primary key value you pass. What was happening was that the constructor was running, getting the table data from the database (which is how I know the connection was functioning), and running the code to retrieve the row, but the values of the class members were never set.

Well, needless to say, I was baffled. Before too long, I thought I had narrowed it down to the method that actually retrieves the database row. We're using PDO (PHP Data Objects) for database access in the model class and I was using the feature of PDO whereby it will return a row from the database as or into an object. You can pass it an object and it will set each column as a field in the object. Alternatively, you can pass PDO a class name it it will return a new object of that class with the appropriate fields set from the database. You can also have it return an stdClass, which is PHP's generic object. I thought I was onto something when I discovered that fetching into the current object or fetching into a new object of the same class didn't work, but fetchin a generic stdClass did. So, of course, I eventually decided to work around the problem by re-coding the method to use the stdClass and just populate my object from that. Imagine my surprise when, even after that, and verifying that I was getting the row from the database, my unit tests still failed.

Well, to cut a long and boring story short, the problem was the magic __set() method. This is one of the "magic" functions PHP lets you define on a class. This particular one takes a key and a value and is executed every time you try to assign a value to a class property that doesn't exist. Well, we had defined that in the child class, not the model class, to do an update on the database in certain conditions and when I was re-writing it, I forgot to account for the default case - when you we don't want to update the DB, just assign the value to the key. It was one line, $this->$key = $value, which I had put inside an "if" block instead of out. So every time my code tried to do something like $this->foo = 'bar', the statement ended up being a no-op. Same thing when PDO tried to populate the object. But populating the field with the table layout worked fine, since that field was set in the class definition rather than being created at run-time.

So what's the moral of this story? Well, I guess there are several. First, tunnel-vision in debugging is a bad thing. I got hung up on the PDO object fetching and wasted lots of time trying to figure out why it wasn't working when the problem was actually someplace completely different.

The second lesson is that you need to stick to a process - in this case, I should have been doing real test-driven development. The reason this took so long is that I let far too much time elapse between the change to the __set() function and running my unit tests. If I had been doing real TDD and had run my tests after every change, I would have instantly been able to pinpoint the problem.

The last lesson is simply that software development is about managing complexity, and managing complexity is hard. Hacking away, designing as you go, relying on raw brain-power is all well and good, but it's not sustainable in the long run. When you run out of gas, or you're having a bad day, you need something to fall back on. This episode reminded me of that the hard way.

This seems to be a recurring theme for me. The longer I develop software, the more obvious it becomes that I need to imrpove my process - even whey I'm working by myself. It's kind of funny, really. The more I learn about software development, the less I feel like I know what I'm doing. There's just so much I have left to learn, it seems a little overwhelming. I know that doesn't make me inadequate - in fact, they say wisdom is knowing what you don't know. But it can still be a bit...disquieting at times.