Account for cache

Here's another interesting debugging tid-bit for you. Assuming, that is, that anyone but me cares about such things.

This past Friday, we rolled out a new feature of the site. Without giving away any identifying details, it involved a change to the way data related to user activity is tracked in one of my modules. In the past, we tracked this "user activity data" (UAD) via a batch calculation. Every night we ran a cron job that analyzed various database tables and calculated some results that we then stored in the database. It was a fairly intenxive calculation that touched a lot of tables, so it didn't make sense to do it in real time.

For this new site feature, I had to switch the module over to an event-driven approach. That is, rather than the batch calculation, we're now keeping a "running tally" of the UAD based on user actions. This means that the users can now see this information change in real time, which is good, but it also means it's now hard/impossible to determine exactly how a particular user arrived at their current UAD.

So here's the catch: for reasons I won't get into, I need to track the per-day UAD change. I did this by simply adding a "daily change" column in the database that I update along with the UAD score. Then, every night, we run a cron job to zero-out that column in the database. The SQL query it runs is simply:
update uad_table set daily_change = 0;
For non-coders, that just sets the daily_change to 0 for every row, i.e. every user, in the database. Nice and simple. For a query like that, there isn't much testing you can do - it either works or it doesn't.

With that in mind, I was quite surprised this morning to learn that this simple query was apparently not working for all users. While the daily change column was indeed cleared for some users, it was not cleared for some others. But, for SQL above, that is absolutely impossible. The only way that could happen is if MySQL was badly, badly broken.

But, as with all things in software, the devil is in the details. I failed to account for two things in this equation: my database update code and caching.

First, my update code. The relevant portion of my PHP update code looked something like this:
$this->uad += $adtl_uad;
$this->daily_change += $adtl_uad;

In this, the save() method writes the object properties to the database. This is pretty standard stuff. The problem is: what happens if the database row changes between the time the object fetches it and the time it writes the updates? The answer: the change will be clobbered and the object will write bad data. So if the user is doing something that changes his UAD and if things are timed just right, we could end up in a situation where this code reads a "daily change" value of 500, our cron job updates the database to make it zero, and then the update code writes back a "daily change" value of 500 plus something, thus clobbering our update. However, this alone could not be the problem because the time between reading the data and writing it is so small - way less than a second. The timing would have to be just right, and the odds of that happening often enough to account for the number of un-updated users are just too small.

No, the real problem was when we combined this update code with cachine. You see, we use memcached to cache the results of database queries in order to take some of the load off of our database servers and increase performance. I was using memcached for the UAD retreival, so whenever I retreived a user's UAD from the database, I would first check if we had a memcached copy of it. If so, I'd use that and bypass the database altogether. If not, then I'd query the database and then save the results in memcached. When I wrote the updated UAD back to the database, I would clear the memcached data for that record so that it would be refreshed on the next access.

If you're sharp, you may have picked up on the problem now. Where does the cron job clear memcached? It doesn't. So what happens if I run the cron job and then someone accesses a user UAD record for which there is still memcached data? Well, the PHP script is going to get the cached data that still has a non-zero "daily change" value, update that value, and write it back to the database, effectively canceling out the column zero-out for that record. Doh!

My solution was simply to change the PHP method that updates the UAD. Rather than calculating the UAD in code and writing the result back to the database, I now just let the database do the calculation. The parameterized SQL looks something like this:
update uad_table set uad = uad + :adtl_uad, daily_change = daily_change + :adtl_uad where uid = :uid

The moral of the story is two-fold. First, scale changes everything. On a smaller site that didn't need memcached, the race condition in the update code probably never would have been an issue.

Second, keep your database access consistent. The reason this problem popped up is because I was handling data through cache-aware domain objects in the web application, but going straight to the database in the cache-ignorant cron job. Ideally, I would have used the same update method in both the cron job and the web application. The only problem with that is that running one bulk update query in the cron job is a lot faster than iterating over 400,000 rows of user records to run a PHP method that has the same effect. Which is yet another example of how little in software is ever as simple as it seems.

You can reply to this entry by leaving a comment below. You can send TrackBack pings to this URL. This entry accepts Pingbacks from other blogs. You can follow comments on this entry by subscribing to the RSS feed.

Add your comments #

A comment body is required. No HTML code allowed. URLs starting with http:// or ftp:// will be automatically converted to hyperlinks.