Chrome debugging victory

This post is about a Chrome bug that I fixed a few months ago in new product I was working on. Normally, this would not be noteworthy, but this one was a real mind-bender.

The product I was working on at the time is an add-on to one of my company's services. To give a little background, our bread and butter is high-resolution aerial imagery - think satellite images, but with better resolution and with oblique shots taken from multiple directions in addition to the overheads. Our biggest customers for these fly-overs are county governments, who use the imagery for a variety of purposes. For example, when I was working for a county government, we used Pictometry's images for 911 dispatching, so that dispatchers could give detailed information on a property to police and other first responders.

Anyway, counties will typically order a new fly-over every year or every few years so that their images remain up to date. One of the add-on services we offer with this is a "change finder" service, where we'll analyze the images from two different years and provide the customer with a data set highlighting the structures that have changed. The product I was working on is actually an add-on to that service. It's a web-based application targeted at local assessors that gives them a simple, no-setup-needed way to access their change data and provides a basic workflow for analyzing it and making determinations on property values. So rather than spending an hour driving out to inspect a location, the assessor can use a rich web UI to see that the property now has garage that wasn't there last year and estimate how big it is. Saves them time and keeps the tax rolls up-to-date.

As for the application itself, it's based on our next-generation internal development framework. The back-end is built on PHP and PostgreSQL/Oracle with the front-end being built on Ext JS. It features a dual-pane "before and after" map viewer that's built on OpenLayers and integrates with the customer GIS data. There is a set of search tools that lets the customer narrow down and navigate their data set. In addition, the images are overlaid with outlines to highlight what properties have changed.

But enough exposition - on to the bug! This one was a real head-scratcher. Basically, sometimes the UI would just freak out. All the controls outside the map area would just turn into one solid gray field, effectively rendering them invisible. They still worked, if you knew where to click, but you couldn't see them. They'd come back for a second when you panned the map image, but then disappear when you stopped panning. And it wasn't a transient JavaScript thing - reloading the page didn't help. And the real kicker was that this bug only happened in Chrome (we didn't test other Webkit-based browsers, as Chrome is the only one we officially support).

This bug had been languishing in our queue for weeks. It wasn't consistently reproducible. We found that it happened less if you turned off Chrome's hardware acceleration, but it didn't go away entirely. Sometimes it would happen consistently, whereas other times you could go a full week and never see it. We eventually determined that it was related to the number of images overlayed in the map area, because hiding enough of them made the issue go away. We also found that it happened more at larger window sizes. In retrospect, that was probably because that allows room for more images.

Since the non-map portion of the UI was what was getting messed up, I hypothesized that it might be due do some weird interaction with the styes used by Ext JS. However, that didn't pan out. It turned out that there was nothing remarkable about Ext's styles and every "discovery" I made ended up reducing to "more images in the map". The one insight I was able to glean was that the issue was the number of images visible in the viewport, not just the number present in the map area DOM.

I eventually found the source of this bug through sheer brute force, i.e. process of elimination. I opened up the Chrome developer tools and starting hiding and removing DOM nodes. I started outside the map area and gradually pruned the DOM down to the smallest number of elements that still caused the bug.

It was at that point that I finally found a suspicious CSS rule. It was a CSS3 transform coming from our client-side image viewing component. Apparently the rule was originally put there to trigger hardware acceleration in some browsers. I have no idea why it caused this bug, though - I don't think we ever did figure that out. But it was an easy fix and I was rather proud of myself for having figured it out. I suppose that just goes to show that sometimes persistence is more important than brilliant insight when it comes to bug fixing.

Account for cache

Here's another interesting debugging tid-bit for you. Assuming, that is, that anyone but me cares about such things.

This past Friday, we rolled out a new feature of the site. Without giving away any identifying details, it involved a change to the way data related to user activity is tracked in one of my modules. In the past, we tracked this "user activity data" (UAD) via a batch calculation. Every night we ran a cron job that analyzed various database tables and calculated some results that we then stored in the database. It was a fairly intenxive calculation that touched a lot of tables, so it didn't make sense to do it in real time.

For this new site feature, I had to switch the module over to an event-driven approach. That is, rather than the batch calculation, we're now keeping a "running tally" of the UAD based on user actions. This means that the users can now see this information change in real time, which is good, but it also means it's now hard/impossible to determine exactly how a particular user arrived at their current UAD.

So here's the catch: for reasons I won't get into, I need to track the per-day UAD change. I did this by simply adding a "daily change" column in the database that I update along with the UAD score. Then, every night, we run a cron job to zero-out that column in the database. The SQL query it runs is simply:
update uad_table set daily_change = 0;
For non-coders, that just sets the daily_change to 0 for every row, i.e. every user, in the database. Nice and simple. For a query like that, there isn't much testing you can do - it either works or it doesn't.

With that in mind, I was quite surprised this morning to learn that this simple query was apparently not working for all users. While the daily change column was indeed cleared for some users, it was not cleared for some others. But, for SQL above, that is absolutely impossible. The only way that could happen is if MySQL was badly, badly broken.

But, as with all things in software, the devil is in the details. I failed to account for two things in this equation: my database update code and caching.

First, my update code. The relevant portion of my PHP update code looked something like this:
$this->uad += $adtl_uad;
$this->daily_change += $adtl_uad;
$this->save();

In this, the save() method writes the object properties to the database. This is pretty standard stuff. The problem is: what happens if the database row changes between the time the object fetches it and the time it writes the updates? The answer: the change will be clobbered and the object will write bad data. So if the user is doing something that changes his UAD and if things are timed just right, we could end up in a situation where this code reads a "daily change" value of 500, our cron job updates the database to make it zero, and then the update code writes back a "daily change" value of 500 plus something, thus clobbering our update. However, this alone could not be the problem because the time between reading the data and writing it is so small - way less than a second. The timing would have to be just right, and the odds of that happening often enough to account for the number of un-updated users are just too small.

No, the real problem was when we combined this update code with cachine. You see, we use memcached to cache the results of database queries in order to take some of the load off of our database servers and increase performance. I was using memcached for the UAD retreival, so whenever I retreived a user's UAD from the database, I would first check if we had a memcached copy of it. If so, I'd use that and bypass the database altogether. If not, then I'd query the database and then save the results in memcached. When I wrote the updated UAD back to the database, I would clear the memcached data for that record so that it would be refreshed on the next access.

If you're sharp, you may have picked up on the problem now. Where does the cron job clear memcached? It doesn't. So what happens if I run the cron job and then someone accesses a user UAD record for which there is still memcached data? Well, the PHP script is going to get the cached data that still has a non-zero "daily change" value, update that value, and write it back to the database, effectively canceling out the column zero-out for that record. Doh!

My solution was simply to change the PHP method that updates the UAD. Rather than calculating the UAD in code and writing the result back to the database, I now just let the database do the calculation. The parameterized SQL looks something like this:
update uad_table set uad = uad + :adtl_uad, daily_change = daily_change + :adtl_uad where uid = :uid

The moral of the story is two-fold. First, scale changes everything. On a smaller site that didn't need memcached, the race condition in the update code probably never would have been an issue.

Second, keep your database access consistent. The reason this problem popped up is because I was handling data through cache-aware domain objects in the web application, but going straight to the database in the cache-ignorant cron job. Ideally, I would have used the same update method in both the cron job and the web application. The only problem with that is that running one bulk update query in the cron job is a lot faster than iterating over 400,000 rows of user records to run a PHP method that has the same effect. Which is yet another example of how little in software is ever as simple as it seems.

Remember the magic properties

You know what's a good way to feel really beaten-down at the end of the day? Spend 2 hours debugging a 1-line problem.

That's just what happened to me the other day. And, while I place some of the blame on PHP in order to make myself feel better, it was actually due to my own forgetfulness and tunnel-vision. If I had just taken a minute to step back and consider the possibilities, I probably would have had it figured out much faster.

Since last week, I have been working on the data access code for our site. Basically, we're going from using ad hoc SQL queries scattered throughout the codebase to a model-based design, where we have an abstract data model class which handles database access and then classes that inherit from that. You know, sort of an ActiveRecord pattern, a la Ruby on Rails or CakePHP.

Anyway, things were going along quite nicely. That kind of code conversion is always a little tedious, but I hadn't hit any major road-blocks and all my unit tests were passing.

Well, Tuesday afternoon, I hit a snag. I'm not sure at what point it happened, but I ran my unit tests, and suddenly I was seeing red. Lots of red. Everything was failing!

Well, I immediately when into homing mode. I started taking out tests and trying to isolate a single failure. It turns out that the first failure was in the test case for my class constructor. A little more debugging revealed the problem - none of the class members were being populated from the database. And yet, there was nothing in the error logs and the connection to the database wasn't failing.

Now, the way our (custom) abstract model class works is that it reads the table layout from the database, stores that in the object, uses it to determine the primary key of the table, and then retrieves a row based on the primary key value you pass. What was happening was that the constructor was running, getting the table data from the database (which is how I know the connection was functioning), and running the code to retrieve the row, but the values of the class members were never set.

Well, needless to say, I was baffled. Before too long, I thought I had narrowed it down to the method that actually retrieves the database row. We're using PDO (PHP Data Objects) for database access in the model class and I was using the feature of PDO whereby it will return a row from the database as or into an object. You can pass it an object and it will set each column as a field in the object. Alternatively, you can pass PDO a class name it it will return a new object of that class with the appropriate fields set from the database. You can also have it return an stdClass, which is PHP's generic object. I thought I was onto something when I discovered that fetching into the current object or fetching into a new object of the same class didn't work, but fetchin a generic stdClass did. So, of course, I eventually decided to work around the problem by re-coding the method to use the stdClass and just populate my object from that. Imagine my surprise when, even after that, and verifying that I was getting the row from the database, my unit tests still failed.

Well, to cut a long and boring story short, the problem was the magic __set() method. This is one of the "magic" functions PHP lets you define on a class. This particular one takes a key and a value and is executed every time you try to assign a value to a class property that doesn't exist. Well, we had defined that in the child class, not the model class, to do an update on the database in certain conditions and when I was re-writing it, I forgot to account for the default case - when you we don't want to update the DB, just assign the value to the key. It was one line, $this->$key = $value, which I had put inside an "if" block instead of out. So every time my code tried to do something like $this->foo = 'bar', the statement ended up being a no-op. Same thing when PDO tried to populate the object. But populating the field with the table layout worked fine, since that field was set in the class definition rather than being created at run-time.

So what's the moral of this story? Well, I guess there are several. First, tunnel-vision in debugging is a bad thing. I got hung up on the PDO object fetching and wasted lots of time trying to figure out why it wasn't working when the problem was actually someplace completely different.

The second lesson is that you need to stick to a process - in this case, I should have been doing real test-driven development. The reason this took so long is that I let far too much time elapse between the change to the __set() function and running my unit tests. If I had been doing real TDD and had run my tests after every change, I would have instantly been able to pinpoint the problem.

The last lesson is simply that software development is about managing complexity, and managing complexity is hard. Hacking away, designing as you go, relying on raw brain-power is all well and good, but it's not sustainable in the long run. When you run out of gas, or you're having a bad day, you need something to fall back on. This episode reminded me of that the hard way.

This seems to be a recurring theme for me. The longer I develop software, the more obvious it becomes that I need to imrpove my process - even whey I'm working by myself. It's kind of funny, really. The more I learn about software development, the less I feel like I know what I'm doing. There's just so much I have left to learn, it seems a little overwhelming. I know that doesn't make me inadequate - in fact, they say wisdom is knowing what you don't know. But it can still be a bit...disquieting at times.