Good intentions, bad idea

Today I'm going to discuss a comedy of errors. This starts out with a nasty bug that surfaced in my company's product a couple of months ago, and finally became clear when I was doing some prep work for implementing a CDN. It's a tale of good intentions, bad ideas, and code I didn't write, but have to fix.

First, the bug.

To explain the bug, I have to tell you a little about how my company's new product works. Basically, it's a drag-and-drop UI builder for Flash ads. The user designs their ad on a design surface in a FLEX app, saves it, than can then publish the result. However, rather than actually compile a SWF for the ad, we're currently assembling all the assets at run-time on the client-side. Our ad tags serve up a shell SWF file and inject a few parameters into it, including a URL to a "recipe" file that we use to cook up the ad. This is just an XML file contains the information for the ad's constituent elements. The SWF file parses it and pulls/creates all the needed objects to build the ad. There were various reasons for doing it this way, but I won't get into those.

Now, this bug was a real head-scratcher. The actual problem was simple - the shell SWF just wasn't rendering the ad. No error, no message - just didn't work. However, it only happened in IE - Firefox, Chrome, Opera, and Safari worked just fine. It also only happened in our production and test environments - our dev servers and demo server worked fine in IE. The SWF files were identical in every environment - I know because I triple checked. What's more, I could see the XML file being requested in the server logs, so the SWF wasn't totally crapping out. And, again, it worked in other browsers, so it didn't seem like there could be an issue with the SWF.

Well, after researching and messing around with this for the better part of a day, our QA person found a link that put us on the right track. In fact, there are a bunch of such links. It turned out to be an issue with the HTTP headers on the XML file. The file was being served over SSL with the "no-cache" header set. Turns out IE doesn't like this. Apparently "no-cache" keeps the file completely out of cache when used on SSL, which means not even long enough for the browser to pass it off to the Flash plugin. Apparently we would have seen an error for this if we did the SWF file in ActionScript 3, but we used ActionScript 2 (apparently most of our customers require ads to be in AS 2 - don't ask me why), which has a penchant for failing silently. And the reason it didn't happen in all four environments is because the Apache configurations were actually not all the same. Go figure.

Fast forward two months to the discovery of the cause.

I'm looking to implement a CDN. We've got one that does origin pull, so it shouldn't be a big deal, right? Well, yeah, but we still have to make some changes, because those "recipe" files are on the same path as rest of our media and it won't do to have the CDN caching them when users are editing them and trying to preview the output. So I need to fix it so that, at least for the previews, we can serve the recipes from our own server instead of the CDN.

A few months ago, when we implemented user uploads, we (and by "we" I mean "another guy on my team") added the concept of a "media URL" to our system. The idea is that we could just change this media URL to switch to a CDN without having to change any URLs in code. This was implemented as a method on one of our back-end classes. It would return the base domain and path, if applicable, from which media is served. So building a URL would look like this:
$image_url = ObfuscatedAdClassModel::getMediaURL();
$image_url .= '/path/to/naughty_pic.jpg';

The getMediaURL() method just checks the database/Memcache for a saved URL and returns the saved value or a static default if nothing is found. Easy-peasy.

Or not. You see, I exaggerated in that last sentence - what I meant was, that's what getMediaURL() should do. In actuality, it does a bit more. In fact, as an object lesson in over complication, I'm posting the redacted code below.

static public function getMediaURL(){
   global $cache_enabled;

   $db = self::getDB();
   if (!is_object($db)) {
      throw new Exception('Error getting database connection');
   }

   $settings_key='media_url';
   
   if( $ft_query_cache_enabled ) {
      $settings_value = $db->getCache($settings_key);
   }

   if (!$settings_value){
      $settings_value=DEFAULT_MEDIA_URL;
      $query = "SELECT value FROM settings WHERE key='$settings_key'";
      $ret = $db->query($query);
      if (is_array($ret[0])) {
         $settings_value = $ret[0]["settings_value"];
      } elseif ($ret === true){
         try{
            $db->startTrans();
            $have_lock = $db->query( "LOCK TABLE settings IN ACCESS EXCLUSIVE MODE;" );
            if ($have_lock){
               $query = "SELECT value FROM settings WHERE key='$settings_key'";
               $ret = $db->query($query);
               if (!is_array($ret[0])) {
                  $query = "INSERT INTO settings (key, value) VALUES ('$settings_key','$settings_value')";
                  $ret = $db->query($query);
                  if($ret !== true){
                     $err_msg = "Unable to insert setting: $settings_key";
                     Log::error($err_msg);
                     throw new Exception($err_msg);
                  }
               }
            } else {
               $err_msg = "Could not acquire table lock or table does not exist: settings ".$db->getLastError();
               Log::error($err_msg);
               throw new Exception($err_msg);
            }
            $db->commitTrans();
         } catch (Exception $e){
            $db->rollbackTrans();
            $err_msg = 'DB error while attempting to update settings.';
            Log::error($err_msg);
            throw new Exception($err_msg);
         }
      } else {
         $err_msg = 'DB error while attempting to load settings.';
         Log::error($err_msg);
         throw new Exception($err_msg);
      }
      
      if( $cache_enabled ) {
         $db->setCache($settings_key, $settings_value, 3600);
      }
   }
   if ( $_SERVER["SERVER_PORT"] == 443 ){
      return "https://" . $settings_value . "/";
   } else {
      return "http://" . $settings_value . "/";
   }
}

This should be just a simple look-up in a generic settings table. However, when the desired setting is not found, this method tries to insert it into the database. This is completely unnecessary and leads us to five levels of nested "if" blocks and a table lock. We already have the default value value, so why not just return that? And the table lock is just paranoid. This is not a setting that's likely to change more than once every couple of months. And if we do get a minor inconsistency, so what? You think users are going to complain that an ad didn't render properly?

Note also that the returned URL is set to straight HTTP or SSL based on the current server port. Hint: this will be important later.

Now, the reason I'm looking at this is because I need to adjust how the URLs to the recipe files are handled in our system. Given our media URL scheme, if we set our static contents to serve from the CDN, the recipes will go with them. For previewing in-progress ads, this won't work. So I need to change the recipe URLs to use something other than the media URL. So I find where the recipe URL is injected into the client-side code and start tracing it backward.

To my surprise, I find the that recipe URL is actually not set dynamically using getMediaURL(). It turns out it's coming from our back-end ad object, via a getMetadata() getter method, as a full, absolute URL. And what is this metadata? Well, it's an associative array of seemingly random data that we serialize and cram in the database. And by "we", I mean the same guy who wrote getMediaURL(). For the record, I told him it was a bad idea.

So if the recipe URL is coming out of this metadata, what does that mean? That it's stored in the database. So I start grepping for the key for the metadata of the recipe URL. And I find it in the web service method in our management dashboard that saves the XML files.

Let's pause here. Now, if you're very sharp, you may have asked yourself earlier why we were serving out this XML file over SSL. We're serving ads right? They don't need to go over SSL. And this Flash problem was specific to SSL, so if we just served them over straight HTTP, we should have been good. So why didn't we do that?

That's a good question. It had occurred to me after I fixed that bug (by changing the Apache configuration), but when I looked, I couldn't find where the URL was set to HTTPS. Plus I didn't know why we were serving them over HTTPS, so I assumed there must be some reason for it. And besides, the bug was "fixed" and I had lots of other work to do at the time, so it wasn't a priority.

Again, if you're sharp, you can probably see that there was no good reason. The recipes were being served over SSL by accident! You see, our management dashboard, which is where the recipe file is saved, runs over HTTPS, and getMediaURL() selects the protocol based on the current one. So when we called getMediaURL() to build the recipe URL to save in the database, it came back as an HTTPS URL.

So there you have it. A crazy, hard to diagnose bug, caused by a method that's too smart for its own good, and hidden by an ill-conceived half-abstraction. I hate to speak negatively of a friend and former colleague, but this was really a lesson in poor system design. He needed to separate things out a bit more. He should have done less in getMediaURL() and factored the generic "metadata" out into separate properties rather than lumping them all together.

Situations like this are why we have guidelines in programming. Things like "don't put serialized strings in the database", "a method should do one thing", and "don't use global variables" can seem arbitrary to the inexperienced. After all, it's so much quicker an easier to do all those things. But those of us who've been around the block a few times get that queasy feeling. We know there's a reason for those guidelines - those things can easily come back and bite you hard later on. Sure, they might not become a problem, but if they do, it's going to be much harder to fix them later than to "do it right" the first time. The hard part of software development is not "making things work" - a trained chimp can do that. The real art is in keeping things working over the long haul. That's what separates the real pros from the ones who are just faking it.

You can reply to this entry by leaving a comment below. This entry accepts Pingbacks from other blogs. You can follow comments on this entry by subscribing to the RSS feed.

Add your comments #

A comment body is required. No HTML code allowed. URLs starting with http:// or ftp:// will be automatically converted to hyperlinks.