Stupid PHP serialization

Author's note: Here's another old article that I mostly finished, but never published - I'm not sure why.  This one is from way back on August 23, 2013.

I was working for deviantART.com at the time.  That was a very interesting experience.  For one, it was y first time working 100% remote, and with a 100% distributed team, no less.  We had people in eastern and western Europe, and from the east coast to the west coast of the US.  It was also a big, high-traffic site.  I mean, not Google or Facebook, but I believe it was in the top 100 sites on the web at the time, according to Alexa or Quantcast (depending on how you count "top 100").

It also had a lot of custom tech.  My experience up to that point had mostly been with fairly vanilla stuff, stitching together a bunch of off-the-shelf components.  But deviantART was old enough and big enough that a lot of the off-the-shelf tools we would use today weren't wide-spread yet, so they had to roll their own.  For instance, they had a system to do traits in PHP before that was actually a feature of PHP (it involved code generation, in case you were wondering).

Today's post from the archives is about my first dealings with one such custom tool.  It nicely illustrates one of the pitfalls of custom tooling - it's usually not well documented, so spotting and resolving issues with it isn't always straight-forward.  This was a case of finding that out the hard way.  Enjoy!


Lesson learned this week: object serialization in PHP uses more space than you think.

I had a fun problem recently.  And by "fun", I mean "WTF?!?"  

I got assigned a data migration task at work last week.  It wasn't a particularly big deal - we had two locations where user's names were being stored and my task was to consolidate them.  I'd already updated the UI code to read and save the values, so it was just a matter of running a data migration job.

Now, at dA we have this handy tool that we call Distributor.  We use it mainly for data migration and database cleanup tasks.  Basically, it just crawls all the rows of a database table in chunks and passes the rows through a PHP function.  We have many tables that contain tens or hundreds of millions of rows, so it's important that data migrations can be done gradually - trying to do it all at once would hammer the database too hard and cause problems.  Distributor allows us to set how big each chunk is and configure the frequency and concurrency level of chunk processing.

There's three parts to Distributor that come into play here: the distributor runner (i.e. the tool itself), the "recipe" which determines what table to crawl, and the "job" function which performs the actual migration/cleanup/whatever.  We seldom have to think about the runner, since it's pretty stable, so I was concentrating on the recipe and the job.

Well, things were going well.  I wrote my recipe, which scanned the old_name table and returned rows containing the userid and the name itself.  And then I wrote my migration job, which updated the new_name table.  (I'm fabricating those table names, in case you didn't already figure that out.)  Distributor includes a "counter" feature that allows us to trigger logging messages in jobs and totals up the number of times they're triggered.  We typically make liberal use of these, logging all possible code paths, as it makes debugging easier.It seemed pretty straight-forward and it passed through code review without any complaints.

So I ran my distributor job.  The old_name table had about 22 million rows in it, so at a moderate chunk size of 200 rows, I figured it would take a couple of days.  When I checked back a day or two later, the distributor runner was reporting that the job was only 4% complete.  But when I looked at my logging counters, they reported that the job had processed 28 million rows.  WTF?!?  

Needless to say, head-scratching, testing, and debugging followed.  The short version is that the runner was restarting the job at random.  Of course, that doesn't reset the counters, so I'd actually processed 28 million rows, but most of them were repeats.  

So why was the runner resetting itself?  Well, I traced that back to a database column that's used by the runner.  It turns out that the current status of a job, including the last chunk of data returned by the crawler recipe, is stored in the database as a serialized string.  The reason the crawler was restarting was because PHP's unserialize() function was erroring out when trying to deserialize that string.  And it seems that the reason it was failing was that the string was being truncated - it was overflowing the database column!

The source of the problem appeared to be the fact that my crawler recipe was returning both a userid and the actual name.  You see, we typically write crawlers to just return a list of numeric IDs.  We can look up the other stuff st run-time.  Well, that extra data was just enough to overflow the column on certain records.  That's what I get for trying to save a database lookup!

You can reply to this entry by leaving a comment below. You can send TrackBack pings to this URL. This entry accepts Pingbacks from other blogs. You can follow comments on this entry by subscribing to the RSS feed.

Add your comments #

A comment body is required. No HTML code allowed. URLs starting with http:// or ftp:// will be automatically converted to hyperlinks.