Actually, maybe that disappearing "knowledge" is OK

A couple of weeks ago I posted an entry about the disappearance of online academic journals and how that's a bad thing.  Well, this article made me rethink that a little bit.

The author, Alvaro de Menard (who seems knowledgeable, but on whom I could find no background information, so caveat emptor), apparently participated in Replication Markets, which is a prediction market focused on the replicability of scientific research.  This is not something I was familiar with, but the idea of a prediction market is basically to use the model of economic markets to predict other things.  The idea is that the participants "bet" on specific outcomes and that incentives are aligned in such a way that they gain if the get it right, lose if they get it wrong, and maintain the status quo if they don't bet.

In this case, the market was about predicting whether or not the findings of social science studies could be replicated.  As you probably know, half the point of formalized scientific studies is that other researchers should be able to repeat the study and replicate the results.  If the result can be replicated consistently, that's good evidence that the effect you're observing is real.  You probably also know that science in general, and social science in particular, has been in the midst of a replication crisis for some time, meaning that for a disturbingly large number of studies, the results cannot be replicated.  The exact percentage varies, depending on what research area you're looking at, but it looks like the overall rate is around 50%.

It's a long article, but I highly recommend reading de Menard's account.  The volume of papers he looked at gives him a very interesting perspective on the replication crisis.  He skimmed over 2500 social science papers and assessed whether they were likely to replicate.  He says that he only spent about 2.5 minutes on each paper, but that his results were in line with the consensus of the other assessors in the project and with the results of actual replication attempts.

The picture painted by this essay is actually pretty bleak.  Some areas are not as bad as you might think, but others are much worse.  To me, the worst part is that the problem is systemic.  It's not just the pressure to "publish or perish".  Even of the studies that do replicate, many are not what you'd call "good" - they might be poorly designed (which is not the same thing as not replicable) or just reach conclusions that were pretty obvious in the first place.  As de Menard argues, everyone's incentives are set up in a perverse way that fails to promote quality research.  The focus is on things like statistical significance (which is routinely gamed via p-hacking), citation counts (which doesn't seem to correlate with replicability), and journal rankings.  It's all about producing and publishing "impactful" studies that tick all the right boxes.  If the results turn out to be true, so much the better, but that's not really the main point.  But the saddest part is that it seems like everybody knows this, but nobody is really in a position to change it.

So...yeah.  Maybe it's actually not such a tragedy that all those journals went dark after all.  I mean, on one level I think that is kinda still is a bad thing when potentially useful information disappears.  But on the other hand, there probably wasn't an awful lot of knowledge in the majority of those studies.  In fact, most of them were probably either useless or misleading.  And is it really a loss when useless or misleading information disappears?  I don't know.  Maybe it has some usefulness in terms of historical context.  Or maybe it's just occupying space in libraries, servers, and our brains for no good reason.

Stupid PHP serialization

Author's note: Here's another old article that I mostly finished, but never published - I'm not sure why.  This one is from way back on August 23, 2013.

I was working for deviantART.com at the time.  That was a very interesting experience.  For one, it was y first time working 100% remote, and with a 100% distributed team, no less.  We had people in eastern and western Europe, and from the east coast to the west coast of the US.  It was also a big, high-traffic site.  I mean, not Google or Facebook, but I believe it was in the top 100 sites on the web at the time, according to Alexa or Quantcast (depending on how you count "top 100").

It also had a lot of custom tech.  My experience up to that point had mostly been with fairly vanilla stuff, stitching together a bunch of off-the-shelf components.  But deviantART was old enough and big enough that a lot of the off-the-shelf tools we would use today weren't wide-spread yet, so they had to roll their own.  For instance, they had a system to do traits in PHP before that was actually a feature of PHP (it involved code generation, in case you were wondering).

Today's post from the archives is about my first dealings with one such custom tool.  It nicely illustrates one of the pitfalls of custom tooling - it's usually not well documented, so spotting and resolving issues with it isn't always straight-forward.  This was a case of finding that out the hard way.  Enjoy!


Lesson learned this week: object serialization in PHP uses more space than you think.

I had a fun problem recently.  And by "fun", I mean "WTF?!?"  

I got assigned a data migration task at work last week.  It wasn't a particularly big deal - we had two locations where user's names were being stored and my task was to consolidate them.  I'd already updated the UI code to read and save the values, so it was just a matter of running a data migration job.

Now, at dA we have this handy tool that we call Distributor.  We use it mainly for data migration and database cleanup tasks.  Basically, it just crawls all the rows of a database table in chunks and passes the rows through a PHP function.  We have many tables that contain tens or hundreds of millions of rows, so it's important that data migrations can be done gradually - trying to do it all at once would hammer the database too hard and cause problems.  Distributor allows us to set how big each chunk is and configure the frequency and concurrency level of chunk processing.

There's three parts to Distributor that come into play here: the distributor runner (i.e. the tool itself), the "recipe" which determines what table to crawl, and the "job" function which performs the actual migration/cleanup/whatever.  We seldom have to think about the runner, since it's pretty stable, so I was concentrating on the recipe and the job.

Well, things were going well.  I wrote my recipe, which scanned the old_name table and returned rows containing the userid and the name itself.  And then I wrote my migration job, which updated the new_name table.  (I'm fabricating those table names, in case you didn't already figure that out.)  Distributor includes a "counter" feature that allows us to trigger logging messages in jobs and totals up the number of times they're triggered.  We typically make liberal use of these, logging all possible code paths, as it makes debugging easier.It seemed pretty straight-forward and it passed through code review without any complaints.

So I ran my distributor job.  The old_name table had about 22 million rows in it, so at a moderate chunk size of 200 rows, I figured it would take a couple of days.  When I checked back a day or two later, the distributor runner was reporting that the job was only 4% complete.  But when I looked at my logging counters, they reported that the job had processed 28 million rows.  WTF?!?  

Needless to say, head-scratching, testing, and debugging followed.  The short version is that the runner was restarting the job at random.  Of course, that doesn't reset the counters, so I'd actually processed 28 million rows, but most of them were repeats.  

So why was the runner resetting itself?  Well, I traced that back to a database column that's used by the runner.  It turns out that the current status of a job, including the last chunk of data returned by the crawler recipe, is stored in the database as a serialized string.  The reason the crawler was restarting was because PHP's unserialize() function was erroring out when trying to deserialize that string.  And it seems that the reason it was failing was that the string was being truncated - it was overflowing the database column!

The source of the problem appeared to be the fact that my crawler recipe was returning both a userid and the actual name.  You see, we typically write crawlers to just return a list of numeric IDs.  We can look up the other stuff st run-time.  Well, that extra data was just enough to overflow the column on certain records.  That's what I get for trying to save a database lookup!

Disappearing knowledge

I saw an interesting article on Slashdot recently about the vanishing of online scientific journals.  The short version is that some people looked at online open-access academic journals and found that, over the last decade or so, a whole bunch of them have essentially disappeared.  Presumably the organizations running them either went out of business or just decided to discontinue them.  And nobody backed them up.

In case it's not already obvious, this is a bad thing.  Academic journals are supposed to be where we publish new advances in human knowledge and understanding.  Of course, not every journal article is a leap forward for human kind.  In fact, the majority of them are either tedious crap that nobody cares about, of questionable research quality, or otherwise not really that great.  And since we're talking about open-access journals, rather than top-tier ones like Nature, lower-quality work is probably over-represented in those journals.  So in reality, this is probably not a tragedy for the accumulated wisdom of mankind.  But still, there might have been some good stuff in there that was lost, so it's not good.

To me, this underscores just how transient our digital world is.  We talk about how nothing is ever really deleted from the internet, but that's not even remotely true.  Sure, things that go viral and are copied everywhere will live for a very long time, but an awful lot of content is really just published in one place.  If you're lucky, it might get backed up by the Internet Archive or Google's cache, but for the most part, if that publisher goes away, the content is just gone.

For some content, this is a real tragedy.  Fundamentally, content on the Internet isn't that different from offline content.  Whether it's published on a blog or in a printed magazine, a good article is still a good article.  A touching personal story is no more touching for being recorded on vinyl as opposed to existing as an MP3 file.  I know there's a lot of garbage on the web, but there's also a lot of stuff that has genuine value and meaning to people, and a lot of it is not the super-popular things that get copied everywhere.  It seems a shame for it to just vanish without a trace after a few short years.

I sometimes wonder what anthropologists 5000 years from now will find of our civilization.  We already know that good quality paper can last for centuries.  How long will our digital records last?  And if the media lasts 5000 years, what about the data it contains?  Will anthropologists actually be able to access it?  Or are they going to have to reverse-engineer our current filesystems, document, and media formats?  Maybe in 5000 years figuring out the MPEG-4 fomat from a binary blob on an optical disk will be child's play to the average social science major, who knows?  Or maybe the only thing they'll end up with the archival-quality print media from our libraries.  But then again, given what the social media landscape looks like, maybe that's just as well....

Adding bookmarklets to mobile chrome

Author's note: I started the draft of this article way back on July 1, 2013. Sadly, it's still pretty relevant.

So I'm built myself a nice little web-based bookmarking app. I wanted something that would both give me some insight into how I use my bookmarks and also save me from worrying about syncing bookmarks between multiple browsers on multiple devices. And since I've regained my distrust of "the Cloud" with the demise of Google Reader, I decided to write my own. (Note from the future: Yes, I know seven years ago, but I still don't really trust cloud services.) If you're interested, check out the GitHub page. Maybe one day I'll make a real, official release of it. I call in Lnto, with "ln" being the UNIX "link" command and a tie-in to "LnBlog", the software that runs this site. (For LnBlog, the tie-in was to "ln" for the natural logarithm, i.e. "natural bLOG". Get it? I swear I thought it was funny at the time.)

One must-have feature for such an app is a "bookmark this page" feature. With native browser bookmarks, this is built in. With a web app...not so much. So the solution is to either write an extension for every browser I want to support (which is a lot of work), or just write a bookmarklet - a little piece of JavaScript that you can bookmark and run with a single click. Since this is a personal project that I'm doing in my limited free time, the latter seemed like the obvious choice.

There's just one problem here - mobile browsers. In addition to my laptop and my desktop, I have an Android phone and a Kindle Fire that I want to support. And while the actual bookmarklet code works just fine on all of those devices, actually bookmarking it isn't quite so easy. Because they're mobile browsers, you can't just drag the link to the toolbar as you would on the desktop.

Until recently, Firefox Mobile handled this well. (Author's note: We're back to the current time now, not 2013.) It would allow you to bookmark a bookmarklet like a normal bookmark. You just had to press the link and select a "bookmark this link" item from the menu. Then you could just bring up the bookmark screen when you were on a page and it would run the bookmarklet. However, with the updates for Firefox Mobile 81, that doesn't work anymore - the javascript: URL scheme doesn't seem to get executed when you invoke the bookmark. And other browsers don't seem to support bookmarking the bookmarklet in the first place. This link suggests that it's possible using bookmark syncing, but I'm not sure if that still works and I don't really want to turn on syncing anyway.

What I eventually did was just create a page that I can paste a URL into and it will do the same thing as the bookmarklet. It's not great, but it's serviceable. At some point, maybe I'll get around to creating an Android app. Then I'll have some native integration options to work with.