Alisdair's Data-Blog

Wednesday, 1 January 2014

PostgreSQL Exercises launch reflections and open-sourcing announcement

A couple of weeks ago I decided to launch my on-off side project of the last 15 months, PostgreSQL Exercises. It's an SQL tutorial and exercises site designed to allow you to play around with a live relational database system online. As a shameless (and consequently much-mocked :-) ) lover of all things data storage, I felt it was something useful I could give back to the community. Generally speaking, the launch was a success - it was well-received, staying on the front page of Proggit for 24 hours, and HN for 12-ish. Just thought I’d share some thoughts on the launch:

What went right:

The site stood up well under the load. Despite my ongoing love affair with databases, I really think static generation was the right call. I’m not putting ads on the site, so I want it to be super-cheap to run. I also want to feel comfortable leaving the site up and running mostly-unmaintained if I get bored/busy. Static generation is ideal for these purposes: no framework updates and associated code fixing required, and low attack surface by default.
People generally liked the site. I’m appropriately filled with warm fuzzies :-).

What went wrong:

I failed to fully test every page on the VPS: I did a full run through on my local machine, and only a (substantial) partial test on the VPS. This was a head-desk dumb mistake: it turns out that the sort order of strings in UTF-8 en_US locale is different between OSX and Linux. Since the pages (including expected results) were generated locally and uploaded to the VPS, this resulted in queries not validating correctly on a couple of pages. Fortunately, a sharp-eyed proggit user spotted it quickly, and I was able to manually fix the affected pages, followed by switching both my machines to the C locale over the weekend. I'll be writing a script to auto-test the pages shortly, something I should have done in the first place.
I launched with a couple of known design flaws (particularly the fact that it wasn’t obvious where to find a schema diagram). These were picked up on rather rapidly!
The site basically didn’t appear on Google until recently. A friendly proggit-er pointed out that I might be hurting myself by the fact that my pages linked to pgexercises.com/index.html rather than just pgexercises.com. Apparently it’s possible that this is viewed as duplicate content.

Lessons learned:

Test every page in full on the production system before initial release (…duh). If you’ve got a design flaw that you can notice, even though you know the product, it’s probably a much more obvious problem to new users. Fix it!

I was hilariously over-paranoid about performance. I launched the site with wild hopes of getting to the front pages of Proggit and Hacker News, and was uncomfortably aware that I had absolutely no idea what dealing with that sort of traffic was like. As a result I made double-sure that all my expiry headers were good, installed the static gzip caching module for Nginx, configured timeouts for queries, prepared the site to work under round-robin DNS, and watched the load on the server just in case I needed to spin up another VPS. Here’s the CPU load on my 2GB Linode server during the period PGExercises was on the front pages of HN and Proggit:

So, yeah, maybe I need to chill out :-). Many of the steps that I’ve taken to reduce server load have also made the site pretty quick from the user perspective, which is nice, but I’ll try to remember in future that when it comes to performance there’s room to launch a bit earlier and iterate. It does give you some perspective on the insanely low cost of running mostly-static sites, though.

On feedback:

I sometimes see debates on HN about feedback - particularly over whether the environment is overly critical. I’ve certainly seen a few situations where ‘Show HNs’ got pretty negative, and it’s fair to say that I was nervous about putting my own work into the limelight. I’m happy to say that the feedback I got was very much positive and/or constructive - and you can’t ask for much more than that.

All this meant rather a lot to me: generally speaking, I do side projects for myself, to learn a language, library, or to scratch an itch. If people don’t like those projects, well, that’s okay. I’m doing it for me. PGExercises was very different in that respect: I mostly stopped learning new things about 20% of the way in. The rest of the project was fuelled by a desire to create something that people would like and find useful. Seeing positive comments and email really makes the effort feel worthwhile, and motivates me to do more in the future.

Github:

I’ve finally got around to putting the site code on Github. Be aware that the code quality is currently not that great - I’m very much a journeyman when it comes to web development, and the code has grown rather organically. Notes in the TODO file about what I plan to clean up.

The code is BSD licensed, while the content of the exercises (under the questions folder) is licensed under CC-BY-SA 3.0.

Other stuff:

I currently have a slightly frankensteinien mess when it comes to CSS: most of my css is in the site.css file, but I also have some inline in the template html. Initially I was planning to go through the template file, remove all the inline CSS, and put it into site.css. On the other hand, if the purpose of non-inline CSS is to separate content from styling, I already have that in the form of the static templating system.

Part of me is tempted to move a bit more of the styling into the template, so that if I change styles I won't have weird problems with file expiry where the (cached) site.css is out of sync with the html files. Does anyone more experienced with web dev have any thoughts on the pros/cons?

Monday, 28 September 2009

RDF storage and modern computer architectures

Modern computers are complicated beasts. Most of us probably still work with a simple model in mind: slow secondary storage for bulk data and persistence (usually a hard disk), and fast random access memory for working data and program instructions, which are sequentially executed in the CPU. For day to day programming this is perfectly adequate for getting things done.

For those who need to extract the highest possible performance, however, things are not so simple: progress has interfered quite dramatically with this simple model. The hard disk is perhaps the most obvious example: Anyone who has spent any time on performance-sensitive disk-based programs will be aware of the importance of the page cache. This cache holds disk-backed data in spare memory, and helps cover the grinding inadequacy of the seek times offered by modern disks.

CPUs based on the x86 architecture have changed dramatically. For some time now they have featured pipelined, super-scalar architectures that don't even execute your instructions in the order that you might expect. Poor or unlucky program design can inhibit the use of these features (a whole other blog post waiting to happen...). On top of this, modern compiler optimisations (particularly those offered by JIT compilers) prompt behaviour that's hard to anticipate.

All of this makes it very difficult for a programmer to know for sure how to extract the highest possible performance out of their code. Unfortunately, it doesn't end there. As time has gone by, RAM technologies have failed to keep up with the staggering performance that modern processors can provide. While they continue to offer very creditable bandwidth (DDR3 modules being theoretically capable of transferring data at up to 13GB/s), they exhibit ever increasing latency relative to the performance of the CPU: modern CPUs can execute a couple of hundred instructions in the time it takes to fetch a single piece of data from RAM. These latencies are explained in greater detail in Gustavo Duarte's excellent What Your Computer Does While You Wait.

Clearly, we can't hang around for 200 cycles every time we want something from memory. As a result we've constructed layers of CPU caches to buffer data and instructions, progressively slower and more capacious as they sit further away from the CPU. A typical system will feature 2-3 levels of cache, with level 1 being the fastest. The level 1 cache can retrieve data in 2-3 cycles, while level 2 might take around 14. Level 1 caches are generally very small: around 32KB per core, while a level 2 cache might be in the low megabytes. Where three levels are present, the L2 per core will be much smaller, with a multi-megabyte shared L3 cache.

When we encounter a request that cannot be satisfied from a cache, data is fetched from RAM to the caches in units of cache lines. These are typically 32 or 64 bytes long. The CPU is able to fetch data in advance if it detects a predictable access pattern: most usually iterating through an array, for example. The practical upshot of this is that RAM isn't really random access at all any more: if anything, it's a bit like a faster disk. We're rewarded for colocating related data, and for performing our accesses in a predictable manner. It also follows that we're rewarded for keeping data smaller, too: the smaller our pieces of information are, the more likely we are to be able to colocate them within the same cache line*.

As an illustration of this, I coded up a quick demonstration: a simple program that performs a series of binary searches over a large array, the size of which was expanded by a factor of 10 for each test. Since the binary search requires log₂(N) comparisons per search, in an uncached system we would expect the time per test to increase in logarithmic fashion, and the time per comparison to stay the same:

Size (KB)	Max. comparisons per search	Time (ms)	Time/comparison(ns)
600	18	2514	14.0
6,000	21	9293	44.3
60,000	24	17963	74.8
600,000	28	44106	157.5

As the table shows, the reality is very different. Note the particularly large increase in time required per comparison between the first two results: here, the dataset has outgrown the L2 cache size. Since the result of a binary search cannot be predicted by the CPU, we're starting to have to do work outside the cache, and this has a substantial impact on performance.

This sort of effect is noticeable in other situations: the BST, traditionally very popular for in memory work, is now slower for most purposes than a low-fanout B-Tree: the B-Tree groups related data within its large nodes, while the BST in comparison offers terrible cache behaviour.

All of this is information that informs my current work on in-memory RDF storage. I'm interested in examining different structures for indexing RDF data, and cache performance has a large part to play. RDF is actually particularly interesting from a cache performance point of view: since we usually give RDF URIs and literals small integer IDs, they offer excellent potential for related data colocation within a cache line.

Over the coming posts I'll talk about a few more factors that interest me with respect to RDF storage, and my thoughts for dealing with it. If you'd like to read about this all in one place, with decent referencing and the aid of competent proof readers, you might want to take a look at my Mini-Thesis.

* I've done a fair bit of glossing over and hand waving in this post for the sake of readability. If you want to know about caching in greater detail, I'd suggest Ulrich Drepper's rather complete What Every Programmer Should Know About Main Memory, a well written if probably misnamed paper on the subject.

Thursday, 24 September 2009

Reverse execution in GDB

So, not exactly the promised post about my work, but how incredibly useful is this? New GDB allows reverse execution.

Introduction

Welcome to my new blog. It's dedicated largely to my work on my PhD at the University of Southampton, with perhaps a smattering of general interest and personal info as well.

So, my PhD is focused on the storage and query of RDF data. My major focus is now on high-performance in-memory algorithms for indexing RDF. As part of this I'm interested in the effect of CPU data caches on algorithm performance, as well as the effect of object overheads such as those seen in Java. I'm currently working on evaluations of a bunch of indexing techniques with these factors in mind, with a view to implementing an adaptive algorithm of my own for RDF storage.

My previous projects are summarised on my web page here, and includes work on distributed RDF stores and RDF dataset characterisation. Over the next few weeks I'll put up some posts explaining what I've been up to both in the past and recently.