Alisdair's Data-Blog: 2009

Modern computers are complicated beasts. Most of us probably still work with a simple model in mind: slow secondary storage for bulk data and persistence (usually a hard disk), and fast random access memory for working data and program instructions, which are sequentially executed in the CPU. For day to day programming this is perfectly adequate for getting things done.

For those who need to extract the highest possible performance, however, things are not so simple: progress has interfered quite dramatically with this simple model. The hard disk is perhaps the most obvious example: Anyone who has spent any time on performance-sensitive disk-based programs will be aware of the importance of the page cache. This cache holds disk-backed data in spare memory, and helps cover the grinding inadequacy of the seek times offered by modern disks.

CPUs based on the x86 architecture have changed dramatically. For some time now they have featured pipelined, super-scalar architectures that don't even execute your instructions in the order that you might expect. Poor or unlucky program design can inhibit the use of these features (a whole other blog post waiting to happen...). On top of this, modern compiler optimisations (particularly those offered by JIT compilers) prompt behaviour that's hard to anticipate.

All of this makes it very difficult for a programmer to know for sure how to extract the highest possible performance out of their code. Unfortunately, it doesn't end there. As time has gone by, RAM technologies have failed to keep up with the staggering performance that modern processors can provide. While they continue to offer very creditable bandwidth (DDR3 modules being theoretically capable of transferring data at up to 13GB/s), they exhibit ever increasing latency relative to the performance of the CPU: modern CPUs can execute a couple of hundred instructions in the time it takes to fetch a single piece of data from RAM. These latencies are explained in greater detail in Gustavo Duarte's excellent What Your Computer Does While You Wait.

Clearly, we can't hang around for 200 cycles every time we want something from memory. As a result we've constructed layers of CPU caches to buffer data and instructions, progressively slower and more capacious as they sit further away from the CPU. A typical system will feature 2-3 levels of cache, with level 1 being the fastest. The level 1 cache can retrieve data in 2-3 cycles, while level 2 might take around 14. Level 1 caches are generally very small: around 32KB per core, while a level 2 cache might be in the low megabytes. Where three levels are present, the L2 per core will be much smaller, with a multi-megabyte shared L3 cache.

When we encounter a request that cannot be satisfied from a cache, data is fetched from RAM to the caches in units of cache lines. These are typically 32 or 64 bytes long. The CPU is able to fetch data in advance if it detects a predictable access pattern: most usually iterating through an array, for example. The practical upshot of this is that RAM isn't really random access at all any more: if anything, it's a bit like a faster disk. We're rewarded for colocating related data, and for performing our accesses in a predictable manner. It also follows that we're rewarded for keeping data smaller, too: the smaller our pieces of information are, the more likely we are to be able to colocate them within the same cache line*.

As an illustration of this, I coded up a quick demonstration: a simple program that performs a series of binary searches over a large array, the size of which was expanded by a factor of 10 for each test. Since the binary search requires log₂(N) comparisons per search, in an uncached system we would expect the time per test to increase in logarithmic fashion, and the time per comparison to stay the same:

Size (KB)	Max. comparisons per search	Time (ms)	Time/comparison(ns)
600	18	2514	14.0
6,000	21	9293	44.3
60,000	24	17963	74.8
600,000	28	44106	157.5

As the table shows, the reality is very different. Note the particularly large increase in time required per comparison between the first two results: here, the dataset has outgrown the L2 cache size. Since the result of a binary search cannot be predicted by the CPU, we're starting to have to do work outside the cache, and this has a substantial impact on performance.

This sort of effect is noticeable in other situations: the BST, traditionally very popular for in memory work, is now slower for most purposes than a low-fanout B-Tree: the B-Tree groups related data within its large nodes, while the BST in comparison offers terrible cache behaviour.

All of this is information that informs my current work on in-memory RDF storage. I'm interested in examining different structures for indexing RDF data, and cache performance has a large part to play. RDF is actually particularly interesting from a cache performance point of view: since we usually give RDF URIs and literals small integer IDs, they offer excellent potential for related data colocation within a cache line.

Over the coming posts I'll talk about a few more factors that interest me with respect to RDF storage, and my thoughts for dealing with it. If you'd like to read about this all in one place, with decent referencing and the aid of competent proof readers, you might want to take a look at my Mini-Thesis.

* I've done a fair bit of glossing over and hand waving in this post for the sake of readability. If you want to know about caching in greater detail, I'd suggest Ulrich Drepper's rather complete What Every Programmer Should Know About Main Memory, a well written if probably misnamed paper on the subject.

Alisdair's Data-Blog

Monday, 28 September 2009

RDF storage and modern computer architectures

Thursday, 24 September 2009

Reverse execution in GDB

Introduction

About Me

Blog Archive