Making Big Data Small
29 August 2012

Most appreciate the value of what's come to be known as "Big Data" but it's rather harder to appreciate the tools, and particularly Hadoop, that are associated with the movement. Making fun of Hadoop is cheap and easy, so I'll just note that when your product is so complex that people download not the software but a distribution you've got some serious bloat going on. What we need is not Big Data but lean data. Simple tools that let small teams move fast when analysing large amounts of data. This post is about one such family of tools, known as streaming algorithms.

Streaming algorithms have a long history, and arose from the need to analyse data in such large volumes that storing it on disk is impossible. Think the Large Hadron Collider, the phone network, and Internet giants like Google. The main restriction in the streaming algorithm world is that you only look at a data point once. You can do a surprising number of analyses in this framework, and the algorithms are simple, blazing fast, and real-time by default. These properties make them a great alternative to map-reduce type processing even when you already have the data on the disk.

That's the basic introduction. In future installments I want to go over some of the key algorithms, including:

  • The Bloom filter, and related Count-Min sketch, which is a simple but versatile datastructure for summarising a stream of data.
  • Finding the most frequent items in a stream of data, the so called heavy-hitters problem.
  • Estimating the number of elements in a large set using much much less memory than my first computer (4K, it was a Vic-20)

No doubt there will be detours and additions along the way.

If you just can't get enough of this, I'll be talking about streaming algorithms at O'Reilly's Strata Conference (1 Oct 2012) and the London Scala Users Group (12 Sept 2012).