Enter the maze

Little Data: Compressing Vicky Pollard

Crushed Cars

Britain, Britain, Britain, the land that discovered nylon socks a week ago and invented the electrical thingamabob to do stuff. One of the characters wot lives here in this land is teenager Vicky Pollard. Her words, like Shakespeare's, will ring through the ages.

'Yeh but no but yeh but no but yeh but no but yeh... but no because I never done nuthin' nor nuthin' and anyone says I did is well gonna get beatens'.

The world of Little Britain has become the biggest TV phenomenon in years, with many unforgettable comedy characters, one of the favourites being schoolgirl Vicky Pollard from Darkly Noon. If we look at Vicky's famous catch phrase above we notice several things, not least the fact that she does tend to repeat herself. But it's not just Vicky who repeats, most writing repeats words, with common words like 'the' and 'and' cropping up all over the written page. Now suppose you wanted to send Vicky's message using the least amount of data, how could you make the amount of information smaller without loosing any of it? This is what data compression does, and we can use Vicky's catchphrase as a good example.

It's a useful technology - no yeh or buts about it.

The phrase starts off as 32 words and 116 characters (147 if you count spaces too). The first thing is to look at the words. 'Yeh', 'but' and 'no' turn up a lot. So does 'nuthin'. The first thing we could do is to create a code book. We can represent each of the common words or patterns of letters by a number so for example 'Yeh '=1, 'but '=2, 'no'=3 and 'nuthin ' =4. We can also use 5 to equal the three characters (full stops) in '...', and 'on'=6. So the phrase now becomes:

'1 2 3 2 1 2 3 2 1 2 3 2 1 5 2 3 because I never d6e 4 3r and any6e says I did is well g6na get beatens'

Which is now only 72 characters long, but remember you need to send the codebook dictionary with the message to allow it to be reconstructed at the other end. This is the way data compression software like ZIP compresses large amounts of data including pictures or music which are stored as binary 1's and 0's. There are always patterns within the data. These are used to create a codebook and reduce the data transmitted. Some compression systems are adaptive. That is they look through the data, select a possible set of code words then overwrite these as better codeword or patterns are found, often produced from compressing the first set of code words.

However it does it, data compression helps get information small enough to be transmitted quickly, so we can download texts movies and music. It's a useful technology - no yeh or buts about it.