21 January 2010

A data remember

The digital world went a little moist with anticipation this week, as Sir Tim Berners-Lee launched the new government website data.gov.uk, opening up reams of HMG's information to the wider world. Everyone seems terribly enthusiastic about the event and, in particular, how user friendly it is. The vision is for a nation of mash-ups - developers beavering away to build applications that draw on this data to tell us, well, lots of important things that we can't yet anticipate.

The idea seems to have come about with frightening ease, according to Sir Tim in The Guardian:

"Gordon Brown said to me, 'How should the UK make the best use of the Internet?' and I replied that the government should just put all of its data on it," Berners-Lee recalled. "And he said 'OK, let's do it'."

Job done. Imagine how tempting it would have been to say, with the Prime Minister just waiting to carry out whatever you say, "Free porn for the over 75s". Or "A webcam in every home".

While I welcome the principle of open government and free access to public information, I can't help thinking this is symptomatic of the main problem with the Internet itself - providing a waterfall when all you wanted was a cupful. A quick search under "schools" will yield more data sets than you can possibly have thought existed: Post-16 participation in training in Wales; "Core Accessibility Indicators", sorted by school type; Cross Local Authority border movement of school pupils resident in England. In fact, my over-riding thought is not "hooray for openness" but "blimey, doesn't the government collect a lot of information".

Actually what set alarm bells ringing in my mind was the prospect of unlimited data sets for limited minds. In particular, the offhand remark by Sir Tim that it offered the chance "potentially to discover hidden patterns that may not be obvious from the raw information." Generally speaking, that is the sort of thing best left to statisticians, very clever people who understand things like data points, randomness, regression to the mean, biases, clustering, statistical significance and the Bonferroni Correction. I fear we are opening up the candy store to the idiot kleptomaniac children who will have capacity to waste an inordinate amount of everyone's time because they don't know anything about how stats work.

I may not know much about how statistics works myself, but I do know that data dredging will allow all the conspiracy theorist nuts, quacks and fanatics to convince themselves they have the evidence that backs up their claims about UFOs, telephone masts and fluoride in the water. I also know that, as a general principle, data collected for one purpose that is used to demonstrate another is often flawed - it's the oldest trick in the book if you want to fake evidence. Say you collect data to show a correlation between school attendance and smoking that gives nothing, but you notice an unusually high number of blond haired children showing up in the results. It's a short step from there to "proving" that blond children are more likely to take up smoking.

I don't predict this will give us any labour-saving apps anytime soon, since its use in creating hacked applications will pass by 99.5% of the population. But I do predict that it will be used by chancers wanting to get misleading, dangerous or malicious stories into the news with the weight of "evidence" behind them. The Daily Mail's health agenda is about to go nuclear.

No comments: