It’s been a while since I last posted (I have a good excuse, I was organising my wedding and then off on honeymoon!). But something’s been bugging me – a strange claim I’ve seen in a few places. Legend has it that 90% of all data was created in the last two years (ta @james_randerson for the link).
I find it difficult to get my head round this. I think the statement is supposed to convey the humungous amount of new data that is being created every day – from genetic sequencing to personal data recorded by smart phones. But it seems to be one of those statements where the story has been simplified so much as to be almost completely meaningless – what kind of ‘data’ are we even talking about?
‘Big data’ – that’s what kind. Big data is apparently ‘where data about individuals, groups and periods of time are combined into bigger groups or longer periods of time’. Sounds very meta. And it’s certainly mind-blowing to think just how much information is streaming back and forth between computers all around the world.
But data isn’t just bits encoded on a flash drive somewhere. Surely data existed before computers, before numbers, before writing? If we’re being philosophical about it, data is just ‘things known or assumed as facts’. And if data is simple facts, things that are, then it must have existed since the Big Bang, right?
So it seems unbelievably arrogant to state that 90% of data has been created in the last 2 years. What about the BILLIONS of years before that, when lots of stuff happened and existed? What about all the human beings who pondered the world before the creation of computers?
Maybe I’m taking this throwaway line too seriously. But if science is about what we can measure (our data) then we need to be hyper-aware of what we can’t measure (yet or ever). The data is out there, as Mulder and Scully might say, but that doesn’t mean we can capture it and store it on a computer. And just because we have captured some data – even a lot of data – that doesn’t mean it’s useful.
‘Big data’ does highlight this problem – in some areas of science there is literally too much data to handle. Genetic sequencing is now so fast that it’s becoming difficult to keep up with the sheer amount of genetic information out there.
But in a lot of ways, science has always had this problem. So many things to explore, and so little time! Every bit of data collected means that a decision has been made NOT to look at something else, so in one sense there is literally an infinite amount of possible data – impossible to collect or analyse.
So I don’t think we should get too hung up on big data or how much data we have ‘created’. The key questions are WHAT data should we collect and HOW should we analyse what we have. Big data is the least of our problems.