Does (Big) Data Suck?

In a world where social media has given everyone a voice and a platform from which to shout, there is no doubt that ‘fake news’ is a thing.

It has become really difficult to identify the truth or have absolute confidence in what we read online. I don’t know about you, but I often seek solace in data.

Data is, after all, objective and based on numbers and is not polluted by personal opinion. It can’t be wrong, can it?

Data hype

It is a cliché, but data is the new oil. There is massive value in data as it serves as the foundation for a lot of modern technology. Without data, artificial intelligence cannot really exist and it feels as though pretty much everything is AI-driven these days…

We are guilty of riding on the data-hype train at Browser Media as we describe ourselves as a data-driven, creative digital marketing agency. Data is a fundamental part of pretty much everything we do, as we strive to make decisions based on quantitative / objective numbers rather than a whim.

In recent weeks, however, I have found myself questioning whether our blind faith in data has, in fact, rendered us blind to the truth. Have we all become so reliant on data that we are losing the ability to think for ourselves? There have been two instances of what I consider to be fairly colossal data fails:

1) US election pollsters wrong (again)

I have no idea how much is spent on pre-election polling in the US, but you can rest assured that it will be an eye-watering amount. The industry was still reeling from an embarrassing failure to forecast a Trump victory in the 2016 election, so put out strong claims that they had learned their lessons and updated algorithms, etc.

Throughout the build-up to the 2020 election, Biden enjoyed a significant lead across most polls. This lead should not be downplayed – Biden was apparently so far out in the lead that he would still win by a comfortable margin even if the polls suffered from the same margin of error as 2016.

Whilst the final vote count remains to be seen (show some humility Mr Trump!), there is no question that the election was much closer than the pollsters had suggested. Whatever your political leanings, there can be no denying that it was a close-run thing and certainly not the landslide victory that was forecast.

So much for their updated algorithms and claims that they had learned their lessons from 2016. Their analysis of the data got it wrong.

2) UK Gov coronavirus media briefing(s)

I am not really sure where to start with an analysis of the media briefings that we have been subjected to in recent months.

This is not really the place to be political but I am sure most will agree, whatever your views on how to tackle the challenge of the virus, that the communication from the Government has been woeful. Leadership requires clarity of direction and certainly not the series of u-turns that we have seen over the course of this year. The one persistent theme in what we have been told has been that we need to ‘follow the science’, which demonstrates the implicit trust in data. The ‘scientific’ data has been used to justify the decisions made in Boris’s ivory tower.

Lockdown 2.0 was announced with a series of slides that ~~represented new heights of ineptitude~~ justified the decision. Aside from the fact that it was a visual catastrophe (the information overload didn’t even fit on the screen!), the lockdown was justified primarily on the prediction that, without drastic action, daily covid deaths in the UK would hit 4,000.

I was not the only one to raise my eyebrows at this particular number and it has subsequently been ridiculed. So much so, in fact, that The Office For Statistics Regulation (OSR) gave Whitty and Vallance an extremely polite rap over the knuckles. The doomsday forecast was based on outdated data / flawed modelling and it has now been quietly downgraded to 1,000.

In my humble opinion, that sort of shift is unacceptable and undermines ‘the science’ that is leading policymakers. It was a gross exaggeration and I find it dumbfounding that more has not been made of it, given the fact that the OSR had to step in. I had been feeling more and more sceptical of the data being presented to us in these briefings, but this felt like a final nail in the coffin.

What went wrong?

I think that there are different reasons for these data facepalms, but they both demonstrate that you should not blindly believe data that is presented to you.

In the case of the US pollsters, I suspect that many Trump voters were not prepared to admit that they were going to back him in the election. Thanks to cancel culture, I believe that many of those asked about their political leanings would have said Biden to avoid a possible public lambasting. It is the safe response, even if it is not actually what you believe. But it is not the answer that they will give when in the privacy of the ballot box.

Strictly speaking, the polling data was actually correct. The numbers were not lying, but you need to apply a human / emotive / qualative filter to question whether the numbers are likely to be accurate. The algorithms failed to allow for this and the result was a big serving of humble pie.

In the case of the UK media briefings, I believe that the primary issue is that the data was being used to underpin a political agenda. Throughout the course of 2020, data has been used to scare the UK population. The gov’s own documentation states that it is “highly acceptable to use media to increase the sense of personal threat”, so only the most negative / terrifying data has been selected and its presentation (e.g. inconsistent scales) has been designed to achieve the maximum impact.

A secondary issue reflects the makeup of the SAGE number crunchers. They are awash with mathematical modellers, but the lack of molecular virologists and immunologists poses the real risk that modelling will not include some fundamental factors that will be familiar to the immunologists. A simple example of this is the fact that the SAGE models assume that 100% of the population will be susceptible to Covid-19. This does not account for any pre-existing immunity (primarily from T-cells), so the forecasts of cases, hospital admissions and deaths will immediately be higher than those with more experience in viral spread.

Again, it is the algorithm that needs scrutiny – the raw data (population numbers, contagiousness, etc.) should be fairly black and white, but how you interpret / model such data can introduce multiple shades of grey very quickly.

The moral of the story

So, is data dead? Can I no longer rely on this objective lifeline to lead me to ‘the truth’?

Of course not, but I have definitely learned to apply a common-sense filter to everything that I see these days. Trump is correct in stating that there is a lot of fake news out there. This is certainly not a confession that I would vote for Trump (but tell the pollsters that I will vote for my namesake), but I think that it is very important to recognise that data can be manipulated and there is always the risk that the algorithms used to try to make sense of data overload can absolutely lead to the wrong conclusions.

Data is a tool. A very powerful tool. As with all tools, its efficacy is really down to who is using the tool and for what purpose. It is essential to engage the most powerful tool at your disposal (your brain) when looking at any data and make sure that you have understood the source of the data and the context in which it is being presented to you. Do not blindly ‘follow the science’ just because you are told that it is fact.

Data most certainly is not (always) the gospel truth and artificial intelligence, whilst amazing at some things, can be unintelligent when compared to the sense of intuition that the human brain offers.

And finally…

As a footnote, I wanted to share an amazing piece of data visualisation that is relevant to this post. If you were glued to the news channels trying to understand what was happening in the US election, you may well have shared my impression that it was a done deal for Trump as the whole of the US appeared to be red.

Thank you very much to Karim from Jetpack.ai for sharing this amazing visualisation that answers my question and proves that “Acres don’t vote, people do”:

I am not the first one and I surely won't be the last one to point out that traditional electoral maps are a misleading way to represent the outcome of an election. This is something you can't stress enough: https://t.co/euqDAvTvmU pic.twitter.com/KCEGigmaZu

— Karim Douïeb (@karim_douieb) February 22, 2020

Click on the link above to explore different visual representations of the same data. It is brilliantly executed.

Like I said, remember that data can be manipulated to tell a different story 🙂