This is episode 5 of the Dataverax YoutTube channel and speaks to the data quality dimension of accuracy.
Back when I was in business school, they used to say there were three types of lies: lies, damn lies and statistics. In the age of Big Data and Open Data we ought to be past all that. But we also see that we live in the age of Fake News, and depending on who is throwing that term around, we can usually find confirmation bias.
Confirmation bias is the tendency for people to find evidence to confirm what they already believe to be true, and to ignore all other evidence. This is the opposite of what we should do in a data-driven world. Data should be the lead indicator. We should challenge our own underlying assumptions and look at situations with fresh eyes.
The danger is when we try to fit the data to a story we want to tell. The data should tell the story, not us. But this is difficult. Malevolent actors such as the clever yet criminally-minded accountants that perpetuated the greatest stock frauds of the 2000s found expert ways to hide liabilities and costs off-balance sheet, and therefore modified available data to make it worse than useless, but actually completely false.
Sometimes it is not even on purpose. As we try to make sense of a world of COVID data, each region and jurisdiction handles the situation somewhat differently. How do we count COVID cases? How do we count COVID deaths? Who is being tested? How do we compare apples to apples?
The important thing is that we do not attempt to script this story in advance. Do we want to tell the story of government incompetence, or central authority efficiently working as it should the best it can in bad circumstances? Do we believe we already know and just want the data to confirm it? Or are we willing to see and explore insights we maybe did not expect, and let the data do the talking?
And we also need to address deficiencies in data collection or processing because at the end of the day, our data should reflect the world as it is, and not as what we believe it should be.
I am writing up the following companion piece to go along with my Tableau Public visualization of COVID Cases per Million (North America Region). I created this visualization because I was curious about the state of the outbreak and how it varies from jurisdiction to jurisdiction. At first blush we can see that New York state is the hardest hit by the pandemic, with neighbouring states also affected. It is interesting to see how some states and regions are little affected in as far as pure number of cases go.
But herein lies a problem. There are challenges in using this data which I have drawn from public sources. Each jurisdiction has handled this crisis a little bit differently so we may not be comparing apples to apples here. This may call into question the integrity and the quality of the data, but sometimes you can only make do with what you have. Especially in a crisis.
I trust we will learn many lessons from this crisis in terms of public health, government effectiveness, and also in terms of data quality. This situation continues to unfold, and I shall continue to follow it closely.