Dirty Data

Martin Zeman
Data Driven Sales
Published in
3 min readOct 7, 2017

--

One of the most common reasons data projects fail is because data is not clean enough.

Most data projects start with a healthy dose of optimism. An analyst identifies data that is needed, builds a report or a piece of analysis and they victoriously presents it to end users. The end users look at the report and in just a few seconds they point at it and say: “That’s not right!”

So the analyst goes back to the drawing board, investigating how did the issue happened. They discover that it’s actually the raw data that’s got this incorrect information, they clean it up, update the report and bring it back…

… and the scenario repeats itself, only with another piece of information.

Repeat this three times and you’ve completely lost confidence and trust of your audience.

Data is dirty, that’s its natural state.

Unless data is being used day to day, there will be errors. Some data project managers try to fight this through extensive user testing and data validation exercises but I’m not a big fan of this approach.

Firstly it significantly extends the time it takes to get information to the users. Secondly some of these errors won’t get discovered until the report goes in hands of the end users and until they start filtering and drilling into the data.

I take a different approach — instead of seeing dirty data as the problem, I recognise the problem is the potential loss of trust of the users.

Now, here’s the magic trick — the loss of confidence in data doesn’t happen because the data is inaccurate, it happens because the users expected it to be accurate.

See where I’m heading?

Instead of trying to deliver a perfect solution, I adjust the expectations of the users of my reports. When I launch a new report, piece of analysis or a tool, I explicitly state that it’s likely there are errors in the data (both the raw data and potentially the way I transformed it).

I make it the users’ responsibility to discover these inaccuracies and report them to me so we can investigate them together and if necessary they can fix them inside their systems.

This simple change turns the biggest critics, people with exceptional attention to detail, into my greatest helpers and supporters. This way I leverage their strengths for the benefit of the project. And as a by product they get even more vested in the report as they feel stronger sense of ownership. After all they helped make the report better.

Data is dirty and it needs to be cleaned up in order to be useful and be used. Utilise your end users to do the cleaning. You will get them more engaged and you will eliminate the biggest risk of failure in your data projects.

--

--