Dirty Data – Hygiene Etiquette
If you’ve ever analyzed data, you know the pain of digging into your data only to find that the data is poorly structured, full of inaccuracies, or just plain incomplete. But « dirty data » isn’t just a pain point for analysts; it can ultimately lead to missed opportunities and lost revenue to an organisation. Gartner research shows that the “average financial impact of poor data quality on organizations is $9.7 million per year.”
The amount of time and energy it takes to go from disjointed data to actionable insights leads to inefficient ad-hoc analyses and declining trust in organizational data.
A recent Harvard Business Review study reports that people spend 80% of their time prepping data, and only 20% of their time analyzing it. And this statistic isn’t restricted to the role of the data stewards. Data prep tasks have bled into the work of analysts and even non-technical business users.
Enterprises are taking steps to overcome dirty data by establishing data hygiene etiquette:
- Understand your data location, structure, and composition, along with granular details like field definitions.
Some people refer to this process as “data discovery” and it is a fundamental element of data preparation. Confusion around data definitions, for example, can hinder analysis or worse, lead to inaccurate analyses across the company. For example, if someone wants to analyze customer data, they may find that a marketing team might have a different definition for the term“customer” than someone in finance.
- Standardize data definitions across your company by creating a data dictionary.
This will help analysts understand how terms are used within each business application, showing the fields are relevant for analysis versus the ones that are strictly system-based. Developing a data dictionary is no small task. Data stewards and subject matter experts need to commit to ongoing iteration, checking in as requirements change. If a dictionary is out of date, it can actually do harm to your organization’s data strategy. Communication and ownership should be built into the process from the beginning to determine where the glossary should live and how often it should be updated and refined.
- Data cleansing prior to imports
You need to prepare your data before even thinking of importing it in your system. Every organization has specific needs and there is no ‘one size-fits-all’ approach to data preparation. A self-service data preparation tool allows people to see the full end-to-end process, seeing potential flags earlier on—like misspellings in the data, extra spaces, or incorrect join clauses. It also increases confidence in the final analysis.
- Hands off!!
Keeping your hands out of the data in regular use increases the chances of it keeping clean. Introducing a little dirty data to a system will compromise an entire data set and your little bit of dirty data has suddenly created a lot of dirty data. Cleansing the mess is a far far bigger job than making sure the data is clean before importing it.
- Invest in a self-service business intelligence tool
Adopting a self-service data prep across an organization requires users to learn the ins and outs of the data. Since this knowledge was historically reserved for IT and data engineering roles, it is crucial that analysts take time to learn about nuances within the data, including the granularity and any transformations that have been done to the data set. Scheduling regular check-ins or a standardized workflow for questions allows engineers to share the most up-to-date way to query and work with valid data, while empowering analysts to prepare data faster and with greater confidence.
Data hygiene should be a top concern in organisations. Devoting some resources to ensuring that the data you’re basing decisions on is complete and accurate is a smart investment, because dirty data is costly in so many ways. To get the most and best use out of your data, you need to take the time to ensure its quality is sufficient and that data used by different departments is integrated. This gives you the most complete and precise customer view, so you can make smarter decisions and maximize your return on investment.