“Let’s create a Hadoop™ Data Lake”.
I’m sure you have heard this or something like it proposed as the newest solution for fixing issues with the data in your enterprise. Such a suggestion is really an updated version of the “single Data Warehouse”. While these are both great solutions to potential data processing and reporting requirements, implementation of them will not automatically fix your data issues.
If we can’t get quality data in the multiple existing systems we already have, why should we believe that that adding a layer of complication to the process by aggregating that data into a single warehouse will solve our problems?
Problems with data, be they quality, completeness, accuracy, fitness for purpose, or any other similar issues are not problems with the technology we currently have in place. They are simply issues with your business data that need to be fixed by the business. This means we need to spend the time, and money to understand the root causes of your data issues and correct those at the source of the issue.
How do we accomplish this?
First and foremost, the end users of the data, often the teams that do the reporting, need to define what they mean by quality data. The following steps are a basic guide:
- Look at tests they run today that have identified issues with the data
- Systematically identify all of the ways the to check the data to ensure it is complete and accurate.
- Turn these into requirements that can be fed back down the data feeds (lineage) to have the data quality checks implemented close to the source of the data.
Initially you will probably have to implement tests at multiple nodes along the lineage which repeat tests that may be carried out by previous systems. However until such time as there is a model providing trust along the lineage, all nodes need to ensure they are meeting the quality requirements.
Take the business rules identified and evolve them into data controls that ensure that data issues are identified as close to the source of the issue as possible. You should ensure that these controls include not just the business rules that need to be tested, but also the reporting and remediation processes required to respond to, and remediate, any issues that are identified.
These controls should include processes to review the cause if the issue and identify any changes that need to be made to ensure the issue does not happen again. You can then create certified data sources and a trust model which means that data from that source does not need to be re-tested.
Although the steps described above only take up a couple of paragraphs, you will find they are very hard to do in practice. But without undertaking the steps you will find there is no point in spending money to implement a single source Data Warehouse solution. The end result would only be that the same data that is currently available in multiple systems with issues, will now be available in a single location with issues, possibly more of them because in aggregating the data you add more opportunities for errors to creep in to the process.
If, however, you spend the time to get your data clean, accurate, and timely then you may be able to truly realize the benefits of Hadoop or a Data Warehouse. Your implementation should also be faster and cheaper because you are not spending time trying to understand whether the errors you are seeing are caused by the new system or by bad data.