A curdling scream and gasps can be heard from within the halls of the workspace. Hearing this, you jump into the nearest broom closet only to emerge dressed in a suit made of spandex, a bright red cape, and the letters DQ, short for “Data Quality,” emblazoned on your chest. Running down the hall, you find the commotion, quickly administer prompt justice, and vanquish the evil dirty data that is causing people to collapse and ball up into the fetal position. Taking a few glances at the data, you run it through the Accenture Data Quality Rule Accelerator and POOF! Once again, you are the hero!
Okay, so maybe that’s a bit dramatic. In all likelihood, you wouldn’t be wearing a cape; too often, those things get caught in pesky and annoying doors or cause a face plant into the ground. I do encourage you to wear a facemask though!
Data Quality initiatives aren’t quite as glamorous. In reality, you won’t be leaping tall buildings in a single stride, but rather copious amounts of data and information. You won’t be emerging from a phone booth with a gleaming suit on, but you may run to the broom closet in fear though. What I can tell you is that data is often dirty and often the amount of information that organizations have can be so overwhelming that tackling data quality issues can be difficult, costly, headache inducing, and may leave you with the desire to jump to other projects.
A recent survey of organizations found that most have yet to calculate the ramifications of poor data quality. So what does this mean? This means that most organizations don’t know what to do about the data that they have. When approaching a data quality engagement, those on the ground know how tedious, time consuming, and frustrating it can be to sort through the mounds of information. This only compounds the problem as it causes headaches and the desire of some to avoid all together for fear of spending even more money on an issue that some might see as a money pit.
Furthermore, many of the problems of data quality are no longer focused on names and addresses anymore. Dirty data is frequently encountered on engagements, it is pervasive, and data can be of such poor quality that 30%-80% effort spent in a data integration initiative is spent on data cleanup and understanding. The process of clean up and understanding involves interviews with subject matter experts if they are still around, trying to find documentation, reading all materials, and the manual discovery of and creation of data quality rules that can guide the process to scoring the cleanliness of data.
For this reason, I want to introduce the Accenture Data Quality Rules Accelerator (ADQRA). The ADQRA tool, currently in beta, is a part of the larger R&D initiative surrounding data quality and was created in our Technology Labs with the support of AIMS. The accelerator can take what would normally comprise of a significant amount of time and seed data quality efforts. The accelerator takes a data set and automatically returns with a set of data quality rules that can be used to pin point what data is dirty, which data isn’t, how dirty the data is, and can be help to determine how much effort is needed. The current version of the tool essentially detects inconsistencies in a given dataset, one of the six dimensions of data quality. The discovered data quality rules can then be used to enforce proper data entry or even discover interesting patterns. The ADQRA can do this with in a short amount of time because it is always checking for the stability of a rule as it encounters data. This means that it doesn’t need to look at the entire amount of data to determine a data quality rule. Results are returned relatively quickly and it is tolerant to dirty data.
Using the Data Accelerator is a three-step process: load your data, discover data quality rules, and then browse the rules.
In the current version of the Data Accelerator, data is selected and uploaded.
Once the data is uploaded, the user can tweak the results by manipulating several parameters:
Maximum number of rules – specifies that the ADQRA should return no more than the specified number of rules and that once it discovers that number of rules to stop and return the results
Maximum number of conditions – specifies the total number of conditions in a data quality rule where a rule is composed of conditional values on the left side of an “if-then” rule.
Maximum number of seeds – specifies the of condition combinations
Coverage – specifies the minimum amount of data a rule should cover for it to be considered interesting
Error rate – specifies the expected error rate in the data
Frequency – specifies how frequent the ADQRA checks for rule stability
Window size – specifies the number of tuples to consider at a given time
Once the rules are discovered, they can be browsed, edited, deleted, approved, and then exported to a format suitable for use with Informatica.
If any of this is remotely interesting to you, then I encourage you to contact Accenture and Accenture Technology Labs about the Accenture Data Quality Rules Accelerator. Take it for a test drive and see how it can help you. With the ADQRA you too might become a data quality super hero (but please, leave the spandex at home).