Data Quality – A Narrative
The first move in Data Quality
Over a period, organizations have been accumulating a massive amount of data and this data has become an asset. These data sets generate many insights and meaningful information for decision making. The value of data has grown multifold compared to what it was a few years back. The data sets, the way it was entered into the system lead to many inconsistencies like duplicates, incomplete and inconsistent data which resulted in users moving away from using the information delivered to them. Hence, there is a need for addressing these issues and refining the data consistency, accuracy, and completeness across the organization by implementing data quality initiatives. Data auditing/profiling is the first step towards ensuring the resolution of data-related issues. It is the only method to find out what’s happening with data sets across various applications. Auditing provides the ability to analyze data/Big data in a systematic and continuous process. It is a well thought out process with a methodical, repeatable, consistent and metrics-based means to evaluate the data. Data auditing has 3 primary methods namely column, Rule/logic-based, and dependency/relationship. Column auditing consists of finding out date and number related inconsistencies. This article’s prime focus is on the latter two.Basics before the deep dive
Standardization is one of the important auditing activities to ensure the data across the organization is standardized based on rules. In the above figure, the values have a different set of addresses though all leading to the same address. The standardization process will lead to correction and streamlining these values so that the organization can have a unified version. It is essential to look at some of the dimensions like Customer and Product data sets for possible standardization. Many organizations face serious problems with their customer and product dimensions as many of the reports are directly dependent on the accuracy of data. Any data inconsistencies will lead to serious issues and may lead to a trust deficit in their reporting systems. The data is collected through multiple channels with no consistency or standards in the attributes leading to duplicates across customer, products and many other dimensions. The problem becomes severe when the same data is pulled into the analytics resulting in a disastrous outcome. The biggest hit is on Customer and Product related data which impacts customer-related communication and results in missing sales opportunities. In the figure as mentioned above, there are many versions for the same customer due to inconsistency while keying the values. Due to the nature and requirement of various applications in an organization, the data gets keyed into multiple applications. These data sets that are spread across multiple applications should be consolidated into a single record or merge to achieve data completeness. In the figure as mentioned above, all the 3 different entities are merged into a single record to create one dataset. Enhancement is the process of adding value to the existing data sets by collecting additional, related and reference information to complete the base data set of entities and integrating all the sets of information to ensure completeness of data sets. Matching is a process of data linkage of records from various applications so that there is a link between various records through the unique key. The matching is done using various attributes in order to identify duplicates in the respective data sets. In the de-duplicate and merging section, the two or more identical or duplicate records will be merged into one. As part of the data governance process, the exhaustive rule engine will be established to match the data based on various rules. Another important aspect of data auditing is the identification of noisy data. In data science, the noise is defined as any unwanted or meaningless data which can not be interpreted to derive any meaningful insights. The noise data often mislead algorithms to generate error-prone patterns. Outliers are one such example of noise data. Noise data removal can be done in 3 ways i.e. distance, density, and clustering outlier detection methods.Rule Validation
Rule validation is probably the most difficult exercise compared to any of the techniques mentioned above. This exercise requires the participation of someone who has a broad understanding of the business. Therefore, it is necessary to include an analyst who understands the domain/business to be part of the data quality initiative. Before the commencement of the auditing activities, the preparation of all the business use cases with validation needs to be initiated. For example, in the insurance domain;- Missing policy details for the claims data
- Premium data which is generated from the policy administration system should match with the general ledger
- Accuracy and a single version of policy data like premium and related information like benefits, contract terms
- Sudden jump/slump in a premium collection or claims count as compared to average values
- The claim date cannot be earlier than the policy inception date
- Claim date should be 30 days or later than inception date
- The policy creation date cannot be 99/99/99
- A mismatch between policy management and general ledger application for the Premium data
Where to address
Once the profiling is done and ready with a detailed analysis of all the data related issues, the best place to fix these issues is at the application. Fixing the issues at the application is always beneficial as this will ensure that all the issues are fixed permanently and less scope for the repetition of such issues. However, in many cases, we may not have the liberty and scope to address the applications. Hence the best place to address is at the staging area of the architecture.Rule Engine
Fixing data quality issues is an ongoing and multi-stage exercise with constantly changing data sets. In this context, addressing these issues on an ongoing basis require a framework that not only addresses these issues but also monitors continuously. The rule engine is one such important component of the data quality framework. The rule engine is a repository where all the business rules and algorithms are stored. These rules are defined by business users or derived during the auditing activity. Rules are dynamic as the data changes over the period. These rules can be triggered as and when based on the latency. Along with simple rules, it is vital to define complex domain-specific rules which set the stage to resolve some of the key, complex and important data issues. The effectiveness of the rule engine depends on how comprehensive the rules are. As we include more and more rules in terms of patterns, trends, scenario, complex logic and basic mathematical functions, the correction of issues through automated process increases. Though it is not practically possible to address all the issues automatically, the robust rule engine ensures that the maximum percentage of issues are resolved. The data which fails in the data quality stage will be directed to the manual mode of addressing. A data specialist validates a dataset at various stages before deciding on the final outcome. The error log will have 2 sets of data. They are;- The attempted data where the rule engine is not sure to fix the issues due to low confidence scores.
- No rule in the existing engine to address the issues.
Data Score
The current quality level of data can be shown using the scoring engine. This gives a clear indication to the users about the confidence level of the data sets. The data score can be generated using many algorithms. It is vital to ensure that the algorithm uses the right set of parameters as mentioned below;- Audit records
- Total records
- Confidence
- Error
- Risk level