Data Quality: A Brief Overview

29-09-2021

M.Sc Rafael Sanches

On a business level, data quality is key for great AI products and directly impacts strategic competitive advantages, since data is a fundamental asset of organizations nowadays, fueling strategic planning and decision making on a daily basis.

Out of many definitions, one shall highlight that data of high quality must represent the real-world phenomena it refers to in a proper and trustworthy manner.

To achieve this general goal, one may imagine data quality as being part of an overall data pipeline. Specifically, the quality dimension of the pipeline may be seen as: it starts with proper data modeling, to address the phenomena in the best possible way and; it is composed of efficient mechanisms of checks, validations and normalization to assure that the data produced are in accordance with the modeling and desired outputs. Which means that, in practice, data quality procedures and frameworks emerge to assure that data fits to the needs of decision making and planning both in a design and technical level.
Thus, what are the dimensions of data quality?

Some critical dimensions which directly impact the quality of information produced from data are: believability, accuracy, reputation, relevancy, understandability and consistency.

From a technical point of view, assuring all these dimensions means datasets must be suitable for the information produced from it. Henceforth, data quality at this level is at least a bipartite task:

a) to assess and b) to correct data problems in information systems.

Both tasks are preferably done by automated algorithms. To carry out these tasks a plethora of tools are available. Let’s underline two of them:

Great Expectations: It is an open source Python package designed to validate, profile and document data. With this tool one writes assertions about the data (expectations) in the form of Python methods, which are used to validate data. Which basically means: verify if any batch of data matches the requirements of the suite of expectations previously defined.

The data profile (conjunct of assertions) may be automatically created with this library based upon observation of basic statistics of the data. Those configurations are stored in yaml files. Lastly, the documentation is written in html format where one may both visualize the expectation suites and data validation results in a continuous flow. Great Expectations allows work with backends like pandas, Spark, SQL and datasources/storages like S3, warehouses, filesystem, Databricks, EMR, Athena and others.

Deequ: Deequ is a Scala based open source tool developed by AWS with an available Python API. Natively integrated with Apache Spark, Deequ allows data quality checks procedures by the customized or automated creation of constraints to test data arranged as Spark DataFrame.

Its core components are: metrics computation, constraint suggestion, constraint verification and metrics repository. The latter supports the persistence and tracking of its runs over time, creating a history of Deequ measurements.

Its stateful calculation of metrics is handy when dealing with datasets that change throughout the day. Deequ only measures the newer data, allowing the storage of metrics and anomaly detection for each different batch of data.

In comparison, Great Expectations may be more flexible and Deequ more powerful. However, the decision about its deployment is up to the design of the architecture and the needs of the team. Independently of the technology of choice, data quality is a must for achieving excellence in any data project and leveraging the strategic advantages data and information are able to prompt in today's economy.

*special thanks to Mariana F. Medeiros on the comments about data quality frameworks.