Data quality control framework for enterprise data lakes
May 05, 2022 • 8 min read
Data quality control is a mission-critical capability for virtually any modern enterprise. Why? Because data quality issues can disrupt business processes and services, invalidate any type of analytics performed in the company, and damage a company’s reputation. However, despite the importance of data quality control, it is still woefully undervalued, and significant room for data quality control process improvement still exists in most enterprises.
In this blog post, we highlight the value of data quality control and describe the data quality control framework we have implemented as part of our Analytics Platform to resolve the quality control challenges that most enterprises face. You will learn about the framework architecture, its business benefits, and you’ll see a few examples of how data quality checks can be implemented.
Data quality control methods
Every modern business, starting from supply chain management, manufacturing, and IoT, up to healthcare and financial services, operate with data. Proper data management for various business use cases is key to reducing operational costs and accelerating the process of making critical business decisions. Technology has dramatically evolved over the years to address these challenges, and a number of data quality control methods are currently widely adopted:
- Data Monitoring is a practice to ensure data validity or highlight data issues, usually with a combination of rules. For example, email shouldn’t be null, customer metrics like average customer age is between 34 and 38, and so on. Data monitoring is essential for overall data consistency and integrity.
- Data Inspection is another group of methods to work with data or row-level; for example, filtering duplicate, broken or invalid data for further analysis or re-processing. Clean data are expected to comply with all company standards and can be used by business-critical downstream applications. This approach is intensively used where data quality is a mission-critical part of business operations, such as financial reporting that can’t tolerate invalid or duplicated data.
- Anomaly Detection is a group of techniques to work with structured and unstructured data using a machine learning approach. ML models identify outliers in the data automatically and unsupervised. For instance, age 107 would be identified as an outlier when working with a mean value of around 47. Anomaly detection helps to ensure data quality where it’s hard to identify and implement business rules: machine data processing, IoT or log analysis. While this technique can produce more false positive alerts than a data inspection approach with string rules, it takes far less effort to implement.
In a recent blog post, Data quality monitoring made easy, we discussed data monitoring and data inspection capabilities. In this blog, we continue the thread, but focus instead on the infrastructure that enables those capabilities:
Data quality control framework
Consolidating our experience building data quality solutions for clients, Grid Dynamics has developed a scalable and easy-to-extend data quality framework. The framework is based on open-source Soda SQL, which helps to:
- Integrate with a wide range of data sources;
- Define data quality checks for broken data;
- Scan data;
- Visualize issues.
The Soda SQL framework supports a wide range of data sources:
Another major integration that Soda SQL supports is Alation. For this integration, Soda SQL can push data quality check results, integrate these data into data lineage and data catalog information, and help to deliver an overview.
The implementation of the framework uses Soda SQL to collect data quality metrics, Kibana is used for data quality results visualization, and the framework is integrated with Apache Airflow, which makes it easy to use and extend.
Data quality control architecture
The diagram below illustrates how Soda SQL can be integrated into data pipelines, which include transformations and aggregations, using Apache Spark, and how data are saved into the popular DW Redshift, with further representation test results using ELK stack, as well as lineage of data flow using Apache Atlas. We use the reference architecture for AWS for the sake of specificity, but the framework is generally compatible with GCP and Azure as well.
The architecture includes the following major parts:
- Support of various data sources: file, message queues, databases, APIs.
- Data pipelines are orchestrated, data quality, and metadata management is part of the solution.
- Data monitoring and data inspection are implemented on top of Soda SQL.
- Results of data quality checks are stored in ElasticSearch and visualized in Kibana.
- Airflow pushes pipeline metadata information to Apache Atlas, where Atlas manages metadata and pipeline lineage information.
Data quality control process flow
The data quality framework is integrated with Apache Airflow for orchestration and can be used in Apache Spark applications without any additional configuration. All data quality checks are stored in ElasticSearch, and Kibana gives full flexibility to configure checks or dashboards. Another feature that comes with the framework is common metadata management through Apache Atlas, whereby Apache Airflow managed data pipelines send metainformation about pipeline data sources and targets. This information is used to visualize lineage information and build dependencies between source and target datasets.
When it comes to pipeline orchestration, the expectation is to provide enough flexibility to schedule, manage, extend and support ETL with the ability to integrate data quality as a part of the data pipelines. Taking this into account, we decided to go with Apache Airflow as workflow orchestrator, and integrated Soda SQL into Airflow to provide the extra metadata needed.
High level data quality control framework overview
Initially, Soda SQL was designed to deliver metrics over data collection, run data tests and perform data monitoring. The framework allows you to connect to a data source, run tests defined by a developer, and report issues. The framework is configured through a YAML configuration, with simple tests that may look as follows:
table_name: demo_data
metrics:
- row_count
- missing_count
- missing_percentage
- ....
tests:
- row_count > 0
columns:
id:
valid_format: uuid
tests:
- invalid_percentage == 0
Where row_count, missing_count and missing_percentage are built-in metrics, a high-level framework diagram looks as follows:
Data quality control initial setup procedure
Soda SQL installation requires a corresponding python package, which is detailed in the official Soda SQL installation documentation. After installing the python package, the next step is to configure the data source to be tested. Configuration requires the creation of a warehouse_type command to set up the data warehouse and env_vars YAML files, followed by adding login credentials to the env_vars YAML. Once configuration for the data source is created, the next step is to analyze the data source using the analyze command. Soda SQL will analyze the contents of the data source and prepare a YAML file for each dataset for a given source.
The analyze command, illustrated below, shows how Soda SQL runs tests and reports issues, if any, for a given data warehouse:
Soda SQL checks are easily integrated with existing orchestration tools like Apache Airflow or Dagster, and can be run by external scheduling service tests or data monitoring checks. An example of a python function for Apache Airflow operator looks like this:
def run_soda_scan(warehouse_yml_file, scan_yml_file):
from sodasql.scan.scan_builder import ScanBuilder
scan_builder = ScanBuilder()
# Optionally you can directly build the warehouse dict from Airflow secrets/variables
# and set scan_builder.warehouse_dict with values
scan_builder.warehouse_yml_file = warehouse_yml_file
scan_builder.scan_yml_file = scan_yml_file
scan = scan_builder.build()
scan_result = scan.execute()
if scan_result.has_test_failures():
failures = scan_result.get_test_failures_count()
raise AirflowFailException(f"Soda Scan found {failures} errors in your data!")
What is a Soda SQL test?
A test is a check that Soda SQL performs when it scans a dataset in a data source. Technically, it is a Python expression that checks metrics to see if they match the parameters defined for a measurement. A single Soda SQL scan runs against a single dataset in the data source, but each scan can run multiple tests against multiple columns.
Soda SQL tests are defined in a scan YAML file which is associated with a specific dataset in the data source. Writing tests is possible using a built-in set of metrics that Soda SQL applies to an entire dataset, built-in column metrics that Soda SQL applies to individual columns, or using custom metrics (also known as SQL metrics) that apply to an entire dataset or to individual columns.
Regardless of where it applies, each test is generally comprised of three parts:
- metric – property of the data in your data source;
- comparison operator;
- value.
However, sometimes tests can have a fourth element to check whether data is valid. Validity is defined by the column configuration key and expected format. For example, the user defined the valid_format as date_eu or dd/mm/yyyy format. The metric invalid_percentage refers to the valid_format configuration key to determine if the data in the column is valid.
A fully-structured YAML file with examples of usage of all mentioned features is shown below:
table_name: table_name
metrics:
- row_count
- missing_count
- missing_percentage
# Validates that a table has rows
tests:
- row_count > 0
# Tests that numbers in the column are entered in a valid format as whole numbers
columns:
incident_number:
valid_format: number_whole
tests:
- invalid_percentage == 0
# Tests that no values in the column are missing
school_year:
tests:
- missing_count == 0
# Tests for duplicates in a column
bus_no:
tests:
- duplicate_count == 0
# Compares row count between datasets
sql_metric:
sql: |
SELECT COUNT(*) as other_row_count
FROM other_table
tests:
- row_count == other_row_count
For more information about possible metrics, please refer to the official page: Example tests by metric – Soda SQL
How are Soda SQL test results represented?
Data checks and data monitoring visualization are a vital part of data monitoring. There are two ways checks can be visualized:
- Soda Cloud visualization
- Custom visualization on top of ElasticSearch and Kibana
Soda comes with basic visualization capabilities available in their paid subscription in the Soda Cloud interface. Cloud monitoring was recently released and provides basic visualization capabilities of data quality checks that may not be sufficient for all end users. Another highly flexible option is to visualize checks using ElasticSearch as data storage and Kibana. This approach provides flexibility in the way data can be stored, aggregated, queried for visualization and, finally, visualized in Kibana. It also simplifies work with raw data, allowing direct work with data quality reports. An example of such a visualization is depicted below:
Conclusion
Data quality control and data monitoring are essential for delivering advanced data products and reports for business users. However, the implementation of data quality checks can be relatively straightforward, provided a proper infrastructure is in place to run tests, record and visualize the results, and onboard new types of checks. The data quality control framework we have presented in this blog post addresses the complexity of building such a unified scalable infrastructure, enabling engineers to focus more on business problems, rather than infrastructure development.
If you’re interested to learn more about our data quality control framework, get in touch with us to start a conversation.