5 steps to implementing a successful DataOps practice
Oct 15, 2020 • 6 min read
After the initial excitement that data lakes would help companies maximize the utility of their data, many companies became disillusioned by rapidly diminishing returns from their big data efforts. While it was easy to put large volumes of data in the lakes, turning that data into insights and realizing value from it turned out to be a much more difficult task.
Many of these problems were related to poor quality of data, lack of alignment between business and technology, lack of collaboration, and absence of proper tooling. When software development faced similar challenges, Agile and DevOps techniques helped solve the problem.
In the world of data, the industry invented the term DataOps, which takes Agile and DevOps principles and applies them to data engineering, science, and analytics. We are not going to focus on DataOps techniques in this post, as there are many good articles on the topic. Instead, we will focus on the technology enablers that facilitate DataOps implementation.
In a previous article we wrote about how data lakes are insufficient to satisfy enterprise data analytics needs and that companies need to build fully featured analytics platforms. In this article we will explore the capabilities of an analytics platform that facilitates a DataOps methodology and helps companies derive value from the data more quickly and efficiently.
Data orchestration
Modern analytics platforms contain thousands of data transformation jobs and move hundreds of terabytes of data in batch and real-time streaming. Manual management of complex data pipelines is an extremely time consuming and error-prone activity, leading to stale data and lost productivity.
The goal of automated data orchestration is to take the effort of scheduling execution of data engineering jobs off the shoulders of data engineering and support teams and automate it with tools. A good example of an open source data orchestration tool is Apache Airflow, which has a number of benefits:
- Ability to orchestrate complex interdependent data pipelines.
- Scalability to manage hundreds of flows.
- Robust security and controlled access with LDAP/AD integration, Kerberos authentication support, role-based access, and multi-tenancy.
- Support for a variety of pipeline triggers including time-based scheduling, data dependencies sensors such as creation or update of files on the file system, changes in database tables, and inter-pipeline dependencies including completion or failure of upstream jobs.
- Flexible retry policies with configurable recovery options and SLA enforcement.
- Convenient graphical interface for visualization of data pipeline dependency graphs.
- Extensible Python-based DSL as a primary configuration language.
- Rich reusable components library.
- Embedded secrets management subsystem.
- Storing orchestration and job configuration as code in source code version systems.
- Support for local testing of pipelines on developers’ workstations.
- Flexible configuration of jobs for high availability and disaster recovery.
- Cloud-friendly with support to provision task executors on-demand using Kubernetes.
Data monitoring
Some DataOps articles refer to statistical process controls, which we call data monitoring. Data monitoring is the first step and precursor to data quality. The key idea behind data monitoring is observing data profiles over time and catching potential anomalies.
In the simplest form, it can be implemented by collecting various metrics of the datasets and individual data column, such as:
- Number of processed records over time.
- Ranges of values for numeric or date columns.
- Size of data in text or binary data columns.
-
Number of null or empty values.
Then for each metric, the system would calculate a number of usual statistics, such as:
- Mean value.
- Median value.
- Percentiles.
- Standard deviation.
With this information, we can observe whether the new data item or dataset is substantially different from what the system has observed in the past. The data analytics and data science teams can also use collected data profiles to learn more about data to quickly validate some hypotheses.
The simple methods of data monitoring can be augmented by AI-driven anomaly detection. Modern anomaly detection algorithms can learn periodic data patterns, use correlations between various metrics, and minimize the number of false positive alerts. To learn more about this technique, read our recent article on various approaches to real-time anomaly detection. To simplify adding anomaly detection to the existing analytics platform, we have implemented a Starter Kit, which you can learn more about in this article or reach out to us to try it out.
Data quality
While data monitoring helps data engineers, analysts, and scientists learn additional details about data and get alerted in case of anomalies, data quality capabilities take the idea of improving data trustworthiness, or veracity, to another level. The primary goal of data quality is to automatically detect data corruption in the pipeline and prevent it from spreading.
Data quality uses three main techniques to accomplish that goal:
- Business rules – Business rules can be thought of as tests that continuously run in the production data pipeline to check if data complies with pre-defined requirements. It is a fully supervised way to ensure data integrity and quality. It requires the most effort but is also the most precise.
- Anomaly detection – The anomaly detection implemented for data monitoring can be reused for data quality enforcement and requires setting certain thresholds to balance between precision and recall.
- Comparison with data sources – Comparison of data in the lake with data sources typically works for ingested data and is best used for occasional validation of data freshness for streaming ingress in the data lake. This method has the most overhead in production and requires either direct access to systems-of-record databases or APIs.
If a team already uses automated data orchestration tools that support configuration-as-code such as Apache Airflow, data quality jobs can be automatically embedded in the required steps between, or in parallel to, data processing jobs. This further saves time and effort to keep the data pipeline monitored. To learn more about data quality, please refer to our recent article. To speed up implementation of data quality in the existing data analytics platforms, we have implemented a Starter Kit based on the open source technology stack. The Starter Kit is built with cloud-native architecture and works with most types of data sources.
Data governance
Data governance is a ubiquitous term that also encompasses people and process techniques however, we will focus on the technology and tooling aspects of it. The two aspects of data governance tooling that have become absolute must-haves for any modern analytics platform are the data catalog and data lineage.
Data catalog and lineage enable data scientists, analysts, and engineers to quickly find required datasets and learn how they were created. Tools like Apache Atlas, Collibra, Alation, Amazon Glue Catalog, or Data Catalogs from Google Cloud and Azure can be good starting points in implementing this capability.
Adding data catalog, data glossary, and data lineage capabilities increases productivity of the analytics team and improves speed to insights.
Continuous delivery
The concept of DevOps is one of the cornerstones and inspirations behind the DataOps methodology. While DevOps relies on culture, skills, and collaboration, modern tooling and a lightweight but secure continuous integration and continuous delivery process helps with reducing time-to-market when implementing new data pipelines or data analytics use cases.
As is the case with regular application development, the continuous delivery process for data needs to follow microservices best practices. Such best practices allow the organization to scale, decrease time to implement and deploy new data or ML pipelines, and improve overall quality and stability of the system.
While having many similarities with application development, continuous delivery processes for data have their own specifics:
- Due to large volumes of data, attention should be placed on unit testing and functional testing with generated data.
- Due to the large scale of the production environment, it is often impractical to create on-demand environments for every execution of the CI/CD pipeline.
- Data orchestration tooling needs to be used for safe releases and A/B testing in production.
- A larger focus needs to be placed on data quality and monitoring in production and testing for data outputs.
Traditional tooling such as GitHub or other Git-based version control systems, unit testing and static code validation tools, Jenkins for CI/CD, and Harness.io for continuous deployment, find their principal use in the data engineering world. Using data pipeline orchestration tools, which allow configuration-as-code such as Apache Airflow, streamline the continuous delivery process even further.
Conclusion
DataOps has become an important methodology for a modern data analytics organization. As is the case with Agile and DevOps in traditional software development, DataOps helps recognize value sooner and achieve business goals in a more reliable way. To be successful with DataOps, companies need to learn new skills, adjust their culture, collaboration, processes, and extend their data lakes with a set of new technical capabilities and tools.
At Grid Dynamics, we’ve helped Fortune-1000 companies adopt a DataOps culture, onboard required processes, acquire the necessary skills, and implement the needed technical capabilities. To help our clients get to insights faster, we’ve created starter kits for all necessary capabilities including data orchestration, data monitoring, data quality, anomaly detection, and continuous delivery. To learn more about the case studies on implementing these capabilities at the enterprise scale, read our whitepaper. To try our starter kits, see the demos, and discuss how to onboard them, please feel free to reach out to us.