Modern serverless data ingestion solution on AWS
Aug 29, 2022 • 4 min read
Introduction
A meal kit company that specializes in delivering healthy, organic, ready meal kits was about to start its journey to the cloud. The company was recently acquired by a leader in the food and beverage industry who intended to merge the business under one umbrella, with the same brand, customer management, operations and technologies, all managed by the parent company. The Client engaged Grid Dynamics to integrate data into their ecosystem through the development of an effective data ingestion solution that provides data model reconciliation and data backfill.
The challenge
During the pandemic, the client grew substantially, released to new markets, and made several acquisitions, leading to the need for a new approach to manage business, run operations and maintain technical solutions. Due to this tremendous business growth, consistent operations improvements were required to compete in the market. Multiple IT operations, platform solutions, technical departments and integrations across acquisitions made it hard to manage a sophisticated technical landscape.
Grid Dynamics had a specific focus on integrating the acquired business into the parent technological ecosystem. The biggest challenge of any acquisition is merging businesses that have a greater number of different components than common components. Unification of business processes for this client involved:
- Unification of customer audience;
- Unification of marketing strategies: building a marketing strategy for each brand complimentary to other brands; and
- Technical architecture and solutions unification.
Further considerations for integrating acquisitions into the parent architecture included:
- An integration strategy for different technical stacks;
- Recommendations on how the parent architecture should be adopted in order to expose the integration API; and
- A data management strategy.
The rest of the derived use cases, like unified customer 360, marketing campaigns, customer acquisition and retention policies, were out of scope for the engagement.
For this case study, we’ll focus on the unification of the technical architecture, including the approaches we used, and the solutions we built on top of AWS. We also tackle the other major goal of the integration, which was to create a defined technical roadmap for future acquisition integrations.
With these defined requirements, Grid Dynamics developed a lightweight solution hosted on AWS. Below we explain why certain AWS services were beneficial for this particular integration use case.
Solution expectations
At the beginning of our engagement, the client was running an on-premise platform, with some infrastructure components migrated to AWS. Coming from an on-premise world, where supporting hardware, infrastructure, services and applications is a prerequisite, the client wanted to build a serverless platform that required close to zero infrastructure support.
Serverless considerations
Integration between the two businesses required data transformation and exposure to the parent company. While considering the serverless approach we would take, AWS Glue as a serverless data integration service stood out for its features that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. Furthermore, AWS Glue provides all the capabilities needed for data integration out of the box, enabling greater speed to market.
There are three AWS Glue components:
- AWS Glue Data Catalog – This is basically a central repository for your metadata, built to hold information in metadata tables, with each table pointing to a single data store. In other words, it acts as an index to your data schema, location, and runtime metrics, which are then used to identify the targets and sources of your ETL (Extract, Transform, Load) jobs.
- Job Scheduling System – The job scheduling system, on the other hand, is intended to help you automate and chain your ETL pipelines. It comes in the form of a flexible scheduler that’s capable of setting up event-based triggers and job execution schedules.
- ETL Engine – AWS Glue’s ETL engine is the one component that handles ETL code generation. It automatically provides this in Python or Scala, and also gives you the option of customizing the code.
Architectural solution
Grid Dynamics opted for a solution based on AWS serverless to help the client achieve their data integration goal fast. Using serverless scalable services like AWS Glue and AWS Redshift enabled us to optimize operating costs and development expenses.
The Analytics Platform that we built, based on AWS Glue capabilities, involved data ingestion from MongoDB to the data lake with an intermediate data lake in AWS S3. For ingestion and transformation, Glue ETL services based on Apache Spark were used. To meet best practices, the intermediate data lake was split into several logical layers:
- S3 Landing Zone – a place that contains the source data as is, with no transformations
- S3 Consumption Zone – a place that contains transformed landing data to corresponding data models. It contains ready-to-use data for analytics.
The data ingestion process can be summarized as follows:
- All data in the intermediate data lake were categorized by AWS DataCatalog. If needed, data is accessible using AWS Athena.
- The data ingestion pipeline writes the final data to AWS S3, AWS RDS PostgreSQL and AWS Redshift.
- The data ingestion pipeline is triggered by AWS Glue Workflow services to create and visualize complex ETL activities involving multiple crawlers, jobs and triggers.
- Finally, the needed credentials for services intercommunication are stored in AWS Secrets Manager.
The results
The project timeline was aggressive: the integration needed to go live within three months, including production infrastructure, pipelines, data quality, monitoring and support runbooks. Grid Dynamics completed the project within the timeline, providing the client with:
- Fully automated infrastructure provisioning;
- CI/CD and version control;
- Data ingestion and transformation pipelines;
- Data quality checks and schema enforcement;
- Data catalog and self service access to the data.
The solution was built on top of serverless components of the AWS cloud, and since all data pipelines are batch in nature, there is no need to run infrastructure constantly – all services are provisioned on demand and released after pipeline completion. This approach resulted in drastic infrastructure cost reductions, no more infrastructure support engineers, and greater scalability as the client grows.