Accelerating enterprise data migrations: A GenAI recipe
Sep 02, 2024 • 9 min read
An average company adopting cloud today could achieve 180% ROI in business benefits, although few are getting close to these returns.
Migrating or replatforming to a cloud data platform is a complex process. Even a basic “lift-and-shift” migration requires careful planning—designing the target architecture, determining the migration order, setting up a roadmap, and ensuring the right skill sets and validation scenarios are in place. The complexity further increases when the source solution needs to be completely re-architected, requiring extra efforts to create a proper cloud-native or cloud-agnostic design, define implementation guidelines, and establish an execution roadmap.
A few years ago, such migrations were daunting, often leading to delays and extra costs for businesses, But with the latest generative AI solutions, migrations can be simplified and accelerated. Traditional, routine, and boilerplate code creation can be handed over to generative AI, freeing up your engineering team to focus on the critical parts of the migration.
Research shows that generative AI can boost cloud program ROI by 75 to 110% through three key benefits: unlocking new business use cases, cutting application remediation and migration time and costs by up to 40%, and enhancing the productivity of development and infrastructure teams on cloud.
In this article, we’ll cover the key steps in migrating from on-premises to the cloud, focusing on how generative AI can help speed up the process:
- Schema migration: Simplifying the transition of data structures to the cloud
- Legacy SQL script modernization: Upgrading old SQL scripts for better cloud performance
- Legacy orchestration script migration: Adapting and updating existing orchestration scripts
- Modern workflow code generation: Automatically generating code for new workflows based on descriptions
- Validation scenario creation: Developing test scenarios to ensure the migration is successful
Starting with a clear migration strategy, we’ll explore how generative AI can partially automate the process and reduce routine operations during the migration. The above-mentioned high-level migration plan is consistent across large-scale migrations, whether you’re moving from Teradata to a cloud data warehouse or upgrading from Hadoop-based data lakes to modern technologies. The process includes data migration, transitioning workloads, optimizing operations, validating results, and aligning business processes to ensure everything runs smoothly in the new environment.
Navigating migration challenges
Cloud migration isn’t just about technology—it’s a major shift in how your organization operates, plans, and manages its release cycles. For on-premises, proper capacity planning is crucial to prevent hardware shortages, while in the cloud, where elastic scalability is a core feature, cost control becomes essential for efficient operations.
There are two main approaches to cloud migration—lift-and-shift, and replatforming.
Lift-and-shift migration involves moving a major amount of workloads as-is, without significant changes to the codebase or related development processes. This is the most straightforward migration approach, but it does come with the need to redesign some applications and adapt certain processes, such as disaster recovery and monitoring. Although this approach might seem appealing due to its simplicity, it presents some challenges:
- Higher operational costs: Lift-and-shift migrations often result in higher operational costs compared to on-premises environments.
- Need for replatforming: To optimize costs and take full advantage of cloud features, you’ll eventually need to replatform, redesigning applications to fit the cloud environment.
- Architectural adaptation: Disaster recovery, scalability, and extensibility require architecture specifically designed for the cloud. Using an on-premises architecture in a lift-and-shift migration can lead to additional costs.
Let’s elaborate on the high operational costs for lift-and-shift migrations. On-premises applications are usually built in a stateful manner to utilize hardware as much as possible. In the cloud, however, the tendency is to build stateless applications that are easy to deploy, scale, and maintain. There is no extra cost incurred by keeping the state of the application or trying to replicate the state across the cluster. Cloud environments also recommend avoiding hardware starvation and emphasize freeing up hardware once it’s no longer required. Therefore, the key principles of a successful cloud application are:
- Statelessness: Build stateless applications wherever possible
- Elastic scaling: Scale up as needed and scale down after heavy processing
- Resource efficiency: Free up resources when they are not in use
Following these principles, businesses can ensure that their application is disaster-resilient, scalable, and extensible while maintaining consistent costs.
Migration strategy overview
Migration programs are always a sophisticated shift that requires replatforming of existing solutions, deployments, monitoring, and other aspects of production support. Although these migrations are labeled as agile, they often follow a more structured approach, resembling a waterfall model. The target architecture is set, and the migration roadmap breaks down this architecture into manageable parts, making the process of moving legacy code/data/schema to modern platforms more manual and methodical. Generally, it’s an iterative process with the following deliverables:
Schema migration requires a thoughtful understanding of the technical limitations of the target data warehouse, data mart requirements, consumption patterns, and self-service capabilities addressing business demands. All modern cloud data warehouses greatly support semi-normalized or denormalized schema to avoid unnecessary joins where possible. Migration from on-premises to cloud requires an understanding of the target schema to create a schema evolution strategy and apply this strategy to the migrated schemas. Generative AI can help automate the route operation of schema migration once target samples and guidelines are defined, allowing engineers to focus on actual validations.
Extract, transform, and load (ETL/ELT) migration is the most sophisticated part that requires handling of pipeline orchestration logic as well as the transformations themselves. Pipeline orchestration logic might be unified, or in many cases, organizations use multiple orchestration tools like Control M, Oozie, and Fivetran, which need to be migrated to a single orchestration framework. This is where automation can really make a difference.
Generative AI significantly helps reduce routine operations by automating translation from legacy codebase to modern approaches. In some cases, it’s even more productive to translate orchestration code to business flow diagrams and create orchestration logic based on it.
While ETL/ELT orchestration code is generally straightforward and can be easily generated and maintained by LLMs, business transformation logic requires more careful handling. Business transformations can be implemented in plain SQL, using SQL extensions, or Python/Java/Scala/other language-based logic. Usually, migration is not as simple as one-to-one translations and requires replatforming, or paradigm change. For instance, business transformations implemented using tools like Informatica often require a complete redesign during migration. SQL code migration to a modern data warehouse is much easier than migration off Cobol.
Consequently, our experts at Grid Dynamics have created a solution for performing lift-and-shift code migrations. Advanced modernizations are usually handled in a slightly different way through an intermediate step of creating business flow diagrams. More details can be found in the application modernization demo.
GenAI Data Migration Starter Kit
Migrating to new platforms is often seen as a routine task, but it’s one where automation can significantly reduce bugs and speed up the time to market. At Grid Dynamics, we’ve seen firsthand how leveraging generative AI can boost productivity, allowing developers to focus more on ensuring business logic accuracy.
Below are some key areas from our GenAI Data Migration Starter Kit where automation, powered by generative AI, proves to be invaluable in cloud migration.
Strategy to validate migrated data and business logic
Once a migration strategy is defined, the next step is to create a validation plan and establish a data quality/observability framework to validate both legacy and new implementations. In our previous migrations, validations were mostly manual involving tasks such as row counts, data accuracy, consistency checks, etc. Today, modern tooling can automate the creation and maintenance of validation checks. For example, Github Copilot or modern large language models (LLMs) can generate test suites based on verbal descriptions.
Usually, we start by generating a test suite for legacy code using generative AI. This test automation can be reused for the replatformed solution. Typically, this covers about 30-40% of the business logic and complements any existing tests.
Business logic validations typically require manual input like scenarios, data distribution to check, and rule-based checks such as allowed values for a particular field, and so on.
A recommended validation scenario includes:
- Defining the target schema and determining the allowed value types for table columns
- Leveraging generative AI to generate a smoke test suite that will validate:
- Row and volume counts
- Data accuracy for both legacy and replatformed datasets over different time periods (hour, day, week, month)
- Identify null or empty values
- Highlight seasonality in seasonal data
- Validate email formats and ensure PII data is masked where required
- Check for data duplication to ensure there are no duplicates
- Other specific checks like non-null transaction amounts
- Ensuring business scenarios following implementation— such as after actions X and Y a user gets coupon/loyalty status—are updated
- Cross-checking implementations; for example, ensuring the transaction amount for a given day exactly matches the replatformed solution
This scenario can be enriched with domain and data-specific validations, and rule-based tests that can run independently for legacy and new solutions. Having such a phased approach helps reduce the number of possible issues and quickly run smoke testing before release.
Schema migration
Over the last few years, Grid Dynamics has successfully migrated dozens of modern enterprises from on-premises data warehouses like Teradata, Netezza, and SQL Server to cloud platforms such as Snowflake, BigQuery, Redshift, and Synapse Analytics. Below is a practical guide to help you navigate the process of schema migration from these traditional systems to a modern cloud environment.
On-premises data warehouses operate with a normalized schema. However, when migrating to the cloud, it’s often beneficial to shift toward a denormalized schema. Using the schema below as a reference, we’ll demonstrate how the actual migration can be executed:
Click to expand the code sample (36 lines)
click to expand
T = 24 * 1 # time step is one hour, flash offering for 1 day
m = 4 # not more than 4 price updates
def logx(x, n): # iterative logarithm function
for i in range(0, n): x = math.log(x) if x > 0 else 0
return x
def intervals(m, T, scale): # generate a price schedule vector
mask = []
for i in range(1, m):
mask.extend( np.full(scale * math.ceil(logx(T, m - i)), i - 1) )
return np.append(mask, np.full(T - len(mask), m - 1))
tau = 0 # start time of the current interval
p = h_vec[0]['p_opt'] # initial price
t_mask = intervals(m, T, 2)
h_vec = demand_hypotheses(...)
hist_d = []
for t in range(0, T - 1): # simulation loop
realized_d = sample_actual_demand(p)
hist_d.append(realized_d)
if( t_mask[t] != t_mask[t + 1] ): # end of the interval
interval_mean_d = np.mean( hist_d[tau : t + 1] )
min_dist = float("inf")
for h in h_vec: # search for the best hypothesis
dist = abs(interval_mean_d - h['d'](p))
if(dist < min_dist):
min_error = error
h_opt = h
p = h_opt['p_opt'] # set price for the next interval
tau = t + 1 # switch to the next interval
click to collapse
Schema migration is usually a routine operation: Begin by defining the target schema structure, move all tables to it, including data definition statements (create/update/delete tables), and apply to all the migrated entities. To simplify the migration, the schema can be passed to a generative AI-based tool that can automatically generate the required—denormalized or semi-normalized—schema for the target data warehouse.
Modern data warehouses support denormalized schemas with embedded structures to avoid multiple joins and provide a wide range of out-of-the-box reporting capabilities. For example, in user transaction tables, businesses can access transaction data alongside user profiles, loyalty attributes, and other relevant tags, all in one place.
Our starter kit migrates the Teradata schema to the BigQuery structure. This kit creates a denormalized schema with embedded structures tailored for reporting, and it allows for adjustments to fine-tune the migration process as needed.
Workflow orchestration migration
In our experience, legacy pipeline orchestration can’t be migrated as-is and requires significant restructuring. Workflow orchestration mechanisms vary widely from one application to another—migrating Control-M is not the same as migrating from Oozie. Typically, migration is carried out in two steps:
- Creating comprehensive business flow diagrams: This includes mapping out upstream and downstream dependencies.
- Generating orchestration code: The code is generated based on the flow diagrams.
For simplicity, let’s consider a Talend workflow represented as an XML file. In the case of Talend workflow migration, we bypass the flow diagram generation and directly generate Apache Airflow orchestration code from the XML as follows:
- Tune the LLM model to generate Python code from the XML description
- Generate the required class structure
- Generate transformations from Talend XMLs
- Validate generated code and fine-tune the LLM model if some enhancements are required
Below is an example of the generated orchestration code:
The created Airflow DAG represents upstream and downstream dependencies of the pipeline but doesn’t contain the transformation logic which is migrated separately and will be a point of discussion in the next section.
SQL scripts migration
Transformation code migration is the most complicated part of replatforming and requires proper business context. In this section, we’ll focus on migrating SQL code, although the same principles can apply to other types of transformations.
The scope typically consists of plain SQL scripts, stored procedures, and trigger logic that should be migrated. We generally recommend avoiding one-to-one migration. Even though modern cloud data warehouses support stored procedures, they can become performance bottlenecks. Therefore, stored procedures are often migrated as separate runnables on top of Apache Spark or Lambda functions.
There are two primary approaches to migrating SQL code:
- Direct migration: Translating from one SQL dialect to another
- Business logic extraction: Extracting business logic and then generating new SQL scripts based on the extracted information
We prefer the second approach because:
- Business logic is explicitly validated
- The code generated from extracted business logic is more transparent and easier to debug
- Business logic can be documented
In our practice, code-to-code migration can introduce subtle side effects that require extra validation effort to validate generated code.
SQL-to-business logic translation and SQL legacy code transformation into business flow diagrams and descriptions can be achieved using any existing open-source LLM models, whether hosted privately or accessed via a public API. For SQL code generation, we commonly use Vanna.ai—a RAG model trained on SQL codebases that can generate SQL scripts based on the provided description.
The model can further be trained using existing codebase or documentation:
vn.train(documentation="Consider sales as transaction table")
This approach has the following benefits:
- Data security: Data remains within a secure perimeter
- High accuracy: The generated SQL code is highly accurate
- Flexibility: The LLM can be easily updated or replaced
This approach significantly reduces the time-to-market for migrating large legacy codebases to modern technological stacks, regardless of the target platform or technologies used.
Conclusion
Modern enterprises have a defined cycle of technical debt elimination: every seven to ten years, companies run modern replatforming programs which require the whole system redesign and reconsidering the implementation according to evolved industry standards. However, only 10% of companies have fully captured cloud’s potential value, while another 50% are starting to capture it, and the remaining 40% have seen no material value. Thankfully, generative AI now helps reduce costs and time spent on boilerplate code creation, increasing focus on business logic, and simplifying business rules extraction and re-implementation.
If your business needs replatforming so you can continue delivering the best products and services to your customers, you can save both time and money in building a robust cloud foundation using Grid Dynamic’s starter kit. At the same time, you’ll have access to our cloud migration experts who will help you prioritize high-value business cases and adopt a product-oriented operating model that yields maximum ROI. Contact us today to get started on your seamless cloud migration journey.