CI/CD challenges of 2020
Oct 13, 2020 • 9 min read
The landscape
There’s no way around it – preparing an annual CI/CD roadmap is boring. For the last half a decade it has simply ended up with listing the same “automate X”, “extend pipeline Y”, or “implement dashboard Z” actions. Always referring to similar tools and barely ever featuring any novel ideas. The cause and consequence is that software delivery is seen as tooling just for the sake of tooling and has been relegated to a second class citizen. It has become a cost center rather than a vehicle for innovation. Even worse, in some “well established” enterprise environments, so-called “CI/CD automation” effectively handcuffs change velocity and experimentation.
At the same time, application development is continuing to evolve day and night:
- Application systems are getting bigger and are assembled from an increasing number of smaller parts at run time. It can be seen as a generalization of SOA or microservices. An application system is a changing compound where parts co-evolve and rely on coordinated changes, not just released GA functionality.
- New deployment targets take their share. Not just VMs and containers (traditional compute primitives), but serverless across its many forms – cloud runtimes and web, including applications for the browsers distributed by CDNs. There is a steadily growing reliance on cloud services beyond computing.
This is being fueled by the public clouds. They also set the expectations – instant unlimited elasticity and cost efficiency.
Cloud, microservices, and serverless shift change management right – towards production. This seems leaner and therefore more favorable. However the CI/CD approach moves feedback loops left – closer to developers of the business logic. The left part of the pipeline is well automated, while the more interesting things happen at the right in production. With feature flags, experiments and canary deployments production is not the final point of a delivery pipeline, it is a continuum. This part of the pipeline creates the business value but is not covered by processes or tools.
The move to clouds has now entered the productivity plateau of its hype curve. Another generation of mainstream computing infrastructure may be over the horizon, such as edge computing. Data has become the carrier for business value substituting software. Now is the time when the technology has matured sufficiently for efficient processes and proper organization to best exploit it and create maximum value. It means it’s time for an advance in CI/CD unless we are able to invent something better.
In the current landscape, the factors creating the most value are:
- Cost efficiency of the CI/CD infrastructure and operations
- New runtime infrastructure targets
- Production change management
- User risk control (application reliability and security)
- Recognizing that it’s no longer only about software
Challenge: Cost efficiency
“Efficiency” is broader than just “cost efficiency”. But the cost is the most visible metric. So to succeed that’s where we need to begin. Cost should be seen as TCO aggregating human and infrastructure costs to acquire and maintain the service. So far most of the cost is associated with the left part of the delivery process: build-test automation.
Clouds set the expectations:
- Something new doesn’t take much to implement
- Changes are quick and easy
- Repetition and scaling are instant and almost free
- No use = no pay, and this applies to both human and infrastructure costs
Clouds immediately sort out the traditional operations approach to CI/CD automation. Setting up a Jenkins build server for the first time looked cool and made a difference. But adding more and more projects and jobs isn’t cool and doesn’t scale. Operations costs grow rapidly while maintaining the service at the level of decade-old expectations. But the cloud taught us that once a feature is implemented the scaling should be almost free.
Build-test automation may burn a whole lot of compute, especially for test automation. That’s fine if it does the work but is not fine if it stays idle overnight. And during the day no queueing is tolerated because it’s expensive as human resources wait for the result. Therefore build-test automation infrastructure should be truly elastic from zero.
Naturally it should scale as well to higher test coverage and to many projects. As TravisCI or CircleCI are almost ubiquitous at GitHub projects with little effort from either company, build-test automation for as many projects as desired should cost nothing for an “automation team”. The “automation team” shouldn’t even be aware of the application project’s existence. Application teams should do build-test automation for themselves easily and naturally. Easily means it’s really simple.
It’s not obvious but it also sorts out comprehensive opinionated “CI/CD automation frameworks”. Firstly, one size doesn’t fit all. Current application systems have great variety across teams and technologies. For example a simple 3-tier web application employs different technologies for UI, API, and persistence tiers and uses different deployment options. An automation framework has to support them all. Such a framework is an expensive product with a high maintenance cost. As a result, it is not quick and easy to adapt. So it becomes a drag rather than a benefit.
Now as we are moving into a new decade, CI/CD practices aren’t new or cool. They simply need to work and be cost efficient. How? At least by:
- Instantly autoscaling via elastic from zero build-test automation infrastructure
- Scaling to as many application projects as necessary, for free
- Empowering application teams to do build-test automation themselves, easily
- Keeping the “CI/CD automation team” as a tiny group of developers rather than a busy operations team
Challenge: New runtime targets
It was fine while the only deployment artifact was a software package file. And it was still bearable with cloud VM images handled separately. But a modern application deployment is a combination of cloud primitives and custom code for many runtime infrastructures besides VMs – Kubernetes, serverless compute options, CDN, and managed runtimes (such as GCP Firebase, and DataFlow). All the variety should be in the project source code repository and it should fit to a consistent code-build-test-deploy flow.
Enterprises used to see vendor lock-in risk with cloud-specific services. But it is no longer a concern – the infrastructure migration cost is eclipsed by shorter application life time. An application is rebuilt from the ground in about five years or so. A modern application is likely assembled from cloud “Lego bricks” at deployment time. Cloud-agnostic application architecture is still a good pattern to produce better systems but now it needs different reasons to justify itself. With the variety of deployment targets available, application artifacts should still be built and tracked in a uniform way. Maintain build and deployment reproducibility, manage dependencies, and have a test pyramid.
New runtime infrastructure targets will inevitably come after applications start using it although application teams have to make extra efforts to adopt them. That’s why software delivery and CI/CD support should pioneer them so application teams can justify their use by realizing benefits without big prior investments.
So for a “CI/CD automation team” (or however the SDLC people are referred to), this challenge is straightforward – to stay competitive in the yearly race for the IT innovation budget. To achieve this they must:
- Include support for a variety of modern deployment targets in the cloud (when the applications start to use them)
- Pioneer the new cloud deployment targets (before the applications adopt them)
Challenge: Continuous delivery in production
CI/CD is a set of measures to make SDLC efficient. SDLC starts with a user story by the business and ends at feature decommissioning. But CI/CD support and CI/CD automation usually focus on a small part of this path – build-test automation. It neither covers CI – because CI starts before the code is submitted to the source code repository, nor addresses CD – because CD happens in production. The left part of the software delivery pipeline is owned by the development teams, as they are the ones who shape it. But the right part creating the actual business value is often handled like it was done in the previous millennium.
It’s easy to overlook how much happens in the production environment. It’s the most complex part of feature delivery. The most sophisticated processes are there – shadow and canary deployment, traffic shifts, experiments, feature control by traffic and topology, and safe decommissioning to name a few. So there is good justification in saying that it deserves more attention.
Just as software artifacts for the cloud come in many forms, a feature comes to production by several means – software updates, feature flags, or topology (traffic) control. Different processes (if any) control them. It easily causes interference and surprises in production. Meanwhile as code deployment becomes cheaper, configuration and data change management converge with code. The same concepts – artifact, deployment, and feature flag apply from VM infrastructure to web UI. It calls for holistic, consistent change management that’s lightweight but safe, transparent, and predictable so that any change can be traced and reverted if necessary. Configuration, code, feature flags, and topology are all on par despite these change vehicles acting at different levels. The payload is a feature and a feature is the thing to deliver, not a code drop or configuration change.
The variety of deployment and control options creates complexity. This can be mitigated by carving out simple operation processes and discarding many of the possible combinations. But don’t be too restrictive. Permit process augmentation. This is achieved by reliable, algorithmic automation of basic steps like application component updates or gradual traffic shifts. Leave little room for human failure. Not by paving a single rail track along dire cliffs, but by flattening the area as much as possible so most possible actions are safe.
Unless the feature is in production, CI/CD brings no value. It is naturally seen as a disposable resource to do second class low-tech work. The challenge is to make it first class again. This can occur by:
- Delivering features (changes) rather than artifacts
- Re-focusing from deep automation of small parts to broader support of the process
- Creating a diverse menu of simple easy processes instead of a few that are comprehensive and sophisticated
- Organic continuation of the change pipeline rather than a push-button deployment
- Creating a multi-view mixed production system rather than a “production version”
- Holistic multi-channel change delivery by code deployments, flags, and topology
- Recorded, replayable operations (do “like that”) instead of explicit operations
Challenge: User risk control
Risk management related to software is a more or less established discipline. It covers planning and implementation processes, quality assurance, and even technical debt measurement. But how, as a business user, should I react to the fact that most of the production issues come not from the software code, not from hackers, but from something as vague as “configuration”? The reaction should be “I need to control it”.
When a car maker realizes that a commodity part has a defect, it knows exactly what cars should be recalled for replacement. It should be the norm for software systems too (remember the Knight Capital and Equifax disasters?). For every update I should know exactly what I’m shipping whether it’s parts, features, or change history. Is it a good change? It should be trivial to answer.
Straightforward control means having in place checkpoints and responsible gatekeepers. However, the gatekeepers can quickly become bottlenecks and besides that they’re not always very effective. Control by restriction and pushing approvals to higher ranks that “should know everything for some magical reason” is a common but poor approach. This is because the higher ranks don’t know everything – they just try to guess their personal risks.
Everybody should do the work that is natural to them. They should utilize their expertise and use locally available data. That way a human or a machine can make reliable decisions – like a unit test passing because a function returns the expected value or an SRE putting changes on hold because the service failure budget has been exceeded.
So the need for control and to minimize the costs associated with it present another set of challenges:
- Making the checks natural and inexpensive
- Tracking the history of everything so that a change is a small delta, not a whole new system
- Breaking free from professional or project silos so if an infrastructure part needs software methods – hand it to software people rather than stretching operations
- Make decisions local, using local data and expertise
Challenge: It is not just software
Software ruled the world of business applications for decades. Now the aging king has a rival – data, and it’s more diverse than executable computer code. It comes in the form of ML models and training datasets, structured analytical data and explorable data lakes, transformation pipelines, and blockchains. Because this change doesn’t reflect a change in humans, the business problems essentially stay the same. The business doesn’t care much if the solution is a piece of code or an AI model.
Regardless of the implementation, the business needs a feature delivered in a timely and reliable way. The current SDLC and CI/CD process are built exclusively around software, which has created a bias. This bias created a whole new trend and class of tooling – “Everything as Code” (Infrastructure as Code, Configuration as Code, Security as Code etc). This biased unification may be fine if it is software that prevails. But what if it doesn’t? This leads to a new set of challenges:
- CI/CD for user features rather than for code changes
- Consistent delivery of changes in different forms – either code, data, or configuration
Summary
Sometimes CI/CD implementations become deeply automated Rube Goldberg machines. Sometimes they achieve great local efficiency but constrain end-to-end delivery performance. And quite often they are costly to maintain. Those are common “childhood diseases” of modern SDLC management. Yes, methodologically Continuous Delivery is a solved problem. However, the application development landscape has changed. It is no longer enough to “have CI/CD” to get the bottom line in the black. It is not even enough to use modern tools. It is critical to address the right actual challenges. Currently they are not about deeper automation but about overlooked areas (production), priorities (efficiency and risk control), and contemporary applications (new runtimes and data).