Kubernetes use cases beyond container scheduling
Dec 19, 2024 • 16 min read
- Use case 1: Application infrastructure framework for self-hosted AI/ML platform
- Use case 2: Multi-tenant application platform for B2B SaaS
- Use case 3: Data center operating system for self-managed computing infrastructure
- Use case 4: Software test labs
- Use case 5: Edge computing and multi-cloud
- What Kubernetes brings that public clouds lack
- Conclusion
- Credits
In 2025, Kubernetes adoption as an application infrastructure platform appears to be declining, as application developers migrate to the cloud and serverless platforms. Kubernetes has become more like the Linux kernel—an essential building block that’s relied upon but rarely directly interacted with.
However, Kubernetes is re-emerging with new use cases, particularly in the AI/ML domain, including large language models (LLM) and GenAI applications. These workloads require complex infrastructure for data persistence and workload scheduling, often spanning multiple clouds and on-premises locations. Traditional computing infrastructure management is also converging towards Kubernetes, positioning it as the new OpenStack.
This is because Kubernetes is not just a cluster management tool for containerized workloads but an extensible API platform for application infrastructure. These applications are often not built from Linux containers but instead rely on Kubernetes as a bridge between their business logic and environment.
Let’s review some Kubernetes use cases that don’t fit its traditional perception. We will then explain what makes Kubernetes special and still relevant for application developers.
Use case 1: Application infrastructure framework for self-hosted AI/ML platform
Generative AI and other applications of LLMs are becoming essential capabilities for businesses. Public cloud providers offer a range of services in this domain. However, the diversity of LLM use cases means that cloud offerings only scratch the surface.
Organizations may opt for a self-hosted LLM platform for several reasons. Concerns about data privacy and intellectual property often influence this decision. Organizations with substantial on-premise data centers may prefer utilizing existing infrastructure rather than incurring significantly higher costs for cloud GPUs. Furthermore, regulatory requirements may necessitate on-premise solutions. Overall, data sovereignty, compliance, scaling, cost, and risk management are powerful drivers for implementing self-hosted LLM application solutions.
AI/ML applications, especially GenAI use cases, are complex. They leverage various building blocks, including accelerated computing, complex scheduling, generic and application-specific data persistence services—relational, document, graph, and vector databases—and API management services. The SDLCs of these applications are sophisticated, involving diverse code and data artifacts, multiple teams working concurrently on data, model training, business logic development, and complex deployments. It’s not an exaggeration to state that LLM and GenAI solutions are among the most sophisticated business applications.
Given the complexity and entanglement, building an AI/ML platform for LLM applications might appear daunting, even on a whiteboard. However, without a platform, the development of bespoke applications becomes time-consuming and costly, eliminating any competitive advantage these applications could bring to a business.
With Kubernetes as the infrastructure framework, implementing a usable application platform for LLM applications can be reduced to known patterns of cloud-native application development. Almost magically, a plethora of tools for application development, delivery, and operations becomes available.
A typical blueprint of a self-hosted application platform, as discussed in detail in the “LLMOps blueprint for open source large language models” blog post, looks like this:
In this case, Kubernetes does several things:
- Accelerated computing abstraction
- Service network abstraction
- Storage abstraction
- A standard API for external software components
- A standard uniform API for all aspects beyond the business logic and application composition
- Tools for SDLC and operations
- Hosting middleware for external application components
Interestingly, in this situation, applications don’t utilize Kubernetes as a middleware to run containers. Instead, for computational tasks they employ Ray, a modern framework for AI/ML applications, fulfilling workload orchestration and API serving. For application developers, Kubernetes is a platform with an API offering components to assemble the application, akin to an application-specific cloud. It doesn’t instantaneously make an application relocatable from, say, Google Cloud to an on-premises data center. But it significantly simplifies this task for developers. From this perspective, GenAI application development doesn’t vary much from building a regular e-commerce website.
For platform developers, Kubernetes is already both a foundation and a construction kit. Often, a functional platform may be delivered without algorithmic coding, simply by assembling available components. This reduces platform development time from months to weeks. Moreover, the immediate availability of various tools for application assembly, deployment, operations, monitoring, and life cycle management is invaluable.
Baseline comparison
Finally, let’s compare it to the baseline: Is it possible to create a comparable self-hosted AI/ML platform without Kubernetes? Yes, quite so. None of the generic external components require Kubernetes for deployment. For instance, the Ray framework for computing tasks can be deployed on static computers, virtual or physical. But in this case, provisioning each application instance is a sequence of change management processes—VM provisioning and configuration, system software deployment (DBMS, Ray, etc.), and finally, deployment of the application code. Specific tools are required for each task, resulting in patchy automation. Each application instance becomes a “pet” with specific configurations and legacy. Scaling to the load and optimizing resource utilization remain unresolved challenges. While this is achievable through the above means, the Kubernetes option may accomplish it more quickly.
Use case 2: Multi-tenant application platform for B2B SaaS
SaaS systems are frequently viewed as shared applications that handle data from different clients, particularly in the context of end-user (B2C) services. However, B2B SaaS systems often operate differently. In this model, each client receives a dedicated instance of the application system, which exclusively processes that client’s data. The SaaS provider manages multiple instances of their systems for different clients, provisioning them on demand. This approach guarantees data sovereignty, simplifies billing, and grants the client complete control over their specific instance of the system. Notably, this pattern is commonly utilized in internal enterprise SaaS deployments.
The instance-per-client approach requires a runtime platform capable of rapid deployment of application system instances with efficient resource management and adequate isolation. Self-service and scalable operations demand a robust API to control the platform. While Kubernetes excels in rapid deployment, resource management, and API, it struggles with multi-tenant scenarios. Kubernetes namespaces provide some isolation, but certain application resources like custom resource definitions (CRD) and custom resource controllers are global within the Kubernetes cluster. Some SaaS providers opt for a “Kubernetes cluster per client” configuration, but this approach conflicts with the “rapid deployment” requirement, as provisioning a cluster in the cloud can take up to an hour, and introduces increased resource overhead and management complexity.
For this problem, Kubernetes extensible API is the basis for solutions. Several open-source projects implement Kubernetes API servers, effectively providing a virtual cluster with its API endpoint on top of resources hosted in the underlying real Kubernetes cluster. These API servers enable users to get a dedicated logical cluster quickly and use it to install cluster-scoped resources like CRDs without impacting other tenants on the same underlying cluster. The functionality and use cases supported by these implementations vary, and we’ll explore a few examples.
vCluster emulates a fully functional Kubernetes cluster within a namespace of the underlying real cluster. vCluster is a CNCF-certified Kubernetes implementation, so from a user perspective, it’s almost indistinguishable from a real cluster. Virtual cluster users can launch Pods, install CRDs and controllers, and configure their own access control.
Besides multi-tenant platforms, common use cases for vCluster include:
- Computing resource management within an enterprise, where each team has autonomy over their computing platform’s software and configuration. However, the enterprise maintains overall control and resource efficiency.
- Development environments, where development teams have complete control over their cluster with minimal overhead.
- Testbeds for Kubernetes extensions.
kcp is a CNCF project developing a Kubernetes-compatible API server that implements “workspaces.” Think of them as virtual Kubernetes clusters with flexible resource mapping. A workspace starts empty, it should be populated with CRDs and resource mappings. A single kcp workspace can map resources from multiple underlying clusters. As API objects, workspaces are lightweight and provisioned quickly, making kcp particularly useful for:
- Multi-tenant API-driven platforms
- Multi-regional application platforms
- Cluster federation
Kubernetes alone cannot resolve the multi-tenancy issue. Instead, it offers an extensible API and a runtime platform to host extensions, which are essentially software components. This allows developers to construct meta-APIs and meta-clusters for specific use cases with relative ease.
Baseline comparison
While comparing the Kubernetes API-based approach for multi-tenant application platforms to the baseline, the following popular options should be considered:
- Cluster per client approach: While effective, this approach is costly and challenging to scale. Provisioning tenants takes time, leading to a poor user experience.
- Cloud-based systems approach: Operators in cloud-based systems utilize cloud primitives for tenant isolation. However, for complex applications, the most specific isolation container is a cloud account/project/subscription, which acts as the fundamental unit of cloud resource management. While functional, this approach is expensive and time-consuming for provisioning and maintenance and limits scalability.
Use case 3: Data center operating system for self-managed computing infrastructure
In 2024, less than half of enterprise software workloads operated in public clouds (Gartner, Uptime Institute, Statista). The remaining part isn’t solely confined to personal computers and mainframes. Since the advent of inexpensive server machines in the 1990s, which became the workhorse of enterprise computational power, companies have accumulated a lot of code and data within their data centers. This legacy isn’t going to disappear soon.
Aside from the above, public clouds will never be an option for certain workloads. For instance:
- When low response latency is crucial, such as in manufacturing, telecommunications, healthcare, and some fintech applications.
- When the network bandwidth is insufficient to transfer the volume of data for processing. For example, Large Hadron Collider raw experiment data is processed on-site. Extensive CCTV networks can also generate substantial amounts of data.
- Sometimes, regulatory compliance is cited as a reason to retain on-premises systems. While cloud providers have acquired numerous compliance certifications, sometimes it can be easier to certify a system that is locked in the basement.
In the 21st century, running contemporary business software on individual computers resembles software coding in assembly language, a practice from the 1960s. Today, application software needs a programmable environment for distributed infrastructure, akin to an operating system for the data center.
An operating system manages resources and offers shared services to applications. For a distributed networked computing infrastructure like a data center, an operating system should provide an elastic computing environment with workload isolation and a collection of common application building blocks, including storage, communication, security, and telemetry. Such an operating system has the traits of a computing cloud.
Kubernetes alone isn’t a data center operating system but may serve as its kernel. It offers a uniform API for core infrastructure facilities, such as computing, storage, and networking. However, Kubernetes does not handle its installation or upgrades, relying on underlying infrastructure like cloud IaaS, physical infrastructure management tools, or manual intervention. Nevertheless, its primary strength lies in providing a uniform API for all resources within the “cloud computer.”
Assembling the underlying infrastructure can involve third-party tools, including commercial solutions like Red Hat OpenShift Platform, open-source options (check out this DIY: Create Your Own Cloud with Kubernetes blog), and cloud-provided solutions. These solutions cover operational aspects of the datacenter OS itself.
Once the foundation is in place, it can be expanded to support a broader range of workloads and library services. KubeVirt, a Red Hat-backed CNCF project, enables traditional virtual machine workloads by bringing Linux virtualization under Kubernetes control. VMs can be deployed from OCI images, similar to containers. Knative, another widely supported CNCF project, is a framework for serverless “functions as a service” workloads, handling both online synchronous and message-based processing. These building blocks collectively provide a gamut of workload types comparable to a mature public cloud.
Pure Kubernetes manages a few essential resources—CPU, memory, storage, and network. Specialized hardware can be exposed as managed Kubernetes resources through Kubernetes device plugins created by hardware vendors such as NVIDIA, AMD, and Intel. GPUs are the most notable example of such hardware, making Kubernetes indispensable for AI/ML workloads.
Major public cloud providers offer Kubernetes API extensions for their infrastructure, such as GCP Config Connector for Kubernetes, AWS Controllers for Kubernetes, and Azure Service Operator for Kubernetes. Crossplane, a CNCF project, provides a generic Kubernetes API extension that supports various resources managed through APIs. This enables cloud resources like object storage buckets, DNS zones, managed databases, and API gateways to become Kubernetes resources, on par with Pods and Services.
Together, these components create a resource management core for a data center operating system with resource support surpassing any existing cloud platform. Unlike proprietary cloud platforms, Kubernetes is not restricted to a specific vendor’s infrastructure. It unifies cross-cloud and cross-datacenter operations, providing a consistent experience across different environments.
On top of resource and workload management, a data center OS should also provide:
- Operations management tools: For an operating system for a large farm of resources such as a data center, these capabilities are important. Here the ecosystem of Kubernetes tooling comes into play—ranging from basic workload management, monitoring, and API control to security and cost management tools. Thanks to the uniform API for Kubernetes managed resources, these tools work evenly with both stock Kubernetes objects and extension-managed resources. A reasonable microservices platform includes them.
- Multi-tenancy: This is an essential trait of an operating system for shared computing infrastructure, cloud, on-premises, or hybrid. As discussed earlier, for this facet, Kubernetes does have a solution.
- Programmability: This is a key property of an operating system. For a data center OS, it means Infrastructure as Code (IaC). Here, thanks to the well-designed uniform resource API, Kubernetes is unbeatable. This topic is discussed in detail in our “IaC framework selection guideline” blog post.
Baseline comparison
Finally, let’s contrast Kubernetes-based solutions with baseline options, such as public clouds or data center management platforms from VMware/Broadcom, Nutanix, Proxmox, and OpenStack. Common sense suggests that building and maintaining your own data center operating system is not a good idea. However, the specifics matter. If existing products lack support for a particular use case, assembling a custom solution might be reasonable. This is especially true for AI/ML systems, where assembling the solution from functional blocks is less labor-intensive than developing a data center operating system from scratch. Vendor Kubernetes platforms, such as Red Hat OpenShift and SUSE Rancher, can provide valuable assistance.
Choosing Kubernetes as the foundation for a data center operating system is not merely a technological decision; it’s also a risk management strategy. Each Kubernetes-based data center operating system comprises multiple building blocks with multiple options. These building blocks are mature, well-supported, and originate from trusted foundations or stable vendors.
Use case 4: Software test labs
Besides application software development, software test labs are widely used in various domains. For example, in the development of system software, such as service controllers and custom API resources for Kubernetes, system services for operating systems, embedded software, and operations automation. Test lab environments are frequently used for running test tools for functional, load, or resilience testing. An interesting class of use cases comes from the security domain—ranging from routine security scanning and penetration testing to risky activities like honeypot environments and controlled malware detonation. These security-related uses are often prohibited on public cloud platforms, necessitating the use of self-hosted infrastructure.
A test lab is a deployment environment designed to gather information about the workload rather than supporting its primary function. Test lab users do things that would be unnecessary or dangerous in production. In many cases, they require complete control over the environment, effectively having “root” access. Test labs should be ephemeral, as they are often left in an unpredictable state after use, making it more convenient to discard and provision a new one rather than restore its state. Additionally, test labs must be strongly isolated.
Despite the seemingly conflicting requirements, techniques discussed in the “Multi-tenant application platform” and “Datacenter operating system” cases can be leveraged to create an infrastructure platform for test labs that supports ephemeral workloads while providing a high level of isolation. If containers work as the workload enclosure, a few open-source projects, such as Kata Containers and Google gVisor, implement additional isolation. The table below provides several examples of how isolation requirements for specific use cases are addressed within the Kubernetes ecosystem.
Use case | Isolation level | Implementation |
Application software testing | Kubernetes namespace | Kubernetes |
Remote development environment | Kubernetes pod | gVisor |
Kubernetes CRD development | Kubernetes cluster | vCluster |
Operations automation | Kubernetes cluster | vCluster |
OS software development | VM | KubeVirt |
Security testing | VM | Kata Containers |
Test lab environments are useful when they can be directly provisioned by the users, in a self-service manner. Kubernetes handles deployment environment provisioning through the same API as application software deployments, unlocking a plethora of application life cycle management tools for this specific use case. “Internal developer portal” systems, like Backstage by Spotify, serve as libraries of environment blueprints.
Environment isolation on a shared platform depends on proper configuration. Accidental misconfigurations can occur since the users themselves provision the test deployment environments. Kubernetes, as an infrastructure API platform, provides two mechanisms to address this: access control and admission controllers. These can be used in conjunction with policy management frameworks, such as CNCF Open Policy Agent (OPA), to implement access management and security measures that surpass those offered by public clouds.
Baseline comparison
As before, let’s compare Kubernetes-based infrastructure against the existing baseline. For test labs, the primary infrastructure choices are either public clouds or self-hosted VM management platforms. However, test labs frequently need specialized infrastructure that mainstream platforms might not prioritize:
- In public clouds, an account (or a project, or a subscription—the terminology depends on the cloud platform) implements the fundamental isolation level. Major clouds have robust account management mechanisms, both proactive and reactive. However, a cloud account is a basic resource container rather than a fully equipped lab. A lot to be done to make it useful. Also, an account is an expensive resource, limited in quantity. Fully automated account provisioning may take an hour or more. However, for some use cases, it’s a working approach and encoded in best practice guidelines of cloud providers.
- Integrated VM management platforms, offered by vendors like VMware, Nutanix, and Proxmox, excel in provisioning and isolation. Their commercial nature provides polished interfaces and management capabilities, but they focus solely on VM management. Additionally, VM-based environments lack elasticity, posing a constraint for use cases like load testing.
So while the convenient cloud-based and self-hosted software test lab options work well, they have limitations. In many cases, Kubernetes-based toolstacks offer a more comprehensive and sustainable solution.
Use case 5: Edge computing and multi-cloud
Contrary to the blog post title, this use case is actually about running container workloads. Linux containers have become the most widely used portable software packaging format, making Kubernetes the default choice for deploying software at the edge.
It’s worth noting that edge doesn’t always imply low-end hardware. For example, an edge computing site in a big-box store could be a small data center. Its “edge” designation comes from its role as a “spoke” connected to the central “hub” hosting the core of the enterprise systems. Edge locations primarily host production workloads dedicated to that specific site and are mostly managed remotely.
Technically, edge sites can operate independently. Workload deployments can be automated locally using GitOps tools, while the central hub controls the overall ALM automation. However, this approach pushes too much complexity to the edge.
A more practical solution is to manage edge locations using the Kubernetes resource model. This approach offers a middle ground as it’s simpler than full-blown ALM automation, yet it’s standardized and generic enough to accommodate local variations. The techniques for workload orchestration across a fleet of loosely coupled Kubernetes clusters are collectively known as cluster federation.
Kubernetes doesn’t seem to be the optimal choice for edge computing because it’s complex for such low-maintenance facilities and isn’t designed for operating over the public Internet. Additionally, cluster federation is a less mature area in Kubernetes development. There are a few ways to address these limitations:
- A DIY solution using techniques discussed above. Utilize techniques from the “Datacenter operating system” and “Multi-tenant application platform” sections to build a resource management platform and a multi-cluster system for the edge.
- Open source solutions for edge computing and cluster federation for Kubernetes. There are a few active open-source projects working on this problem, Open Cluster Management (OCM) and Karmada are the best-known examples.
- Commercial solutions for edge computing. The best-known examples are Red Hat OpenShift and SUSE Rancher, with Rancher supporting tiny deployment with K3s, a small-footprint Kubernetes variant.
- Leverage cloud extensions like EKS Anywhere by AWS and GKE Enterprise (Anthos) by Google. This option works best when the central systems (the “hub”) are in the cloud.
Regarding multi-cloud—it’s rarely an objective. The major cloud platforms exceed the service level and reliability needs of most enterprise applications, as a technology platform and as a business partner. Multi-cloud just happens when different mono-cloud parts of an organization come together. For these cases, a uniform infrastructure is not necessary, as integration goes at a higher level. A reasonably “high” level can be the Kubernetes resource model and API.
Baseline comparison
In edge computing, there are currently no standard solutions that specifically address the needs of this use case. As a result, Kubernetes has the potential to become the default choice, much like Linux became the default operating system for edge and embedded devices.
What Kubernetes brings that public clouds lack
After reviewing a few diverse use cases, we can distill the specific traits of Kubernetes, which make it a viable application infrastructure platform alternative to the public clouds backed by the largest companies in the world.
Kubernetes is an extensible API framework for the application runtime environment, with “extensibility” as its defining characteristic. It functions as the system library for application software (such as libc in Unix-like systems) that is complemented by countless third-party and custom libraries. In contrast, in the context of the public cloud as a computer, a developer is constrained to the system library, which, while offering a comprehensive array of features, remains finite in its capabilities.
When an application system is inherently connected to the runtime infrastructure, this limitation becomes apparent. Applications may interact with the runtime environment for data (such as service discovery) or resources (particularly for batch workloads). This interaction isn’t problematic for basic web services that scale automatically based on CPU load. However, it can pose challenges for many business, scientific, and software development scenarios, especially for AI/ML workloads. In these cases, a proprietary public cloud platform may be restraining compared to the open and broad Kubernetes ecosystem.
Layering, a popular architectural pattern, is important for complex systems, as it keeps them from becoming a big ball of spaghetti. Unfortunately, it doesn’t play well with public cloud APIs. For example, creating a customized “AWS” API on top of the regular AWS infrastructure, including specific DBMS and computing primitives, but still compatible with AWS tools, isn’t feasible. While implementation of a Kubernetes-like API on top of Kubernetes API is straightforward and supports the common API design patterns.
A comprehensive cloud API can be assembled with Kubernetes on top of a public cloud infrastructure. This can be useful when the application system outgrows the high-level building blocks of the underlying cloud. AWS Lambda is a good example. While Lambda is convenient and accelerates web application development, it can become expensive at scale. Additionally, deploying a substantial application containing tens to hundreds of Lambda functions using CloudFormation can take over an hour, slowing down the development process. Therefore, at a certain point, it may be reasonable to switch to a self-managed serverless framework like Knative and pay only for the underlying computing power with either spot or committed use discounts.
Public cloud extensibility may be addressed through the use of vendor-neutral infrastructure as code (IaC) frameworks like Terraform/OpenTofu and Pulumi. These frameworks support a wide range of existing cloud services and are easily extensible. However, it’s important to note that IaC frameworks don’t have APIs. Instead, they consist of code that controls cloud resources when executed. To understand the distinction, consider the well-defined Kubernetes CRUD API supported by the Kubernetes reconciler. Comparing IaC frameworks to this API is akin to comparing the traditional management of physical servers using SSH and CLI tools to a modern, API-controlled cloud platform. This difference powers the technological leap enabled by cloud platforms. Also, an IaC framework is an end-user tool, while an API forms the foundation for an ecosystem.
SDLC performance is often overlooked, but industry benchmarks such as the DORA Accelerate State of DevOps report indicate that it is deterministic for organizational success. The report also highlights the significant role of clouds in enhancing SDLC performance. However, the focus of SDLC orchestration tool vendors primarily revolves around Kubernetes as the target infrastructure, paying less attention to public clouds. This move is understandable as these vendors aim to avoid competing with the cloud providers on their home turf. Open-source projects like GitOps automation tools also exhibit a similar bias towards Kubernetes. As a result, public cloud platforms are largely restricted to cloud-supplied SDLC automation tools and basic services from code hosting platforms like GitHub. For complex systems, organizations often resort to investing in custom “DevOps automation.” This approach can be viable, and the lack of sophisticated tooling may be compensated by simplifying SDLC processes (which is a good thing anyway). However, in cases where the limitations of public cloud services become restrictive, it is crucial to consider SDLC concerns and the role of Kubernetes-based tools in improving them.
For clarity, it isn’t a “build vs buy” debate—this article is already for builders. If the public cloud of choice satisfies the needs of the use case, there’s no need to look further. Custom solutions might not justify the time investment. However, when the use case is too specific for the cloud, or when its goals don’t align with the cloud provider’s interests, building a custom Kubernetes-based application platform becomes a viable option.
Conclusion
Today, Kubernetes as a cluster management software for container workloads is a generally available low-level commodity. As a developer of regular business applications, you likely won’t need to work closely with it in the foreseeable future.
But Kubernetes as an application platform framework is growing. While public cloud services are adequate for many common business applications, when it comes to technological advancements in complex problems (such as AI/ML), operational scale (multi-tenancy), and heterogeneous environments (accelerated computing, edge, multi-cloud, and on-prem), Kubernetes is the go-to framework. Learn and employ it if you work in areas where business advances with new technology.
Credits
Special thanks to Sergey Plastinkin, Dmitry Mezhensky, Alexander Danilov, and Dmytro Zamaruiev for their thorough feedback and discussions.