LLMOps blueprint for open-source large language models
Aug 07, 2024 • 10 min read
Deploying open-source large language models (LLMs) like Mixtral-8x7B, Mistral, and Llama 2 requires a significantly different LLMOps architecture compared to using closed-source models from providers like OpenAI, Google, and Anthropic, even though data vectorization, embeddings creation and observability are similar.
This paper presents a reusable, scalable LLMOps platform for building AI solutions with open-source LLMs. Key challenges addressed include supporting on-premises and hybrid deployments, managing diverse GPU hardware, and enabling efficient scaling. The proposed architecture leverages cloud-agnostic components for data engineering, model training and inference, and observability.
For organizations seeking to maintain data privacy and control costs, this platform provides a flexible approach to deriving maximum business value from open-source LLMs. Technical leaders will gain an understanding of the core components and design choices involved in implementing enterprise-grade LLMOps for open-source models.
Challenges in deploying open-source LLM models
As large language models evolve, many challenges arise that need to be solved before production deployment. While our previous article explored the broader challenges of building an end-to-end LLMOps platform, this article focuses specifically on the complexities of running open-source LLMs in cloud or on-premises environments.
Data privacy and intellectual property concerns
Most modern enterprises adhere to strict protocols for managing personally identifiable information (PII) data and intellectual property. These organizations are often prohibited from sharing sensitive data or code outside their corporate security perimeter. Consequently, their only viable option for leveraging LLM technology is to deploy models within their own on-premises infrastructure.
Deployment scenarios
Our clients use various approaches to manage their data processing needs. Some operate massive on-premises data centers with modern GPUs for training, but may use cloud services when they reach capacity limits. This flexibility allows for temporary or long-term cloud use for specific applications. Based on their infrastructure choices, we’ve identified three primary deployment scenarios:
- Hybrid deployments across on-premise and cloud
- Fully cloud-based deployments
- Entirely on-premises deployments
Clients who run hybrid deployments typically aim to balance costs using on-premises resources for expensive neural network machine learning model training, while leveraging the cloud for less costly inference tasks. Those operating entirely on-premises typically do so due to regulatory requirements, necessitating a cloud-agnostic toolset. Fully cloud-based clients enjoy the most flexibility, able to build target LLM platforms leveraging a wide range of cloud-native and cloud-agnostic solutions.
Considering the above, clients requiring cloud-agnostic solutions, typically those with on-premises or hybrid deployments, have specific expectations::
- Support for on-premises and hybrid deployments
- Scalability from on-premises to cloud
- Management of diverse GPU hardware across environments, including on-premises options not available in the cloud (e.g. Intel GPU Max series)
- Deployment of ML models to various target GPU hardware architectures
These clients often face significant challenges in building end-to-end systems, onboarding legacy models from different environments, and implementing comprehensive MLOps, observability, on-demand scaling, and data platform integration.
In contrast, cloud-based clients avoid cross-cloud complexities and can choose industry-leading cloud-native or cloud-agnostic solutions. However, they must implement strict cost control and hardware optimization policies to prevent drastic budget increases.
The following sections will compare essential components of hybrid or cloud-agnostic solutions with cloud-native options like AWS Sagemaker, focusing on key solution elements and industry best practices.
Cloud-agnostic architecture
This cloud-agnostic blueprint is built on the Ray framework for AI workloads. It provides a runtime environment for model training, inference, and observability:
It’s important to note that Ray AI is primarily a framework for distributed training and inference across all ML models types. It does not offer:
- Model training optimization
- Pipeline orchestration
- Hyperparameter tuning
- Explainability features
- Experiment management
The architecture contains four major components:
- Data engineering
- Training capabilities
- Real-time inference
- Observability
In the following sections, we’ll focus on two key aspects:
- Building inference capabilities using Ray AI
- Technical essentials for scaling LLMs
Data engineering for LLMs
Data engineering for LLMs relies heavily on traditional data engineering tools like Apache Spark, Apache Flink or cloud-native technologies. In retrieval augmented generation (RAG) architectures, massive amounts of unstructured data (e.g., thousands of PDF or Word documents) are processed, parsed, split into chunks, and vectorized to solve specific business use cases such as chatbots or knowledge bases.
Given the large scale of document collections, Big Data techniques are essential:
- Apache Spark or Apache Flink for data transformations
- Kubernetes for distributed computing
- Apache Airflow or Dagster for data pipeline orchestration
Typical data preparation steps include:
- Creating a bag of words from each document
- Forming a vocabulary from unique words in the entire word corpus
- Creating a sparse matrix from the vocabulary
- Extending the sparse matrix to include single words of bigrams
The industry provides other vectorization techniques like TF-IDF, Word2Vec, CBOW, and others. Data engineering should support each of these approaches flexibly, integrating with the upstream LLM requirements and downstream data sources. In RAG architectures, vectorization simplifies identifying the most relevant documents. All documents are vectorized, and for a search query, a subset of matching documents is identified through vector search. The LLM then determines the most relevant results from this subset.
Inference layer
The inference layer, along with distributed training, forms the most complex part of the architecture. It requires proper automation to support LLMOps, scaling options, model observability, and tuning. Built on Ray Serve, this layer supports everything from single models to sophisticated pipelines. Each deployment is essentially a Kubernetes workload with specific interfaces for inference invocation, observability, and scaling.
An LLM serving application is defined by three key aspects:
- LLM deployment definition: Resource requirements, replica count, auto-scaling strategy, etc.
- Deployment handles are used for model pipeline deployments
- Model itself: A runnable implementing the Ray AI interface and encapsulating business logic
- Communication protocols: Ray AI supports both gRPC (for async requests) and HTTP (for synchronous invocations)
Deployment and ingress traffic handling are core concepts in Ray AI, thus worth describing in more detail. First of all, there are two deployment methods:
- Single-node deployment (when possible)
- Multi-node deployment with shared state maintained externally (e.g., in cache or database)
Ray AI supports both deployment types. In Ray AI, a ‘deployment’ refers to the distribution of training or inference workloads across the cluster. This differs slightly from the industry-standard definition of deployment. Each deployment is a Python function or class responsible for serving requests. Deployments can be scaled through auto-scaling configuration, with each replica having specific hardware requirements (GPU or CPU). Below is an example configuration used to run Mixtral 7x8B on a Ray AI cluster:
Modern business applications often require multiple LLM or ML models to operate as an ensemble, with preprocessing, processing, and postprocessing steps. Each step has its own deployment and scaling configuration but is orchestrated as a single pipeline. Ray AI provides three options to handle sophisticated model ensembles:
- Request chaining: A single request passes through a chain of models, with each model’s output becoming the input for the next in the pipeline.
- Request streaming: An async, future-like response that may take some time to evaluate.
- Asynchronous request (pre)processing: Downstream models can begin preprocessing before upstream model invocation is complete.
Each model in the pipeline maintains its own deployment, scaling configuration, and monitoring.
Not all modern LLM models fit into a single GPU-accelerated virtual machine; however, in our case, Mixtral 7x8B parameters do. We evaluated its performance in both single-node and cluster deployments. Our performance tests revealed that single-node deployment has less latency than cluster deployment due to the additional latency introduced by cross-node state transfer in cluster deployments during inference calls. We’ll provide more details in the next section. It’s worth noting that while cluster deployments may introduce some latency, they offer advantages in scalability and resource management for larger models or higher throughput requirements.
Inference scaling
As previously mentioned, Ray AI supports two deployment and scaling methods for LLMs:
- Single-node deployment:
- Used when the model fits on one machine
- Scales by replicating identical copies to serve multiple requests
- Follows a ‘cookie-cutter’ approach
- Multi-node (cluster deployment):
- Used when the model doesn’t fit on a single machine
- Distributes inference across multiple nodes
- Splits model weights between several servers for distributed inference
For our performance tests, we used Mixtral 7x8B, which fits on a single AWS g5.48xlarge GPU node. We compared two setups, both serving 10 parallel users:
- Single-node deployment:
- Hardware: AWS g5.48xlarge
- Model: Mixtral 7x8B
- Two-node deployment:
- Hardware: Two AWS g5.24xlarge nodes
- Model: Mixtral 7x8B
The following figure demonstrates the tokens per second for both deployments:
The figure above shows that single-node deployment provides twice the tokens per second compared to two-node deployment. We observe the same behavior for time to first token:
Two-node deployment exhibits 100ms higher latency for the first token compared to single-node deployment, due to cross-node computations during inference calls.
There are two ways LLMs might be deployed:
- Strictly on a single node
- Distributed across multiple nodes
The single-node option is straightforward, resembling microservices deployment orchestrated by load balancer and Kubernetes. The multi-node option is more complex. For example, scaling from 5 to 7 nodes requires a full redeployment, as the model can’t simply be extended to two new nodes.
For models that fit on a single machine, we recommend starting with single-node deployment and scaling identical copies. This approach is technically easier and covers most use cases.
So generally speaking, there are two scaling options available:
- Horizontal: Manage scaling by increasing/decreasing the number of replicas
- Vertical: Manage scaling by increasing hardware resources
Horizontal scaling can be implemented by:
- Adding more single-node LLM models (when LLM fits to a single node);
- Completely redeploying to a bigger cluster.
Vertical scaling strategy is straightforward, similar to scaling other types of runnables like microservices.
Our experience and synthetic tests show that single-node deployments outperform cluster deployments. For instance, deployment to an AWS g5.48xlarge instance has lower latency than a model deployed across two AWS g5.24xlarge instances. Cluster node deployments require data transfer between nodes and additional algorithmic calculations, which are faster on a single node. Moreover, single-node scaling is much simpler compared to cluster scaling.
Observability
Ray AI provides monitoring and debugging tooling designed to visualize metrics for physical and logical components, as well as provide job profiling information. By default, the following metrics are available for profiling:
- Ray cluster overview and status
- Task and actor breakdown information
- Task execution timeline
- Ray Serve information:
- Serving metrics
- Application information
- Replica details
- Logs
The image above demonstrates a cluster metrics overview split by GPU core during LLM invocation. More advanced metrics can be configured using Prometheus/Grafana. Generally, Ray AI provides tooling for monitoring and debugging at the infrastructure layer, including:
- Token consumption
- CPU/GPU utilization
- Memory usage
However, observability metrics specific to LLM evaluation, prompt troubleshooting, and monitoring RAG architectures are typically addressed by specialized tooling focused on LLM model monitoring. It’s worth noting that some LLM serving engines, such as vllm, offer out-of-the-box Prometheus tooling and profiling of application-level metrics, including:
- Time to first token
- Token throughput
- Request throughput
These additional metrics can provide a more comprehensive view of LLM performance beyond the infrastructure-level metrics offered by Ray AI.
Cloud-native architecture
Our cloud-native reference architecture is based on AWS, but similar architectures can be built using Google Cloud’s Vertex AI or Microsoft Azure’s AzureML. Unlike cloud-agnostic solutions, cloud-native approaches eliminate the need to maintain a separate runtime environment for testing and inference. This architecture is heavily dependent on the Amazon SageMaker service which is used for training and inference.
Key components and features:
- Model deployment:
- LLM models can be deployed from the SageMaker JumpStart hub
- Models are provisioned via SageMaker Studio, fetching the latest versions from Hugging Face
- Data engineering:
- Supported by AWS native services or Databricks
- Data chunking and vectorization typically run as Apache Spark jobs on EMR or Databricks
- Transfer learning:
- AWS offers a simplified transfer learning process for models on SageMaker
- Enables accurate model tuning on smaller datasets (e.g., Llama 2 models)
- AWS Bedrock:
- Manages LLMOps complexities
- Potential alternative to custom implementations
Key considerations:
- Privacy: Reliance on AWS may violate compliance requirements for some clients
- Fine-tuning: Supported with some limitations
For clients requiring full control over third-party services due to compliance issues, AWS Bedrock may not be suitable, necessitating complete management of the LLM infrastructure.
Inference layer
SageMaker currently supports single-node deployment for inference, but with certain limitations on token size. By default, SageMaker terminates requests longer than 60 seconds, imposing a hard requirement on context size. Our experiments show that if a model generates more than 3,000 tokens, a timeout may occur. This behavior is consistent across all models, leading AWS to recommend setting the maximum token size to no more than 3,000.
Model chaining and pipeline creation are supported by AWS out-of-the-box through an inference pipeline API. However, this requires creating the LLM model as a separate service and handling it appropriately. SageMaker exposes the root REST endpoint and manages the chain of requests between containers running on EC2 instances.
When it comes to scaling, AWS provides vertical scaling for LLMs, similar to the cloud-agnostic approach. For horizontal scaling, SageMaker allows running multiple independent instances of the same model with the ability to distribute traffic among them.
These limitations and features are crucial to consider when designing and implementing LLM solutions on SageMaker. The token limit, in particular, may significantly impact the use cases that can be effectively deployed on this platform.
Observability
AWS implements large language model observability through two main approaches:
- Infrastructure monitoring: Hardware utilization, log collection, and metrics tracking are handled through standard means in CloudWatch. This provides insights into the operational aspects of LLM deployment.
- Model-specific observability: Model observability and explainability are implemented using SageMaker Clarify. This tool helps developers automate the evaluation of self-hosted LLMs using metrics such as GLUE, ROUGE, and BLEU.
SageMaker Clarify provides a framework for evaluating both self-hosted LLMs and existing services like AWS Bedrock. Data scientists can select relevant datasets and metrics to run comprehensive evaluations of their models.
AWS offers an end-to-end lifecycle for introducing observability intoMLOps/LLMOps projects. It’s worth noting that the general approaches to MLOps in this article, How to enhance MLOps with ML observability features: A guide for AWS users, are still fully relevant and could be reused.
This dual approach to observability allows teams to monitor both the infrastructure performance and the model’s output quality, providing a comprehensive view of the LLM system’s health and effectiveness.
Conclusion
The field of modern LLMOps for open-source models is evolving at a rapid pace, aiming to provide a full spectrum of services for large language models. However, neither open-source technologies nor public clouds currently provide end-to-end, out-of-the-box support for LLMOps.
Cloud platforms offer an easier starting point for running LLMs, but they come with limitations:
- Restricted runtime configurations
- Limited scaling options
- Model pipelining requires adjustments
For organizations seeking to avoid vendor lock-in, open-source technologies like Ray AI provide tooling to build flexible, comprehensive solutions. These can support a wide range of business scenarios and scaling options. Cost control remains a critical challenge. To properly calculate hardware requirements and scaling options, usage scenarios must be well-understood in advance.
Despite these challenges, stable options exist for running open-source LLMs in-house. This approach offers two key advantages:
- Guaranteed privacy within the organization
- Ability to fine-tune models for specific use cases
As the field continues to mature, we can expect more integrated solutions to emerge, simplifying the process of deploying and managing open-source LLMs at scale.