A look at key features of Kubernetes that naturally fit the needs of AI inference and how they benefit inference workloads.
了解 Kubernetes 的主要特性,这些特性自然而然地满足了 AI 推理的需求,以及它们如何使推理工作负载受益。

Many of the key features of Kubernetes naturally fit the needs of AI inference, whether it’s AI-enabled microservices or ML models, almost as if they were invented for the purpose. Let’s look at these features and how they benefit inference workloads.
Kubernetes的许多关键功能自然而然地满足了 AI 推理的需求,无论是支持 AI 的微服务还是 ML 模型,就好像它们是为此而发明的一样。让我们看看这些功能以及它们如何使推理工作负载受益。
1.可扩展性 Scalability
Scalability for AI-enabled applications and ML models ensures they can handle as much load as needed, such as the number of concurrent user requests. Kubernetes has three native autoscaling mechanisms, each benefiting scalability: Horizontal Pod Autoscaler (HPA), Vertical Pod Autoscaler (VPA) and Cluster Autoscaler (CA).
支持 AI 的应用程序和 ML 模型的可扩展性确保它们能够处理所需的负载,例如并发用户请求的数量。Kubernetes 有三种原生自动缩放机制,每种机制都具有可扩展性:水平 Pod 自动缩放器 (HPA)、垂直 Pod 自动缩放器 (VPA) 和集群自动缩放器 (CA)。
Horizontal Pod Autoscaler scales the number of pods running applications or ML models based on various metrics such as CPU, GPU and memory utilization. When demand increases, such as a spike in user requests, HPA scales resources up. When the load decreases, HPA scales the resources down. Vertical Pod Autoscaler adjusts the CPU, GPU, and memory requirements and limits for containers in a pod based on their actual usage. By changing limits
in a pod specification, you can control the amount of specific resources the pod can receive. It’s useful for maximizing the utilization of each available resource on your node.Cluster Autoscaler adjusts the total pool of compute resources available across the cluster to meet workload demands. It dynamically adds or removes worker nodes from the cluster based on the resource needs of the pods. That’s why CA is essential for inferencing large ML models with a large user base.
-
Horizontal Pod Autoscaler根据各种指标(例如 CPU、GPU 和内存利用率)扩展运行应用程序或 ML 模型的 Pod 数量。当需求增加(例如用户请求激增)时,HPA 会扩大资源。当负载减少时,HPA 会减少资源。 -
Vertical Pod Autoscaler根据容器的实际使用情况调整容器中容器的 CPU、GPU 和内存要求和限制。通过更改 limits
容器规范,您可以控制容器可以接收的特定资源量。它有助于最大限度地利用节点上每个可用资源。 -
Cluster Autoscaler会调整整个集群中可用的计算资源总量,以满足工作负载需求。它会根据 Pod 的资源需求动态地在集群中添加或删除工作节点。因此,CA 对于推理具有大量用户群的大型 ML 模型至关重要。
Here are the key benefits of K8s scalability for AI inference:
Ensuring high availability of AI workloads by automatically scaling up and down the number of pod replicas as needed Supporting product growth by automatically adjusting cluster size as needed Optimizing resource utilization based on an application’s actual needs, thereby ensuring that you only pay for the resources your pods use
以下是 K8s 可扩展性对 AI 推理的主要好处:
-
通过根据需要自动增加或减少 pod 副本数量来确保 AI 工作负载的高可用性 -
根据需要自动调整集群大小来支持产品增长 -
根据应用程序的实际需求优化资源利用率,从而确保你只需为 Pod 使用的资源付费
2.资源优化 Resource Optimization
By thoroughly optimizing resource utilization for inference workloads, you can provide them with the appropriate amount of resources. This saves you money, which is particularly important when renting often-expensive GPUs. The key Kubernetes features that allow you to optimize resource usage for inference workloads are efficient resource allocation, detailed control over
limits
andrequests
, and autoscaling.
通过彻底优化推理工作负载的资源利用率,您可以为其提供适当数量的资源。这可以为您节省资金,这在租用通常昂贵的 GPU 时尤为重要。允许您优化推理工作负载资源使用率的关键 Kubernetes 功能包括高效的资源分配、对limits
和 的详细控制requests
以及自动扩展。
Efficient resource allocation: You can allocate a specific amount of GPU, CPU and RAM to pods by specifying it in a pod manifest. However, only NVIDIA accelerators currently support time-slicing and multi-instance partitioning of GPUs. If you use Intel or AMD accelerators, pods can only request entire GPUs. Detailed control over resource “limits” and “requests”: The requests
define the minimum resources a container needs, whilelimits
prevent a container from using more than the specified resources. This provides granular control over computing resources.Autoscaling: HPA, VPA and CA prevent wasting money on idle resources. If you configure these features correctly, you won’t have any idle resources at all.
-
高效的资源分配:您可以通过在 Pod 清单中指定来为 Pod 分配特定数量的 GPU、CPU 和 RAM。但是,目前只有 NVIDIA 加速器支持 GPU 的时间分片和多实例分区。如果您使用 Intel 或 AMD 加速器,Pod 只能请求整个 GPU。 -
对资源“限制”和“请求”进行精细控制: requests
定义容器所需的最小资源,同时limits
防止容器使用超过指定数量的资源。这提供了对计算资源的精细控制。 -
自动扩展:HPA、VPA 和 CA 可防止在闲置资源上浪费金钱。如果您正确配置这些功能,您将根本不会有任何闲置资源。
With these Kubernetes capabilities, your workloads receive the computing power they need and no more. Since renting a mid-range GPU in the cloud can cost 2 per hour, you save a significant amount of money in the long run.
借助这些 Kubernetes 功能,您的工作负载将获得所需的计算能力,而不会更多。由于在云端租用中档 GPU每小时的费用为 1 到 2 美元,因此从长远来看,您可以节省大量资金。
3. 性能优化 Performance Optimization
While AI inference is typically less resource-intensive than training, it still requires GPU and other computing resources to run efficiently. HPA, VPA and CA are the key contributors to Kubernetes’ ability to improve inference performance. They ensure optimal resource allocation for AI-enabled applications even as the load changes. However, you can use additional tools to help you control and predict the performance of your AI workloads, such as StormForge or Magalix Agent.
虽然 AI 推理通常比训练占用更少的资源,但它仍然需要 GPU 和其他计算资源才能高效运行。HPA、VPA 和 CA 是 Kubernetes 提高推理性能的关键因素。即使负载发生变化,它们也能确保为支持 AI 的应用程序提供最佳的资源分配。但是,您可以使用其他工具来帮助您控制和预测 AI 工作负载的性能,例如StormForge或Magalix Agent。
Overall, Kubernetes’ elasticity and ability to fine-tune resource usage allows you to achieve optimal performance for your AI applications, regardless of their size and load.
总体而言,Kubernetes 的弹性和微调资源使用的能力使您能够为 AI 应用程序实现最佳性能,无论其大小和负载如何。
4.可移植性 Portability
It’s critical that AI workloads, such as ML models, are portable. This allows you to run them consistently across environments without worrying about infrastructure differences, saving you time and, consequently, money. Kubernetes enables portability primarily through two built-in features: containerization and compatibility with any environment.
至关重要的是,AI 工作负载(例如 ML 模型)必须具有可移植性。这样,您就可以在不同环境中一致地运行它们,而不必担心基础设施差异,从而节省您的时间,进而节省金钱。Kubernetes 主要通过两个内置功能实现可移植性:容器化和与任何环境的兼容性。
Containerization: Kubernetes uses containerization technologies, like containerd and Docker, to package ML models and AI-enabled applications, along with their dependencies, into portable containers. You can then use those containers on any cluster, in any environment, and even with other container orchestration tools. Support for multicloud and hybrid environments: Kubernetes clusters can be distributed across multiple environments, including public clouds, private clouds and on-premises infrastructure. This gives you flexibility and reduces vendor lock-in.
-
容器化:Kubernetes 使用容器化技术(如 containerd 和 Docker)将 ML 模型和支持 AI 的应用程序及其依赖项打包到可移植容器中。然后,您可以在任何集群、任何环境中使用这些容器,甚至可以与其他容器编排工具一起使用。 -
支持多云和混合环境:Kubernetes 集群可以分布在多个环境中,包括公共云、私有云和本地基础设施。这为您提供了灵活性并减少了供应商锁定。
Here are the key benefits of K8s portability:
Consistent ML model deployments across different environments Easier migration and updates of AI workloads Flexibility to choose cloud providers or on-premises infrastructure
以下是 K8s 可移植性的主要优点:
-
在不同环境中实现一致的 ML 模型部署 -
更轻松地迁移和更新 AI 工作负载 -
灵活选择云提供商或本地基础设施
5.容错 Fault Tolerance
Infrastructure failures and downtime when running AI inference can lead to significant accuracy degradation, unpredictable model behavior or simply a service outage. This is unacceptable for many AI-enabled applications, including safety-critical applications such as robotics, autonomous driving and medical analytics. Kubernetes’ self-healing and fault tolerance features help prevent these problems.
运行 AI 推理时,基础设施故障和停机可能导致准确度大幅下降、模型行为不可预测,甚至导致服务中断。这对于许多支持 AI 的应用程序来说是不可接受的,包括机器人、自动驾驶和医疗分析等安全关键型应用程序。Kubernetes 的自我修复和容错功能有助于防止这些问题。
Pod-level and node-level fault tolerance: If a pod goes down or doesn’t respond, Kubernetes automatically detects the issue and restarts the pod. This ensures that the application remains available and responsive. If a node running a pod fails, Kubernetes automatically reschedules the pod to a healthy node. Rolling updates: Kubernetes supports rolling updates, so you can update your container images with minimal downtime. This lets you quickly deploy bug fixes or model updates without disrupting the running inference service. Readiness and liveliness probes: These probes are health checks that detect when a container is not ready to receive traffic or has become unhealthy, triggering a restart or replacement if necessary. Cluster self-healing: K8s can automatically repair control plane and worker node problems, like replacing failed nodes or restarting unhealthy components. This helps maintain the overall health and availability of the cluster running your AI inference.
-
Pod 级和节点级容错:如果 Pod 发生故障或没有响应,Kubernetes 会自动检测问题并重新启动 Pod。这可确保应用程序保持可用和响应。如果运行 Pod 的节点发生故障,Kubernetes 会自动将 Pod 重新安排到健康节点。 -
滚动更新:Kubernetes 支持滚动更新,因此您可以在最短的停机时间内更新容器映像。这样您就可以快速部署错误修复或模型更新,而不会中断正在运行的推理服务。 -
就绪性和活跃性探测:这些探测是健康检查,用于检测容器何时尚未准备好接收流量或变得不健康,并在必要时触发重新启动或更换。 -
集群自我修复:K8s 可以自动修复控制平面和工作节点问题,例如更换故障节点或重新启动不健康的组件。这有助于维护运行 AI 推理的集群的整体健康和可用性。
Here are the key benefits of K8s fault tolerance:
Increased resiliency of AI-enabled applications by keeping them highly available and responsive Minimal downtime and disruption when things go wrong Increased user satisfaction by making applications and models highly available and more resilient to unexpected infrastructure failures
以下是 K8s 容错功能的主要优点:
-
通过保持高可用性和响应能力来提高支持 AI 的应用程序的弹性 -
出现问题时,停机和中断时间最小化 -
通过提高应用程序和模型的可用性和对意外基础设施故障的恢复能力来提高用户满意度
结论 Conclusion
As organizations continue to integrate AI into their applications, use large ML models and face dynamic loads, adopting Kubernetes as a foundational technology is critical. As a managed Kubernetes provider, we’ve seen an increasing demand for scalable, fault-tolerant and cost-effective infrastructure that can handle AI inference at scale. Kubernetes is a tool that natively delivers all of these capabilities.
随着组织不断将 AI 集成到其应用程序中、使用大型 ML 模型并面临动态负载,采用 Kubernetes 作为基础技术至关重要。作为托管 Kubernetes 提供商,我们看到对可扩展、容错且经济高效的基础设施的需求日益增长,这些基础设施可以大规模处理AI 推理。Kubernetes 是一种原生提供所有这些功能的工具。
本文分享自微信公众号 - DevOps云学堂(idevopsvip)。
如有侵权,请联系 support@oschina.cn 删除。
本文参与“OSC源创计划”,欢迎正在阅读的你也加入,一起分享。