文档章节

Kubernetes Scheduler原理解析

WaltonWang
 WaltonWang
发布于 2017/01/14 23:56
字数 2048
阅读 716
收藏 2

本文是对Kubernetes Scheduler的算法解读和原理解析,重点介绍了预选(Predicates)和优选(Priorities)步骤的原理,并介绍了默认配置的Default Policies。接下来,我会分析Kubernetes Scheduler的源码,窥探其具体的实现细节以及如何开发一个Policy,见我下片博文吧。

Scheduler及其算法介绍

Kubernetes Scheduler是Kubernetes Master的一个组件,通常与API Server和Controller Manager组件部署在一个节点,共同组成Master的三剑客。

一句话概括Scheduler的功能:将PodSpec.NodeName为空的Pods逐个地,经过预选(Predicates)和优选(Priorities)两个步骤,挑选最合适的Node作为该Pod的Destination。

展开这两个步骤,就是Scheduler的算法描述:

  • 预选:根据配置的Predicates Policies(默认为DefaultProvider中定义的default predicates policies集合)过滤掉那些不满足这些Policies的的Nodes,剩下的Nodes就作为优选的输入。

  • 优选:根据配置的Priorities Policies(默认为DefaultProvider中定义的default priorities policies集合)给预选后的Nodes进行打分排名,得分最高的Node即作为最适合的Node,该Pod就Bind到这个Node。

    如果经过优选将Nodes打分排名后,有多个Nodes并列得分最高,那么scheduler将随机从中选择一个Node作为目标Node。

因此整个schedule过程,算法本身的逻辑是非常简单的,关键在这些Policies的逻辑,下面我们就来看看Kubernetes的Predicates and Priorities Policies。

Predicates and Priorities Policies

Predicates Policies

Predicates Policies就是提供给Scheduler用来过滤出满足所定义条件的Nodes,并发的(最多16个goroutine)对每个Node启动所有Predicates Policies的遍历Filter,看其是否都满足配置的Predicates Policies,若有一个Policy不满足,则直接被淘汰。

注意:这里的并发goroutine number为All Nodes number,但最多不能超过16个,由一个queue控制。

Kubernetes提供了以下Predicates Policies的定义,你可以在kube-scheduler启动参数中添加--policy-config-file来指定要运用的Policies集合,比如:

{
"kind" : "Policy",
"apiVersion" : "v1",
"predicates" : [
	{"name" : "PodFitsPorts"},
	{"name" : "PodFitsResources"},
	{"name" : "NoDiskConflict"},
	{"name" : "NoVolumeZoneConflict"},
	{"name" : "MatchNodeSelector"},
	{"name" : "HostName"}
	],
"priorities" : [
	...
	]
}
  1. NoDiskConflict: Evaluate if a pod can fit due to the volumes it requests, and those that are already mounted. Currently supported volumes are: AWS EBS, GCE PD, ISCSI and Ceph RBD. Only Persistent Volume Claims for those supported types are checked. Persistent Volumes added directly to pods are not evaluated and are not constrained by this policy.

  2. NoVolumeZoneConflict: Evaluate if the volumes a pod requests are available on the node, given the Zone restrictions.

  3. PodFitsResources: Check if the free resource (CPU and Memory) meets the requirement of the Pod. The free resource is measured by the capacity minus the sum of requests of all Pods on the node. To learn more about the resource QoS in Kubernetes, please check QoS proposal.

  4. PodFitsHostPorts: Check if any HostPort required by the Pod is already occupied on the node.

  5. HostName: Filter out all nodes except the one specified in the PodSpec's NodeName field.

  6. MatchNodeSelector: Check if the labels of the node match the labels specified in the Pod's nodeSelector field and, as of Kubernetes v1.2, also match the scheduler.alpha.kubernetes.io/affinity pod annotation if present. See here for more details on both.

  7. MaxEBSVolumeCount: Ensure that the number of attached ElasticBlockStore volumes does not exceed a maximum value (by default, 39, since Amazon recommends a maximum of 40 with one of those 40 reserved for the root volume -- see Amazon's documentation). The maximum value can be controlled by setting the KUBE_MAX_PD_VOLS environment variable.

  8. MaxGCEPDVolumeCount: Ensure that the number of attached GCE PersistentDisk volumes does not exceed a maximum value (by default, 16, which is the maximum GCE allows -- see GCE's documentation). The maximum value can be controlled by setting the KUBE_MAX_PD_VOLS environment variable.

  9. CheckNodeMemoryPressure: Check if a pod can be scheduled on a node reporting memory pressure condition. Currently, no BestEffort should be placed on a node under memory pressure as it gets automatically evicted by kubelet.

  10. CheckNodeDiskPressure: Check if a pod can be scheduled on a node reporting disk pressure condition. Currently, no pods should be placed on a node under disk pressure as it gets automatically evicted by kubelet.

默认的DefaultProvider中选了以下Predicates Policies:

  1. NoVolumeZoneConflict

  2. MaxEBSVolumeCount

  3. MaxGCEPDVolumeCount

  4. MatchInterPodAffinity

    说明:Fit is determined by inter-pod affinity.AffinityAnnotationKey represents the key of affinity data (json serialized) in the Annotations of a Pod.

    AffinityAnnotationKey string = "scheduler.alpha.kubernetes.io/affinity"

  5. NoDiskConflict

  6. GeneralPredicates

    • PodFitsResources
      • pod, in number
      • cpu, in cores
      • memory, in bytes
      • alpha.kubernetes.io/nvidia-gpu, in devices。截止V1.4,每个node最多只支持1个gpu
    • PodFitsHost
    • PodFitsHostPorts
    • PodSelectorMatches
  7. PodToleratesNodeTaints

  8. CheckNodeMemoryPressure

  9. CheckNodeDiskPressure

Priorities Policies

经过预选策略甩选后得到的Nodes,会来到优选步骤。在这个过程中,会并发的根据每个Priorities Policy分别启动一个goroutine,在每个goroutine中会根据对应的policy实现,遍历所有的预选Nodes,分别进行打分,每个Node每一个Policy的打分为0-10分,0分最低,10分最高。待所有policy对应的goroutine都完成后,根据设置的各个priorities policies的权重weight,对每个node的各个policy的得分进行加权求和作为最终的node的得分。

finalScoreNodeA = (weight1 * priorityFunc1) + (weight2 * priorityFunc2)

注意:这里的并发goroutine number为Priorities Policies number,无队列控制,数量不封顶。当然,正常情况,也不会配置超过十几二十个Policies。

思考:如果经过预选后,没有一个Node满足条件,则直接返回FailedPredicates报错,不会再触发Prioritizing阶段,这是合理的。但是,如果经过预选后,只有一个Node满足条件,同样会触发Prioritizing,并且所走的流程和多个Nodes一样。实际上,如果只有一个Node满足条件,在优选阶段,可以直接返回该Node作为最终scheduled结果,无需跑完整个打分流程。

如果经过优选将Nodes打分排名后,有多个Nodes并列得分最高,那么scheduler将随机从中选择一个Node作为目标Node。

Kubernetes提供了以下Priorities Policies的定义,你可以在kube-scheduler启动参数中添加--policy-config-file来指定要运用的Policies集合,比如:

{
"kind" : "Policy",
"apiVersion" : "v1",
"predicates" : [
	...
	],
"priorities" : [
	{"name" : "LeastRequestedPriority", "weight" : 1},
	{"name" : "BalancedResourceAllocation", "weight" : 1},
	{"name" : "ServiceSpreadingPriority", "weight" : 1},
	{"name" : "EqualPriority", "weight" : 1}
	]
}
  • LeastRequestedPriority: The node is prioritized based on the fraction of the node that would be free if the new Pod were scheduled onto the node. (In other words, (capacity - sum of requests of all Pods already on the node - request of Pod that is being scheduled) / capacity). CPU and memory are equally weighted. The node with the highest free fraction is the most preferred. Note that this priority function has the effect of spreading Pods across the nodes with respect to resource consumption.
  • BalancedResourceAllocation: This priority function tries to put the Pod on a node such that the CPU and Memory utilization rate is balanced after the Pod is deployed.
  • SelectorSpreadPriority: Spread Pods by minimizing the number of Pods belonging to the same service, replication controller, or replica set on the same node. If zone information is present on the nodes, the priority will be adjusted so that pods are spread across zones and nodes.
  • CalculateAntiAffinityPriority: Spread Pods by minimizing the number of Pods belonging to the same service on nodes with the same value for a particular label.
  • ImageLocalityPriority: Nodes are prioritized based on locality of images requested by a pod. Nodes with larger size of already-installed packages required by the pod will be preferred over nodes with no already-installed packages required by the pod or a small total size of already-installed packages required by the pod.
  • NodeAffinityPriority: (Kubernetes v1.2) Implements preferredDuringSchedulingIgnoredDuringExecution node affinity; see here for more details.

默认的DefaultProvider中选了以下Priorities Policies

  1. SelectorSpreadPriority, 默认权重为1

  2. InterPodAffinityPriority, 默认权重为1

    • pods should be placed in the same topological domain (e.g. same node, same rack, same zone, same power domain, etc.)

    • as some other pods, or, conversely, should not be placed in the same topological domain as some other pods.

    • AffinityAnnotationKey represents the key of affinity data (json serialized) in the Annotations of a Pod.

      scheduler.alpha.kubernetes.io/affinity="..."

  3. LeastRequestedPriority, 默认权重为1

  4. BalancedResourceAllocation, 默认权重为1

  5. NodePreferAvoidPodsPriority, 默认权重为10000

    说明:这里权重设置足够大(10000),如果得分不为0,那么加权后最终得分将很高,如果得分为0,那么意味着相对其他得搞很高的,注定被淘汰,分析如下:

    如果Node的Anotation没有设置key-value:

    scheduler.alpha.kubernetes.io/preferAvoidPods="..."

    则该node对该policy的得分就是10分,加上权重10000,那么该node对该policy的得分至少10W分。

    如果Node的Anotation设置了

    scheduler.alpha.kubernetes.io/preferAvoidPods="..."

    如果该pod对应的Controller是ReplicationController或ReplicaSet,则该node对该policy的得分就是0分,那么该node对该policy的得分相对没有设置该Anotation的Node得分低的离谱了。也就是说这个Node一定会被淘汰!

  6. NodeAffinityPriority, 默认权重为1

  7. TaintTolerationPriority, 默认权重为1

##scheduler算法流程图 输入图片说明

##总结

  • kubernetes scheduler的任务就是将pod调度到最合适的Node。
  • 整个调度过程分两步:预选(Predicates)和优选(Policies)
  • 默认配置的调度策略为DefaultProvider,具体包含的策略见上。
  • 可以通过kube-scheduler的启动参数--policy-config-file指定一个自定义的Json内容的文件,按照格式组装自己Predicates and Priorities policies。

© 著作权归作者所有

共有 人打赏支持
WaltonWang
粉丝 204
博文 102
码字总数 214403
作品 0
深圳
程序员
私信 提问
Kubernetes之scheduler模块源码分析

传送门 哈哈,隔了太长时间,网上已经有对应的分析,而且我看了以后觉得写的还真的挺好的,基本想要写的他都写的。 Kubernetes Scheduler原理解析 Kubernetes Scheduler源码分析 如何对kuber...

weixin_38975685
2017/09/29
0
0
“Hack”阿里云Kubernetes的Scheduler的日志级别

问题 阿里云的Kubernetes容器服务,默认已经根据生产的要求创建好了3个master的高可用集群。每个master节点都部署了对应的apiserver, controller, scheduler。对于一些高级用户,特别是在开发...

了哥-duff
2018/06/29
0
0
Kubernetes核心组件解析

众所周知,Kubernetes是目前最为火热的容器编排工具之一,其背后有如此多的追随者必然是有原因的。首先Kubernetes非常轻量,通常Kubernetes都是以容器作为载体,而容器本来就具有轻量级秒级部...

Docker
2018/08/05
0
0
kubernetes大概的工作原理

先放一张Kubernetes的架构图: 整体来看,是一个老大,多个干活的这种结构,基本上所有的分布式系统都是这样,但是里面的组件名称就纷繁复杂,下面将一一解析。 1、元数据存储与集群维护 作为...

网易云
2018/09/19
0
0
我是怎么阅读kubernetes源代码的?

为什么要阅读代码?怎么阅读k8s源代码? 源代码中包含了所有信息。写开源软件,从文档和其他地方拿到的是二手的信息,代码就是最直接的一手信息。代码就是黑客帝国中neo看到的世界本源。 文本...

难易
2015/12/13
424
0

没有更多内容

加载失败,请刷新页面

加载更多

OSChina 周六乱弹 —— 世界的源代码

Osc乱弹歌单(2019)请戳(这里) 【今日歌曲】 @小鱼丁 :#今日歌曲推荐# 分享Jason Mraz的单曲《Prettiest Friend (Demo)》: 《Prettiest Friend (Demo)》- Jason Mraz 手机党少年们想听歌...

小小编辑
今天
59
5
java框架学习日志-13(Mybatis基本概念和简单的例子)

在mybatis初次学习Mybatis的时候,遇到了很多问题,虽然阿里云的视频有教学,但是视频教学所使用的软件和我自己使用的软件不用,我自己用的数据库是oracle数据库,开发环境是idea。而且视频中...

白话
今天
10
0
Java基础:String、StringBuffer和StringBuilder的区别

1 String String:字符串常量,字符串长度不可变。Java中String是immutable(不可变)的。 String类的包含如下定义: /** The value is used for character storage. */private final cha...

watermelon11
今天
6
0
mogodb服务

部署MongoDB 官网: https://www.mongodb.com/download-center/community 创建mongo数据目录 mkdir /data/mongodb 二进制部署 wget -c https://fastdl.mongodb.org/linux/mongodb-linux-x8......

以谁为师
昨天
5
0
大神教你Debian GNU/Linux 9.7 “Stretch” Live和安装镜像开放下载

Debian项目团队于昨天发布了Debian GNU/Linux 9 "Stretch" 的第7个维护版本更新,重点修复了APT软件管理器中存在的安全漏洞。在敦促每位用户尽快升级系统的同时,Debian团队还发布了Debian ...

linux-tao
昨天
4
0

没有更多内容

加载失败,请刷新页面

加载更多

返回顶部
顶部