文档章节

Kubernetes Node Controller源码分析之创建篇

WaltonWang
 WaltonWang
发布于 2017/07/29 22:44
字数 1742
阅读 435
收藏 5
点赞 0
评论 1

Author: xidianwangtao@gmail.com

NewNodeController入口

Controller Manager在启动时,会启动一系列的Controller,Node Controller也是在Controller Manager启动时StartControllers方法中启动的Controller之一,其对应的创建代码如下。

cmd/kube-controller-manager/app/controllermanager.go:455

nodeController, err := nodecontroller.NewNodeController(
			sharedInformers.Core().V1().Pods(),
			sharedInformers.Core().V1().Nodes(),
			sharedInformers.Extensions().V1beta1().DaemonSets(),
			cloud,
			clientBuilder.ClientOrDie("node-controller"),
			s.PodEvictionTimeout.Duration,
			s.NodeEvictionRate,
			s.SecondaryNodeEvictionRate,
			s.LargeClusterSizeThreshold,
			s.UnhealthyZoneThreshold,
			s.NodeMonitorGracePeriod.Duration,
			s.NodeStartupGracePeriod.Duration,
			s.NodeMonitorPeriod.Duration,
			clusterCIDR,
			serviceCIDR,
			int(s.NodeCIDRMaskSize),
			s.AllocateNodeCIDRs,
			s.EnableTaintManager,
			utilfeature.DefaultFeatureGate.Enabled(features.TaintBasedEvictions),
		)

可见,Node Controller主要是ListWatch sharedInformers中的如下对象:

  • Pods
  • Nodes
  • DaemonSets

另外,需要注意:

  • s.EnableTaintManager的默认值为true,即表示默认开启Taint Manager,可通过--enable-taint-manager进行设置。
  • DefaultFeatureGate.Enabled(features.TaintBasedEvictions)的默认值为false,可通过--feature-gates中添加TaintBasedEvictions=true修改为true,true即表示Node上的Pods Eviction Operation通过TaintManager来进行。

补充:关于Kubernetes的Default FeaturesGate的设置见如下代码:

pkg/features/kube_features.go:100

var defaultKubernetesFeatureGates = map[utilfeature.Feature]utilfeature.FeatureSpec{
	ExternalTrafficLocalOnly:                    {Default: true, PreRelease: utilfeature.Beta},
	AppArmor:                                    {Default: true, PreRelease: utilfeature.Beta},
	DynamicKubeletConfig:                        {Default: false, PreRelease: utilfeature.Alpha},
	DynamicVolumeProvisioning:                   {Default: true, PreRelease: utilfeature.Alpha},
	ExperimentalHostUserNamespaceDefaultingGate: {Default: false, PreRelease: utilfeature.Beta},
	ExperimentalCriticalPodAnnotation:           {Default: false, PreRelease: utilfeature.Alpha},
	AffinityInAnnotations:                       {Default: false, PreRelease: utilfeature.Alpha},
	Accelerators:                                {Default: false, PreRelease: utilfeature.Alpha},
	TaintBasedEvictions:                         {Default: false, PreRelease: utilfeature.Alpha},

	// inherited features from generic apiserver, relisted here to get a conflict if it is changed
	// unintentionally on either side:
	StreamingProxyRedirects: {Default: true, PreRelease: utilfeature.Beta},
}

NewNodeController定义


func NewNodeController(
	podInformer coreinformers.PodInformer,
	nodeInformer coreinformers.NodeInformer,
	daemonSetInformer extensionsinformers.DaemonSetInformer,
	cloud cloudprovider.Interface,
	kubeClient clientset.Interface,
	podEvictionTimeout time.Duration,
	evictionLimiterQPS float32,
	secondaryEvictionLimiterQPS float32,
	largeClusterThreshold int32,
	unhealthyZoneThreshold float32,
	nodeMonitorGracePeriod time.Duration,
	nodeStartupGracePeriod time.Duration,
	nodeMonitorPeriod time.Duration,
	clusterCIDR *net.IPNet,
	serviceCIDR *net.IPNet,
	nodeCIDRMaskSize int,
	allocateNodeCIDRs bool,
	runTaintManager bool,
	useTaintBasedEvictions bool) (*NodeController, error) {
		
	...
	
	nc := &NodeController{
		cloud:                           cloud,
		knownNodeSet:                    make(map[string]*v1.Node),
		kubeClient:                      kubeClient,
		recorder:                        recorder,
		podEvictionTimeout:              podEvictionTimeout,
		maximumGracePeriod:              5 * time.Minute,    // 不可配置,表示"The maximum duration before a pod evicted from a node can be forcefully terminated"
		zonePodEvictor:                  make(map[string]*RateLimitedTimedQueue),
		zoneNotReadyOrUnreachableTainer: make(map[string]*RateLimitedTimedQueue),
		nodeStatusMap:                   make(map[string]nodeStatusData),
		nodeMonitorGracePeriod:          nodeMonitorGracePeriod,
		nodeMonitorPeriod:               nodeMonitorPeriod,
		nodeStartupGracePeriod:          nodeStartupGracePeriod,
		lookupIP:                        net.LookupIP,
		now:                             metav1.Now,
		clusterCIDR:                     clusterCIDR,
		serviceCIDR:                     serviceCIDR,
		allocateNodeCIDRs:               allocateNodeCIDRs,
		forcefullyDeletePod:             func(p *v1.Pod) error { return forcefullyDeletePod(kubeClient, p) },
		nodeExistsInCloudProvider:       func(nodeName types.NodeName) (bool, error) { return nodeExistsInCloudProvider(cloud, nodeName) },
		evictionLimiterQPS:              evictionLimiterQPS,
		secondaryEvictionLimiterQPS:     secondaryEvictionLimiterQPS,
		largeClusterThreshold:           largeClusterThreshold,
		unhealthyZoneThreshold:          unhealthyZoneThreshold,
		zoneStates:                      make(map[string]zoneState),
		runTaintManager:                 runTaintManager,
		useTaintBasedEvictions:          useTaintBasedEvictions && runTaintManager,
	}
	
	...
	
	// 注册enterPartialDisruptionFunc函数为ReducedQPSFunc,当zone state为"PartialDisruption"时,将invoke ReducedQPSFunc来setLimiterInZone。
	nc.enterPartialDisruptionFunc = nc.ReducedQPSFunc
	
	// 注册enterFullDisruptionFunc函数为HealthyQPSFunc,当zone state为"FullDisruption"时,将invoke HealthyQPSFunc来setLimiterInZone。
	nc.enterFullDisruptionFunc = nc.HealthyQPSFunc
	
	// 注册computeZoneStateFunc函数为ComputeZoneState,当handleDisruption时,将invoke ComputeZoneState来计算集群中unhealthy node number及zone state。
	nc.computeZoneStateFunc = nc.ComputeZoneState
	
	
	// 注册PodInformer的Event Handler:Add,Update,Delete。
podInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{

        // 对于Pod Add和Update Event,都会去判断Node上kubelet的version,如果version低于1.1.0,则会通过forcefullyDeletePod直接调用apiserver接口删除etcd中该Pod object。
        // 对于Pod Add, Update, Delete Event,如果启动了TaintManager,则会对比OldPod和newPod的Tolerations信息,如果不相同,则会将该Pod的变更信息Add到NoExecuteTaintManager的podUpdateQueue中,交给Taint Controller处理。只不过对于Delete Event,newPod 为nil。
		AddFunc: func(obj interface{}) {
			nc.maybeDeleteTerminatingPod(obj)
			pod := obj.(*v1.Pod)
			if nc.taintManager != nil {
				nc.taintManager.PodUpdated(nil, pod)
			}
		},
		UpdateFunc: func(prev, obj interface{}) {
			nc.maybeDeleteTerminatingPod(obj)
			prevPod := prev.(*v1.Pod)
			newPod := obj.(*v1.Pod)
			if nc.taintManager != nil {
				nc.taintManager.PodUpdated(prevPod, newPod)
			}
		},
		DeleteFunc: func(obj interface{}) {
			pod, isPod := obj.(*v1.Pod)
			// We can get DeletedFinalStateUnknown instead of *v1.Node here and we need to handle that correctly. #34692
			if !isPod {
				deletedState, ok := obj.(cache.DeletedFinalStateUnknown)
				if !ok {
					glog.Errorf("Received unexpected object: %v", obj)
					return
				}
				pod, ok = deletedState.Obj.(*v1.Pod)
				if !ok {
					glog.Errorf("DeletedFinalStateUnknown contained non-Node object: %v", deletedState.Obj)
					return
				}
			}
			if nc.taintManager != nil {
				nc.taintManager.PodUpdated(pod, nil)
			}
		},
	})
	
	// returns true if the shared informer's store has synced.
	nc.podInformerSynced = podInformer.Informer().HasSynced
	
	
	// 注册NodeInformer的Event Handler:Add,Update,Delete。
	nodeEventHandlerFuncs := cache.ResourceEventHandlerFuncs{}
	if nc.allocateNodeCIDRs {
	   // --allocate-node-cidrs —— Should CIDRs for Pods be allocated and set on the cloud provider。
		...
	} else {
		nodeEventHandlerFuncs = cache.ResourceEventHandlerFuncs{
		
		  // 对于Node Add, Update, Delete Event,如果启动了TaintManager,则会对比OldNode和newNode的Taints信息,如果不相同,则会将该Node的变更信息Add到NoExecuteTaintManager的nodeUpdateQueue中,交给Taint Controller处理。只不过对于Delete Event,newNode 为nil。
			AddFunc: func(originalObj interface{}) {
				obj, err := api.Scheme.DeepCopy(originalObj)
				if err != nil {
					utilruntime.HandleError(err)
					return
				}
				node := obj.(*v1.Node)
				if nc.taintManager != nil {
					nc.taintManager.NodeUpdated(nil, node)
				}
			},
			UpdateFunc: func(oldNode, newNode interface{}) {
				node := newNode.(*v1.Node)
				prevNode := oldNode.(*v1.Node)
				if nc.taintManager != nil {
					nc.taintManager.NodeUpdated(prevNode, node)

				}
			},
			DeleteFunc: func(originalObj interface{}) {
				obj, err := api.Scheme.DeepCopy(originalObj)
				if err != nil {
					utilruntime.HandleError(err)
					return
				}

				node, isNode := obj.(*v1.Node)
				// We can get DeletedFinalStateUnknown instead of *v1.Node here and we need to handle that correctly. #34692
				if !isNode {
					deletedState, ok := obj.(cache.DeletedFinalStateUnknown)
					if !ok {
						glog.Errorf("Received unexpected object: %v", obj)
						return
					}
					node, ok = deletedState.Obj.(*v1.Node)
					if !ok {
						glog.Errorf("DeletedFinalStateUnknown contained non-Node object: %v", deletedState.Obj)
						return
					}
				}
				if nc.taintManager != nil {
					nc.taintManager.NodeUpdated(node, nil)
				}
			},
		}
	}
	
	// 注册NoExecuteTaintManager为taintManager。
	if nc.runTaintManager {
		nc.taintManager = NewNoExecuteTaintManager(kubeClient)
	}
	nodeInformer.Informer().AddEventHandler(nodeEventHandlerFuncs)
	nc.nodeLister = nodeInformer.Lister()
	
	// returns true if the shared informer's nodeStore has synced.
	nc.nodeInformerSynced = nodeInformer.Informer().HasSynced
	
	// returns true if the shared informer's daemonSetStore has synced.
	nc.daemonSetStore = daemonSetInformer.Lister()
	nc.daemonSetInformerSynced = daemonSetInformer.Informer().HasSynced

	return nc, nil
	

因此,创建NodeController实例时,主要进行了如下工作:

  • maximumGracePeriod - The maximum duration before a pod evicted from a node can be forcefully terminated. 不可配置,代码中写死为5min。
  • 注册enterPartialDisruptionFunc函数为ReducedQPSFunc,当zone state为"PartialDisruption"时,将invoke ReducedQPSFuncsetLimiterInZone
  • 注册enterFullDisruptionFunc函数为HealthyQPSFunc,当zone state为"FullDisruption"时,将invoke HealthyQPSFuncsetLimiterInZone
  • 注册computeZoneStateFunc函数为ComputeZoneState,当handleDisruption时,将invoke ComputeZoneState来计算集群中unhealthy node number及zone state。
  • 注册**PodInformer**的Event Handler:Add,Update,Delete。
    • 对于Pod Add和Update Event,都会去判断Node上kubelet version,如果version低于***1.1.0***,则会通过forcefullyDeletePod直接调用apiserver接口删除etcd中该Pod object。
    • 对于Pod Add, Update, Delete Event,如果启动了TaintManager,则会对比OldPod和newPod的Tolerations信息,如果不相同,则会将该Pod的变更信息Add到NoExecuteTaintManager的**podUpdateQueue**中,交给Taint Controller处理。只不过对于Delete Event,newPod 为nil。
  • 注册PodInformerSynced,用来检查the shared informer's Podstore 是否已经synced.
  • 注册**NodeInformer**的Event Handler:Add,Update,Delete。
    • 对于Node Add, Update, Delete Event,如果启动了TaintManager,则会对比OldNode和newNode的Taints信息,如果不相同,则会将该Node的变更信息Add到NoExecuteTaintManagernodeUpdateQueue中,交给Taint Controller处理。只不过对于Delete Event,newNode 为nil。
  • 注册NoExecuteTaintManager为taintManager。
  • 注册NodeInformerSynced,用来检查the shared informer's NodeStore 是否已经synced.
  • 注册DaemonSetInformerSynced,用来检查the shared informer's DaemonSetStore 是否已经synced.

关于ZoneState

上面提到ZoneState,关于***ZoneState***是怎么来的,见如下代码:

pkg/api/v1/types.go:3277

const (
	// NodeReady means kubelet is healthy and ready to accept pods.
	NodeReady NodeConditionType = "Ready"
	// NodeOutOfDisk means the kubelet will not accept new pods due to insufficient free disk
	// space on the node.
	NodeOutOfDisk NodeConditionType = "OutOfDisk"
	// NodeMemoryPressure means the kubelet is under pressure due to insufficient available memory.
	NodeMemoryPressure NodeConditionType = "MemoryPressure"
	// NodeDiskPressure means the kubelet is under pressure due to insufficient available disk.
	NodeDiskPressure NodeConditionType = "DiskPressure"
	// NodeNetworkUnavailable means that network for the node is not correctly configured.
	NodeNetworkUnavailable NodeConditionType = "NetworkUnavailable"
	// NodeInodePressure means the kubelet is under pressure due to insufficient available inodes.
	NodeInodePressure NodeConditionType = "InodePressure"
)



pkg/controller/node/nodecontroller.go:1149

// This function is expected to get a slice of NodeReadyConditions for all Nodes in a given zone.
// The zone is considered:
// - fullyDisrupted if there're no Ready Nodes,
// - partiallyDisrupted if at least than nc.unhealthyZoneThreshold percent of Nodes are not Ready,
// - normal otherwise
func (nc *NodeController) ComputeZoneState(nodeReadyConditions []*v1.NodeCondition) (int, zoneState) {
	readyNodes := 0
	notReadyNodes := 0
	for i := range nodeReadyConditions {
		if nodeReadyConditions[i] != nil && nodeReadyConditions[i].Status == v1.ConditionTrue {
			readyNodes++
		} else {
			notReadyNodes++
		}
	}
	switch {
	case readyNodes == 0 && notReadyNodes > 0:
		return notReadyNodes, stateFullDisruption
	case notReadyNodes > 2 && float32(notReadyNodes)/float32(notReadyNodes+readyNodes) >= nc.unhealthyZoneThreshold:
		return notReadyNodes, statePartialDisruption
	default:
		return notReadyNodes, stateNormal
	}
}

zone state共分为如下三种类型:

  • FullDisruption:Ready状态的Nodes number为0,并且NotReady状态的Nodes number大于0。
  • PartialDisruption:NotReady状态的Nodes number大于2,并且notReadyNodes/(notReadyNodes+readyNodes) >= nc.unhealthyZoneThreshold,其中nc.unhealthyZoneThreshold通过--unhealthy-zone-threshold设置,默认为0.55。
  • Normal:除了以上两种zone state,其他都属于Normal状态。

关于Node Controller的其他博文:

© 著作权归作者所有

共有 人打赏支持
WaltonWang
粉丝 159
博文 88
码字总数 182269
作品 0
深圳
程序员
加载中

评论(1)

z
zhuyi灬
hao
浙江大学软件工程实验室关于Docker和KUBERNETES的分析文章

DOCKER源码分析(一):DOCKER架构, 2014.12.02, http://www.sel.zju.edu.cn/?p=112 DOCKER源码分析(二):DOCKER CLIENT创建与命令执行, 2014.12.02, http://www.sel.zju.edu.cn/?p=147 DO......

一配 ⋅ 2015/08/30 ⋅ 1

Kubernetes内部组件工作原理介绍

本篇文章讲述了Kubernetes内部组件的工作原理,及创建Pod的流程。如果你是运维人员或者是Kubernetes的使用者,你可以不需要知道Kubernetes的内部工作原理,但是如果你想理解Kubernetes内部的...

Docker ⋅ 04/25 ⋅ 0

Kubernetes1.10 ——二进制集群部署

之前的博文中已经介绍过使用kubeadm自动化安装Kubernetes ,但是由于各个组件都是以容器的方式运行,对于具体的配置细节没有太多涉及,为了更好的理解Kubernetes中各个组件的作用,本篇博文将...

酥心糖 ⋅ 05/25 ⋅ 0

kubernetes DaemonSet资源对象

What is a DaemonSet? DaemonSet能够让所有(或者一些特定)的Node节点运行同一个pod。当节点加入到kubernetes集群中,pod会被(DaemonSet)调度到该节点上运行,当节点从kubernetes集群中被...

yzy121403725 ⋅ 04/13 ⋅ 0

Kubernetes集群部署1

1.规划 192.168.100.102------>Master[kube-apiserver、kube-controller-manager、kube-scheduler] Node[kubelet、kube-proxy] 192.168.100.103------>Node1[kubelet、kube-proxy] 192.168.1......

结束的伤感 ⋅ 2017/11/27 ⋅ 0

容器化RDS:PersistentLocalVolumes和VolumeScheduling

容器化RDS系列文章: 容器化RDS:计算存储分离架构下的“Split-Brain” 容器化RDS:计算存储分离还是本地存储? 容器化RDS:你需要了解数据是如何被写"坏"的 数据库的高可用方案非常依赖底层...

Docker ⋅ 04/28 ⋅ 0

Centos7 安装 Kubernetes 集群详细步骤(安装篇)

Kubernetes 是goole开源的大规模容器集群管理系统,使用centos7 自带的Kubernetes 组件、分布式键值存储系统etcd 以及flannel 实现Docker容器中跨容器访问。 (集群环境需要ntp时钟一致,因为...

crazy_charles ⋅ 2017/07/07 ⋅ 0

kubernetes集群部署

鉴于docker如此火爆,Google推出kubernetes管理docker集群,不少人估计会进行尝试。kubernetes得到了很多大公司的支持,kubernetes集群部署工具也集成了gce,coreos,aws等iaas平台,部署起来...

Hi徐敏 ⋅ 2015/10/13 ⋅ 0

kubernetes容器编排系统介绍

版权声明:本文由turboxu原创文章,转载请注明出处: 文章原文链接:https://www.qcloud.com/community/article/152 来源:腾云阁 https://www.qcloud.com/community Kubernetes作为容器编排生...

偶素浅小浅 ⋅ 2016/11/07 ⋅ 0

Kubernetes1.6安装指南 (二进制文件方式)

Kubernetes的总体架构 部署环境说明 主机名 IP 操作系统 角色 node201 10.0.0.201 centos 7.3 64位 Master node202 10.0.0.202 centos 7.3 64位 Node node203 10.0.0.203 centos 7.3 64位 No......

chenhaifeng2016 ⋅ 2017/05/07 ⋅ 0

没有更多内容

加载失败,请刷新页面

加载更多

下一页

180621-一个简单的时间窗口设计与实现

如何设计一个计数的时间窗口 时间窗口,通常对于一些实时信息展示中用得比较多,比如维持一个五分钟的交易明细时间窗口,就需要记录当前时间,到五分钟之前的所有交易明细,而五分钟之前的数...

小灰灰Blog ⋅ 41分钟前 ⋅ 0

Android之Dalvik、ART、JIT、AOT

Android之Dalvik、ART、JIT、AOT 本文内容:Dalvik、ART、JIT、AOT之间关系 本文定位:知识记录 学习过程记录,加深理解,提升文字组合表达能力。也希望能给学习的同学一些灵感 本文整理于[...

lichuangnk ⋅ 45分钟前 ⋅ 0

Thrift RPC实战(五) thrift连接池

Thrift本身没有提供连接池,我们可以用Apache Commons Pool2来实现一个 一、定义对象工厂 BasePooledObjectFactory<T> extends BaseObject implements PooledObjectFactory<T> public class......

lemonLove ⋅ 45分钟前 ⋅ 0

git 命令简写

简写 命令 g git gst git status gd git diff gdc git diff --cached gdv git diff -w "$@" | view - gl git pull gup git pull --rebase gp git push gc git commit -v gc! git commit -v ......

charley158 ⋅ 53分钟前 ⋅ 0

Java中的锁使用与实现

1.Lock接口 锁是用来控制多个线程访问共享资源的方式,一般来说,一个锁能够防止多个线程同时访问共享资源。 在Lock出现之前,java程序是靠synchronized关键字实现锁功能的,而Java SE5之后,...

ZH-JSON ⋅ 54分钟前 ⋅ 0

Intellij IDEA神器常用技巧四-类和方法注释模板设置

IDEA自带的注释模板不是太好用,我本人到网上搜集了很多资料系统的整理了一下制作了一份比较完整的模板来分享给大家,我不是专业玩博客的,写这篇文章只是为了让大家省事。 这里设置的注释模...

Mkeeper ⋅ 57分钟前 ⋅ 0

Jira接入钉钉机器人

https://open-doc.dingtalk.com/docs/doc.htm?spm=a219a.7629140.0.0.9Z9czj&treeId=257&articleId=106075&docType=1...

谢思华 ⋅ 59分钟前 ⋅ 0

微信公众号开发

一、开通微信服务号,填写URL,微信将想你发送验证信息。接收并处理微信发来的GET请求 二、处理客户向公众号发送的各种消息: 如文本、图片、乐音、视频、音乐 : 消息分为:1 请求消息(文本...

无敌小学僧 ⋅ 今天 ⋅ 0

广州三本找Java实习经历

前言 只有光头才能变强 这阵子跑去面试Java实习生啦~~~我来简单介绍一下背景吧。 广州三本大三在读,在广州找实习。大学开始接触编程,一个非常平庸的人。 在学习编程时,跟我类似的人应该会...

Java3y ⋅ 今天 ⋅ 0

php json_encode()不转义中文字符 和 斜杠"/"

php格式化json的函数 json_encode($value,$options) 其中有2个比较常用到的参数 JSON_UNESCAPED_UNICODE (中文不转为unicode ,对应的数字 256) JSON_UNESCAPED_SLASHES (不转义斜杠,对应的...

一只大橘子 ⋅ 今天 ⋅ 0

没有更多内容

加载失败,请刷新页面

加载更多

下一页

返回顶部
顶部