记一次Service Endpoints与容器网络的碰撞

2022-07-15

4 min read

Kubernetes , Service , Endpoint

问题描述

服务的部署形式是三副本，每个pod包含2个容器。一个容器A通过k8s lease资源选主。一个容器B通过service对外提供服务，通过readinessProbe暴露容器的就绪状态。

但是通过cluster IP访问服务超时，直接访问后端endpoints可以通。

排查原因

首先判断问题的范围。这个情况只发生在新搭建的集群（k8s v1.17.4）上，新集群的唯一变化是上线新的容器网络插件。

service-example

通过cluster ip访问不了容器服务
通过endpoint ip可以服务容器服务
通过cni插件的日志可以判断功能正常：能够收到service的endpoints事件
再比较不同endpoints的差异，无法访问的enpoints有选主的标签

最终确定原因是cni插件为了降低选主的endpoints频繁变更带来的损耗，过滤掉了这些endpoints

底层逻辑：K8S选主实现

client-go中的leaderelection支持5种类型的资源锁：

const (
    LeaderElectionRecordAnnotationKey = "control-plane.alpha.kubernetes.io/leader"
    EndpointsResourceLock             = "endpoints"
    ConfigMapsResourceLock            = "configmaps"
    LeasesResourceLock                = "leases"
    EndpointsLeasesResourceLock       = "endpointsleases"
    ConfigMapsLeasesResourceLock      = "configmapsleases"
)

这5种资源锁也是随着时间发展不断完善的。早期版本的client-go只支持使用Endpoints和ConfigMaps作为资源锁。**由于Endpoints和ConfigMaps本身需要被集群内多个组件监听，使用这两种类型资源作为选主锁会显著增加监听他们的组件的事件数量。**如Kube-Proxy。这个问题在社区也被多次讨论。

为了解决这个问题，社区新增了Leases类型的资源锁。

Add Lease implementation to leaderelection package

很明显，用一个专门的类型做资源锁比复用Endpoints和ConfigMaps是一个更理想的方案。在社区的讨论中，也是推荐使用Lease Object替代Endpoints和ConfigMaps做选主。

It was a mitigation, not the fix. The real fix is to switch leader election to be based on Lease object instead of Endpoints or ConfiMap. So basically #80289

而对于之前已经存在使用Endpoints/ConfigMaps作为选主资源锁的组件，如scheduler，kcm等，社区也提出了一个保证稳定性的迁移方案：

Migrate all uses of leader-election to use Lease API

Currently, Kubernetes components (scheduler, kcm, …) are using leader election that is based on either Endpoints or ConfigMap objects. Given that both of these are watched by different components, this is generating a lot of unnecessary load.
We should migrate all leader-election to use Lease API (that was designed exactly for this case).
The tricky part is that in order to do that safely, I think the only reasonable way of doing that would be to:
in the first phase switch components to:
acquire lock on the current object (endpoints or configmap)
acquire lock on the new lease object
only the proceed with its regular functionality [ loosing any of those two, should result in panicing and restarting the component]
in the second phase (release after) remove point 1
@kubernetes/sig-scalability-bugs

也就是说，在第一阶段的过渡期，需要同时获取Endpoints/Configmap和Lease两种类型的资源锁（MultiLock），任一资源锁的丢失都会导致组件的重启。当更新锁时，先更新Endpoints/Configmap，再更新Lease。在判断丢失时，如果两种资源都有holder但是不一致，则返回异常，重新选主。

在第二阶段，可以移除Endpoints/Configmap的资源锁，完成向Lease资源锁的迁移。为了支持该迁移方案，社区在1.17版本新增了EndpointsLeasesResourceLock和ConfigMapsLeasesResourceLock，完成过渡期的锁获取。没有svc的endpoints通过leader创建的。

migrate leader election to lease API

同时，在1.17版本中，将controller-manager 和 scheduler的资源锁从Endpoints切换成EndpointsLeases：

Migrate components to EndpointsLeases leader election lock

在1.20版本中，将controller-manager 和 scheduler的资源锁从EndpointsLeases切换成Leases：

Migrate scheduler, controller-manager and cloud-controller-manager to use LeaseLock

而在1.24版本，社区也彻底移除了对Endpoints/ConfigMaps作为选主资源锁的支持。

Remove support for Endpoints and ConfigMaps lock from leader election

总结

综上，社区整体的趋势是使用Leases类型替代原先Endpoints/Configmaps作为资源锁的方式。而对于在低版本下使用Endpoints/Configmaps作为选主实现的K8S，很多组件也采取了不处理这些选主的Endpoints/Configmaps的方式，以屏蔽频繁的Endpoints更新带来的事件处理开销，如：

k8s: stop watching for kubernetes management endpoints （cilium）

因此，对于需要选主的组件，可以采取如下方式来规避：

**使用Leases Object作为选主资源锁，这也符合社区的演进趋势；**对着这个组件（原生调度器），配置resourceLock类型就行。

**使用独立的选主Endpoints，与Service Endpoints保持独立。**这种方式不太推荐，只有当上述方法不适用时才考虑。

记一次Service Endpoints与容器网络的碰撞

问题描述

排查原因

底层逻辑：K8S选主实现

总结

See Also