解决Kubernetes admission webhook timeout error

2020-06-30

4 min read

最近在做cluster chaos test，我们的功能组件出现不可用的情况。原因是Kubernetes admission webhook在执行中出现连接拒绝访问和timeout的错误，导致CR创建失败。按理说组件是以Deployment分布到3个不同zone中，其中一个zone的网络断开不应该影响整体的服务。因为某个pod无法提供服务（readiness probe）了，Kube Service会主动把endpoint踢出，deployment controller也会创建新的pod。

经过进一步的探索，我发现是golang net包的一个http/2 issue，考虑到依赖关系（Kubernetes->client-go->golibs），我们也无法直接修改code，只能通过workaround来解决问题。

问题描述

Error Injection

chaos test进行的error injection是：禁止一批（一个zone内）Kube nodes的网络20分钟，并且持续地产生admission webhook workloads（创建相应的CR）。我们的webhook拥有3个replicas分布在3个不同的zone。从injection开始，其中一个webhook pod会受到影响，Kube无法判断pod状态，直接抛弃创建新的pod来满足replicas。我们发现Service确实及时地把这个pod的endpoint(172.30.82.17:8443)踢出，并且在健康的node上创建新的pod。我们从LogDNA的控制面板上也观察到几乎所有的请求成功达到各个pod。

logdna-dashboard

但是workloads执行中间出现了失败的错误，错误信息显示原因是admission mutating webook响应超时：

Error from server (InternalError): error when creating "/tmp/mycr930288778.yml": Internal error occurred: failed calling webhook "mutate-mycrd.apigroup.com": Post https://mycrd-webhook.control-plane.svc:443/mutate-jobrun?timeout=30s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Error from server (InternalError): error when creating "/tmp/mycr930288778.yml": Internal error occurred: failed calling webhook "mutate-mycrd.coligo.cloud.ibm.com": Post https://mycrd-webhook.control-plane.svc:443/mutate-jobrun?timeout=30s: context deadline exceeded

timeout出现的原因可能是：

Webhook主动抛弃了请求
API Server无法连接到webhook endpoint

对于第一点，我们查看日志没有发现相关的信息表明webhook主动抛弃了admission request。因为公司Cluster不提供master node的权限，我们只能从已有的信息来判断request timeout的原因。

发现

进一步的追踪和比较，我发现Kubernetes有类似的issue: https://github.com/kubernetes/kubernetes/issues/80313

总结如下：

这个golang http/2的issue，net包没有关闭dead connections的机制: https://github.com/kubernetes/client-go/issues/374
Kubernetes的client-go拥有同样的问题，所以他影响大部分的Kubernetes组件
当Webhook使用默认的http/2作为server就会包含这个issue
触发的条件是由webhook deployment升级或者node导致的pod删除
golang已经接受相应的fix，并且client-go会尽早pick up。Kubernetes也需要bump相应的client-go。

等待是漫长的，不过我们可以先通过一个命令来暂时关闭http/2:

kubectl set env -n control-plane deployment/custom-webhook GODEBUG=http2server=0

这个命令就是通过环境参数来主动关闭server端的HTTP/2支持，来自官方文档：

Starting with Go 1.6, the http package has transparent support for the HTTP/2 protocol when using HTTPS. Programs that must disable HTTP/2 can do so by setting Transport.TLSNextProto (for clients) or Server.TLSNextProto (for servers) to a non-nil, empty map. Alternatively, the following GODEBUG environment variables are currently supported: GODEBUG=http2client=0 # disable HTTP/2 client support GODEBUG=http2server=0 # disable HTTP/2 server support GODEBUG=http2debug=1 # enable verbose HTTP/2 debug logs GODEBUG=http2debug=2 # … even more verbose, with frame dumps

https://golang.org/pkg/net/http/

验证

workaround升级deployment: kubectl set env -n control-plane deployment/custom-webhook GODEBUG=http2server=0.
触发滚动升级: kubectl rollout restart deployment
同时推cr创建的workload来验证，观察执行情况和webhook的接受请求情况。

结果

经过测试，前面那些错误已经不再出现，cr的创建也成功执行。通过观察webhook的pods在执行中也没有丢request。

解决Kubernetes admission webhook timeout error

问题描述

Error Injection

发现

验证

结果

See Also