k8s node 节点加入 matser 错误 cannot construct envvars
问题描述
问题现象
root@master-node1:~# kubectl get pod -A -owide | grep Error
kube-flannel kube-flannel-ds-cbblv 0/1 Init:CreateContainerConfigError 0 114m 192.168.31.32 edge-node1 <none> <none>
kube-system kube-proxy-zvgz4 0/1 CreateContainerConfigError 0 114m 192.168.31.32 edge-node1 <none> <none>
kube-system raven-agent-ds-xd8j5 0/1 CreateContainerConfigError 0 114m 192.168.31.32 edge-node1 <none> <none>
错误日志 kubelet 持续输出
Sep 26 16:56:49 edge-node1 kubelet[16009]: E0926 16:56:49.340163 16009 kuberuntime_manager.go:1449] "Unhandled Error" err="container raven-agent start failed in pod raven-agent-ds-xd8j5_kube-system(15a05897-dcef-47c5-8bbf-a2307c1ef6ef): CreateContainerConfigError: services have not yet been read at least once, cannot construct envvars" logger="UnhandledError"
Sep 26 16:56:49 edge-node1 kubelet[16009]: E0926 16:56:49.340200 16009 pod_workers.go:1324] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"raven-agent\" with CreateContainerConfigError: \"services have not yet been read at least once, cannot construct envvars\"" pod="kube-system/raven-agent-ds-xd8j5" podUID="15a05897-dcef-47c5-8bbf-a2307c1ef6ef"
Sep 26 16:56:52 edge-node1 kubelet[16009]: E0926 16:56:52.340762 16009 kuberuntime_manager.go:1449] "Unhandled Error" err="container kube-proxy start failed in pod kube-proxy-zvgz4_kube-system(1f939284-468a-4724-a8db-172260d836dc): CreateContainerConfigError: services have not yet been read at least once, cannot construct envvars" logger="UnhandledError"
Sep 26 16:56:52 edge-node1 kubelet[16009]: E0926 16:56:52.340798 16009 pod_workers.go:1324] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"kube-proxy\" with CreateContainerConfigError: \"services have not yet been read at least once, cannot construct envvars\"" pod="kube-system/kube-proxy-zvgz4" podUID="1f939284-468a-4724-a8db-172260d836dc"
Sep 26 16:56:54 edge-node1 kubelet[16009]: E0926 16:56:54.341508 16009 kuberuntime_manager.go:1449] "Unhandled Error" err="init container install-cni-plugin start failed in pod kube-flannel-ds-cbblv_kube-flannel(38bad478-9d07-466c-8476-8f71ebf96313): CreateContainerConfigError: services have not yet been read at least once, cannot construct envvars" logger="UnhandledError"
Sep 26 16:56:54 edge-node1 kubelet[16009]: E0926 16:56:54.341537 16009 pod_workers.go:1324] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"install-cni-plugin\" with CreateContainerConfigError: \"services have not yet been read at least once, cannot construct envvars\"" pod="kube-flannel/kube-flannel-ds-cbblv" podUID="38bad478-9d07-466c-8476-8f71ebf96313"
排查过程
验证 apiserver
root@edge-node1:/etc/containerd# telnet 192.168.31.61 6443
Trying 192.168.31.61...
Connected to 192.168.31.61.
Escape character is '^]'.
master apiserver 正在监听端口,没什么问题
root@edge-node1:/etc/containerd# curl -k https://192.168.31.61:6443/healthz
ok
验证是否可以请求,也没有问题,就是先排除和 apiserver 的通信问题
查看源码
// If the pod originates from the kube-api, when we know that the kube-apiserver is responding and the kubelet's credentials are valid.// Knowing this, it is reasonable to wait until the service lister has synchronized at least once before attempting to build// a service env var map. This doesn't present the race below from happening entirely, but it does prevent the "obvious"// failure case of services simply not having completed a list operation that can reasonably be expected to succeed.// One common case this prevents is a kubelet restart reading pods before services and some pod not having the// KUBERNETES_SERVICE_HOST injected because we didn't wait a short time for services to sync before proceeding.// The KUBERNETES_SERVICE_HOST link is special because it is unconditionally injected into pods and is read by the// in-cluster-config for pod clientsif !kubetypes.IsStaticPod(pod) && !kl.serviceHasSynced() {return nil, fmt.Errorf("services have not yet been read at least once, cannot construct envvars")}
查看源码发现出现这个错误需要满足两个条件
- 不是静态 pod
- Service 是否已同步(至少一次)
我们在这里可以确定,肯定不是静态 pod,所以我们需要针对条件 2 来规避返回这个错误。
ExecStartPre=/bin/sleep 30
添加到 kubelet systemd 中,然后从起 kubelet
systemctl daemon-reload
systemctl restart kubelet
发现没有起作用,还有其他办法
--sync-frequency=30s # 增加同步频率检查间隔
添加到 kueblet 的 config 配置中,重启之后发现还是没有解决,到这里感觉可能不是 kubelet 的问题。
解决方案
最后的最后僵持了一个下午,在 flanne
issues 中发现了相似的问题,issue 里面描述的大致内容是把 k8s 升级到 v1.30.3
-> v1.31.0
后遭遇到了这个问题,原因是 k8s 版本和 flannel 之间不兼容造成的问题。
- https://github.com/flannel-io/flannel/issues/2031
因为我的集群是 v1.34.0
并且使用的也是 flannel
所以抱着死马当活马医的态度,我重置了集群并且安装了低版本,发现果然没这个问题
root@master-node1:~# kubectl get pod -A -owide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
kube-flannel kube-flannel-ds-qg7pp 1/1 Running 0 103m 192.168.31.32 edge-node1 <none> <none>
kube-flannel kube-flannel-ds-t95vl 1/1 Running 0 15h 192.168.31.61 master-node1 <none> <none>
kube-system coredns-7db6d8ff4d-dppqf 1/1 Running 0 15h 10.244.0.5 master-node1 <none> <none>
kube-system coredns-7db6d8ff4d-gslxm 1/1 Running 0 15h 10.244.0.6 master-node1 <none> <none>
kube-system etcd-master-node1 1/1 Running 0 15h 192.168.31.61 master-node1 <none> <none>
kube-system kube-apiserver-master-node1 1/1 Running 0 15h 192.168.31.61 master-node1 <none> <none>
kube-system kube-controller-manager-master-node1 1/1 Running 0 15h 192.168.31.61 master-node1 <none> <none>
kube-system kube-proxy-6fltr 1/1 Running 0 103m 192.168.31.32 edge-node1 <none> <none>
kube-system kube-proxy-kkc2j 1/1 Running 0 15h 192.168.31.61 master-node1 <none> <none>
kube-system kube-scheduler-master-node1 1/1 Running 0 15h 192.168.31.61 master-node1 <none> <none>
kube-system raven-agent-ds-5df4d 1/1 Running 0 103m 192.168.31.32 edge-node1 <none> <none>
kube-system raven-agent-ds-6xv5h 1/1 Running 0 15h 192.168.31.61 master-node1 <none> <none>
kube-system yurt-hub-edge-node1 1/1 Running 1 102m 192.168.31.32 edge-node1 <none> <none>
kube-system yurt-manager-96c7d9484-5t5g5 1/1 Running 0 15h 10.244.0.7 master-node1 <none> <none>
结论
- k8s v1.31.0 API 变更:新版本对 service selector 和环境变量构建方式进行了改动
- ghcr.io/flannel-io/flannel-cni-plugin:v1.7.1-flannel1 插件在新版本存在兼容性问题
修复方法
- k8s 降级
or
- 使用其他版本的 flannel 最起码要支持新特性
二选一