K8S第二次安装
文章目录
- Kubernetes 集群初始化问题总结文档
- 概述
- 遇到的问题及解决方案
- 1. Kubelet cgroup 驱动配置错误
- 2. CoreDNS Pod 处于 Pending 状态
- 3. 节点污点阻止 CoreDNS 调度
- 最终验证结果
- 经验总结
Kubernetes 集群初始化问题总结文档
概述
本文档总结了在 Kubernetes 集群初始化过程中遇到的主要问题及其解决方案。通过系统性的诊断和修复,最终成功建立了稳定运行的 Kubernetes 控制平面。
遇到的问题及解决方案
1. Kubelet cgroup 驱动配置错误
问题描述:
在集群初始化完成后,发现 kubelet 服务存在 cgroup 驱动配置错误。日志显示 kubelet 配置中的 cgroupDriver 被设置为 “systemd”,但实际需要的是 “cgroupfs”。
诊断过程:
- 检查
/var/lib/kubelet/config.yaml配置文件 - 确认
cgroupDriver字段值为 “systemd” - 查看 kubelet 服务日志确认配置问题
解决方案:
- 使用 sed 命令将配置文件中的
cgroupDriver: systemd修改为cgroupDriver: cgroupfs - 重启 kubelet 服务以应用更改
- 验证修改结果,通过日志确认 CgroupDriver 已正确设置为 “cgroupfs”
2. CoreDNS Pod 处于 Pending 状态
问题描述:
集群初始化后,CoreDNS Pod 一直处于 Pending 状态,无法正常启动。
诊断过程:
- 检查 Pod 状态发现 CoreDNS 处于 Pending
- 分析原因发现缺少网络插件支持
- 检查发现存在 calico.yaml 配置文件
解决方案:
- 应用 Calico 网络插件配置:
kubectl apply -f ~/calico.yaml - 等待 Calico 组件启动完成
- 观察 CoreDNS Pod 状态变化
3. 节点污点阻止 CoreDNS 调度
问题描述:
即使部署了网络插件,CoreDNS Pod 仍然处于 Pending 状态。
诊断过程:
- 检查节点污点配置发现存在
node-role.kubernetes.io/master:NoSchedule - 检查 CoreDNS 部署的容忍度设置
- 发现 CoreDNS 有容忍
control-plane但节点标记为master
解决方案:
- 移除节点上的 master 污点:
kubectl taint nodes --all node-role.kubernetes.io/master- - 等待 CoreDNS Pod 自动调度并启动
最终验证结果
所有问题解决后,集群状态恢复正常:
- 节点状态:Ready
- 所有系统组件正常运行:
- etcd
- kube-apiserver
- kube-controller-manager
- kube-scheduler
- kube-proxy
- CoreDNS
- Calico 网络插件
经验总结
- 配置一致性很重要:确保 kubelet 配置与容器运行时的 cgroup 驱动保持一致
- 网络插件必不可少:CoreDNS 等组件依赖网络插件才能正常工作
- 污点和容忍度需要匹配:节点上的污点必须与 Pod 的容忍度设置相匹配才能正常调度
- 系统性诊断方法:从日志分析入手,逐步排查问题根源
通过以上步骤,成功解决了 Kubernetes 集群初始化过程中遇到的关键问题,确保了集群的稳定运行。
/home/user/k8s_init_complete_v2.sh
#!/bin/bashYELLOW="\033[1;33m"
GREEN="\033[1;32m"
RED="\033[1;31m"
RESET="\033[0m"echo -e "${YELLOW}=== Kubernetes 初始化完善执行流程 ===${RESET}"# 阶段一:环境清理与依赖检查(基础保障)
echo -e "${YELLOW}=== 阶段一:环境清理与依赖检查 ===${RESET}"# 步骤 1:清理历史残留
echo -e "${YELLOW}=== 步骤 1/20:清理历史残留 ===${RESET}"
# 先启动 containerd 服务,避免 kubeadm reset 出现连接错误
echo "启动 containerd 服务..."
sudo systemctl start containerd 2>/dev/null || true
# 等待 containerd 启动
sleep 5
# 重置 kubeadm
sudo kubeadm reset -f
# 删除残留文件
sudo rm -rf /etc/kubernetes/* /var/lib/kubelet/* /var/lib/etcd/* /etc/cni/net.d/* /home/user/kubeadm-config.yaml ~/.kube/config
echo -e "${GREEN}=== 步骤 1 执行完成 ===${RESET}"# 步骤 2:检查系统依赖(必过校验)
echo -e "${YELLOW}=== 步骤 2/20:检查系统依赖 ===${RESET}"
# 校验内核版本(需 ≥4.19)
kernel_version=$(uname -r | cut -d '.' -f1-2)
if [[ $(echo "$kernel_version >= 4.19" | bc) -ne 1 ]]; thenecho -e "${RED}❌ 内核版本过低(当前 $kernel_version),需升级至 4.19+${RESET}"exit 1
fi# 校验关键工具是否安装
required_tools=("kubeadm" "kubelet" "kubectl" "containerd" "crictl" "ss" "jq")
for tool in "${required_tools[@]}"; doif ! command -v $tool &> /dev/null; thenecho -e "${RED}❌ 缺失依赖工具:$tool,请先安装${RESET}"exit 1fi
done# 校验端口占用(6443/2379/2380/10250 需空闲)
occupied_ports=()
for port in 6443 2379 2380 10250; doif sudo ss -tuln | grep -q ":$port"; thenoccupied_ports+=($port)fi
done
if [[ ${#occupied_ports[@]} -gt 0 ]]; thenecho -e "${RED}❌ 端口被占用:${occupied_ports[*]},请释放后重试${RESET}"exit 1
fiecho -e "${GREEN}✅ 步骤 2 校验通过:系统依赖满足要求${RESET}"# 阶段二:Containerd 配置与校验(核心组件)
echo -e "${YELLOW}=== 阶段二:Containerd 配置与校验 ===${RESET}"# 步骤 3:配置 Containerd(cgroupfs 驱动)
echo -e "${YELLOW}=== 步骤 3/20:配置 Containerd ===${RESET}"
sudo tee /etc/containerd/config.toml > /dev/null << EOF
version = 2
root = "/var/lib/containerd"
state = "/run/containerd"
plugin_dirs = [][grpc]address = "/run/containerd/containerd.sock"uid = 0gid = 0[ttrpc]address = ""uid = 0gid = 0[debug]address = ""uid = 0gid = 0level = ""[metrics]address = ""grpc_histogram = false[cgroup]path = ""[timeouts]"io.containerd.timeout.shim.cleanup" = "5s""io.containerd.timeout.shim.load" = "5s""io.containerd.timeout.shim.shutdown" = "3s""io.containerd.timeout.task.state" = "2s"[plugins][plugins."io.containerd.grpc.v1.cri"]disable_tcp_service = truestream_server_address = "127.0.0.1"stream_server_port = "0"stream_idle_timeout = "4h0m0s"enable_tls_streaming = falsesandbox_image = "swr.cn-north-4.myhuaweicloud.com/ddn-k8s/registry.k8s.io/pause:3.10.1"stats_collect_period = 10systemd_cgroup = falseenable_selinux = falseselinux_category_range = 1024max_container_log_line_size = 16384disable_cgroup = falsedisable_apparmor = falserestrict_oom_score_adj = falsemax_concurrent_downloads = 3max_concurrent_uploads = 5disable_proc_mount = falseunset_seccomp_profile = ""tolerate_missing_hugetlb_controller = truedisable_hugetlb_controller = trueignore_image_defined_volumes = falsenetns_mounts_under_state_dir = falseenable_unprivileged_ports = falseenable_unprivileged_icmp = false[plugins."io.containerd.grpc.v1.cri".containerd]default_runtime_name = "runc"disable_snapshot_annotations = truediscard_unpacked_layers = falseignore_rdt_not_enabled_errors = false[plugins."io.containerd.grpc.v1.cri".containerd.runtimes][plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]runtime_type = "io.containerd.runc.v2"runtime_engine = ""runtime_root = ""privileged_without_host_devices = falsebase_runtime_spec = ""[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]SystemdCgroup = false[plugins."io.containerd.grpc.v1.cri".cni]bin_dir = "/opt/cni/bin"conf_dir = "/etc/cni/net.d"max_conf_num = 1conf_template = ""[plugins."io.containerd.grpc.v1.cri".cni.ipam][plugins."io.containerd.grpc.v1.cri".cni.interface][plugins."io.containerd.grpc.v1.cri".registry]config_path = ""[plugins."io.containerd.grpc.v1.cri".registry.mirrors][plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]endpoint = ["https://registry-1.docker.io"][plugins."io.containerd.grpc.v1.cri".registry.mirrors."registry.cn-hangzhou.aliyuncs.com"]endpoint = ["https://registry.cn-hangzhou.aliyuncs.com"][plugins."io.containerd.internal.v1.opt"]path = "/opt/containerd"[plugins."io.containerd.monitor.v1.cgroups"]no_prometheus = false[plugins."io.containerd.runtime.v2.task"]platforms = ["linux/amd64"][plugins."io.containerd.service.v1.containers-service"]rdt_config_file = ""[plugins."io.containerd.snapshotter.v1.btrfs"]root_path = "/var/lib/containerd/snapshots/btrfs"[plugins."io.containerd.snapshotter.v1.native"]root_path = "/var/lib/containerd/snapshots/native"[plugins."io.containerd.snapshotter.v1.overlayfs"]root_path = "/var/lib/containerd/snapshots/overlayfs"[plugins."io.containerd.snapshotter.v1.zfs"]root_path = "/var/lib/containerd/snapshots/zfs"
EOF
echo -e "${GREEN}=== 步骤 3 执行完成 ===${RESET}"# 步骤 4:启动 Containerd 并校验
echo -e "${YELLOW}=== 步骤 4/20:启动并校验 Containerd ===${RESET}"
# 重启服务
sudo systemctl daemon-reload
sudo systemctl start containerd
sudo systemctl enable containerd# 等待启动(10秒)
sleep 10# 校验状态(必须为 active (running))
if ! sudo systemctl is-active --quiet containerd; thenecho -e "${RED}❌ Containerd 启动失败,查看日志:sudo journalctl -u containerd -n 20${RESET}"exit 1
fi# 校验 cgroup 驱动配置(必须为 false,即 cgroupfs)
if ! grep -qE "systemd_cgroup = false|SystemdCgroup = false" /etc/containerd/config.toml; thenecho -e "${RED}❌ Containerd cgroup 驱动配置错误,未启用 cgroupfs${RESET}"exit 1
fi# 校验 CRI 接口连通性
if ! sudo crictl info &> /dev/null; thenecho -e "${RED}❌ CRI 接口连通失败,Containerd 配置异常${RESET}"exit 1
fiecho -e "${GREEN}✅ 步骤 4 校验通过:Containerd 正常运行(cgroupfs 驱动)${RESET}"# 步骤 5:预拉取 Pause 镜像(避免后续组件启动时重复拉取)
echo -e "${YELLOW}=== 步骤 5/20:预拉取 Pause 镜像 ===${RESET}"
# 检查镜像是否已存在,如果不存在则拉取
if ! sudo ctr -n k8s.io images ls | grep -q "swr.cn-north-4.myhuaweicloud.com/ddn-k8s/registry.k8s.io/pause:3.10.1"; thenif ! sudo ctr -n k8s.io images pull swr.cn-north-4.myhuaweicloud.com/ddn-k8s/registry.k8s.io/pause:3.10.1; thenecho -e "${RED}❌ Pause 镜像拉取失败,检查网络或镜像仓库${RESET}"exit 1fi
elseecho -e "${GREEN}✅ Pause 镜像已存在,跳过拉取${RESET}"
fi
echo -e "${GREEN}=== 步骤 5 执行完成 ===${RESET}"
# 注释掉原来的拉取命令,避免重复拉取
# pause_image="registry.cn-hangzhou.aliyuncs.com/google_containers/pause:3.9"
# if ! sudo crictl pull $pause_image; then
# echo -e "${RED}❌ Pause 镜像拉取失败,检查网络或镜像仓库${RESET}"
# exit 1
# fiecho -e "${GREEN}✅ 步骤 5 校验通过:Pause 镜像拉取成功${RESET}"# 阶段三:Kubelet 配置与校验(节点代理)
echo -e "${YELLOW}=== 阶段三:Kubelet 配置与校验 ===${RESET}"# 步骤 6:配置 Kubelet 服务
echo -e "${YELLOW}=== 步骤 6/20:配置 Kubelet 服务 ===${RESET}"
sudo mkdir -p /etc/systemd/system/kubelet.service.d/
sudo tee /etc/systemd/system/kubelet.service.d/10-kubeadm.conf > /dev/null << EOF
[Service]
Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf"
Environment="KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml"
# 移除不支持的 --container-runtime 标志,将其配置在 config.yaml 文件中
Environment="KUBELET_EXTRA_ARGS=--cgroup-driver=cgroupfs"
ExecStart=
ExecStart=/usr/bin/kubelet \$KUBELET_KUBECONFIG_ARGS \$KUBELET_CONFIG_ARGS \$KUBELET_EXTRA_ARGS
EOF
echo -e "${GREEN}=== 步骤 6 执行完成 ===${RESET}"# 步骤 7:创建 Kubelet 配置文件
echo -e "${YELLOW}=== 步骤 7/20:创建 Kubelet 配置文件 ===${RESET}"
sudo mkdir -p /var/lib/kubelet
sudo tee /var/lib/kubelet/config.yaml > /dev/null << EOF
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
cgroupDriver: cgroupfs
runtimeRequestTimeout: "15m"
maxPods: 110
containerRuntimeEndpoint: "unix:///run/containerd/containerd.sock"
EOF
echo -e "${GREEN}=== 步骤 7 执行完成 ===${RESET}"# 阶段四:镜像预拉与配置文件创建(初始化前置)
echo -e "${YELLOW}=== 阶段四:镜像预拉与配置文件创建 ===${RESET}"# 步骤 8:预拉取 K8s 核心镜像(避免初始化时拉取超时)
echo -e "${YELLOW}=== 步骤 8/20:预拉取 K8s 核心镜像 ===${RESET}"
images=("registry.cn-hangzhou.aliyuncs.com/google_containers/kube-apiserver:v1.33.5""registry.cn-hangzhou.aliyuncs.com/google_containers/kube-controller-manager:v1.33.5""registry.cn-hangzhou.aliyuncs.com/google_containers/kube-scheduler:v1.33.5""registry.cn-hangzhou.aliyuncs.com/google_containers/kube-proxy:v1.33.5""registry.cn-hangzhou.aliyuncs.com/google_containers/coredns:v1.12.0""registry.cn-hangzhou.aliyuncs.com/google_containers/etcd:3.5.12-0""swr.cn-north-4.myhuaweicloud.com/ddn-k8s/registry.k8s.io/pause:3.10.1"
)for image in "${images[@]}"; do# 检查镜像是否已存在,如果不存在则拉取if ! sudo ctr -n k8s.io images ls | grep -q "$image"; thenif ! sudo ctr -n k8s.io images pull "$image"; thenecho -e "${RED}❌ 镜像拉取失败: $image${RESET}"exit 1fielseecho -e "${GREEN}✅ 镜像已存在,跳过拉取: $image${RESET}"fi
done
echo -e "${GREEN}=== 步骤 8 执行完成 ===${RESET}"
# 注释掉原来的拉取命令,避免重复拉取
# if ! sudo kubeadm config images pull --image-repository=registry.cn-hangzhou.aliyuncs.com/google_containers --kubernetes-version=v1.33.5; then
# echo -e "${RED}❌ K8s 镜像拉取失败,检查网络或镜像仓库地址${RESET}"
# exit 1
# fi# 校验镜像是否拉取成功(使用实际拉取的镜像版本,适配crictl输出格式)
echo -e "${YELLOW}=== 校验镜像是否拉取成功 ===${RESET}"
required_images=("kube-apiserver:v1.33.5""kube-controller-manager:v1.33.5""kube-scheduler:v1.33.5""etcd:3.5.12-0""coredns:v1.12.0"
)# 获取镜像列表
image_list=$(sudo crictl images --quiet)# 检查每个必需的镜像
for img in "${required_images[@]}"; do# 提取镜像名称和标签image_name=$(echo "$img" | cut -d':' -f1)image_tag=$(echo "$img" | cut -d':' -f2)# 构造搜索模式search_pattern="${image_name}\s\+${image_tag}"# 检查镜像是否存在if ! sudo crictl images | grep -q "$search_pattern"; thenecho -e "${RED}❌ 缺失核心镜像:$img,请重新拉取${RESET}"exit 1fi
doneecho -e "${GREEN}✅ 步骤 8 校验通过:K8s 核心镜像拉取完成${RESET}"# 步骤 9:创建 Kubeadm 配置文件
echo -e "${YELLOW}=== 步骤 9/20:创建 Kubeadm 配置文件 ===${RESET}"
sudo tee /home/user/kubeadm-config.yaml > /dev/null << EOF
apiVersion: kubeadm.k8s.io/v1beta3
kind: InitConfiguration
localAPIEndpoint:advertiseAddress: 10.16.233.177bindPort: 6443
nodeRegistration:criSocket: unix:///run/containerd/containerd.socktaints:- effect: NoSchedulekey: node-role.kubernetes.io/master
---
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
kubernetesVersion: v1.33.5
imageRepository: registry.cn-hangzhou.aliyuncs.com/google_containers
networking:podSubnet: 192.168.0.0/16
controlPlaneEndpoint: "10.16.233.177:6443"
controllerManager:extraArgs:configure-cloud-routes: "false"
scheduler:extraArgs:bind-address: 0.0.0.0 # 修复:将 address 改为 bind-address(当前版本支持的参数)
EOF# 校验配置文件语法(必须无报错)
if ! kubeadm config validate --config=/home/user/kubeadm-config.yaml; thenecho -e "${RED}❌ Kubeadm 配置文件语法错误,请检查格式${RESET}"exit 1
fiecho -e "${GREEN}✅ 步骤 9 校验通过:Kubeadm 配置文件有效${RESET}"# 阶段五:K8s 初始化与核心校验(关键阶段)
echo -e "${YELLOW}=== 阶段五:K8s 初始化与核心校验 ===${RESET}"# 步骤 10:启动 Kubelet 服务
echo -e "${YELLOW}=== 步骤 10/20:启动 Kubelet 服务 ===${RESET}"
# 重启服务
sudo systemctl daemon-reload
sudo systemctl start kubelet
sudo systemctl enable kubelet
echo -e "${GREEN}=== 步骤 10 执行完成 ===${RESET}"# 步骤 11:执行 Kubeadm 初始化
echo -e "${YELLOW}=== 步骤 11/20:执行 Kubeadm 初始化 ===${RESET}"
# 开启 debug 日志,便于排查
sudo kubeadm init --config=/home/user/kubeadm-config.yaml --ignore-preflight-errors=all --v=5# 校验初始化是否成功(必须返回 0)
if [ $? -ne 0 ]; thenecho -e "${RED}❌ Kubeadm 初始化失败,查看日志:sudo journalctl -u kubelet -n 50${RESET}"exit 1
fiecho -e "${GREEN}✅ 步骤 11 执行完成:Kubeadm 初始化无语法错误${RESET}"# 步骤 12:校验 Kubelet 状态
echo -e "${YELLOW}=== 步骤 12/20:校验 Kubelet 状态 ===${RESET}"
# 等待启动(5秒)
sleep 5# 校验状态(必须为 active (running))
if ! sudo systemctl is-active --quiet kubelet; thenecho -e "${RED}❌ Kubelet 启动失败,查看日志:sudo journalctl -u kubelet -n 20${RESET}"exit 1
fi# 校验 cgroup 驱动配置(日志中必须含 CgroupDriver: cgroupfs)
if ! sudo journalctl -u kubelet -n 10 --no-pager | grep -q "CgroupDriver: cgroupfs"; thenecho -e "${RED}❌ Kubelet cgroup 驱动配置错误,未启用 cgroupfs${RESET}"exit 1
fiecho -e "${GREEN}✅ 步骤 12 校验通过:Kubelet 正常运行(cgroupfs 驱动)${RESET}"# 步骤 13:配置 Kubectl 客户端
echo -e "${YELLOW}=== 步骤 13/20:配置 Kubectl 客户端 ===${RESET}"
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config# 校验 Kubectl 连通性(必须返回集群版本)
if ! kubectl version --short; thenecho -e "${RED}❌ Kubectl 客户端配置失败,无法连接集群${RESET}"exit 1
fiecho -e "${GREEN}✅ 步骤 13 校验通过:Kubectl 客户端配置成功${RESET}"echo -e "${GREEN}=== Kubernetes 初始化完善执行流程完成 ===${RESET}"
