k8s学习过程中遇错汇总

k8s学习过程中遇错汇总

Scroll Down
学习k8s的时候总是会遇到这样那样问题,哪怕就是挂起一下虚拟机,重启之后也必报错,零零总总的发了好几个解决报错的文章,在提供给他人思路的时候太过零散,不利于降低出现问题时的怒气值,所以还是汇总一下吧,也有利于自己回顾。

问题一:

在编写完kube-apiserver的配置文件、编写完systemd的Unit配置文件等一系列工作后,systemd启动kube-apiserver失败,

root@k8s-master1 bin# systemctl status  kube-apiserver
● kube-apiserver.service - Kubernetes API Server
   Loaded: loaded (/usr/lib/systemd/system/kube-apiserver.service; enabled; vendor preset: disabled)
   Active: failed (Result: start-limit) since Thu 2020-06-11 16:13:42 CST; 14min ago
     Docs: https://github.com/kubernetes/kubernetes
  Process: 23894 ExecStart=/opt/kubernetes/bin/kube-apiserver $KUBE_APISERVER_OPTS (code=exited, status=1/FAILURE)
 Main PID: 23894 (code=exited, status=1/FAILURE)

Jun 11 16:13:42 k8s-master1 kube-apiserver[23894]: --stderrthreshold severity         logs at or above this threshold go to stderr (default 2)
Jun 11 16:13:42 k8s-master1 kube-apiserver[23894]: -v, --v Level                          number for the log level verbosity (default 0)
Jun 11 16:13:42 k8s-master1 kube-apiserver[23894]: --version version[=true]           Print version information and quit
Jun 11 16:13:42 k8s-master1 kube-apiserver[23894]: --vmodule moduleSpec               comma-separated list of pattern=N settings for file-filtered logging
Jun 11 16:13:42 k8s-master1 systemd[1]: kube-apiserver.service holdoff time over, scheduling restart.
Jun 11 16:13:42 k8s-master1 systemd[1]: Stopped Kubernetes API Server.
Jun 11 16:13:42 k8s-master1 systemd[1]: start request repeated too quickly for kube-apiserver.service
Jun 11 16:13:42 k8s-master1 systemd[1]: Failed to start Kubernetes API Server.
Jun 11 16:13:42 k8s-master1 systemd[1]: Unit kube-apiserver.service entered failed state.
Jun 11 16:13:42 k8s-master1 systemd[1]: kube-apiserver.service failed.

从提示“ ExecStart=/opt/kubernetes/bin/kube-apiserver $KUBE_APISERVER_OPTS ”中可以看出来执行方式是使用了可执行文件kube-apiserver 并通过变量$KUBE_APISERVER_OPTS提供了一系列参数,那么可执行文件kube-apiserver本身是没有什么问题的,问题可能是出在$KUBE_APISERVER_OPTS变量身上。这个变量是在kube-apiserver的配置文件/opt/kubernetes/cfg/kube-apiserver.conf中定义的。

这里直接尝试使用全命令寻找bug:

# /opt/kubernetes/bin/kube-apiserver  --logtostderr=false --v=2 --log-dir=/opt/kubernetes/logs --etcd-servers=https://192.168.52.10:2379,https://192.168.52.20:2379,https://192.168.52.21:2379 -bind-address=192.168.52.10 --secure-port=6443 --advertise-address=192.168.52.10 --allow-privileged=true --service-cluster-ip-range=10.0.0.0/24 --enable-admission-plugins=NamespaceLifecycle,LimitRanger,ServiceAccount,ResourceQuota,NodeRestriction --authorization-mode=RBAC,Node --enable-bootstrap-token-auth=true --token-auth-file=/opt/kubernetes/cfg/token.csv --service-node-port-range=30000-32767 --kubelet-client-certificate=/opt/kubernetes/ssl/server.pem --kubelet-client-key=/opt/kubernetes/ssl/server-key.pem --tls-cert-file=/opt/kubernetes/ssl/server.pem --tls-private-key-file=/opt/kubernetes/ssl/server-key.pem --client-ca-file=/opt/kubernetes/ssl/ca.pem --service-account-key-file=/opt/kubernetes/ssl/ca-key.pem --etcd-cafile=/opt/etcd/ssl/ca.pem --etcd-certfile=/opt/etcd/ssl/server.pem --etcd-keyfile=/opt/etcd/ssl/server-key.pem --audit-log-maxage=30 --audit-log-maxbackup=3 --audit-log-maxsize=100 --audit-log-path=/opt/kubernetes/logs/k8s-audit.log

如图,找到关键报错信息,找到配置文件一对比,发现错误原因
image.png

问题一解决办法

出现类似这样的报错基本就是因为编辑配置文件时粗心大意导致的,去把配置文件修改正确即可

问题二:

当部署好dashboard之后挂起了虚拟机,然后在启动就gg了

image.png

image.png

image.png

image.png

看这报错应该是连不上apiserver了,尝试了挺长时间并没有解决问题,于是删除部署的dashboard。

为了彻底删除相关dashboard的所有资源,使用下面这个方法:

# namespace视自己的而定
# kubectl get configmap,clusterrole,clusterrolebinding,secret,sa,role,rolebinding,services,deployments  --namespace=kubernetes-dashboard | grep dashboard

image.png
然后一个一个删除(根据自己搜到的)

kubectl delete configmap kubernetes-dashboard-settings -n kubernetes-dashboard
kubectl delete clusterrole kubernetes-dashboard  -n kubernetes-dashboard
kubectl delete clusterrolebinding dashboard-admin  -n kubernetes-dashboard
kubectl delete clusterrolebinding kubernetes-dashboard -n kubernetes-dashboard
kubectl delete secret kubernetes-dashboard-certs  -n kubernetes-dashboard
kubectl delete secret kubernetes-dashboard-csrf  -n kubernetes-dashboard
kubectl delete secret kubernetes-dashboard-key-holder  -n kubernetes-dashboard
kubectl delete secret kubernetes-dashboard-token-8vl8b  -n kubernetes-dashboard
kubectl delete  serviceaccount kubernetes-dashboard -n kubernetes-dashboard
kubectl delete role kubernetes-dashboard  -n kubernetes-dashboard
kubectl delete rolebinding kubernetes-dashboard -n kubernetes-dashboard
kubectl delete service dashboard-metrics-scraper -n kubernetes-dashboard
kubectl delete service kubernetes-dashboard -n kubernetes-dashboard
kubectl delete deployment dashboard-metrics-scraper -n kubernetes-dashboard
kubectl delete deployment kubernetes-dashboard -n kubernetes-dashboard

image.png

问题二解决办法

水平有限,暂时只能靠彻底删除失效的dashboard并重新部署来解决dashboard不能访问的问题

问题三:

今天碰到一个几乎逼疯我的问题,不知道咋整的kubectl get csr命令就报错:

image.png

然后我就去排查了一下,也不知道啥时候手残把/opt/kubernetes/cfg/token.csv文件的token值与生成bootstrap.kubeconfig文件时的token值弄的不一致了!
(终端关过了,某得操作记录了……)

虽然发现了原因,整了好久也没找到原因,因为赶进度所以就选择了重做,在重做的时候,node节点还没部署kubelet竟然也可以通过kubectl get node看到,这我能忍?手起刀落kubectl delete node node1/2,然后把master上已通过的csr请求也给删了
image.png

image.png
然后就……
image.png

再然后一筹莫展的时候,看到了这个东西:
image.png

虽然与我的处境并不完全吻合,但是由此得到了些启发:

问题三解决办法

1、找到kubelet.kubeconfig文件的路径(可以在kubelet的配置文件中找到),然后cat一下

image.png

image.png
2、然后干掉那两个证书文件
image.png
3、然后重启kubelet,会重新生成新的kubelet.kubeconfig,且get csr可以看到master节点重新发送的csr请求,这时候就可以重新通过该请求把该节点加回集群
image.png