RKE2 Troubleshooting：etcd 與記憶體不足

背景說明

這篇 troubleshooting 是從一個實際案例整理出來的。

當時看到錯誤訊息如下：

Failed to test etcd connection: rpc error: code = Unavailable desc = connection error: dial tcp 127.0.0.1:2379: connect: connection refused

一開始以為是 etcd 資料損毀或叢集設定有問題，後來逐步排查後發現，實際的 root cause 是：

節點記憶體只有 512 Mi，導致 etcd 在啟動過程中被 OOM kill。
將節點記憶體調整為 2048 Mi（2 Gi） 後，問題即解決。

這份文件就是把當時的排查步驟整理起來，做為未來診斷類似問題時的參考。

問題說明

這代表：

etcd 沒有啟動
或 etcd 正在啟動但尚未就緒
或 etcd port 被佔用
或 etcd 資料損毀
或 /etc/hosts 錯誤，導致 peer 無法連線

RKE2 Server 依賴 embedded etcd，因此任何 etcd 問題都會造成 Control Plane 啟動失敗。

Step 1：檢查 etcd 是否正在運作

查看 static pod

crictl ps | grep etcd

若 API 正常可用：

kubectl -n kube-system get pods | grep etcd

Step 2：查看 etcd log（最關鍵）

journalctl -u rke2-server -f

若 container 已啟動：

crictl logs <etcd-container-id>

常見錯誤：

錯誤	原因
no space left on device	磁碟滿
corrupted wal	etcd 損毀
address already in use	port 2379/2380 被占用
failed to resolve host	/etc/hosts 錯誤
bind: cannot assign requested address	IP 配置錯誤

Step 3：檢查 etcd port

ss -ltnp | grep 2379
ss -ltnp | grep 2380

若被占用 → 需停用相關服務。

Step 4：檢查 etcd 資料目錄

ls -lh /var/lib/rancher/rke2/server/db/etcd/
df -h /var/lib/rancher

如看到：

多個巨大的 WAL → 可能損毀
snapshot 過大 → 需要 compaction

Step 5：重建 etcd（單 Control Plane）

⚠️ 會重置 Kubernetes 所有資料

systemctl stop rke2-server
rm -rf /var/lib/rancher/rke2/server/db/etcd/
systemctl start rke2-server

Step 6：HA etcd cluster 診斷

檢查成員狀態：

/var/lib/rancher/rke2/bin/etcdctl member list

健康狀態預期：

狀態	意義
started	正常
unstarted	節點未啟動
unhealthy	無法連線

常見 Root Cause 舉例

1. /etc/hosts 錯誤（最常見）

錯誤示例：

127.0.1.1 node1

改為：

127.0.0.1 localhost
<real-ip> node1

2. IPv6 問題

禁用 IPv6：

sysctl -w net.ipv6.conf.all.disable_ipv6=1

3. 磁碟滿

df -h

常見情境：

/var/lib/rancher 已滿
container log / journal 累積過多

處理方式：

清理舊的 container / image
規劃 logrotate，限制 log 大小
規劃足夠的磁碟空間

4. 防火牆阻擋 etcd peer

必須允許控制平面互相通訊：

Port	功能
2379	etcd client
2380	etcd peer

請確認安全群組 / 防火牆允許節點之間在這些 port 上通訊。

5. 記憶體不足（OOM）

這是本次實際遇到的 root cause。

現象

rke2-server / etcd 一直重啟（CrashLoopBackOff）
journalctl 中出現：
- signal: killed
- OOMKilled
- out of memory
節點的可用記憶體顯著偏低（例如只有 512 Mi）

檢查方式

free -h grep -i oom /var/log/messages /var/log/syslog 2>/dev/null | tail -n 20 journalctl -u rke2-server | grep -i oom

處理方式（實際案例）

原本節點記憶體：512 Mi
調整為：2048 Mi（2 Gi）
調整後，etcd 與 RKE2 server 可正常啟動，錯誤不再出現

一般建議

Control Plane 節點建議至少配置 2 Gi 以上記憶體（視實際 workload 調整）
避免在同一台節點上放置過多額外服務，與 RKE2 / etcd 搶記憶體
建議搭配監控（如 node exporter + Prometheus），對記憶體使用率與 OOM 事件設置告警

建議加值功能

etcd 健康監控（例如透過 Prometheus）
logrotate 保護 journal / container log 使用量
auto-heal script（偵測異常並嘗試自動修復）
節點資源監控（CPU / Memory / Disk）與告警

背景說明​

問題說明

Step 1：檢查 etcd 是否正在運作

查看 static pod​

若 API 正常可用：​

Step 2：查看 etcd log（最關鍵）

常見錯誤：​

Step 3：檢查 etcd port

Step 4：檢查 etcd 資料目錄

Step 5：重建 etcd（單 Control Plane）

Step 6：HA etcd cluster 診斷

常見 Root Cause 舉例

1. /etc/hosts 錯誤（最常見）​

2. IPv6 問題​

3. 磁碟滿​

4. 防火牆阻擋 etcd peer​

5. 記憶體不足（OOM）​

現象​

檢查方式​

處理方式（實際案例）​

一般建議​