Symptom
Recently, I faced an issue where Kubelet on a node reported error message failed to create file descriptors.
error creating file watcher: too many open files
error creating file watcher: no space left on device
After short checking, I found the node has max_user_watches
of 10000, but the TotalinotifyWatches
is beyond this value. (P. S still not sure why watcher can initiate more than cap). In order to find which process occupied the most watchers. I used following command1 to find it out.
echo -e "COUNT\tPID\tUSER\tCOMMAND"; sudo find /proc/[0-9]*/fdinfo -type f 2>/dev/null | sudo xargs grep ^inotify 2>/dev/null | cut -d/ -f 3 | uniq -c | sort -nr | { while read -rs COUNT PID; do echo -en "$COUNT\t$PID\t"; ps -p $PID -o user=,command=; done}
COUNT PID USER COMMAND
7491 8412 root /home/kubernetes/bin/kubelet --v=2 --cloud-provide=gce --experi
2620 1 root /sbin/init
....
Surprisingly, Kubelet initiated more than 7000 inotify watchers. I assumed there was an inotify leakage in Kubelet.
Leakage check
Clean Kubelet
To better understand the situation, I created a clean cluster with only 1 clean node on GKE. Roughly 70 inotify watchers were there. I created a single nginx pod and the number increased by 3. Theoretically, these 3 watchers are used by Kubelet to monitor any changes on rootfs
, kube-api-access
and PodSandbox
. But to verify it, we need to check more details on which inodes are monitored by Kubelet.
Check inotify file descriptors
To do so, let’s take a look how to track a single inotify file descriptor. Opened processes’ fdinfo
folder, we can examine each or them to find an inotify fd.
# Find kubelet pid
ps -aux | grep kubelet
KPID=2430
# File the an example fd
sudo ls /proc/2430/fdinfo
0 1 10 11 12 13 14 2 3 4 5 6 7 8 9
...
sudo cat /proc/2430/fdinfo/8
pos: 0
flags: 02004000
mnt_id: 15
ino: 1057
inotify wd:1 ino:3f327 sdev:800001 mask:fc6 ignored_mask:0 fhandle-bytes:8 fhandle-type:1 f_handle:27f30300e5059ea2
This is very confusing, so I rely on man proc
2 to understand every piece of them. In given fd, the needed information to continue sit in the last line. It’s an inotify entry represents the 1 file or folder to be monitored. And the most useful data is ino:3f327
which represents the inode number of target file (in hexadecimal). And sdev:800001
, which represents the ID of device where the inode sit on, and it’s also in hex.
Using lsblk
, I can see there’s only 1 disk I’m using on the node, so finding the target file would be easy.
# Cast to decimal
ino=3f327
dec="$((16#${ino}))"
# Find the target file
loc="debugfs -R 'ncheck ${dec}' /dev/sda1"
sudo eval $loc 2>/dev/null
debugfd 1.46.5 (30-Dec-2021)
Inode Pathname
258855 /etc/srv/kubernetes/pki/ca-certificates.crt
Put all processes above into one single script(please see the updated version in the appendix), I can retrieve all target files, that would help to understand if there’s a real leakage. Also, I count the unique inode amount, this could also help to know which inode are monitored multiple times.
cat << EOF | sudo tee -a test.sh
echo "kubelet pid="${PID}
in_fds=$(find /proc/${PID}/fdinfo -type f 2>/dev/null | xargs grep ^inotify | cut -d " " -f 3 | cut -d ":" -f 2)
echo ${in_fds}
echo "Count: $(find /proc/${PID}/fdinfo -type f 2>/dev/null | xargs grep ^inotify | wc -l)"
uniq_fds=$(echo "${in_fds[@]}" | sort | uniq)
echo ${uniq_fds}
while read -r element;
do
count=$(echo "${in_fds[@]}" | grep -o "${element}" | wc -l)
dec="$((16#${element}))"
loc="debugfs -R 'ncheck ${dec}' /dev/sda1"
loc=$(eval $loc 2>/dev/null | tail -1 | cut -d " " -f 4)
printf "%-6s %-10s %-6s %s\n" "${element}" "${dec}" "${count}" "${loc}"
done <<< "${uniq_fds}"
EOF
sudo bash test.sh
kubelet pid=2430
3f327 3f321 ...
Count: 120
1 10b 1259 128a ...
1 1 72 Inode Pathname
10b 267 1 267 /etc/systemd/system/multi-user.target.wants/snapd.service
...
The given results are consists by following parts:
- One line for get Kubelet pid
- One line for all target inode numbers
- One line for tell how many unique inode (120)
- One line of sorted target
- Following 120 lines, each of them represents a unique inode number, its decimal number, count, another time of decimal number and the target file path.
I used the same script to the problematic node, and it showed the following result. In summary, most Kubelet watchers were targeting ino:1
. And there are 6649 targets files, which likely to be leakage, because there were only 150 pods on this pod. Unfortunately, debugfs
can’t find any target files, so the output showed as meaningless string "Inode Pathname"
.
kubelet pid=8412
...
Count: 7491
...
1 1 6649 Inode Pathname
Bad apple
Why debugfs
can’t help anymore? The reason is simple, each cgroup for a pod is using its own rootfs. This means the watcher are somehow residing on different rootfs and using independent inode index. There are some other ways to do it, I choose the most common tool grep
to find out.
sudo grep / -inum 1
/home/kubernetes/containerized_mounter/rootfs/dev
/home/kubernetes/containerized_mounter/rootfs/proc
...
/home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet/pods/5325873d-f2a0-48df-83e2-0b911df2f77f/volumes/kubernetes.io~projected/kube-api-access-227jg
...
/dev
/boot/efi
...
This turns things easy, because I can just use pod ID to compare between running pods to find out if there are any terminated pods are there. And it did show there’s some non-existence pods still being watched somehow.
What to expect next
Appendix
Updated script to get inotify watchers initiated by kubelet
Thanks for yujuhong@’s momentum and helps in finishing updated script.
PID=$(echo $(ps -aux | grep "/home/kubernetes/bin/kubelet" | head -1) | cut -d " " -f 2)
echo "Kubelet Pid:" ${PID}
inums_raw=$(find /proc/${PID}/fdinfo -type f 2>/dev/null | xargs grep ^inotify)
# echo ${inums_raw}
echo "Count: $(find /proc/${PID}/fdinfo -type f 2>/dev/null | xargs grep ^inotify | wc -l)"
while read -r line;
do
reg="ino:([0-9a-f]*) sdev:([0-9a-f]*)"
if [[ ${line} =~ $reg ]]; then
ino="${BASH_REMATCH[1]}"
sdev="${BASH_REMATCH[2]}"
# echo $ino $sdev
else
echo "wrong line"
fi
sdev_in_dec=$((16#$sdev))
minor=$((sdev_in_dec % 256))
major=$((sdev_in_dec / 256))
# echo "${major}:${minor}"
in_fds_sdev+=("${ino}-${major}:${minor}")
done <<< "${inums_raw}"
uniq_pairs=($(echo "${in_fds_sdev[@]}" | sort | uniq))
echo "Unique target" ${#uniq_pairs[@]}
printf "%-10s %-10s %-6s %s\n" "INUM" "DEV" "COUNT" "TARGET"
for pair in "${uniq_pairs[@]}"
do
count=$(echo "${in_fds_sdev[@]}" | grep -o "${pair}" | wc -l)
fd_hex=$(echo ${pair} | cut -d "-" -f 1)
dev=$(echo ${pair} | cut -d "-" -f 2)
fd_dec="$((16#${fd_hex}))"
mount_info=$(grep ${dev} /proc/$PID/mountinfo)
if [[ -z $mount_info ]]; then
echo "Can't find mount info for" $dev
else
tmpfs_path=$(echo $mount_info | cut -d " " -f 5)
# echo $tmpfs_path
loc=$(find ${tmpfs_path} -inum ${fd_dec})
printf "%-10s %-10s %-6s %s\n" "${fd_dec}" "${dev}" "${count}" "${loc}"
fi
done