Inotify watcher leaks in Kubelet

Symptom

Recently, I faced an issue where Kubelet on a node reported error message failed to create file descriptors.

error creating file watcher: too many open files
error creating file watcher: no space left on device

After short checking, I found the node has max_user_watches of 10000, but the TotalinotifyWatches is beyond this value. (P. S still not sure why watcher can initiate more than cap). In order to find which process occupied the most watchers. I used following command1 to find it out.

echo -e "COUNT\tPID\tUSER\tCOMMAND"; sudo find /proc/[0-9]*/fdinfo -type f 2>/dev/null | sudo xargs grep ^inotify 2>/dev/null | cut -d/ -f 3 | uniq -c | sort -nr | { while read -rs COUNT PID; do echo -en "$COUNT\t$PID\t"; ps -p $PID -o user=,command=; done}

COUNT	PID	USER	COMMAND
7491	8412	root /home/kubernetes/bin/kubelet --v=2 --cloud-provide=gce --experi
2620	1	root /sbin/init
....

Surprisingly, Kubelet initiated more than 7000 inotify watchers. I assumed there was an inotify leakage in Kubelet.

Leakage check

Clean Kubelet

To better understand the situation, I created a clean cluster with only 1 clean node on GKE. Roughly 70 inotify watchers were there. I created a single nginx pod and the number increased by 3. Theoretically, these 3 watchers are used by Kubelet to monitor any changes on rootfs , kube-api-access and PodSandbox. But to verify it, we need to check more details on which inodes are monitored by Kubelet.

Check inotify file descriptors

To do so, let’s take a look how to track a single inotify file descriptor. Opened processes’ fdinfo folder, we can examine each or them to find an inotify fd.

# Find kubelet pid
ps -aux | grep kubelet
KPID=2430

# File the an example fd
sudo ls /proc/2430/fdinfo

0 1 10 11 12 13 14 2 3 4 5 6 7 8 9

...

sudo cat /proc/2430/fdinfo/8

pos:	0
flags:	02004000
mnt_id:	15
ino:	1057
inotify wd:1 ino:3f327 sdev:800001 mask:fc6 ignored_mask:0 fhandle-bytes:8 fhandle-type:1 f_handle:27f30300e5059ea2

This is very confusing, so I rely on man proc2 to understand every piece of them. In given fd, the needed information to continue sit in the last line. It’s an inotify entry represents the 1 file or folder to be monitored. And the most useful data is ino:3f327 which represents the inode number of target file (in hexadecimal). And sdev:800001, which represents the ID of device where the inode sit on, and it’s also in hex.

Using lsblk, I can see there’s only 1 disk I’m using on the node, so finding the target file would be easy.

# Cast to decimal
ino=3f327
dec="$((16#${ino}))"

# Find the target file
loc="debugfs -R 'ncheck ${dec}' /dev/sda1"
sudo eval $loc 2>/dev/null

debugfd 1.46.5 (30-Dec-2021)
Inode	Pathname
258855	/etc/srv/kubernetes/pki/ca-certificates.crt

Put all processes above into one single script(please see the updated version in the appendix), I can retrieve all target files, that would help to understand if there’s a real leakage. Also, I count the unique inode amount, this could also help to know which inode are monitored multiple times.

cat << EOF | sudo tee -a test.sh
echo "kubelet pid="${PID}
in_fds=$(find /proc/${PID}/fdinfo -type f 2>/dev/null | xargs grep ^inotify | cut -d " " -f 3 | cut -d ":" -f 2)
echo ${in_fds}
echo "Count: $(find /proc/${PID}/fdinfo -type f 2>/dev/null | xargs grep ^inotify | wc -l)"

uniq_fds=$(echo "${in_fds[@]}" | sort | uniq)
echo ${uniq_fds}

while read -r element;
do
  count=$(echo "${in_fds[@]}" | grep -o "${element}" | wc -l)
  dec="$((16#${element}))"
  loc="debugfs -R 'ncheck ${dec}' /dev/sda1"
  loc=$(eval $loc 2>/dev/null | tail -1 | cut -d " " -f 4)
  printf "%-6s %-10s %-6s %s\n" "${element}" "${dec}" "${count}" "${loc}"
done <<< "${uniq_fds}"
EOF

sudo bash test.sh

kubelet pid=2430
3f327 3f321 ...
Count: 120
1 10b 1259 128a ...
1	1	72	Inode	Pathname
10b	267	1	267	/etc/systemd/system/multi-user.target.wants/snapd.service
...

The given results are consists by following parts:

  • One line for get Kubelet pid
  • One line for all target inode numbers
  • One line for tell how many unique inode (120)
  • One line of sorted target
  • Following 120 lines, each of them represents a unique inode number, its decimal number, count, another time of decimal number and the target file path.

I used the same script to the problematic node, and it showed the following result. In summary, most Kubelet watchers were targeting ino:1 . And there are 6649 targets files, which likely to be leakage, because there were only 150 pods on this pod. Unfortunately, debugfs can’t find any target files, so the output showed as meaningless string "Inode Pathname".

kubelet pid=8412
...
Count: 7491
...
1	1	6649	Inode Pathname

Bad apple

Why debugfs can’t help anymore? The reason is simple, each cgroup for a pod is using its own rootfs. This means the watcher are somehow residing on different rootfs and using independent inode index. There are some other ways to do it, I choose the most common tool grep to find out.

sudo grep / -inum 1
/home/kubernetes/containerized_mounter/rootfs/dev
/home/kubernetes/containerized_mounter/rootfs/proc
...
/home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet/pods/5325873d-f2a0-48df-83e2-0b911df2f77f/volumes/kubernetes.io~projected/kube-api-access-227jg
...
/dev
/boot/efi
...

This turns things easy, because I can just use pod ID to compare between running pods to find out if there are any terminated pods are there. And it did show there’s some non-existence pods still being watched somehow.

What to expect next

Appendix

Updated script to get inotify watchers initiated by kubelet

Thanks for yujuhong@’s momentum and helps in finishing updated script.

PID=$(echo $(ps -aux | grep "/home/kubernetes/bin/kubelet" | head -1) |  cut -d " " -f 2)
echo "Kubelet Pid:" ${PID}

inums_raw=$(find /proc/${PID}/fdinfo -type f 2>/dev/null | xargs grep ^inotify)
# echo ${inums_raw}
echo "Count: $(find /proc/${PID}/fdinfo -type f 2>/dev/null | xargs grep ^inotify | wc -l)"

while read -r line;
do
        reg="ino:([0-9a-f]*) sdev:([0-9a-f]*)"
        if [[ ${line} =~ $reg ]]; then
                ino="${BASH_REMATCH[1]}"
                sdev="${BASH_REMATCH[2]}"
                # echo $ino $sdev
        else
                echo "wrong line"
        fi

        sdev_in_dec=$((16#$sdev))
        minor=$((sdev_in_dec % 256))
        major=$((sdev_in_dec / 256))
        # echo "${major}:${minor}"

        in_fds_sdev+=("${ino}-${major}:${minor}")
done <<< "${inums_raw}"

uniq_pairs=($(echo "${in_fds_sdev[@]}" | sort | uniq))
echo "Unique target" ${#uniq_pairs[@]}

printf "%-10s %-10s %-6s %s\n" "INUM" "DEV" "COUNT" "TARGET"
for pair in "${uniq_pairs[@]}"
do
        count=$(echo "${in_fds_sdev[@]}" | grep -o "${pair}" | wc -l)
        fd_hex=$(echo ${pair} | cut -d "-" -f 1)
        dev=$(echo ${pair} | cut -d "-" -f 2)
        fd_dec="$((16#${fd_hex}))"

        mount_info=$(grep ${dev} /proc/$PID/mountinfo)
        if [[ -z $mount_info ]]; then
                echo "Can't find mount info for" $dev
        else
                tmpfs_path=$(echo $mount_info | cut -d " " -f 5)
                # echo $tmpfs_path
                loc=$(find ${tmpfs_path} -inum ${fd_dec})
                printf "%-10s %-10s %-6s %s\n" "${fd_dec}" "${dev}" "${count}" "${loc}"
        fi
done

References