Kill -9 does not work

From Noah.org
Revision as of 23:21, 26 April 2010 by Root (talk | contribs)
Jump to navigationJump to search


Why doesn't `kill -9` always work?

You are supposed to be able to kill any process with `kill -9 [PID]`, but you may come across a process that can't be killed. Usually this happens when you are trying to kill a <defunct> process. These are processes that are dead and have exited, but they remain as zombies in the process list. The kernel keeps them in the process list until the parent proces retrieves the exit status code by calling the wait() system call. This does not usually happen with daemon processes because they detach themselves from their parent process and are adopted by the init process (PID=1) which will automatically call wait() to clear them out of the process list. You may sometimes see the daemon defunct PID in the process list for a brief moment before it gets cleaned up by the init process. You don't have to worry about these. You can also end up with an unkillable process if a process is stuck waiting for the kernel to finish something. This usually happens when the kernel is waiting for I/O. Where you see this most often is with network filesystems such as NFS and SaMBa that have disconnected uncleanly. This also happens when a drive fails or if someone unplugs a cable to a mounted drive. If the device had a memmapped file or was used for swap then you may be really screwed. Any kernel calls that flush IO may hang forever waiting for the device to respond.

Which processes are wedged?

Look for process in the state 'D' (uninterruptible sleep) or in the state 'Z' (defunct zombie). The following command will list processes in state 'D' or 'Z'. Note that if no processes are in state 'D' or 'Z' then this will still print the `ps` header, but nothing else.

ps Haxwwo stat,pid,ppid,user,wchan:25,command | grep -e "^STAT" -e "^D" -e "^Z"

For testing you might want to add normally suspended/sleeping processes:

ps Haxwwo stat,pid,ppid,user,wchan:25,command | grep -e "^STAT" -e "^D" -e "^Z" -e "^S"

kill stuck process

After you send a kill signal to a stuck process you must also send a kill signal to the `rpciod` kernel thread (it will restart when needed).

ps Haxwwo pid,command | grep "rpciod" | grep -v grep

unmount stuck filesystems

You can sometimes kill a process by unmounting filesystems that it is stuck waiting for. If that doesn't cause the process to generate an IO error or a segfault then go back and try killing the process again.

Use both `mount` and `cat /proc/mounts` to see what filesystems are mounted. Sometimes `mount` will not show NFS mounts where a previous `umount` is still pending -- yet another headache when dealing with NFS.

You can use `fuser` to show which processes have filedescriptors open to a given filesystem. In the command below DEV must be the device name such as '/dev/sda1' or an NFS network name such as 'some_nfs:/home/user'. Do not use the mount point directory name for NFS mounts because this will cause `fuser` to hang. Again, for NFS, use only the nfs_server:/path name.

fuser -v -m [DEV]

You can force an NFS share to unmount by using the lazy option with `umount`. This may cause the stuck process as well as other processes to segfault as mem-mapped files and the like suddenly disappear.

umount -l [MOUNT_POINT_OR_DEV]


The system can't reboot

Sometimes the only thing to do is reboot, but even `reboot` and `halt` will first try to sync filesystems by default and they will end up stuck. This sounds like a Catch-22, but the fix is simple by specifying the options '-n' to not sync any mounted filesystems and '-f' for force a reboot without calling `shutdown`.

reboot -n -f

Why is a process wedged?

Use the `ps` instructions given previously to find a Process ID you are interested in.

cat /proc/[PID]/chan
strace -p [PID]
gdb [PROCESS_COMMAND_NAME] [PID]