Difference between revisions of "Kill -9 does not work"

From Noah.org
Jump to navigationJump to search
m
Line 1: Line 1:
 
[[Category:Engineering]]
 
[[Category:Engineering]]
  
== Why doesn't kill -9 work? ==
+
== Why doesn't `kill -9` always work? ==
  
They say "kill -9" will kill any process, but sometimes you will find
+
You are supposed to be able to kill any process with `kill -9 [PID]`, but you may come across a process that can't be killed. Usually this happens when you are trying to kill a <defunct> process. These are processes that are dead and have exited, but they remain as '''zombies''' in the process list. The kernel keeps them in the process list until the parent proces retrieves the exit status code by calling the wait() system call. This does not usually happen with daemon processes because they detach themselves from their parent process and are adopted by the init process (PID=1) which will automatically call wait() to clear them out of the process list. You may sometimes see the daemon defunct PID in the process list for a brief moment before it gets cleaned up by the init process. You don't have to worry about these. You can also end up with an unkillable process if a process is stuck waiting for the kernel to finish something. This usually happens when the kernel is waiting for I/O. Where you see this most often is with network filesystems such as NFS and SaMBa that have disconnected uncleanly. This also happens when a drive fails or if someone unplugs a cable to a mounted drive. If the device had a memmapped file or was used for swap then you may be really screwed. Any kernel calls that flush IO may hang forever waiting for the device to respond.
a process that can't be killed even with "kill -9".  
 
Usually this means you are trying to kill a <defunct> zombie process.
 
The parent process of a zombie must clean up the process  
 
(by calling wait()). You don't usually have to worry  
 
about these as they don't take any CPU or memory -- although they
 
do take up a slot in the PID table.
 
  
Sometimes an unkillable process is stuck waiting for the kernel
+
=== Which processes are wedged? ===
to finish something. kill -9' will not work if your process is  
+
 
waiting for stuck kernel I/O such as waiting for a bad disk or
+
Look for process in the state 'D' (uninterruptible sleep) or in the state 'Z' (defunct zombie). The following command will list processes in state 'D' or 'Z'. Note that if no processes are in state 'D' or 'Z' then this will still print the `ps` header, but nothing else.
by unplugging a device that is not supposed to be hot-swapable.
+
 
Sometimes you see it with network filesystems such as NFS or SaMBa.
+
<pre>
Usually the kernel will eventually timeout on the hardware call,
+
ps Haxwwo stat,pid,ppid,user,command | grep -e "^STAT" -e "^D" -e "^Z"
but sometimes the only thing to do is reboot.
+
</pre>
 +
 
 +
=== unmount stuck filesystems ===
 +
 
 +
You can sometimes kill a process by unmounting filesystems that it is stuck waiting for. If that doesn't cause the process to generate an IO error or a segfault then go back and try killing the process again.
 +
 
 +
Use both `mount` and `cat /proc/mounts` to see what filesystems are mounted. Sometimes `mount` will not show NFS mounts where a previous `umount` is still pending -- yet another headache when dealing with NFS.
 +
 
 +
You can use `fuser` to show which processes have filedescriptors open to a given filesystem. In the command below DEV must be the device name such as '/dev/sda1' or an NFS network name such as 'some_nfs:/home/user'. '''Do not''' use the mount point directory name for NFS mounts because this will cause `fuser` to hang. Again, for NFS, use only the '''nfs_server:/path''' name.
 +
 
 +
<pre>
 +
fuser -v -m [DEV]
 +
</pre>
 +
 
 +
You can force an NFS share to unmount by using the '''lazy''' option with `umount`. This may cause the stuck process as well as other processes to segfault as mem-mapped files and the like suddenly disappear.
 +
 
 +
<pre>
 +
umount -l [MOUNT_POINT_OR_DEV]
 +
</pre>
 +
 
 +
<!--
 +
Note that the '-f' option forces the unmount only if the NFS mount is currently available and in a good state, but this may still be helpful since it allows other processes to more cleanly handle the unmount as long they are '''not''' already stuck on IO.  
 +
-->
 +
 
 +
=== The system can't reboot ===
 +
 
 +
Sometimes the only thing to do is reboot, but even `reboot` and `halt` will first try to sync filesystems by default and they will end up stuck. This sounds like a Catch-22, but the fix is simple by specifying the options '-n' to not sync any mounted filesystems and '-f' for force a reboot without calling `shutdown`.
 +
<pre>
 +
reboot -n -f
 +
</pre>

Revision as of 21:34, 26 April 2010


Why doesn't `kill -9` always work?

You are supposed to be able to kill any process with `kill -9 [PID]`, but you may come across a process that can't be killed. Usually this happens when you are trying to kill a <defunct> process. These are processes that are dead and have exited, but they remain as zombies in the process list. The kernel keeps them in the process list until the parent proces retrieves the exit status code by calling the wait() system call. This does not usually happen with daemon processes because they detach themselves from their parent process and are adopted by the init process (PID=1) which will automatically call wait() to clear them out of the process list. You may sometimes see the daemon defunct PID in the process list for a brief moment before it gets cleaned up by the init process. You don't have to worry about these. You can also end up with an unkillable process if a process is stuck waiting for the kernel to finish something. This usually happens when the kernel is waiting for I/O. Where you see this most often is with network filesystems such as NFS and SaMBa that have disconnected uncleanly. This also happens when a drive fails or if someone unplugs a cable to a mounted drive. If the device had a memmapped file or was used for swap then you may be really screwed. Any kernel calls that flush IO may hang forever waiting for the device to respond.

Which processes are wedged?

Look for process in the state 'D' (uninterruptible sleep) or in the state 'Z' (defunct zombie). The following command will list processes in state 'D' or 'Z'. Note that if no processes are in state 'D' or 'Z' then this will still print the `ps` header, but nothing else.

ps Haxwwo stat,pid,ppid,user,command | grep -e "^STAT" -e "^D" -e "^Z"

unmount stuck filesystems

You can sometimes kill a process by unmounting filesystems that it is stuck waiting for. If that doesn't cause the process to generate an IO error or a segfault then go back and try killing the process again.

Use both `mount` and `cat /proc/mounts` to see what filesystems are mounted. Sometimes `mount` will not show NFS mounts where a previous `umount` is still pending -- yet another headache when dealing with NFS.

You can use `fuser` to show which processes have filedescriptors open to a given filesystem. In the command below DEV must be the device name such as '/dev/sda1' or an NFS network name such as 'some_nfs:/home/user'. Do not use the mount point directory name for NFS mounts because this will cause `fuser` to hang. Again, for NFS, use only the nfs_server:/path name.

fuser -v -m [DEV]

You can force an NFS share to unmount by using the lazy option with `umount`. This may cause the stuck process as well as other processes to segfault as mem-mapped files and the like suddenly disappear.

umount -l [MOUNT_POINT_OR_DEV]


The system can't reboot

Sometimes the only thing to do is reboot, but even `reboot` and `halt` will first try to sync filesystems by default and they will end up stuck. This sounds like a Catch-22, but the fix is simple by specifying the options '-n' to not sync any mounted filesystems and '-f' for force a reboot without calling `shutdown`.

reboot -n -f