Difference between revisions of "Find notes"

From Noah.org
Jump to navigationJump to search
 
(25 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
[[Category:Engineering]]
 
[[Category:Engineering]]
 +
 +
== Calculate the MD5 sum of an entire directory structure ==
 +
 +
Set '''ROOT_PATH''' t the directory to be processed.
 +
<pre>
 +
(
 +
        cd "${ROOT_PATH}"
 +
        find -name .git -prune -o -type f -exec md5sum -b "{}" \; | sed 's/^\([[:xdigit:]]\+\).*/\1/'
 +
        find -name .git -prune -o -print | LC_ALL=C sort | md5sum -b | sed 's/^\([[:xdigit:]]\+\).*/\1/'
 +
) | LC_ALL=C sort | md5sum -b | sed 's/^\([[:xdigit:]]\+\).*/\1/'
 +
</pre>
  
 
== exec versus xargs ==
 
== exec versus xargs ==
Line 9: Line 20:
 
Find also his its own built-in form of xargs using the {}+ instead of {}\; in an -exec or -execdir section. Like using xargs the replacement string must be the only and last in the statement to be executed.
 
Find also his its own built-in form of xargs using the {}+ instead of {}\; in an -exec or -execdir section. Like using xargs the replacement string must be the only and last in the statement to be executed.
  
== Find and delete old files with `find` and cron ==
+
== create a sym link to the latest file ==
 +
 
 +
I couldn't figure out a way to do this using only '''find'''. Here is the best I could come up with. In this example, I find the newest filename that matches the pattern  '''version-*'''. A symbolic link from '''version-latest''' is created that points to the newest file found. To create dummy files to test this you can run the command '''date "+%F %T%:::z" > version-$(date "+%s")''' a few times.
 +
<pre>
 +
ln -s -f $(find . -name "version-*" -type f -exec stat -c "%y %n" {} + | sort -r | head -n 1 | cut -d" " -f4) version-latest
 +
</pre>
 +
 
 +
== find sym links to a given file ==
 +
 
 +
The trick is to use a wildcard because a link may be created with a relative or absolute path, so you want to be sure to pick up either style.
 +
<pre>
 +
find /dev -lname "*dm-1"
 +
</pre>
 +
 
 +
This is another trick that finds symlinks or hardlinks. It basically asks if they are essentially the same file.
 +
<pre>
 +
find -L / -samefile /etc/apache2/mods-available/ssl.conf
 +
</pre.
 +
 
 +
== find and delete old files with `find` and cron ==
  
 
Put in /etc/cron.daily. This automatically deletes Spam older than 30 days from my Spam folder.
 
Put in /etc/cron.daily. This automatically deletes Spam older than 30 days from my Spam folder.
Line 23: Line 53:
 
#!/bin/sh
 
#!/bin/sh
 
find /home/vpopmail/domains/noah.org/noah/Maildir/.Spam/cur/ -mtime +30 | xargs rm
 
find /home/vpopmail/domains/noah.org/noah/Maildir/.Spam/cur/ -mtime +30 | xargs rm
 +
</pre>
 +
Test it by setting the dates of a dummy file to delete.
 +
<pre>
 +
touch -m -d "60 days ago" [FILENAME]
 +
</pre>
 +
 +
== find older or newer files by minutes ==
 +
 +
Find files older than 30 minutes:
 +
<pre>
 +
find . -type f -mmin +30
 +
</pre>
 +
Find files newer than 30 minutes:
 +
<pre>
 +
find . -type f -mmin -30
 +
find . -type f -not -mmin +30
 
</pre>
 
</pre>
  
== Find files newer than 1 day ==
+
== delete files older than given minutes ==
 +
 
 +
This deletes files older than one hour (60 minutes):
 +
<pre>
 +
find . -type f -mmin +60 -exec rm -f {} \;
 +
</pre>
 +
 
 +
== find files newer than 1 day ==
  
 
Find files less than 1 day old.
 
Find files less than 1 day old.
Line 31: Line 84:
 
<pre>
 
<pre>
 
find . -type f -mtime -1
 
find . -type f -mtime -1
 +
</pre>
 +
 +
== find devices recently added ==
 +
 +
This is a good trick to poll any directory for file updates. See also [[inotify]].
 +
<pre>
 +
find /dev -maxdepth 1 -mmin -1
 +
</pre>
 +
You might want to increase the depth depending on what type of device you are looking for.
 +
<pre>
 +
find /dev -maxdepth 3 -mmin -1
 +
</pre>
 +
 +
On the other hand it might just be easier to run this:
 +
<pre>
 +
ls -latr /dev | tail
 
</pre>
 
</pre>
  
Line 38: Line 107:
  
 
<pre>
 
<pre>
 +
find . \( -type d -exec chmod a+rwx '{}' \; \) -or \( -type f -exec chmod a+rw '{}' \; \)
 
find . \( -type d -exec chmod a+rwx '{}' \; \) , \( -type f -exec chmod a+rw '{}' \; \)
 
find . \( -type d -exec chmod a+rwx '{}' \; \) , \( -type f -exec chmod a+rw '{}' \; \)
 
</pre>
 
</pre>
Line 75: Line 145:
 
</pre>
 
</pre>
  
== find duplicates of files ==
+
== find duplicates of files (dedupe) ==
  
 
This will list files with duplicates. It compares all files under the given directory. This ignores .svn directories and files of size 0.
 
This will list files with duplicates. It compares all files under the given directory. This ignores .svn directories and files of size 0.
  
This needs a little more work... It would be more efficient if it ignored all files that have a unique size, but then it's a slippery slope into writing a full-blown script. I would also like to get rid of the tmp file.
+
This needs a little more work... It would be more efficient if it ignored all files that have a unique size, but then it's a slippery slope into writing a full-blown script. I would also like to get rid of the tmp file. This could be done by using `sed` to rearrange the fields so that the crc is at the end of the line, and then use the '''-f''' option of `uniq` to ignore all but the last field when comparing lines.
  
 
<pre>
 
<pre>
 
find . . -name .svn -prune -o -size 1 \! -type d -exec cksum {} \; | sort | tee /tmp/f.tmp | cut -f 1,2 -d ' ' | uniq -d | grep -hif - /tmp/f.tmp
 
find . . -name .svn -prune -o -size 1 \! -type d -exec cksum {} \; | sort | tee /tmp/f.tmp | cut -f 1,2 -d ' ' | uniq -d | grep -hif - /tmp/f.tmp
 +
</pre>
 +
 +
=== find duplicates on Mac OS X ===
 +
 +
The Mac does not have the `md5sum` command, but it has the `md5` command, which formats the output differently. Note that this only prints the filename of duplicates, not the original filename. It should be possible to modify this command to use `chsum`, which is faster. The crc field would have to be moved to the end of the line using `sed` because the `uniq` command has an option to skip leading fields, but it has no option to look at only a single field in the middle of other fields.
 +
<pre>
 +
find . -type f -exec md5 '{}' ';' | sort | uniq -f 3 -d | sed -e "s/.*(\(.*\)).*/\1/"
 +
</pre>
 +
If you want to delete the duplicate you can pipe the output through `xargs`. This works because the original filename is not printed.
 +
<pre>
 +
find . -type f -exec md5 '{}' ';' | sort | uniq -f 3 -d | sed -e "s/.*(\(.*\)).*/\1/" | xargs rm
 
</pre>
 
</pre>
  
Line 93: Line 174:
 
<pre>
 
<pre>
 
cat <(find BACKUP_SET_1 ! -type d -exec ls -1i "{}" \;) <(find BACKUP_SET_2 ! -type d -exec ls -1i "{}" \;) | sort | cut -d ' ' -f 2-,1 | uniq -u -f 1
 
cat <(find BACKUP_SET_1 ! -type d -exec ls -1i "{}" \;) <(find BACKUP_SET_2 ! -type d -exec ls -1i "{}" \;) | sort | cut -d ' ' -f 2-,1 | uniq -u -f 1
 +
</pre>
 +
 +
== count number of directory names and regular filenames, excluding .git ==
 +
 +
<pre>
 +
echo "Regular files"
 +
find . -name .git -prune -o -type f -print | wc -l
 +
echo "Directories"
 +
find . -name .git -prune -o -type d -print | wc -l
 +
</pre>
 +
 +
=== print file counts for each subdirectory in the current directory (count inodes, sort of)===
 +
 +
This will print a table of the size of each directory under the current working directory. This will count symlinks and directories since they count as an inode. It will also give a grand total count for the entire current working directory.
 +
<pre>
 +
for ii in $(find . -maxdepth 1 -type d); do echo -e "${ii}\t$(find "${ii}" -type l -o -type f -o -type d | wc -l)"; done | sort -n -k 2 | column -t
 +
</pre>
 +
 +
This is a way to print counts of the inodes in the current directory. I am not sure if this will exactly match the inode count in the filesystem since this counts entries in the directory structure. It does not directly count inodes. Because this command is split up into two find commands it cannot correctly discriminate against directories mounted on other filesystems (the '''-mount''' option will not work). For example, this should really ignore "/proc" and "/sys" directories. You can see that in the case of running this command in "/" that including "/proc" and "/sys" grossly skews the grand total count.
 +
 +
Note that the total number of inodes used for an entire filesystem mount can be found with '''df''':
 +
<pre>
 +
df -B 1 -i
 +
</pre>
 +
 +
== count max directory depth ==
 +
 +
The '''%d''' argument to '''-printf''' prints the directory depth of the current file.
 +
<pre>
 +
find . -printf '%d\n' | sort -n  | tail -1
 +
</pre>
 +
This can also be done without '''-printf''':
 +
<pre>
 +
find -type d | awk -F'/' '{print NF-1 "\n"}' | sort -n | tail -1
 
</pre>
 
</pre>

Latest revision as of 14:01, 4 June 2015


Calculate the MD5 sum of an entire directory structure

Set ROOT_PATH t the directory to be processed.

(
        cd "${ROOT_PATH}"
        find -name .git -prune -o -type f -exec md5sum -b "{}" \; | sed 's/^\([[:xdigit:]]\+\).*/\1/'
        find -name .git -prune -o -print | LC_ALL=C sort | md5sum -b | sed 's/^\([[:xdigit:]]\+\).*/\1/'
) | LC_ALL=C sort | md5sum -b | sed 's/^\([[:xdigit:]]\+\).*/\1/'

exec versus xargs

You may notice that some people will pipe `find` output into `xargs`, but other people tell `find` to start a command using -exec. What is the difference? The difference is that xargs is faster. It will group arguments and feed batches to the subcommand, so it doesn't have to start a new instance of the subcommand for every argument.

I think -exec is easier because you can use filename more than once in the -exec argument. It's easier for me to express exactly what I want to be executed.

Find also his its own built-in form of xargs using the {}+ instead of {}\; in an -exec or -execdir section. Like using xargs the replacement string must be the only and last in the statement to be executed.

create a sym link to the latest file

I couldn't figure out a way to do this using only find. Here is the best I could come up with. In this example, I find the newest filename that matches the pattern version-*. A symbolic link from version-latest is created that points to the newest file found. To create dummy files to test this you can run the command date "+%F %T%:::z" > version-$(date "+%s") a few times.

ln -s -f $(find . -name "version-*" -type f -exec stat -c "%y %n" {} + | sort -r | head -n 1 | cut -d" " -f4) version-latest

find sym links to a given file

The trick is to use a wildcard because a link may be created with a relative or absolute path, so you want to be sure to pick up either style.

find /dev -lname "*dm-1"

This is another trick that finds symlinks or hardlinks. It basically asks if they are essentially the same file.

find -L / -samefile /etc/apache2/mods-available/ssl.conf
</pre.

== find and delete old files with `find` and cron ==

Put in /etc/cron.daily. This automatically deletes Spam older than 30 days from my Spam folder.

<pre>
#!/bin/sh
find /home/vpopmail/domains/noah.org/noah/Maildir/.Spam/cur/ -type f -mtime +30 -exec rm -f {} \;

More CPU efficient:

#!/bin/sh
find /home/vpopmail/domains/noah.org/noah/Maildir/.Spam/cur/ -mtime +30 | xargs rm

Test it by setting the dates of a dummy file to delete.

touch -m -d "60 days ago" [FILENAME]

find older or newer files by minutes

Find files older than 30 minutes:

find . -type f -mmin +30

Find files newer than 30 minutes:

find . -type f -mmin -30
find . -type f -not -mmin +30

delete files older than given minutes

This deletes files older than one hour (60 minutes):

find . -type f -mmin +60 -exec rm -f {} \;

find files newer than 1 day

Find files less than 1 day old.

find . -type f -mtime -1

find devices recently added

This is a good trick to poll any directory for file updates. See also inotify.

find /dev -maxdepth 1 -mmin -1

You might want to increase the depth depending on what type of device you are looking for.

find /dev -maxdepth 3 -mmin -1

On the other hand it might just be easier to run this:

ls -latr /dev | tail

set full access to all for an entire subdirectory

Directories need 'a+rwx' whereas files need 'a+rw'. You don't want to remove execute permissions from all files or add it from all files (otherwise you could just do something stupid like, `chmod -R 777 .`).

find . \( -type d -exec chmod a+rwx '{}' \; \) -or \( -type f -exec chmod a+rw '{}' \; \)
find . \( -type d -exec chmod a+rwx '{}' \; \) , \( -type f -exec chmod a+rw '{}' \; \)

See also the note on how to #copy user permissions to group permissions.

copy user permissions to group permissions

Often you want group permissions to be identical as user permissions for an entire directory structure. This often happens with htdoc directories on web sites. The typical newbie mistake is to execute a massive `chmod -R a+rwx .` in an attempt to "get rid of permission problems". The following is a slightly more surgical:

find . -exec /bin/sh -c 'chmod g=`ls -ld "{}" | cut -c2-4 | tr -d "-"` "{}"' \;

This is also really slow. It forks a shell for every single file and directory in the current path directory structure. Run time: 95m 36.895s on a directory tree with 111645 files. This was a system with a slow drive (disk read: 8.43 MB/sec), but even so most of the poor performance is due to exec'ing a shell.

List all extensions in the current directory

This came in handy when I was trying to find out exactly what mime-types I need to care about.

find . -print0 | xargs -L 1 -0 basename | sed -e "s/.*\(\\.\\s*\)/\\1/" | sort | uniq > /tmp/types

The -print0 option tells find to null-terminate filenames. The -0 option for xargs tells it to read null-terminated strings. These two options are used to handle filenames that have special characters such as quotes or line-feeds. If you don't do this then you may get the following error:

xargs: unmatched single quote; by default quotes are special to xargs unless you use the -0 option

massive recursive grep

Grep has a recursive option, but you can fine tune a recursive grep wtih `find` -- you can make much more complicated expressions for the types of files you want to grep through. The main to remember when using `grep` with `find` is that you probably want the -H option on grep. This prints the filename along with the match.

find . -exec grep -H PatternToFind {} \;

find duplicates of files (dedupe)

This will list files with duplicates. It compares all files under the given directory. This ignores .svn directories and files of size 0.

This needs a little more work... It would be more efficient if it ignored all files that have a unique size, but then it's a slippery slope into writing a full-blown script. I would also like to get rid of the tmp file. This could be done by using `sed` to rearrange the fields so that the crc is at the end of the line, and then use the -f option of `uniq` to ignore all but the last field when comparing lines.

find . . -name .svn -prune -o -size 1 \! -type d -exec cksum {} \; | sort | tee /tmp/f.tmp | cut -f 1,2 -d ' ' | uniq -d | grep -hif - /tmp/f.tmp

find duplicates on Mac OS X

The Mac does not have the `md5sum` command, but it has the `md5` command, which formats the output differently. Note that this only prints the filename of duplicates, not the original filename. It should be possible to modify this command to use `chsum`, which is faster. The crc field would have to be moved to the end of the line using `sed` because the `uniq` command has an option to skip leading fields, but it has no option to look at only a single field in the middle of other fields.

find . -type f -exec md5 '{}' ';' | sort | uniq -f 3 -d | sed -e "s/.*(\(.*\)).*/\1/" 

If you want to delete the duplicate you can pipe the output through `xargs`. This works because the original filename is not printed.

find . -type f -exec md5 '{}' ';' | sort | uniq -f 3 -d | sed -e "s/.*(\(.*\)).*/\1/" | xargs rm

find unique files in between two directories of hard-linked copies

Backups that create a rotating backup often hard-link unchanged files between each rotation set. This saves disk space. Files that have changes are copied normally, so they don't have additional hard-links. Sometimes it is useful to compare two backup sets and generate a list of the files that changed between each set. This could be done with file hashes, but that would be slow. We can use the fact that files that are identical in two separate backup sets will have the same inode number. Files that are changed will have different inode numbers.

This is a work in progress...

cat <(find BACKUP_SET_1 ! -type d -exec ls -1i "{}" \;) <(find BACKUP_SET_2 ! -type d -exec ls -1i "{}" \;) | sort | cut -d ' ' -f 2-,1 | uniq -u -f 1

count number of directory names and regular filenames, excluding .git

echo "Regular files"
find . -name .git -prune -o -type f -print | wc -l
echo "Directories"
find . -name .git -prune -o -type d -print | wc -l

print file counts for each subdirectory in the current directory (count inodes, sort of)

This will print a table of the size of each directory under the current working directory. This will count symlinks and directories since they count as an inode. It will also give a grand total count for the entire current working directory.

for ii in $(find . -maxdepth 1 -type d); do echo -e "${ii}\t$(find "${ii}" -type l -o -type f -o -type d | wc -l)"; done | sort -n -k 2 | column -t

This is a way to print counts of the inodes in the current directory. I am not sure if this will exactly match the inode count in the filesystem since this counts entries in the directory structure. It does not directly count inodes. Because this command is split up into two find commands it cannot correctly discriminate against directories mounted on other filesystems (the -mount option will not work). For example, this should really ignore "/proc" and "/sys" directories. You can see that in the case of running this command in "/" that including "/proc" and "/sys" grossly skews the grand total count.

Note that the total number of inodes used for an entire filesystem mount can be found with df:

df -B 1 -i

count max directory depth

The %d argument to -printf prints the directory depth of the current file.

find . -printf '%d\n' | sort -n  | tail -1

This can also be done without -printf:

find -type d | awk -F'/' '{print NF-1 "\n"}' | sort -n | tail -1