Deduplicate (dedup)

From Noah.org
Revision as of 10:43, 22 November 2013 by Root (talk | contribs)
Jump to navigationJump to search

This shows various ways to deduplicate files (dedup or dedupe, as I often misspell it). Deduping involves creating a hash of all files and then finding duplicate hashes. This finds duplicate files even if the name is different and/or if the files are in completely different parts of a directory tree.

Jarno Elonen's remove-duplicate-files

Thsi script was written by Jarno Elonen, 2003-04-06...2003-12-29. His web page for this script was last seen here: http://elonen.iki.fi/code/misc-notes/remove-duplicate-files/ .

The following shell script finds duplicate (2 or more identical) files and outputs a new shell script containing commented-out rm statements for deleting them. You then have to select which files to keep - the script can't safely do it automatically!

You will then have to edit the generated file, remove comments from files you want to remove and run the script.

The code was written for Debian GNU/Linux and has been tested with Bash and Zsh. Needless to say, you are welcome to do whatever you like with it as long as you don't blame me for disasters... (released in Public Domain)

Known bugs: the script doesn't work correctly with file names whose last characters are space(s) due to a bug (misfeature?) in the read command nor with filenames with backslashes due to some unknown reason. Both kinds of file names are fortunately very rare.

Newer version. I made a few modifications such as instead of having the script remove file it will recalculate the md5sum on files identified as duplicates as sanity check for the paranoid. You can run the show-duplicates.sh script and do a visual spot check. Then edit the script and replace md5sum with rm as you see fit.

OUTF=show-duplicates.sh; echo "#! /bin/sh" > $OUTF; find "$@" -type f -printf "%s\n" | sort -n | uniq -d |
    xargs -I@@ -n1 find "$@" -type f -size @@c -exec md5sum {} \; |
    sort --key=1,32 | uniq -w 32 -d --all-repeated=separate |
    sed -r 's/^[0-9a-f]*( )*//;s/([^a-zA-Z0-9./_-])/\\\1/g;s/(.+)/md5sum \1/' >> $OUTF; chmod a+x $OUTF; ls -l $OUTF

Older version:

OUTF=rem-duplicates.sh; ESCSTR="s/([^a-zA-Z0-9-\.\/_])/\\\\\1/g";
echo "#! /bin/sh" > $OUTF; find -type f | sed -r "$ESCSTR" | while
read x; do md5sum "$x"; done | sort --key=1,32 | uniq -w 32 -d
--all-repeated=separate | sed -r "s/^[0-9a-f]*(\\
)*//;$ESCSTR;s/(.+)/#rm \1/" >> $OUTF; chmod a+x $OUTF; ls -l $OUTF

...and same for older uniq and sort versions (e.g. those in Debian Woody):

OUTF=rem-duplicates.sh; ESCSTR="s/([^a-zA-Z0-9-\.\/_])/\\\\\1/g";
SUM="dummy"; echo "#! /bin/sh" > $OUTF; find -type f | sed -r
"$ESCSTR" | while read x; do md5sum "$x"; done | sort -k 1,32 |
uniq -w 32 -d --all-repeated | while read y; do NEW=`echo
"$y-dummy" | sed "s/ .*$//"`; if [ $NEW != $SUM ]; then echo "" >>
$OUTF; fi; SUM="$NEW"; echo "$y" | sed -r "s/^[0-9a-f]*(\\
)*//;$ESCSTR;s/(.+)/#rm \1/" >> $OUTF; done; chmod a+x $OUTF; ls -l
$OUTF

Example output

#! /bin/sh
#rm ./gdc2001/113-1303_IMG.JPG
#rm ./reppulilta/gdc2001/113-1303_IMG.JPG

#rm ./lissabon/01-01-2001/108-0883_IMG.JPG
#rm ./kuvat\ reppulilta/lissabon/01-01-2001/108-0883_IMG.JPG

#rm ./gdc2001/113-1328_IMG.JPG
#rm ./kuvat\ reppulilta/gdc2001/113-1328_IMG.JPG

Explanation

      1. write output script header
      2. list all files recursively under current directory
      3. escape all the potentially dangerous characters with a slash
      4. calculate MD5 sums
      5. find duplicate sums
      6. strip off MD5 sums and leave only file names
      7. escape the strange characters again
      8. write out a commented-out delete command
      9. make the output script writable and ls -l it