Deduplicate (dedup)

From Noah.org
Revision as of 01:35, 10 November 2006 by Root (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Unix shell script for removing duplicate files

by Jarno Elonen, 2003-04-06...2003-12-29

The following shell script finds duplicate (2 or more identical) files and outputs a new shell script containing commented-out rm statements for deleting them. You then have to select which files to keep - the script can't safely do it automatically!

You will then have to edit the generated file, remove comments from files you want to remove and run the script.

The code was written for Debian GNU/Linux and has been tested with Bash and Zsh. Needless to say, you are welcome to do whatever you like with it as long as you don't blame me for disasters... (released in Public Domain)

Known bugs: the script doesn't work correctly with file names whose last characters are space(s) due to a bug (misfeature?) in the read command nor with filenames with backslashes due to some unknown reason. Both kinds of file names are fortunately very rare. The code

OUTF=rem-duplicates.sh; ESCSTR="s/([^a-zA-Z0-9-\.\/_])/\\\\\1/g";
echo "#! /bin/sh" > $OUTF; find -type f | sed -r "$ESCSTR" | while
read x; do md5sum "$x"; done | sort --key=1,32 | uniq -w 32 -d
--all-repeated=separate | sed -r "s/^[0-9a-f]*(\\
)*//;$ESCSTR;s/(.+)/#rm \1/" >> $OUTF; chmod a+x $OUTF; ls -l $OUTF

...and same for older uniq and sort versions (e.g. those in Debian Woody):

OUTF=rem-duplicates.sh; ESCSTR="s/([^a-zA-Z0-9-\.\/_])/\\\\\1/g";
SUM="dummy"; echo "#! /bin/sh" > $OUTF; find -type f | sed -r

"$ESCSTR" | while read x; do md5sum "$x"; done | sort -k 1,32 |
uniq -w 32 -d --all-repeated | while read y; do NEW=`echo
"$y-dummy" | sed "s/ .*$//"`; if [ $NEW != $SUM ]; then echo "" >>
$OUTF; fi; SUM="$NEW"; echo "$y" | sed -r "s/^[0-9a-f]*(\\
)*//;$ESCSTR;s/(.+)/#rm \1/" >> $OUTF; done; chmod a+x $OUTF; ls -l
$OUTF

Example output

#! /bin/sh
#rm ./gdc2001/113-1303_IMG.JPG
#rm ./reppulilta/gdc2001/113-1303_IMG.JPG

#rm ./lissabon/01-01-2001/108-0883_IMG.JPG
#rm ./kuvat\ reppulilta/lissabon/01-01-2001/108-0883_IMG.JPG

#rm ./gdc2001/113-1328_IMG.JPG
#rm ./kuvat\ reppulilta/gdc2001/113-1328_IMG.JPG

Explanation

      1. write output script header
      2. list all files recursively under current directory
      3. escape all the potentially dangerous characters with a slash
      4. calculate MD5 sums
      5. find duplicate sums
      6. strip off MD5 sums and leave only file names
      7. escape the strange characters again
      8. write out a commented-out delete command
      9. make the output script writable and ls -l it