Re: Tool to compare directories and delete duplicate files from one directory
Date: Sun, 14 May 2023 22:48:52 UTC
> ---------------------------------------- > From: David Christensen <dpchrist@holgerdanske.com> > Date: May 13, 2023, 6:55:26 PM > To: <questions@freebsd.org> > Subject: Re: Tool to compare directories and delete duplicate files from one directory > > > I wrestled with a Perl script years ago when I did not know of > fdupes(1), jdupes(1), etc.. Brute force O(N^2) comparison worked for > toy datasets, but was impractical when I applied it to a directory > containing thousands of files and hundreds of gigabytes. (The OP > mentioned 12 TB.) Practical considerations of run time, memory usage, > disk I/O, etc., drove me to find the kinds of optimizations fdupes(1) > and jdupes(1) mention. > > > I do not know Awk, so it is hard to comment on your script. I suggest > commenting out any create/update/delete code, running the script against > larger and larger datasets, and seeing what optimizations you can add. > > > David All good points, and why I rewrote it without multi-dimensional arrays. Initially, `stat' was ran on each file encountered, then compared sizes on matched path/filename-pairs. The multi-dimensional arrays stored the filenames, paths, and sizes (hence, multi-d). But that's wasteful since we only care about size if there are duplicates somewhere. This version runs `stat' only if an apparent duplicate is found, which cuts down the `stat' calls significantly. The reason awk is so efficient on types of tasks is because it's doing string comparisons and string manipulation, which is very efficient when done properly. The most resource-intensive part of the program is the initial `find' command, which traverses the directories given on the command line once, then caches what it finds (running `find' twice successively uses the cache the second time). It even trims the list of files as it goes, which makes the match-tests smaller as it runs. I've ran it on paths containing 40,000+ files and it takes less than 1-second on the second run, and less than 5-seconds on its first. Sizes of files don't matter since we're only doing a `stat' call to retrieve its known size, not comparing contents. That said, I found a bug when command-line paths have similar leading predicates, and realized I wasn't protecting names with white-spaces. This version fixes both. Try it out using the [-n] flag. It doesn't do anything but find files, compare names, and compare sizes on duplicates. The `rm' command is even protected during non-dryrun calls using the [-i] flag, which prompts the user before deleting anything. #!/bin/sh -e # remove or report duplicate files: $0 [-n] dir[1] dir[2] ... dir[n] if [ "X$1" = "X-n" ]; then n=1; shift; fi echo "Building files list from: ${@}" find "${@}" -xdev -type f | awk -v n=$n 'BEGIN { cmd = "stat -f %z " for (x = 1; x < ARGC; x++) args = args ? args "|" ARGV[x] : ARGV[x]; ARGC = 0 } { files[$0] = match($0, "(" args ")/?") + RLENGTH } END { for (i in ARGV) sub("/*$", "/", ARGV[i]) print "Comparing files ..." for (i = 1; i < x; i++) for (file in files) if (file ~ "^" ARGV[i]) { for (j = i +1; j < x; j++) if (ARGV[j] substr(file, files[file]) in files) { dup = ARGV[j] substr(file, files[file]) cmd "\"" file "\"" | getline fil_s; close(cmd "\"" file "\"") cmd "\"" dup "\"" | getline dup_s; close(cmd "\"" dup "\"") if (dup_s == fil_s) act("dup") else act("diff") } delete files[file] } } function act(message) { print ((message == "dup") ? "duplicates:" : "difference:"), dup, file if (!n) system("rm -vi \"" dup "\" </dev/tty") }' "${@}" -- Sent with https://mailfence.com Secure and private email