Re: Tool to compare directories and delete duplicate files from one directory
Date: Mon, 15 May 2023 22:26:07 UTC
> ---------------------------------------- > From: David Christensen <dpchrist@holgerdanske.com> > Date: May 15, 2023, 1:43:38 AM > To: <questions@freebsd.org> > Subject: Re: Tool to compare directories and delete duplicate files from one directory > > > I looks like your script only finds duplicates when the subpath is > identical (?): > Yeah. Wasn't that the original problem description? I went off the example given by Paul earlier in this thread, and it looked like only files with matching subpaths were being considered (because the OP accidentally rsync'd files from a source to a bunch of destination dirs). If we're simply looking for files that have the same name anywhere in the set of dirs, then comparing their sizes to know if they're assumed (!) duplicates or differ in size, that's way easier to program. As a side note on performance, I ran the program on a set of 8 dirs containing over 750,000 files and 300G of data. Here are the results: real 0m10.791s user 0m5.361s sys 0m5.928s And here are the results for counting the files in the dirs using `wc': real 0m12.464s user 0m0.834s sys 0m11.671s That means the program processed the list of files quicker that `wc' could count them, which is wild. Obviously, as the number of apparent duplicates is encountered, the number of `stat' calls increases, and the run-time will, too. But this shows how efficient awk is at comparing strings. > 2023-05-15 01:38:20 dpchrist@vf1 /vf1zpool1/dpchrist > $ cp -Ra foo bar > > 2023-05-15 01:39:18 dpchrist@vf1 /vf1zpool1/dpchrist > $ sysadmin.lists_mailfence.com-20230514-1548-find-dupes.sh -n foo bar > Building files list from: foo bar > Comparing files ... > duplicates: bar/1/2/a foo/1/2/a > duplicates: bar/1/i-j foo/1/i-j > duplicates: bar/1/2/e foo/1/2/e > duplicates: bar/1/a-b foo/1/a-b > duplicates: bar/1/g foo/1/g > duplicates: bar/1/2/i foo/1/2/i > duplicates: bar/q-r foo/q-r > duplicates: bar/m-n foo/m-n > duplicates: bar/1/2/m foo/1/2/m > duplicates: bar/c foo/c > duplicates: bar/e-f foo/e-f > duplicates: bar/1/s foo/1/s > duplicates: bar/k foo/k > duplicates: bar/o foo/o > duplicates: bar/q foo/q > duplicates: bar/1/c-d foo/1/c-d > duplicates: bar/1/2/s-t foo/1/2/s-t > duplicates: bar/1/2/o-p foo/1/2/o-p > duplicates: bar/1/2/k-l foo/1/2/k-l > duplicates: bar/g-h foo/g-h > > 2023-05-15 01:39:41 dpchrist@vf1 /vf1zpool1/dpchrist > $ ls -R1 foo | wc > 26 24 82 > > 2023-05-15 01:39:44 dpchrist@vf1 /vf1zpool1/dpchrist > $ ls -R1 bar | wc > 26 24 82 > > 2023-05-15 01:40:10 dpchrist@vf1 /vf1zpool1/dpchrist > $ sysadmin.lists_mailfence.com-20230514-1548-find-dupes.sh -n foo bar > Building files list from: foo bar > Comparing files ... > duplicates: bar/1/2/a foo/1/2/a > duplicates: bar/1/i-j foo/1/i-j > duplicates: bar/1/2/e foo/1/2/e > duplicates: bar/1/a-b foo/1/a-b > duplicates: bar/1/g foo/1/g > duplicates: bar/1/2/i foo/1/2/i > duplicates: bar/q-r foo/q-r > duplicates: bar/m-n foo/m-n > duplicates: bar/1/2/m foo/1/2/m > duplicates: bar/c foo/c > duplicates: bar/e-f foo/e-f > duplicates: bar/1/s foo/1/s > duplicates: bar/k foo/k > duplicates: bar/o foo/o > duplicates: bar/q foo/q > duplicates: bar/1/c-d foo/1/c-d > duplicates: bar/1/2/s-t foo/1/2/s-t > duplicates: bar/1/2/o-p foo/1/2/o-p > duplicates: bar/1/2/k-l foo/1/2/k-l > duplicates: bar/g-h foo/g-h > > 2023-05-15 01:40:22 dpchrist@vf1 /vf1zpool1/dpchrist > $ ls -R1 foo | wc > 26 24 82 > > 2023-05-15 01:40:29 dpchrist@vf1 /vf1zpool1/dpchrist > $ ls -R1 bar | wc > 26 24 82 > > 2023-05-15 01:40:34 dpchrist@vf1 /vf1zpool1/dpchrist > $ sysadmin.lists_mailfence.com-20230514-1548-find-dupes.sh foo bar > Building files list from: foo bar > Comparing files ... > duplicates: bar/1/2/a foo/1/2/a > remove bar/1/2/a? n > duplicates: bar/1/i-j foo/1/i-j > remove bar/1/i-j? n > duplicates: bar/1/2/e foo/1/2/e > remove bar/1/2/e? n > duplicates: bar/1/a-b foo/1/a-b > remove bar/1/a-b? n > duplicates: bar/1/g foo/1/g > remove bar/1/g? n > duplicates: bar/1/2/i foo/1/2/i > remove bar/1/2/i? n > duplicates: bar/q-r foo/q-r > remove bar/q-r? n > duplicates: bar/m-n foo/m-n > remove bar/m-n? n > duplicates: bar/1/2/m foo/1/2/m > remove bar/1/2/m? n > duplicates: bar/c foo/c > remove bar/c? n > duplicates: bar/e-f foo/e-f > remove bar/e-f? n > duplicates: bar/1/s foo/1/s > remove bar/1/s? n > duplicates: bar/k foo/k > remove bar/k? n > duplicates: bar/o foo/o > remove bar/o? n > duplicates: bar/q foo/q > remove bar/q? n > duplicates: bar/1/c-d foo/1/c-d > remove bar/1/c-d? n > duplicates: bar/1/2/s-t foo/1/2/s-t > remove bar/1/2/s-t? n > duplicates: bar/1/2/o-p foo/1/2/o-p > remove bar/1/2/o-p? n > duplicates: bar/1/2/k-l foo/1/2/k-l > remove bar/1/2/k-l? n > duplicates: bar/g-h foo/g-h > remove bar/g-h? n > > > David > Thanks for running that test. It's working as designed. However, it doesn't check if the apparent duplicate is literally the same file (same inode) encountered through an overlapping directory, or a hard-link. This one does (although it might be a moot point if I misunderstood the original problem). #!/bin/sh -e # remove or report duplicate files: $0 [-n] dir[1] dir[2] ... dir[n] if [ "X$1" = "X-n" ]; then n=1; shift; fi echo "Building files list from: ${@}" find "${@}" -xdev -type f | awk -d1 -v n=$n 'BEGIN { cmd = "stat -f \"%i %z\" " for (x = 1; x < ARGC; x++) args = args ? args "|" ARGV[x] : ARGV[x]; ARGC = 0 } { files[$0] = match($0, "(" args ")/?") + RLENGTH } END { for (i in ARGV) sub("/*$", "/", ARGV[i]) print "Comparing files ..." for (i = 1; i < x; i++) for (file in files) if (file ~ "^" ARGV[i]) { for (j = i +1; j < x; j++) if (ARGV[j] substr(file, files[file]) in files) { dup = ARGV[j] substr(file, files[file]) cmd "\"" file "\"" | getline; close(cmd "\"" file "\"") fil_i = $1; fil_s = $2 cmd "\"" dup "\"" | getline; close(cmd "\"" dup "\"") dup_i = $1; dup_s = $2 if (fil_i == dup_i) continue if (fil_s == dup_s) { act("dup") } else act("diff") } delete files[file] } } function act(message) { print ((message == "dup") ? "duplicates:" : "difference:"), dup, file if (!n) system("rm -vi \"" dup "\" </dev/tty") }' "${@}" -- Sent with https://mailfence.com Secure and private email