Re: Tool to compare directories and delete duplicate files from one directory

From: Sysadmin Lists <sysadmin.lists_at_mailfence.com>
Date: Mon, 15 May 2023 22:26:07 UTC
> ----------------------------------------
> From: David Christensen <dpchrist@holgerdanske.com>
> Date: May 15, 2023, 1:43:38 AM
> To: <questions@freebsd.org>
> Subject: Re: Tool to compare directories and delete duplicate files from one directory
> 
> 
> I looks like your script only finds duplicates when the subpath is 
> identical (?):
> 

Yeah. Wasn't that the original problem description? I went off the example
given by Paul earlier in this thread, and it looked like only files with
matching subpaths were being considered (because the OP accidentally rsync'd
files from a source to a bunch of destination dirs).

If we're simply looking for files that have the same name anywhere in the set
of dirs, then comparing their sizes to know if they're assumed (!) duplicates
or differ in size, that's way easier to program.

As a side note on performance, I ran the program on a set of 8 dirs containing
over 750,000 files and 300G of data. Here are the results:

real    0m10.791s
user    0m5.361s
sys     0m5.928s

And here are the results for counting the files in the dirs using `wc':

real    0m12.464s
user    0m0.834s
sys     0m11.671s

That means the program processed the list of files quicker that `wc' could
count them, which is wild. Obviously, as the number of apparent duplicates is
encountered, the number of `stat' calls increases, and the run-time will, too.
But this shows how efficient awk is at comparing strings.

> 2023-05-15 01:38:20 dpchrist@vf1 /vf1zpool1/dpchrist
> $ cp -Ra foo bar
> 
> 2023-05-15 01:39:18 dpchrist@vf1 /vf1zpool1/dpchrist
> $ sysadmin.lists_mailfence.com-20230514-1548-find-dupes.sh -n foo bar
> Building files list from: foo bar
> Comparing files ...
> duplicates: bar/1/2/a foo/1/2/a
> duplicates: bar/1/i-j foo/1/i-j
> duplicates: bar/1/2/e foo/1/2/e
> duplicates: bar/1/a-b foo/1/a-b
> duplicates: bar/1/g foo/1/g
> duplicates: bar/1/2/i foo/1/2/i
> duplicates: bar/q-r foo/q-r
> duplicates: bar/m-n foo/m-n
> duplicates: bar/1/2/m foo/1/2/m
> duplicates: bar/c foo/c
> duplicates: bar/e-f foo/e-f
> duplicates: bar/1/s foo/1/s
> duplicates: bar/k foo/k
> duplicates: bar/o foo/o
> duplicates: bar/q foo/q
> duplicates: bar/1/c-d foo/1/c-d
> duplicates: bar/1/2/s-t foo/1/2/s-t
> duplicates: bar/1/2/o-p foo/1/2/o-p
> duplicates: bar/1/2/k-l foo/1/2/k-l
> duplicates: bar/g-h foo/g-h
> 
> 2023-05-15 01:39:41 dpchrist@vf1 /vf1zpool1/dpchrist
> $ ls -R1 foo | wc
>        26      24      82
> 
> 2023-05-15 01:39:44 dpchrist@vf1 /vf1zpool1/dpchrist
> $ ls -R1 bar | wc
>        26      24      82
> 
> 2023-05-15 01:40:10 dpchrist@vf1 /vf1zpool1/dpchrist
> $ sysadmin.lists_mailfence.com-20230514-1548-find-dupes.sh -n foo bar
> Building files list from: foo bar
> Comparing files ...
> duplicates: bar/1/2/a foo/1/2/a
> duplicates: bar/1/i-j foo/1/i-j
> duplicates: bar/1/2/e foo/1/2/e
> duplicates: bar/1/a-b foo/1/a-b
> duplicates: bar/1/g foo/1/g
> duplicates: bar/1/2/i foo/1/2/i
> duplicates: bar/q-r foo/q-r
> duplicates: bar/m-n foo/m-n
> duplicates: bar/1/2/m foo/1/2/m
> duplicates: bar/c foo/c
> duplicates: bar/e-f foo/e-f
> duplicates: bar/1/s foo/1/s
> duplicates: bar/k foo/k
> duplicates: bar/o foo/o
> duplicates: bar/q foo/q
> duplicates: bar/1/c-d foo/1/c-d
> duplicates: bar/1/2/s-t foo/1/2/s-t
> duplicates: bar/1/2/o-p foo/1/2/o-p
> duplicates: bar/1/2/k-l foo/1/2/k-l
> duplicates: bar/g-h foo/g-h
> 
> 2023-05-15 01:40:22 dpchrist@vf1 /vf1zpool1/dpchrist
> $ ls -R1 foo | wc
>        26      24      82
> 
> 2023-05-15 01:40:29 dpchrist@vf1 /vf1zpool1/dpchrist
> $ ls -R1 bar | wc
>        26      24      82
> 
> 2023-05-15 01:40:34 dpchrist@vf1 /vf1zpool1/dpchrist
> $ sysadmin.lists_mailfence.com-20230514-1548-find-dupes.sh foo bar
> Building files list from: foo bar
> Comparing files ...
> duplicates: bar/1/2/a foo/1/2/a
> remove bar/1/2/a? n
> duplicates: bar/1/i-j foo/1/i-j
> remove bar/1/i-j? n
> duplicates: bar/1/2/e foo/1/2/e
> remove bar/1/2/e? n
> duplicates: bar/1/a-b foo/1/a-b
> remove bar/1/a-b? n
> duplicates: bar/1/g foo/1/g
> remove bar/1/g? n
> duplicates: bar/1/2/i foo/1/2/i
> remove bar/1/2/i? n
> duplicates: bar/q-r foo/q-r
> remove bar/q-r? n
> duplicates: bar/m-n foo/m-n
> remove bar/m-n? n
> duplicates: bar/1/2/m foo/1/2/m
> remove bar/1/2/m? n
> duplicates: bar/c foo/c
> remove bar/c? n
> duplicates: bar/e-f foo/e-f
> remove bar/e-f? n
> duplicates: bar/1/s foo/1/s
> remove bar/1/s? n
> duplicates: bar/k foo/k
> remove bar/k? n
> duplicates: bar/o foo/o
> remove bar/o? n
> duplicates: bar/q foo/q
> remove bar/q? n
> duplicates: bar/1/c-d foo/1/c-d
> remove bar/1/c-d? n
> duplicates: bar/1/2/s-t foo/1/2/s-t
> remove bar/1/2/s-t? n
> duplicates: bar/1/2/o-p foo/1/2/o-p
> remove bar/1/2/o-p? n
> duplicates: bar/1/2/k-l foo/1/2/k-l
> remove bar/1/2/k-l? n
> duplicates: bar/g-h foo/g-h
> remove bar/g-h? n
> 
> 
> David
> 

Thanks for running that test. It's working as designed. However, it doesn't
check if the apparent duplicate is literally the same file (same inode)
encountered through an overlapping directory, or a hard-link. This one does
(although it might be a moot point if I misunderstood the original problem).

#!/bin/sh -e
# remove or report duplicate files: $0 [-n] dir[1] dir[2] ... dir[n]
if [ "X$1" = "X-n" ]; then n=1; shift; fi

echo "Building files list from: ${@}"

find "${@}" -xdev -type f |
awk -d1 -v n=$n 'BEGIN { cmd = "stat -f \"%i %z\" "
for (x = 1; x < ARGC; x++) args = args ? args "|" ARGV[x] : ARGV[x]; ARGC = 0 }
     { files[$0] = match($0, "(" args ")/?") + RLENGTH }
END  { for (i in ARGV) sub("/*$", "/", ARGV[i])
       print "Comparing files ..."
       for (i = 1; i < x; i++) for (file in files) if (file ~ "^" ARGV[i]) {
           for (j = i +1; j < x; j++)
               if (ARGV[j] substr(file, files[file]) in files) {
                   dup = ARGV[j] substr(file, files[file])
                   cmd "\"" file "\"" | getline; close(cmd "\"" file "\"")
                   fil_i = $1; fil_s = $2
                   cmd "\"" dup  "\"" | getline; close(cmd "\"" dup  "\"")
                   dup_i = $1; dup_s = $2
                   if (fil_i == dup_i) continue
                   if (fil_s == dup_s) { act("dup") } else act("diff") }
           delete files[file]
     } }
function act(message) {
    print ((message == "dup") ? "duplicates:" : "difference:"), dup, file
    if (!n) system("rm -vi \"" dup "\" </dev/tty")
}' "${@}"


-- 
Sent with https://mailfence.com  
Secure and private email