Re: Tool to compare directories and delete duplicate files from one directory

From: Sysadmin Lists <sysadmin.lists_at_mailfence.com>
Date: Fri, 12 May 2023 17:24:25 UTC
> ----------------------------------------
> From: Kaya Saman <kayasaman@optiplex-networks.com>
> Date: May 7, 2023, 1:25:18 PM
> To: <questions@freebsd.org>
> Subject: Re: Tool to compare directories and delete duplicate files from one directory
> 
> 
> 
> On 5/6/23 21:33, David Christensen wrote:
> > I thought I sent this, but it never hit the list (?) -- David
> >
> >
> > On 5/4/23 21:06, Kaya Saman wrote:
> >
> >> To start with this is the directory structure:
> >>
> >>
> >>   ls -lhR /tmp/test1
> >> total 1
> >> drwxr-xr-x  2 root  wheel     3B May  5 04:57 dupdir1
> >> drwxr-xr-x  2 root  wheel     3B May  5 04:57 dupdir2
> >>
> >> /tmp/test1/dupdir1:
> >> total 1
> >> -rw-r--r--  1 root  wheel     8B Apr 30 03:17 dup
> >>
> >> /tmp/test1/dupdir2:
> >> total 1
> >> -rw-r--r--  1 root  wheel     7B May  5 03:23 dup1
> >>
> >>
> >> ls -lhR /tmp/test2
> >> total 1
> >> drwxr-xr-x  2 root  wheel     3B May  5 04:56 dupdir1
> >> drwxr-xr-x  2 root  wheel     3B May  5 04:56 dupdir2
> >>
> >> /tmp/test2/dupdir1:
> >> total 1
> >> -rw-r--r--  1 root  wheel     4B Apr 30 02:53 dup
> >>
> >> /tmp/test2/dupdir2:
> >> total 1
> >> -rw-r--r--  1 root  wheel     7B Apr 30 02:47 dup1
> >>
> >>
> >> So what I want to happen is the script to recurse from the top level 
> >> directories test1 and test2 then expected behavior should be to 
> >> remove file dup1 as dup is different between directories.
> >
> >
> > My previous post missed the mark, but I have been watching this thread 
> > with interest (trepidation?).
> >
> >
> > I think Tim already identified a tool that will safely get you close 
> > to your goal, if not all the way:
> >
> > On 5/4/23 09:28, Tim Daneliuk wrote:
> >> I've never used it, but there is a port of fdupes in the ports tree.
> >> Not sure if it does exactly what you want though.
> >
> >
> > fdupes(1) is also available as a package:
> >
> > 2023-05-04 21:25:31 toor@vf1 ~
> > # freebsd-version; uname -a
> > 12.4-RELEASE-p2
> > FreeBSD vf1.tracy.holgerdanske.com 12.4-RELEASE-p1 FreeBSD 
> > 12.4-RELEASE-p1 GENERIC  amd64
> >
> > 2023-05-04 21:25:40 toor@vf1 ~
> > # pkg search fdupes
> > fdupes-2.2.1,1                 Program for identifying or deleting 
> > duplicate files
> >
> >
> > Looking at the man page:
> >
> > https://man.freebsd.org/cgi/man.cgi?query=fdupes&sektion=1&manpath=FreeBSD+13.2-RELEASE+and+Ports 
> >
> >
> >
> > I am fairly certain that you will want to give the destination 
> > directory as the first argument and the source directories after that:
> >
> > $ fdupes --recurse /dir /dir_1 /dir_2 /dir_3
> >
> >
> > The above will provide you with information, but not delete anything.
> >
> >
> > Practice under /tmp to gain familiarity with fdupes(1) is a good idea.
> >
> >
> > As you are using ZFS, I assume you know how to take snapshots and do 
> > rollbacks (?).  These could serve as backup and restore operations if 
> > things go badly.
> >
> >
> > Given a 12+ TB of data, you may want the --noprompt option when you do 
> > give the --delete option and actual arguments,
> >
> >
> > David
> >
> 
> Thanks David!
> 
> 
> I tried using fdupes like this but I wasn't able to see anything. 
> Probably because it took so long to run and never completed? It does 
> actually feature a -d flag too which does delete stuff but from my 
> testing this deletes all duplicates and doesn't allow you to choose the 
> directory to delete the duplicate files from, unless I failed to 
> understand the man page.
> 
> 
> At present the Perl script from Paul in it's last iteration solved my 
> problem and was pretty fast at the same time.
> 
> 
> Of course at first I tested it on my test dirs in /tmp, then I took zfs 
> snapshots on the actual working dirs and finally ran the script. It 
> worked flawlessly.
> 
> 
> Regards,
> 
> 
> Kaya
> 
> 

Curiosity got the better of me. I've been searching for a project that requires
the use of multi-dimensional arrays in BSD-awk (not explicitly supported). But
after writing it, I realized there was a more efficient way without them (only
run `stat' on files with matching paths plus names) [nonplussed].
Here's that one.

#!/bin/sh -e
# remove or report duplicate files: $0 [-n] dir[1] dir[2] ... dir[n]
if [ "X$1" = "X-n" ]; then n=1; shift; fi

echo "Building files list from ... ${@}"

find "${@}" -xdev -type f |
awk -v n=$n 'BEGIN { cmd = "stat -f %z "
for (x = 1; x < ARGC; x++) args = args ? args "|" ARGV[x] : ARGV[x]; ARGC = 0 }
     { files[$0] = match($0, "(" args ")/?") + RLENGTH }  # index of filename
END  { for (i in ARGV) sub("/+$", "", ARGV[i])            # remove trailing-/s
       print "Comparing files ..."
       for (i = 1; i < x; i++) for (file in files) if (file ~ "^" ARGV[i]) {
            for (j = i +1; j < x; j++)
                 if (ARGV[j] "/" substr(file, files[file]) in files) {
                     dup = ARGV[j] "/" substr(file, files[file])
                     cmd file | getline fil_s; close(cmd file)
                     cmd dup  | getline dup_s; close(cmd dup)
                     if (dup_s == fil_s) act(file, dup, "dup")
                     else act(file, dup, "diff") }
            delete files[file]
     } }

function act(file, dup, message) {
    print ((message == "dup") ? "duplicates:" : "difference:"), dup, file
    if (!n) system("rm -vi " dup "</dev/tty")
}' "${@}"

Priority is given by the order of the arguments (first highest, last lowest).
The user is prompted to delete lower-priority dupes encountered if '-n' isn't
given, otherwise it just reports what it finds. Comparing by size and name only
seems odd (a simple `diff' would be easier). Surprisingly, accounting for a
mixture of dirnames with and w/o trailing-slashes was a bit tricky (dir1 dir2/).

Fun challenge. Learned a lot about awk.




-- 
Sent with https://mailfence.com  
Secure and private email