Re: Tool to compare directories and delete duplicate files from one directory
Date: Fri, 12 May 2023 17:24:25 UTC
> ---------------------------------------- > From: Kaya Saman <kayasaman@optiplex-networks.com> > Date: May 7, 2023, 1:25:18 PM > To: <questions@freebsd.org> > Subject: Re: Tool to compare directories and delete duplicate files from one directory > > > > On 5/6/23 21:33, David Christensen wrote: > > I thought I sent this, but it never hit the list (?) -- David > > > > > > On 5/4/23 21:06, Kaya Saman wrote: > > > >> To start with this is the directory structure: > >> > >> > >> ls -lhR /tmp/test1 > >> total 1 > >> drwxr-xr-x 2 root wheel 3B May 5 04:57 dupdir1 > >> drwxr-xr-x 2 root wheel 3B May 5 04:57 dupdir2 > >> > >> /tmp/test1/dupdir1: > >> total 1 > >> -rw-r--r-- 1 root wheel 8B Apr 30 03:17 dup > >> > >> /tmp/test1/dupdir2: > >> total 1 > >> -rw-r--r-- 1 root wheel 7B May 5 03:23 dup1 > >> > >> > >> ls -lhR /tmp/test2 > >> total 1 > >> drwxr-xr-x 2 root wheel 3B May 5 04:56 dupdir1 > >> drwxr-xr-x 2 root wheel 3B May 5 04:56 dupdir2 > >> > >> /tmp/test2/dupdir1: > >> total 1 > >> -rw-r--r-- 1 root wheel 4B Apr 30 02:53 dup > >> > >> /tmp/test2/dupdir2: > >> total 1 > >> -rw-r--r-- 1 root wheel 7B Apr 30 02:47 dup1 > >> > >> > >> So what I want to happen is the script to recurse from the top level > >> directories test1 and test2 then expected behavior should be to > >> remove file dup1 as dup is different between directories. > > > > > > My previous post missed the mark, but I have been watching this thread > > with interest (trepidation?). > > > > > > I think Tim already identified a tool that will safely get you close > > to your goal, if not all the way: > > > > On 5/4/23 09:28, Tim Daneliuk wrote: > >> I've never used it, but there is a port of fdupes in the ports tree. > >> Not sure if it does exactly what you want though. > > > > > > fdupes(1) is also available as a package: > > > > 2023-05-04 21:25:31 toor@vf1 ~ > > # freebsd-version; uname -a > > 12.4-RELEASE-p2 > > FreeBSD vf1.tracy.holgerdanske.com 12.4-RELEASE-p1 FreeBSD > > 12.4-RELEASE-p1 GENERIC amd64 > > > > 2023-05-04 21:25:40 toor@vf1 ~ > > # pkg search fdupes > > fdupes-2.2.1,1 Program for identifying or deleting > > duplicate files > > > > > > Looking at the man page: > > > > https://man.freebsd.org/cgi/man.cgi?query=fdupes&sektion=1&manpath=FreeBSD+13.2-RELEASE+and+Ports > > > > > > > > I am fairly certain that you will want to give the destination > > directory as the first argument and the source directories after that: > > > > $ fdupes --recurse /dir /dir_1 /dir_2 /dir_3 > > > > > > The above will provide you with information, but not delete anything. > > > > > > Practice under /tmp to gain familiarity with fdupes(1) is a good idea. > > > > > > As you are using ZFS, I assume you know how to take snapshots and do > > rollbacks (?). These could serve as backup and restore operations if > > things go badly. > > > > > > Given a 12+ TB of data, you may want the --noprompt option when you do > > give the --delete option and actual arguments, > > > > > > David > > > > Thanks David! > > > I tried using fdupes like this but I wasn't able to see anything. > Probably because it took so long to run and never completed? It does > actually feature a -d flag too which does delete stuff but from my > testing this deletes all duplicates and doesn't allow you to choose the > directory to delete the duplicate files from, unless I failed to > understand the man page. > > > At present the Perl script from Paul in it's last iteration solved my > problem and was pretty fast at the same time. > > > Of course at first I tested it on my test dirs in /tmp, then I took zfs > snapshots on the actual working dirs and finally ran the script. It > worked flawlessly. > > > Regards, > > > Kaya > > Curiosity got the better of me. I've been searching for a project that requires the use of multi-dimensional arrays in BSD-awk (not explicitly supported). But after writing it, I realized there was a more efficient way without them (only run `stat' on files with matching paths plus names) [nonplussed]. Here's that one. #!/bin/sh -e # remove or report duplicate files: $0 [-n] dir[1] dir[2] ... dir[n] if [ "X$1" = "X-n" ]; then n=1; shift; fi echo "Building files list from ... ${@}" find "${@}" -xdev -type f | awk -v n=$n 'BEGIN { cmd = "stat -f %z " for (x = 1; x < ARGC; x++) args = args ? args "|" ARGV[x] : ARGV[x]; ARGC = 0 } { files[$0] = match($0, "(" args ")/?") + RLENGTH } # index of filename END { for (i in ARGV) sub("/+$", "", ARGV[i]) # remove trailing-/s print "Comparing files ..." for (i = 1; i < x; i++) for (file in files) if (file ~ "^" ARGV[i]) { for (j = i +1; j < x; j++) if (ARGV[j] "/" substr(file, files[file]) in files) { dup = ARGV[j] "/" substr(file, files[file]) cmd file | getline fil_s; close(cmd file) cmd dup | getline dup_s; close(cmd dup) if (dup_s == fil_s) act(file, dup, "dup") else act(file, dup, "diff") } delete files[file] } } function act(file, dup, message) { print ((message == "dup") ? "duplicates:" : "difference:"), dup, file if (!n) system("rm -vi " dup "</dev/tty") }' "${@}" Priority is given by the order of the arguments (first highest, last lowest). The user is prompted to delete lower-priority dupes encountered if '-n' isn't given, otherwise it just reports what it finds. Comparing by size and name only seems odd (a simple `diff' would be easier). Surprisingly, accounting for a mixture of dirnames with and w/o trailing-slashes was a bit tricky (dir1 dir2/). Fun challenge. Learned a lot about awk. -- Sent with https://mailfence.com Secure and private email