From nobody Fri May 12 17:24:25 2023 X-Original-To: questions@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4QHwcK1xMDz4BGF7 for ; Fri, 12 May 2023 17:24:33 +0000 (UTC) (envelope-from sysadmin.lists@mailfence.com) Received: from wilbur.contactoffice.com (wilbur.contactoffice.com [212.3.242.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4QHwcH4CLjz4Wfd for ; Fri, 12 May 2023 17:24:31 +0000 (UTC) (envelope-from sysadmin.lists@mailfence.com) Authentication-Results: mx1.freebsd.org; dkim=pass header.d=mailfence.com header.s=20210208-e7xh header.b=vPoBfPgL; spf=pass (mx1.freebsd.org: domain of sysadmin.lists@mailfence.com designates 212.3.242.68 as permitted sender) smtp.mailfrom=sysadmin.lists@mailfence.com; dmarc=pass (policy=quarantine) header.from=mailfence.com Received: from fidget.co-bxl (fidget.co-bxl [10.2.0.33]) by wilbur.contactoffice.com (Postfix) with ESMTP id 094412AB3 for ; Fri, 12 May 2023 19:24:29 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; t=1683912269; s=20210208-e7xh; d=mailfence.com; i=sysadmin.lists@mailfence.com; h=Date:From:To:Message-ID:In-Reply-To:References:Subject:MIME-Version:Content-Type:Content-Transfer-Encoding; l=6416; bh=asVU2Pu/QetwsKKZFX2m2laWfvL2M5aJ8HdbP/2tgwI=; b=vPoBfPgL5D1gvvGNPEVTnzgux/xPy0hUPFshgCbZkt+kANwFvCSai5zWuFs0GEkf hG03P07SOd47shnIOMU/qOedVezL+ssJLz3gv2cAcpN0LYJXt2DK2zGIOJS8xio8pZe mv4XRhtNbHOyF4HQcry2TlEj9xVR3YdiTbaLEEo5gWc7/RwvUhdWhxOcdwfesVTpJ3S 8TEJ5C86IF8hca66X9sboEENiSX08gib2a6cp0ox5m7Pt/BlobeQUARmx+jH0mPphos Sdx8wRIEEHTu+UMY4O3eZSYwWnyrfr666ftIUIwPw3D7tcRlZiC4brEeXcLwtvnyO0C BXNSsKa6Xw== Date: Fri, 12 May 2023 19:24:25 +0200 (CEST) From: Sysadmin Lists To: questions@freebsd.org Message-ID: <347612746.1721811.1683912265841@fidget.co-bxl> In-Reply-To: <6a0aba81-485a-8985-d20d-6da58e9b5580@optiplex-networks.com> References: <9887a438-95e7-87cc-a162-4ad7a70d744f@optiplex-networks.com> <344b29c6-3d69-543d-678d-c2433dbf7152@optiplex-networks.com> <7747f587-f33e-f39c-ac97-fe4fe19e0b76@optiplex-networks.com> <7c2429c5-55d0-1649-a442-ce543f2d46c2@holgerdanske.com> <6a0aba81-485a-8985-d20d-6da58e9b5580@optiplex-networks.com> Subject: Re: Tool to compare directories and delete duplicate files from one directory List-Id: User questions List-Archive: https://lists.freebsd.org/archives/freebsd-questions List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-questions@freebsd.org X-BeenThere: freebsd-questions@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Mailer: ContactOffice Mail X-ContactOffice-Account: com:312482426 X-Spamd-Result: default: False [-4.07 / 15.00]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; NEURAL_HAM_LONG(-1.00)[-1.000]; NEURAL_HAM_SHORT(-0.98)[-0.983]; DMARC_POLICY_ALLOW(-0.50)[mailfence.com,quarantine]; R_DKIM_ALLOW(-0.20)[mailfence.com:s=20210208-e7xh]; R_SPF_ALLOW(-0.20)[+ip4:212.3.242.64/26]; MIME_GOOD(-0.10)[text/plain]; RCVD_IN_DNSWL_LOW(-0.10)[212.3.242.68:from]; XM_UA_NO_VERSION(0.01)[]; FROM_EQ_ENVFROM(0.00)[]; MLMMJ_DEST(0.00)[questions@freebsd.org]; MIME_TRACE(0.00)[0:+]; ASN(0.00)[asn:10753, ipnet:212.3.242.64/26, country:US]; RCVD_TLS_LAST(0.00)[]; TO_MATCH_ENVRCPT_ALL(0.00)[]; PREVIOUSLY_DELIVERED(0.00)[questions@freebsd.org]; DKIM_TRACE(0.00)[mailfence.com:+]; TO_DN_NONE(0.00)[]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_ONE(0.00)[1]; BLOCKLISTDE_FAIL(0.00)[212.3.242.68:server fail]; RCVD_COUNT_TWO(0.00)[2]; ARC_NA(0.00)[] X-Rspamd-Queue-Id: 4QHwcH4CLjz4Wfd X-Spamd-Bar: ---- X-ThisMailContainsUnwantedMimeParts: N > ---------------------------------------- > From: Kaya Saman > Date: May 7, 2023, 1:25:18 PM > To: > Subject: Re: Tool to compare directories and delete duplicate files from = one directory >=20 >=20 >=20 > On 5/6/23 21:33, David Christensen wrote: > > I thought I sent this, but it never hit the list (?) -- David > > > > > > On 5/4/23 21:06, Kaya Saman wrote: > > > >> To start with this is the directory structure: > >> > >> > >> =C2=A0=C2=A0ls -lhR /tmp/test1 > >> total 1 > >> drwxr-xr-x=C2=A0 2 root=C2=A0 wheel=C2=A0=C2=A0=C2=A0=C2=A0 3B May=C2= =A0 5 04:57 dupdir1 > >> drwxr-xr-x=C2=A0 2 root=C2=A0 wheel=C2=A0=C2=A0=C2=A0=C2=A0 3B May=C2= =A0 5 04:57 dupdir2 > >> > >> /tmp/test1/dupdir1: > >> total 1 > >> -rw-r--r--=C2=A0 1 root=C2=A0 wheel=C2=A0=C2=A0=C2=A0=C2=A0 8B Apr 30 = 03:17 dup > >> > >> /tmp/test1/dupdir2: > >> total 1 > >> -rw-r--r--=C2=A0 1 root=C2=A0 wheel=C2=A0=C2=A0=C2=A0=C2=A0 7B May=C2= =A0 5 03:23 dup1 > >> > >> > >> ls -lhR /tmp/test2 > >> total 1 > >> drwxr-xr-x=C2=A0 2 root=C2=A0 wheel=C2=A0=C2=A0=C2=A0=C2=A0 3B May=C2= =A0 5 04:56 dupdir1 > >> drwxr-xr-x=C2=A0 2 root=C2=A0 wheel=C2=A0=C2=A0=C2=A0=C2=A0 3B May=C2= =A0 5 04:56 dupdir2 > >> > >> /tmp/test2/dupdir1: > >> total 1 > >> -rw-r--r--=C2=A0 1 root=C2=A0 wheel=C2=A0=C2=A0=C2=A0=C2=A0 4B Apr 30 = 02:53 dup > >> > >> /tmp/test2/dupdir2: > >> total 1 > >> -rw-r--r--=C2=A0 1 root=C2=A0 wheel=C2=A0=C2=A0=C2=A0=C2=A0 7B Apr 30 = 02:47 dup1 > >> > >> > >> So what I want to happen is the script to recurse from the top level= =20 > >> directories test1 and test2 then expected behavior should be to=20 > >> remove file dup1 as dup is different between directories. > > > > > > My previous post missed the mark, but I have been watching this thread= =20 > > with interest (trepidation?). > > > > > > I think Tim already identified a tool that will safely get you close=20 > > to your goal, if not all the way: > > > > On 5/4/23 09:28, Tim Daneliuk wrote: > >> I've never used it, but there is a port of fdupes in the ports tree. > >> Not sure if it does exactly what you want though. > > > > > > fdupes(1) is also available as a package: > > > > 2023-05-04 21:25:31 toor@vf1 ~ > > # freebsd-version; uname -a > > 12.4-RELEASE-p2 > > FreeBSD vf1.tracy.holgerdanske.com 12.4-RELEASE-p1 FreeBSD=20 > > 12.4-RELEASE-p1 GENERIC=C2=A0 amd64 > > > > 2023-05-04 21:25:40 toor@vf1 ~ > > # pkg search fdupes > > fdupes-2.2.1,1=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Program for identifying or deleting= =20 > > duplicate files > > > > > > Looking at the man page: > > > > https://man.freebsd.org/cgi/man.cgi?query=3Dfdupes&sektion=3D1&manpath= =3DFreeBSD+13.2-RELEASE+and+Ports=20 > > > > > > > > I am fairly certain that you will want to give the destination=20 > > directory as the first argument and the source directories after that: > > > > $ fdupes --recurse /dir /dir_1 /dir_2 /dir_3 > > > > > > The above will provide you with information, but not delete anything. > > > > > > Practice under /tmp to gain familiarity with fdupes(1) is a good idea. > > > > > > As you are using ZFS, I assume you know how to take snapshots and do=20 > > rollbacks (?).=C2=A0 These could serve as backup and restore operations= if=20 > > things go badly. > > > > > > Given a 12+ TB of data, you may want the --noprompt option when you do= =20 > > give the --delete option and actual arguments, > > > > > > David > > >=20 > Thanks David! >=20 >=20 > I tried using fdupes like this but I wasn't able to see anything.=20 > Probably because it took so long to run and never completed? It does=20 > actually feature a -d flag too which does delete stuff but from my=20 > testing this deletes all duplicates and doesn't allow you to choose the= =20 > directory to delete the duplicate files from, unless I failed to=20 > understand the man page. >=20 >=20 > At present the Perl script from Paul in it's last iteration solved my=20 > problem and was pretty fast at the same time. >=20 >=20 > Of course at first I tested it on my test dirs in /tmp, then I took zfs= =20 > snapshots on the actual working dirs and finally ran the script. It=20 > worked flawlessly. >=20 >=20 > Regards, >=20 >=20 > Kaya >=20 >=20 Curiosity got the better of me. I've been searching for a project that requ= ires the use of multi-dimensional arrays in BSD-awk (not explicitly supported). = But after writing it, I realized there was a more efficient way without them (o= nly run `stat' on files with matching paths plus names) [nonplussed]. Here's that one. #!/bin/sh -e # remove or report duplicate files: $0 [-n] dir[1] dir[2] ... dir[n] if [ "X$1" =3D "X-n" ]; then n=3D1; shift; fi echo "Building files list from ... ${@}" find "${@}" -xdev -type f | awk -v n=3D$n 'BEGIN { cmd =3D "stat -f %z " for (x =3D 1; x < ARGC; x++) args =3D args ? args "|" ARGV[x] : ARGV[x]; AR= GC =3D 0 } { files[$0] =3D match($0, "(" args ")/?") + RLENGTH } # index of file= name END { for (i in ARGV) sub("/+$", "", ARGV[i]) # remove trailing= -/s print "Comparing files ..." for (i =3D 1; i < x; i++) for (file in files) if (file ~ "^" ARGV[i]= ) { for (j =3D i +1; j < x; j++) if (ARGV[j] "/" substr(file, files[file]) in files) { dup =3D ARGV[j] "/" substr(file, files[file]) cmd file | getline fil_s; close(cmd file) cmd dup | getline dup_s; close(cmd dup) if (dup_s =3D=3D fil_s) act(file, dup, "dup") else act(file, dup, "diff") } delete files[file] } } function act(file, dup, message) { print ((message =3D=3D "dup") ? "duplicates:" : "difference:"), dup, fi= le if (!n) system("rm -vi " dup "