From nobody Mon May 15 08:29:38 2023 X-Original-To: questions@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4QKXc249PZz4BLYH for ; Mon, 15 May 2023 08:29:54 +0000 (UTC) (envelope-from dpchrist@holgerdanske.com) Received: from holgerdanske.com (holgerdanske.com [IPv6:2001:470:0:19b::b869:801b]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "holgerdanske.com", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4QKXc12VXRz3ldr for ; Mon, 15 May 2023 08:29:53 +0000 (UTC) (envelope-from dpchrist@holgerdanske.com) Authentication-Results: mx1.freebsd.org; dkim=pass header.d=holgerdanske.com header.s=nov-20210719-112354 header.b=XBXazbha; spf=pass (mx1.freebsd.org: domain of dpchrist@holgerdanske.com designates 2001:470:0:19b::b869:801b as permitted sender) smtp.mailfrom=dpchrist@holgerdanske.com; dmarc=pass (policy=none) header.from=holgerdanske.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=holgerdanske.com; s=nov-20210719-112354; t=1684139384; bh=AA1BdDt5p/UEizX1ED5qcncvhCHzYFhdJyjHiQ51ikk=; h=Received:Message-ID:Date:MIME-Version:User-Agent:Subject:To: References:Content-Language:From:In-Reply-To:Content-Type: Content-Transfer-Encoding; b=XBXazbhaHoP24Sx4MF/ar+SAKf//akMPiW/PGD82LeNnYAQoJtHzlvedwEYJTIXhz Qe0PokxmsWz1MHUDUNapnbJoyT7+fYMxLRHRC73gaXb0S7YmZGmEwcwdnpLatIq4WZ ZSHrOBfGQGrmMVbTIMF4Y2wQqkrzLoOEru/g8mZUau7M+UslUHiIbxb+3WC322vUR4 L8GBvdq+Zj94IwWoprT3WX36pLZhzenQ0zwFSHSb/yEUpeLsGV3iLiteB/qGQYnkA5 QwCtJ0z08LHxV50NLpltHzRuISx+IQT1hgSkcS93RWYnftGNcKGMfUr+lWS1kv+a1R Pj4ZydJt2UXzC8KylLAZaIFTCBafU+tIycg4+JMDr4cYXhP/i1e0/OT4j9z97jj9Lm nc6Ir1o7koTnbp9nmKDPoo3OJDtoKbUvury1/SH71OVv7VUF9+bnVgwEqzZlPYdGal AzwwguHKc9Sjdmn14kj9LDQCwYBB56H8A2F29jbktzax1ETHj/0D9Sw18KCsOFIVFo RctaWJEfxmelb0GEd5bAG92dcFZhaR9sq420r+Gb8+3Bpkchy1roSiQfZM/FgAC2y+ OOpOPe7E8IKIXDV60fIBU94qHwmaXh9BIR28qwqyhNEehZoDf3ShLvY3b/6tCIZrAe 5P7xXvoSyZ5jj4JzgfTibdfI= Received: from 99.100.19.101 (99-100-19-101.lightspeed.frokca.sbcglobal.net [99.100.19.101]) by holgerdanske.com with ESMTPSA (TLS_AES_128_GCM_SHA256:TLSv1.3:Kx=any:Au=any:Enc=AESGCM(128):Mac=AEAD) (SMTP-AUTH username dpchrist@holgerdanske.com, mechanism PLAIN) for ; Mon, 15 May 2023 01:29:44 -0700 Message-ID: Date: Mon, 15 May 2023 01:29:38 -0700 List-Id: User questions List-Archive: https://lists.freebsd.org/archives/freebsd-questions List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-questions@freebsd.org X-BeenThere: freebsd-questions@freebsd.org MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.10.0 Subject: Re: Tool to compare directories and delete duplicate files from one directory To: questions@freebsd.org References: <9887a438-95e7-87cc-a162-4ad7a70d744f@optiplex-networks.com> <7747f587-f33e-f39c-ac97-fe4fe19e0b76@optiplex-networks.com> <7c2429c5-55d0-1649-a442-ce543f2d46c2@holgerdanske.com> <6a0aba81-485a-8985-d20d-6da58e9b5580@optiplex-networks.com> <347612746.1721811.1683912265841@fidget.co-bxl> <08804029-03de-e856-568b-74494dfc81cf@holgerdansk e.com> <126434505.494354.1684104532813@ichabod.co-bxl> Content-Language: en-US From: David Christensen In-Reply-To: <126434505.494354.1684104532813@ichabod.co-bxl> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spamd-Result: default: False [-4.00 / 15.00]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; NEURAL_HAM_LONG(-1.00)[-1.000]; NEURAL_HAM_SHORT(-1.00)[-0.996]; DMARC_POLICY_ALLOW(-0.50)[holgerdanske.com,none]; R_SPF_ALLOW(-0.20)[+a:november.he.net]; R_DKIM_ALLOW(-0.20)[holgerdanske.com:s=nov-20210719-112354]; MIME_GOOD(-0.10)[text/plain]; DKIM_TRACE(0.00)[holgerdanske.com:+]; ASN(0.00)[asn:6939, ipnet:2001:470::/32, country:US]; MLMMJ_DEST(0.00)[questions@freebsd.org]; MIME_TRACE(0.00)[0:+]; FROM_EQ_ENVFROM(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; RCPT_COUNT_ONE(0.00)[1]; TO_DN_NONE(0.00)[]; MID_RHS_MATCH_FROM(0.00)[]; ARC_NA(0.00)[]; RCVD_COUNT_TWO(0.00)[2]; FROM_HAS_DN(0.00)[]; PREVIOUSLY_DELIVERED(0.00)[questions@freebsd.org]; TO_MATCH_ENVRCPT_ALL(0.00)[]; RCVD_TLS_ALL(0.00)[] X-Rspamd-Queue-Id: 4QKXc12VXRz3ldr X-Spamd-Bar: --- X-ThisMailContainsUnwantedMimeParts: N On 5/14/23 15:48, Sysadmin Lists wrote: > #!/bin/sh -e > # remove or report duplicate files: $0 [-n] dir[1] dir[2] ... dir[n] > if [ "X$1" = "X-n" ]; then n=1; shift; fi > > echo "Building files list from: ${@}" > > find "${@}" -xdev -type f | > awk -v n=$n 'BEGIN { cmd = "stat -f %z " > for (x = 1; x < ARGC; x++) args = args ? args "|" ARGV[x] : ARGV[x]; ARGC = 0 } > { files[$0] = match($0, "(" args ")/?") + RLENGTH } > END { for (i in ARGV) sub("/*$", "/", ARGV[i]) > print "Comparing files ..." > for (i = 1; i < x; i++) for (file in files) if (file ~ "^" ARGV[i]) { > for (j = i +1; j < x; j++) > if (ARGV[j] substr(file, files[file]) in files) { > dup = ARGV[j] substr(file, files[file]) > cmd "\"" file "\"" | getline fil_s; close(cmd "\"" file "\"") > cmd "\"" dup "\"" | getline dup_s; close(cmd "\"" dup "\"") > if (dup_s == fil_s) act("dup") > else act("diff") } > delete files[file] > } } > function act(message) { > print ((message == "dup") ? "duplicates:" : "difference:"), dup, file > if (!n) system("rm -vi \"" dup "\" }' "${@}" A virtual machine for testing: 2023-05-15 00:59:39 dpchrist@vf1 /vf1zpool1/dpchrist $ freebsd-version ; uname -a ; perl -v | grep . | head -n 1 12.4-RELEASE-p2 FreeBSD vf1.tracy.holgerdanske.com 12.4-RELEASE-p1 FreeBSD 12.4-RELEASE-p1 GENERIC amd64 This is perl 5, version 32, subversion 1 (v5.32.1) built for amd64-freebsd-thread-multi A Perl script to generate a test tree (tuned to generate a small tree by default): 2023-05-15 01:09:12 dpchrist@vf1 /vf1zpool1/dpchrist $ cat ~/bin/t_dir_tree #!/usr/bin/env perl # $Id: t_dir_tree,v 1.4 2023/05/15 08:09:08 dpchrist Exp $ # Generate tree of random directories and files with duplicates # By David Paul Christensen dpchrist@holgerdanske.com # Public Domain use strict; use warnings; use File::Path qw( make_path ); use Getopt::Long; my $dd = '/usr/bin/env dd'; my $d=3; my $f=10; my $m=1E3; my $u=1; GetOptions('d=i'=>\$d,'f=i'=>\$f,'m=i'=>\$m,'u=i'=>\$u) && @ARGV == 1 or die "Usage: t_dir_tree [-d=NDIR] [-f=NFILE] [-m=MAXFILESIZE]", " [-u=MAXDUP] PATH"; my $p = shift; die "$0: refusing to overwrite existing path '$p'" if -e $p; my %dp = (0 => $p); map {$dp{$_} = $dp{int(rand($_))} . "/$_"} 1 .. $d-1; print map {"$_ directory$/"} make_path(values %dp); my $n = 'a'; for (0 .. $f-1) { my $nsave = $n; my $of = $dp{int(rand($d))} . '/' . $n++; my $bs = int(rand($m)); print "$of file size=$bs$/"; qx($dd if=/dev/random of=$of bs=$bs count=1 2>/dev/null); die if $?; for (0 .. int(rand($u))) { my $df = $dp{int(rand($d))} . '/' . $nsave . '-' . $n++; print "$df file size=$bs$/"; qx($dd if=$of of=$df bs=$bs count=1 2>/dev/null); die if $?; } } Create a test tree: 2023-05-15 01:10:29 dpchrist@vf1 /vf1zpool1/dpchrist $ t_dir_tree foo foo directory foo/1 directory foo/1/2 directory foo/1/2/a file size=784 foo/1/a-b file size=784 foo/c file size=655 foo/1/c-d file size=655 foo/1/2/e file size=885 foo/e-f file size=885 foo/1/g file size=267 foo/g-h file size=267 foo/1/2/i file size=438 foo/1/i-j file size=438 foo/k file size=902 foo/1/2/k-l file size=902 foo/1/2/m file size=520 foo/m-n file size=520 foo/o file size=91 foo/1/2/o-p file size=91 foo/q file size=928 foo/q-r file size=928 foo/1/s file size=22 foo/1/2/s-t file size=22 Do a recursive listing and word count of tree to monitor for changes: 2023-05-15 01:16:27 dpchrist@vf1 /vf1zpool1/dpchrist $ ls -R1 foo | wc 26 24 82 fdupes(1) finds duplicates and does not change the tree: 2023-05-15 01:16:32 dpchrist@vf1 /vf1zpool1/dpchrist $ fdupes -fr foo | grep . foo/q-r foo/e-f foo/o foo/m-n foo/1/2/a foo/c foo/1/2/i foo/1/2/s-t foo/g-h foo/1/2/k-l 2023-05-15 01:17:57 dpchrist@vf1 /vf1zpool1/dpchrist $ ls -R1 foo | wc 26 24 82 Your script does not appear to do anything (?): 2023-05-15 01:19:00 dpchrist@vf1 /vf1zpool1/dpchrist $ sysadmin.lists_mailfence.com-20230514-1548-find-dupes.sh -n foo Building files list from: foo Comparing files ... 2023-05-15 01:19:33 dpchrist@vf1 /vf1zpool1/dpchrist $ ls -R1 foo | wc 26 24 82 2023-05-15 01:19:35 dpchrist@vf1 /vf1zpool1/dpchrist $ sysadmin.lists_mailfence.com-20230514-1548-find-dupes.sh foo Building files list from: foo Comparing files ... 2023-05-15 01:19:48 dpchrist@vf1 /vf1zpool1/dpchrist $ ls -R1 foo | wc 26 24 82 David