From nobody Fri May 19 17:19:33 2023 X-Original-To: questions@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4QND9S26Hzz4C9Jb for ; Fri, 19 May 2023 17:19:40 +0000 (UTC) (envelope-from sysadmin.lists@mailfence.com) Received: from wilbur.contactoffice.com (wilbur.contactoffice.com [212.3.242.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4QND9Q5W2zz4Hnk for ; Fri, 19 May 2023 17:19:38 +0000 (UTC) (envelope-from sysadmin.lists@mailfence.com) Authentication-Results: mx1.freebsd.org; dkim=fail ("body hash did not verify") header.d=mailfence.com header.s=20210208-e7xh header.b=sGlveCyX; spf=pass (mx1.freebsd.org: domain of sysadmin.lists@mailfence.com designates 212.3.242.68 as permitted sender) smtp.mailfrom=sysadmin.lists@mailfence.com; dmarc=pass (policy=quarantine) header.from=mailfence.com Received: from fidget.co-bxl (fidget.co-bxl [10.2.0.33]) by wilbur.contactoffice.com (Postfix) with ESMTP id A3790163D for ; Fri, 19 May 2023 19:19:35 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; t=1684516775; s=20210208-e7xh; d=mailfence.com; i=sysadmin.lists@mailfence.com; h=Date:From:To:Message-ID:In-Reply-To:References:Subject:MIME-Version:Content-Type:Content-Transfer-Encoding; l=6685; bh=+f87uyZZXzK/2TOWCS5fykpWXOFvidNFCy/o6CXTYK8=; b=sGlveCyXv7/xS3JCkVvSKK3ya3cKeWJsrWzAsbjJ2N+hyg2Z8f9JM4t1/HPvZicT m33aDs51oph3S52qNlFnnUv5PF1oSu+LsfZd1KvPF5ucn+mBRZ7jvBkaYRWMlgtB31p gaaUb/R659Ab4IHfuhDpc7XepfUtbsX7vuH+jbcOyR4CaWHDy7jwoq9y5Jsw/b27Onu u3thbDMIiLlLgdhRXeoDNO1ZyUMYrraX8X41MwR+5jeRiJieevH94hBDHhtklFBqGM8 5tQWLDnlRgN33sCtLZb5CAFcYxGZRiUs1Ep2H5YTtAb77Y0mUmGASDwOJmhK2kKkg5f er5g3VOP3Q== Date: Fri, 19 May 2023 19:19:33 +0200 (CEST) From: Sysadmin Lists To: questions@freebsd.org Message-ID: <2055648982.2509909.1684516773170@fidget.co-bxl> In-Reply-To: <126434505.494354.1684104532813@ichabod.co-bxl> References: <9887a438-95e7-87cc-a162-4ad7a70d744f@optiplex-networks.com> <344b29c6-3d69-543d-678d-c2433dbf7152@optiplex-networks.com> <7747f587-f33e-f39c-ac97-fe4fe19e0b76@optiplex-networks.com> <7c2429c5-55d0-1649-a442-ce543f2d46c2@holgerdanske.com> <6a0aba81-485a-8985-d20d-6da58e9b5580@optiplex-networks.com> <347612746.1721811.1683912265841@fidget.co-bxl> <08804029-03de-e856-568b-74494dfc81cf@holgerdansk e.com> <126434505.494354.1684104532813@ichabod.co-bxl> Subject: Re: Tool to compare directories and delete duplicate files from one directory List-Id: User questions List-Archive: https://lists.freebsd.org/archives/freebsd-questions List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-questions@freebsd.org X-BeenThere: freebsd-questions@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Mailer: ContactOffice Mail X-ContactOffice-Account: com:312482426 X-Spamd-Result: default: False [-3.03 / 15.00]; NEURAL_HAM_SHORT(-0.97)[-0.966]; NEURAL_HAM_MEDIUM(-0.87)[-0.871]; DMARC_POLICY_ALLOW_WITH_FAILURES(-0.50)[]; NEURAL_HAM_LONG(-0.31)[-0.307]; R_SPF_ALLOW(-0.20)[+ip4:212.3.242.64/26]; RCVD_IN_DNSWL_LOW(-0.10)[212.3.242.68:from]; MIME_GOOD(-0.10)[text/plain]; XM_UA_NO_VERSION(0.01)[]; MIME_TRACE(0.00)[0:+]; RCVD_TLS_LAST(0.00)[]; ASN(0.00)[asn:10753, ipnet:212.3.242.64/26, country:US]; FROM_EQ_ENVFROM(0.00)[]; MLMMJ_DEST(0.00)[questions@freebsd.org]; DMARC_POLICY_ALLOW(0.00)[mailfence.com,quarantine]; R_DKIM_REJECT(0.00)[mailfence.com:s=20210208-e7xh]; RCVD_COUNT_TWO(0.00)[2]; TO_MATCH_ENVRCPT_ALL(0.00)[]; FROM_HAS_DN(0.00)[]; DKIM_TRACE(0.00)[mailfence.com:-]; PREVIOUSLY_DELIVERED(0.00)[questions@freebsd.org]; RCPT_COUNT_ONE(0.00)[1]; TO_DN_NONE(0.00)[]; ARC_NA(0.00)[] X-Rspamd-Queue-Id: 4QND9Q5W2zz4Hnk X-Spamd-Bar: --- X-ThisMailContainsUnwantedMimeParts: N > ---------------------------------------- > From: Sysadmin Lists > Date: May 14, 2023, 3:48:52 PM > To: > Subject: Re: Tool to compare directories and delete duplicate files from one directory > > "[...] looking for files that have the same name anywhere in the set of dirs, then comparing their sizes [...] that's way easier to program." Would have been, but I decided to mimic fdupes and jdupes, instead. I figured, in order for a scripted language to have any chance against a compiled C-program it would need to make comparatively fewer hash-checks, and for awk specifically, to use awk's array-membership test (if (elem in array)) instead of for-looping over lists of files and performing string-comparisons. I used a mixture of both. Performance is pretty good: $ time dedup_multidirs.sh -V dedup{1..13} DEBUG: 313087 files, 3497 duplicates, 309590 unique, 42848 stat calls # 773723 differences: same filenames, different sizes or hashes real 1m32.719s user 0m50.671s sys 0m44.054s $ du -xs dedup{1..13} | awk '{ sum = sum + $1 } END { print sum }' 219195746 # 200G+ of data What I do is: i) only process files that have more than one name-match in the set of dirs ii) use least-expensive tests first: `stat' to check file inodes and sizes iii) generate and store hashes only after a possible duplicate is found iv) perform file-dir membership tests to abort for-loops early v) keep track of number of per-filename children in each dir to abort early vi) "if A = B, and A = C, then B = C" ie, we report A and B are dups, and A and C are dups, but not B and C; reduces a comparison by one vii) likewise, "if A != B, and A = C, then B != C"; saves a comparison again - if a duplicate is found, the user is prompted to delete the lower-priority dup, determined by order of arguments: highest first, lowest last (dir1 dir2). - if a name-only match is found (different sizes or hashes), it's reported but not deleted. - Safety First: safeguards are turned on by default and must be turned off using options: N = no-dryrun, V = no messages, I = no prompt to delete files, with an optional DEBUG=1 to turn on debugging: DEBUG=1 ./program dir1 dir2 ... - to dedup a single dir, list it twice: ./program dir1 dir1 Resource usage is tiny: each system() call is closed immediately, releasing its resources. Although thousands of `stat' and `xxhash' calls are made, they're made sequentially, and the memory foot-print of the entire run is minimal. During testing, the program used less than 125M of RAM, and disk I/O was tiny. Caveat: it's probably riddled with bugs that my limited testing didn't catch. Limitations: two files with different names are not analyzed, even if their contents are identical: cp -a file1 file2 Probably has limited applicability, but was a fun head-scratcher (to do well). Final caveat: it might not even do what I described here, but it wants to! :P Flame-war starter: would love to see you Perl-fanbois beat those numbers. ;) #!/bin/sh -e # remove or report duplicate files: $0 [-VIN] dir1 dir2 ... dir[n] if [ "X${1%%-*}" = "X" ]; then opts=$1; shift; fi echo "Building files list from: ${@}" find "${@}" -xdev -type f | awk -v opts=$opts -v e=$DEBUG ' BEGIN { stat = "stat -f \"%i %z\" "; rm = "rm -i " if (opts) { split(opts, flags, ""); for (f in flags) o[flags[f]] } if ("V" in o) V = 1; if ("I" in o) rm = "rm -v "; if ("N" in o) N = 1 for (x = 1; x < ARGC; x++) args = args ? args "|" ARGV[x] : ARGV[x] if (ARGV[1] == ARGV[2]) { ch = 2 } else ch = 1; ARGC = 0 } { filename = substr($0, match($0, /[^\/]+$/)) dups[filename, ++files[filename]] = $0 basepath = substr($0, match($0, "(" args ")/?"), RLENGTH) sub("/*$", "/", basepath); hasf[basepath, filename]++ total++ } END { for (i in ARGV) sub("/*$", "/", ARGV[i]) system("date") print "Comparing files." for (file in files) if (files[file] > ch) for (i = 1; i < x; i++) { if (ARGV[i] SUBSEP file in hasf) for (j = 1; j < files[file]; j++) if (dups[file, j] ~ "^" ARGV[i] && ! (dups[file, j] in processed)) { for (k = i +1; k < x; k++) { if (ARGV[k] SUBSEP file in hasf) { getstats(dups[file, j]) inode = $1; size = $2 for (l = 1; l <= files[file]; l++) { if (e) debug(1) if (dups[file, j] == dups[file, l]) continue if (dups[file, l] ~ "^" ARGV[k]) { if (dups[file, l] in processed) continue f = dups[file, j] d = dups[file, l] getstats(d) if (e) debug(2) if (inode == $1) { } else if (size == $2) { hashcheck(f, d) processed[d] hits++ } else act("diff") if (c++ == hasf[ARGV[k], file]) break } } } } if (e) debug(3) processed[dups[file, j]]; delete dups[file, j] } } system("date") printf("DEBUG: %d %s, %d %s, %d %s, %d %s\n", total, "files", hits, "duplicates", (total - hits), "unique", statcalls,"stat calls") } function act(message) { if (e) debug(4) if (message == "dup") { if (!V) print "duplicates: " d, "\t", f if ( N) system(rm "\"" d "\"