From nobody Sun May 14 22:48:52 2023 X-Original-To: questions@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4QKHjl3YkYz49GKT for ; Sun, 14 May 2023 22:48:59 +0000 (UTC) (envelope-from sysadmin.lists@mailfence.com) Received: from wilbur.contactoffice.com (wilbur.contactoffice.com [212.3.242.68]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4QKHjk43mcz43w4 for ; Sun, 14 May 2023 22:48:58 +0000 (UTC) (envelope-from sysadmin.lists@mailfence.com) Authentication-Results: mx1.freebsd.org; dkim=fail ("body hash did not verify") header.d=mailfence.com header.s=20210208-e7xh header.b="2/Cfmkyf"; spf=pass (mx1.freebsd.org: domain of sysadmin.lists@mailfence.com designates 212.3.242.68 as permitted sender) smtp.mailfrom=sysadmin.lists@mailfence.com; dmarc=pass (policy=quarantine) header.from=mailfence.com Received: from ichabod.co-bxl (ichabod.co-bxl [10.2.0.36]) by wilbur.contactoffice.com (Postfix) with ESMTP id E1DB6B7D for ; Mon, 15 May 2023 00:48:55 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; t=1684104535; s=20210208-e7xh; d=mailfence.com; i=sysadmin.lists@mailfence.com; h=Date:From:To:Message-ID:In-Reply-To:References:Subject:MIME-Version:Content-Type:Content-Transfer-Encoding; l=3681; bh=29XBVTxha1o6uIrP7YWZ64A+dPPZOMfLSiNTgiRJTQs=; b=2/CfmkyfcAg6wJaqHO4cucQeSyc3/V3lyjioBBIfbNwAAqhv+18mdVPeBXMLMpmm f2NRXMuDFkhJ+Wd4gn/FoHAItArfM8K84/zfFJAPLDk83d14c52JKS37gmmwz0f9Py2 pk6qTvbHZuNESj4noHn/Td4Vs0GXV3DybxhMo+FUnkHtLyACgQCYSclrw3s1nkLgOCv spTqyolakZ/yH1cNKZdLrdreBeITygKGjbIZbKIU3vQgY1BQbbaDj0Ljc2JgcvdKpBX pAPauK05h89M6uzEiaDYBrCVK1KWIP1APaFhLdFsJDhox/g+awU3DbI9hLvxLzmvNd8 k8qnuGfQiA== Date: Mon, 15 May 2023 00:48:52 +0200 (CEST) From: Sysadmin Lists To: questions@freebsd.org Message-ID: <126434505.494354.1684104532813@ichabod.co-bxl> In-Reply-To: <08804029-03de-e856-568b-74494dfc81cf@holgerdanske.com> References: <9887a438-95e7-87cc-a162-4ad7a70d744f@optiplex-networks.com> <344b29c6-3d69-543d-678d-c2433dbf7152@optiplex-networks.com> <7747f587-f33e-f39c-ac97-fe4fe19e0b76@optiplex-networks.com> <7c2429c5-55d0-1649-a442-ce543f2d46c2@holgerdanske.com> <6a0aba81-485a-8985-d20d-6da58e9b5580@optiplex-networks.com> <347612746.1721811.1683912265841@fidget.co-bxl> <08804029-03de-e856-568b-74494dfc81cf@holgerdansk e.com> Subject: Re: Tool to compare directories and delete duplicate files from one directory List-Id: User questions List-Archive: https://lists.freebsd.org/archives/freebsd-questions List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-questions@freebsd.org X-BeenThere: freebsd-questions@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Mailer: ContactOffice Mail X-ContactOffice-Account: com:312482426 X-Spamd-Result: default: False [-2.82 / 15.00]; NEURAL_HAM_SHORT(-0.99)[-0.985]; NEURAL_HAM_MEDIUM(-0.86)[-0.864]; DMARC_POLICY_ALLOW_WITH_FAILURES(-0.50)[]; R_SPF_ALLOW(-0.20)[+ip4:212.3.242.64/26]; RCVD_IN_DNSWL_LOW(-0.10)[212.3.242.68:from]; MIME_GOOD(-0.10)[text/plain]; NEURAL_HAM_LONG(-0.08)[-0.084]; XM_UA_NO_VERSION(0.01)[]; ASN(0.00)[asn:10753, ipnet:212.3.242.64/26, country:US]; RCVD_TLS_LAST(0.00)[]; MLMMJ_DEST(0.00)[questions@freebsd.org]; MIME_TRACE(0.00)[0:+]; FROM_EQ_ENVFROM(0.00)[]; DMARC_POLICY_ALLOW(0.00)[mailfence.com,quarantine]; R_DKIM_REJECT(0.00)[mailfence.com:s=20210208-e7xh]; RCVD_COUNT_TWO(0.00)[2]; TO_MATCH_ENVRCPT_ALL(0.00)[]; FROM_HAS_DN(0.00)[]; DKIM_TRACE(0.00)[mailfence.com:-]; PREVIOUSLY_DELIVERED(0.00)[questions@freebsd.org]; RCPT_COUNT_ONE(0.00)[1]; TO_DN_NONE(0.00)[]; ARC_NA(0.00)[] X-Rspamd-Queue-Id: 4QKHjk43mcz43w4 X-Spamd-Bar: -- X-ThisMailContainsUnwantedMimeParts: N > ---------------------------------------- > From: David Christensen > Date: May 13, 2023, 6:55:26 PM > To: > Subject: Re: Tool to compare directories and delete duplicate files from one directory > > > I wrestled with a Perl script years ago when I did not know of > fdupes(1), jdupes(1), etc.. Brute force O(N^2) comparison worked for > toy datasets, but was impractical when I applied it to a directory > containing thousands of files and hundreds of gigabytes. (The OP > mentioned 12 TB.) Practical considerations of run time, memory usage, > disk I/O, etc., drove me to find the kinds of optimizations fdupes(1) > and jdupes(1) mention. > > > I do not know Awk, so it is hard to comment on your script. I suggest > commenting out any create/update/delete code, running the script against > larger and larger datasets, and seeing what optimizations you can add. > > > David All good points, and why I rewrote it without multi-dimensional arrays. Initially, `stat' was ran on each file encountered, then compared sizes on matched path/filename-pairs. The multi-dimensional arrays stored the filenames, paths, and sizes (hence, multi-d). But that's wasteful since we only care about size if there are duplicates somewhere. This version runs `stat' only if an apparent duplicate is found, which cuts down the `stat' calls significantly. The reason awk is so efficient on types of tasks is because it's doing string comparisons and string manipulation, which is very efficient when done properly. The most resource-intensive part of the program is the initial `find' command, which traverses the directories given on the command line once, then caches what it finds (running `find' twice successively uses the cache the second time). It even trims the list of files as it goes, which makes the match-tests smaller as it runs. I've ran it on paths containing 40,000+ files and it takes less than 1-second on the second run, and less than 5-seconds on its first. Sizes of files don't matter since we're only doing a `stat' call to retrieve its known size, not comparing contents. That said, I found a bug when command-line paths have similar leading predicates, and realized I wasn't protecting names with white-spaces. This version fixes both. Try it out using the [-n] flag. It doesn't do anything but find files, compare names, and compare sizes on duplicates. The `rm' command is even protected during non-dryrun calls using the [-i] flag, which prompts the user before deleting anything. #!/bin/sh -e # remove or report duplicate files: $0 [-n] dir[1] dir[2] ... dir[n] if [ "X$1" = "X-n" ]; then n=1; shift; fi echo "Building files list from: ${@}" find "${@}" -xdev -type f | awk -v n=$n 'BEGIN { cmd = "stat -f %z " for (x = 1; x < ARGC; x++) args = args ? args "|" ARGV[x] : ARGV[x]; ARGC = 0 } { files[$0] = match($0, "(" args ")/?") + RLENGTH } END { for (i in ARGV) sub("/*$", "/", ARGV[i]) print "Comparing files ..." for (i = 1; i < x; i++) for (file in files) if (file ~ "^" ARGV[i]) { for (j = i +1; j < x; j++) if (ARGV[j] substr(file, files[file]) in files) { dup = ARGV[j] substr(file, files[file]) cmd "\"" file "\"" | getline fil_s; close(cmd "\"" file "\"") cmd "\"" dup "\"" | getline dup_s; close(cmd "\"" dup "\"") if (dup_s == fil_s) act("dup") else act("diff") } delete files[file] } } function act(message) { print ((message == "dup") ? "duplicates:" : "difference:"), dup, file if (!n) system("rm -vi \"" dup "\"