Re: Tool to compare directories and delete duplicate files from one directory
- In reply to: Sysadmin Lists : "Re: Tool to compare directories and delete duplicate files from one directory"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Sat, 20 May 2023 21:59:41 UTC
> ---------------------------------------- > From: Sysadmin Lists <sysadmin.lists@mailfence.com> > Date: May 19, 2023, 10:19:33 AM > To: <questions@freebsd.org> > Subject: Re: Tool to compare directories and delete duplicate files from one directory > > > Performance is pretty good: > $ time dedup_multidirs.sh -V dedup{1..13} > DEBUG: 313087 files, 3497 duplicates, 309590 unique, 42848 stat calls > # 773723 differences: same filenames, different sizes or hashes > > real 1m32.719s > user 0m50.671s > sys 0m44.054s > > $ du -xs dedup{1..13} | awk '{ sum = sum + $1 } END { print sum }' > 219195746 # 200G+ of data Found a bug; shaved 10-seconds: -------------------------------------------------------------------------------- diff --git a/dedup_multidirs.sh b/dedup_multidirs.sh index 8563d49..86c5f07 100755 --- a/dedup_multidirs.sh +++ b/dedup_multidirs.sh @@ -48,8 +48,8 @@ END { for (i in ARGV) sub("/*$", "/", ARGV[i]) processed[d] hits++ } else act("diff") - if (c++ == hasf[ARGV[k], file]) - break + if (++c == hasf[ARGV[k], file]) + { c = 0; break } } } } } if (e) debug(3) processed[dups[file, j]]; delete dups[file, j] -------------------------------------------------------------------------------- As a sanity-check, I checked to see how much time it would take to merely store every encountered file, grouped by filename. It's so slow: total files: 14347 real 1m37.176s user 1m36.823s sys 0m0.212s -------------------------------------------------------------------------------- { files[$0] = substr($0, match($0, /[^\/]+$/)); tfiles++ } END { for (f in files) if (f in processed == 0) { processed[f]; dups[f]; hits[files[f]]++ for (s in files) { if (f != s && s in processed == 0) if (s ~ "/" files[f] "$") { processed[s]; dups[s]; hits[files[f]]++ } } compare(dups) for (f in dups) { delete dups[f]; delete files[f] } } for (h in hits) printf("%6d %s\n", hits[h], h) | "sort" close("sort") print "total files:", tfiles } function compare(array, f) { for (f in array) { } # do nothing } -- Sent with https://mailfence.com Secure and private email