From nobody Fri May 05 00:13:02 2023 X-Original-To: freebsd-questions@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4QCB3b20TFz49xnC for ; Fri, 5 May 2023 00:13:15 +0000 (UTC) (envelope-from pprocacci@gmail.com) Received: from mail-oa1-x34.google.com (mail-oa1-x34.google.com [IPv6:2001:4860:4864:20::34]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4QCB3Z74r0z45yN for ; Fri, 5 May 2023 00:13:14 +0000 (UTC) (envelope-from pprocacci@gmail.com) Authentication-Results: mx1.freebsd.org; none Received: by mail-oa1-x34.google.com with SMTP id 586e51a60fabf-1927718b240so773938fac.1 for ; Thu, 04 May 2023 17:13:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1683245594; x=1685837594; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=pfZCahzhIoAuj8GvnQvgXAOlVgzK9/vTEk7sMSfvCZs=; b=NFVjXMe8JBUQDy1bKE50I6x7zChTa3C4rhVUiVm+6S9n0+eQr7pWB9QzI5BOb8OXRN lnoZtSbDP01grO78wkBWDCABLnnxNFpnahDfqf6TlbswtABW5nN/upj1RqNyEbA/fvjq XoeUItwCOKKm8IuePcxA0LXW1KI+GXqpykN+eAJHgfc2qvDEKgwPJukgMttJVfES6HPj HlxJi1SwdxI7COmiaVkupDvJCeEuedMliX2/VYNuhg7TXu9J1Tgo67Ewncm+Hge7TqQY YKeLwetsovp7aL8vKiknyqUdxJhiD+cBtwX2K9CEVFZ2xw6FDIaj84h1Cjeulr88WNpz AE/Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1683245594; x=1685837594; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=pfZCahzhIoAuj8GvnQvgXAOlVgzK9/vTEk7sMSfvCZs=; b=da9guokEiWDjzbZ4z9mhbM4q+aFMwXYUVwercitR1ibw/Mt7rUHDJNsi1vt89VqMIu BZ/H9Wz3Tq4OqT5qgtOhUi7wrVDBYj6vie3199TPUkLUfPbOiXSCg0wCJNyql+Souh4w 7k8hhfWKQGwDdBexUmWUh6/sagT3fiAL6RJXXFLW/X+3KFIo/IBT0Bz3K9XBsNU8dEVa srk8YxrQZ8jysgoE4zNg9yo7LZw4hwzB1SJWLMORS+1rJJF9IyZ2N6OXoAzoexWVKOS0 JkXmBN+JbRjMYe5pzNEpkws+w2xhiJEYVFoeODqTpL/SkyfhdyJJ/nxMWw4GTEFVv0vG oQvw== X-Gm-Message-State: AC+VfDz17YUZYWEt332oO7gP+e24L8fB7JSaj0QUR9WAXkYtEgiejxqz PVGKLI0Z855HXh06zoC0yVDnh/yiK4IAfoXAecE+GUq7ke4p X-Google-Smtp-Source: ACHHUZ5fV/y8hVGNX3Ik68wUXcROkZvCt09eX3HOfyhRJCj/0G5KFPqrfQuj9/Es/t3OqIwFFfxvc1St3h9KvMKiVHQ= X-Received: by 2002:a05:6808:1456:b0:38e:c5d4:19c with SMTP id x22-20020a056808145600b0038ec5d4019cmr2787807oiv.5.1683245593920; Thu, 04 May 2023 17:13:13 -0700 (PDT) List-Id: User questions List-Archive: https://lists.freebsd.org/archives/freebsd-questions List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-questions@freebsd.org X-BeenThere: freebsd-questions@freebsd.org MIME-Version: 1.0 References: <9887a438-95e7-87cc-a162-4ad7a70d744f@optiplex-networks.com> <344b29c6-3d69-543d-678d-c2433dbf7152@optiplex-networks.com> In-Reply-To: From: Paul Procacci Date: Thu, 4 May 2023 20:13:02 -0400 Message-ID: Subject: Re: Tool to compare directories and delete duplicate files from one directory To: Kaya Saman Cc: freebsd-questions@freebsd.org Content-Type: multipart/alternative; boundary="000000000000b207ea05fae72763" X-Rspamd-Queue-Id: 4QCB3Z74r0z45yN X-Spamd-Bar: ---- X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[]; ASN(0.00)[asn:15169, ipnet:2001:4860:4864::/48, country:US] X-Rspamd-Pre-Result: action=no action; module=replies; Message is reply to one we originated X-ThisMailContainsUnwantedMimeParts: N --000000000000b207ea05fae72763 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Thu, May 4, 2023 at 7:53=E2=80=AFPM Kaya Saman wrote: > > On 5/4/23 23:32, Paul Procacci wrote: > > > > On Thu, May 4, 2023 at 5:47=E2=80=AFPM Kaya Saman > wrote: > >> >> On 5/4/23 17:29, Paul Procacci wrote: >> >> >> >> On Thu, May 4, 2023 at 11:53=E2=80=AFAM Kaya Saman < >> kayasaman@optiplex-networks.com> wrote: >> >>> Hi, >>> >>> >>> I'm wondering if anyone knows of a tool like diff or so that can also >>> delete files based on name and size from either left/right or >>> source/destination directory? >>> >>> >>> Basically what I have done is performed an rsync without using the >>> --remove-source-files option onto a newly bought and created disk pool >>> (yes zpool) that i am trying to consolidate my data - as it's currently >>> spread out over multiple pools with the same folder name. >>> >>> >>> The issue I am facing mainly is that I perform another rsync and use th= e >>> --remove-source-files option, rsync will delete files based on name >>> while there are some files that have the same name but not same size an= d >>> I would like to retain these files. >>> >>> >>> Right now I have looked at many different options in both rsync and >>> other tools but found nothing suitable. I even tested using a few test >>> dirs and files that I put into /tmp and whatever I tried, the files of >>> different size either got transferred or deleted. >>> >>> >>> How would be a good way to approach this problem? >>> >>> >>> Even if I create some kind of shell script and use diff, I think it wil= l >>> only compare names and not file sizes. >>> >>> >>> I'm really lost here.... >>> >>> >>> Regards, >>> >>> >>> Kaya >>> >>> >>> >>> >> It sounds like you want fdupes. It's in the ports tree. >> >> ~Paul >> >> -- >> __________________ >> >> :(){ :|:& };: >> >> >> >> I tried fdupes and installed it a while back. For me it felt like it onl= y >> works on a single directory. >> >> >> My dir structure is that I have" >> >> >> /dir <- main directory where everything has now been rsync'ed to >> >> /dir_1 <- old directory with partial content >> >> /dir_2 <- more partial content >> >> /dir_3 <- more partial content >> >> >> The key thing here is that I need to compare: >> >> >> /dir_(x) with /dir >> >> >> if the files are different sizes in /dir_(x) then leave them, otherwise >> delete if both name and file size are the same. >> > > Then a tiny shell script does the job assuming your files don't have any > spaces and no weird characters exist: > > #!/bin/sh > > for i in b c d; > do > ls $i/ | while read file; > do > [ ! -f a/$file ] && cp $i/$file a/$file && continue > > ref=3D`stat -f '%z' a/$file` > src=3D`stat -f '%z' %i/$file` > [ $ref -eq $src ] && rm -f $i/file > > done > done > > Change paths accordingly and backup your stuff. ;) > > ~Paul > > -- > __________________ > > :(){ :|:& };: > > > Thanks Paul, > > > I should be able to work with this. There are actually spaces and weird > characters in the file names so I assume doing something like "file" shou= ld > allow for that? > > > I don't think I need the line after the 'do' statement do I? From what I > understand it copies the file from directory i to directory a? As I > explained initially, the files have already been rsync'ed so I just need = to > compare and delete accordingly. > > When I performed the rsync it took around a week to complete per run, > currently zfs list shows around 12TB usage for my /dir but that's with > compression enabled, of the merged directory. > > > A quick Google shows that I can use something like this: > > search_dir=3D/the/path/to/base/dirfor entry in "$search_dir"/*do > echo "$entry"done > > > To list the files in the directory though this might be Bash and not Csh > > > Otherwise clunkily (my scripting style is pretty rubbish and non > efficient), I could do something like (it probably won't work!): > > > #!/bin/sh > > > #fb =3D file base > > #fm - file merge - file that has already been merged using rsync unless > size was different > > > dir_base=3D/dir > for fb in "$dir_base"/* > do > echo "$fs" > done > > > dir_merge=3D/dir_1 > for fm in "$dir_merge"/* > do > echo "$fm" > done > > > do > > ref=3D`stat -f '%z' $dir_base/$fb` > src=3D`stat -f '%z' %i$dir_merge/$fm` > [ $ref -eq $src ] && rm -f $dir_merge/$fm > > done > > > > Regards, > > > Kaya > What I provided is exactly what you needed as it loops through all the directories. You just have to provide the list of source directories on that first for loop. You can alter it, removing the first for loop, but then you'll need to run it for each directory you'd want to apply the checks to. Enclosing the variables in quotes may or may not help. A quote is a valid character in a filename and therefore may not work as expected. If you're reasonably sure your filenames do not contain quotes then you have a better chance of it working. Worst comes to worst, you'll need to: find /path -print0 | xargs -0 -n 1 to overcome weird characters in filenames. In either case, adding quotes at this point knowing you have at least spaces and some special characters, is probably the correct course of action. As an aside, I don't use this syntax: for entry in "$search_dir"/* You're certainly free to do so, but I personally avoid globs when possible. Maybe not so much in scripts like this but on the command line, those globs can expand to a size that exceeds allowable sizes to command line arguments= . Revised script adding comments: ----------------------------------------------------- #!/bin/sh # # dir_1, dir_2, and dir_3 are the directories I want to search through. for i in dir_1 dir_2 dir_3; do # Retrieve the filenames within each of those directories ls $i/ | while read file; do If the file doesn't exist in the base dir, copy it and continue with the top of the loop. [ ! -f dir_base/$file ] && cp $i/$file dir_base/ && continue # # Getting to this point means the file eixsts in both locations. # # Get the file size as it is in the dir_base ref=3D`stat -f '%z' dir_base/$file` # Get the file size as it is in $i src=3D`stat -f '%z' $i/$file` # If the sizes are the same, remove the file from the source directory [ $ref -eq $src ] && rm -f $i/file done done --=20 __________________ :(){ :|:& };: --000000000000b207ea05fae72763 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


On Thu, May 4, 2023 at 7:53= =E2=80=AFPM Kaya Saman <kayasaman@optiplex-networks.com> wrote:
=20 =20 =20


On 5/4/23 23:32, Paul Procacci wrote:
=20


On Thu, May 4, 2023 at 5:47=E2=80=AFPM Kaya Saman <kayasaman@optiplex-networks.com&= gt; wrote:


On 5/4/23 17:29, Paul Procacci wrote:


On Thu, May 4= , 2023 at 11:53=E2=80=AFAM Kaya Saman <kayasaman@opt= iplex-networks.com> wrote:
H= i,


I'm wondering if anyone knows of a tool like diff or so that can also
delete files based on name and size from either left/right or
source/destination directory?


Basically what I have done is performed an rsync without using the
--remove-source-files option onto a newly bought and created disk pool
(yes zpool) that i am trying to consolidate my data - as it's currently
spread out over multiple pools with the same folder name.


The issue I am facing mainly is that I perform another rsync and use the
--remove-source-files option, rsync will delete files based on name
while there are some files that have the same name but not same size and
I would like to retain these files.


Right now I have looked at many different options in both rsync and
other tools but found nothing suitable. I even tested using a few test
dirs and files that I put into /tmp and whatever I tried, the files of
different size either got transferred or deleted.


How would be a good way to approach this problem?


Even if I create some kind of shell script and use diff, I think it will
only compare names and not file sizes.


I'm really lost here....


Regards,


Kaya




It sounds like you want fdupes.=C2=A0 It's in = the ports tree.

~Paul

--
__________________

:(){ :|:& };:



I tried fdupes and installed it a while back. For me it felt like it only works on a single directory.


My dir structure is that I have"


/dir <- main directory where everything has now been rsync'ed to

/dir_1 <- old directory with partial content

/dir_2 <- more partial content

/dir_3 <- more partial content


The key thing here is that I need to compare:


/dir_(x) with /dir


if the files are different sizes in /dir_(x) then leave them, otherwise delete if both name and file size are the same.


Then a tiny shell script does the job assuming your files don't have any spaces and no weird characters exist:

#!/bin/sh

for i in b c d;
do
=C2=A0 ls $i/ | while read file;
=C2=A0 do
=C2=A0 =C2=A0 [ ! -f a/$file ] && cp $i/$file a/$file &am= p;& continue

=C2=A0 =C2=A0 ref=3D`stat -f '%z' a/$file`
=C2=A0 =C2=A0 src=3D`stat -f '%z' %i/$file`
=C2=A0 =C2=A0 [ $ref -eq $src ] && rm -f $i/file

=C2=A0 done
done

Change paths accordingly and backup your stuff. ;)

~Paul

--
__________________

:(){ :|:& };:


Thanks Paul,


I should be able to work with this. There are actually spaces and weird characters in the file names so I assume doing something like "file" should allow for that?


I don't think I need the line after the 'do' statement d= o I? From what I understand it copies the file from directory i to directory a? As I explained initially, the files have already been rsync'ed so I just need to compare and delete accordingly.

When I performed the rsync it took around a week to complete per run, currently zfs list shows around 12TB usage for my /dir but that's with compression enabled, of the merged directory.


A quick Google shows that I can use something like this:

search_dir=3D/the/path/to/base/dir
for entr=
y in "$search_dir"/*
do
  echo <=
span style=3D"margin:0px;padding:0px;border:0px none;font-style:inherit;fon=
t-variant:inherit;font-weight:inherit;font-stretch:inherit;line-height:inhe=
rit;font-family:inherit;font-kerning:inherit;font-feature-settings:inherit;=
font-size:13px;vertical-align:baseline;box-sizing:inherit">"$entry"
done


To list the files in the directory though this might be Bash and not Csh


Otherwise clunkily (my scripting style is pretty rubbish and non efficient), I could do something like (it probably won't work!):<= /p>


#!/bin/sh


#fb =3D file base

#fm - file merge - file that has already been merged using rsync unless size was different


dir_base=3D/dir
for fb in "$dir_base"/*
do
=C2=A0 echo "$fs"
done


dir_merge=3D/dir_1
for fm in "$dir_merge"/*
do
=C2=A0 echo "$fm"
done


=C2=A0 do

=C2=A0 =C2=A0 ref=3D`stat -f '%z' $dir_base/$fb`
=C2=A0 =C2=A0 src=3D`stat -f '%z' %i$dir_merge/$fm`
=C2=A0 =C2=A0 [ $ref -eq $src ] && rm -f $dir_merge/$fm

=C2=A0 done



Regards,


Kaya


What I provided is exactly what you needed as = it loops through all the directories.=C2=A0 You just have to provide the li= st of source directories on that first for loop.
You can alter it, remov= ing the first for loop, but then you'll need to run it for each directo= ry you'd want to apply the checks to.

Enclosing the variables in= quotes may or may not help.=C2=A0 A quote is a valid character in a filena= me and therefore may not work as expected.
If you're reas= onably sure your filenames do not contain quotes then you have a better cha= nce of it working.

Worst comes to worst, you'll need = to: find /path -print0 | xargs -0 -n 1 <args> to overcome weird chara= cters in filenames.

In either case, adding quotes at this= point knowing you have at least spaces and some special characters, is pro= bably the correct course of action.

As an aside, I don't use thi= s syntax:=C2=A0=C2=A0=C2=A0 for entry in "$search_dir"/*
You're certainly free to do so, but I personally avoid globs when p= ossible.
Maybe not so much in scripts like this but on the command line,= those globs can expand to a size that exceeds allowable sizes to command l= ine arguments.

Revised script adding comments:
------------------= -----------------------------------
#!/bin/sh

#
# dir_1, dir_2, and dir_3 are the directorie= s I want to search through.
for i in dir_1 dir_2 dir_3;
do
=C2=A0 # Retrieve the filenames within each of those = directories
=C2=A0 ls $i/ | while read file;
=C2=A0 do
=C2=A0=C2=A0=C2=A0=C2=A0 If the file does= n't exist in the base dir, copy it and continue with the top of the loo= p.
=C2=A0 =C2=A0 [ ! -f dir_base/$file ] && cp $i/$file dir_= base/ && continue

=C2=A0=C2=A0=C2=A0 #
=C2=A0=C2=A0=C2= =A0 # Getting to this point means the file eixsts in both locations.
=C2= =A0=C2=A0=C2=A0 #

=C2=A0=C2=A0=C2=A0 # Get the file size = as it is in the dir_base
=C2=A0 =C2=A0 ref=3D`stat -f '%z' dir_base/$file`

=
=C2=A0=C2=A0=C2=A0 # Get the file size as it is in $i
<= div> =C2=A0 =C2=A0 src=3D`stat -f '%z' $i/$file`

=
=C2=A0=C2=A0=C2=A0 # If the sizes are the same, remove the file from t= he source directory
=C2=A0 =C2=A0 [ $ref -eq $src ] && rm -f $i/file

=C2=A0 done
done



-= -
__________________<= br>
:(){ :|:& };:
--000000000000b207ea05fae72763--