From nobody Wed May 24 15:10:49 2023 X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4QRF4l5BmPz4Cpqy; Wed, 24 May 2023 15:11:03 +0000 (UTC) (envelope-from gusev.vitaliy@gmail.com) Received: from mail-lf1-x130.google.com (mail-lf1-x130.google.com [IPv6:2a00:1450:4864:20::130]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4QRF4l11xkz3sk2; Wed, 24 May 2023 15:11:03 +0000 (UTC) (envelope-from gusev.vitaliy@gmail.com) Authentication-Results: mx1.freebsd.org; none Received: by mail-lf1-x130.google.com with SMTP id 2adb3069b0e04-4f00d41df22so1050881e87.1; Wed, 24 May 2023 08:11:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1684941061; x=1687533061; h=references:to:cc:in-reply-to:date:subject:mime-version:message-id :from:from:to:cc:subject:date:message-id:reply-to; bh=1yB/FsRx+5zL23ybcm4qwD8TsIzyLeLgWMfyeSKOMNQ=; b=W64aUx+oW8OFnUJF3wWuKRG1vTSmDPAiBXzEX0pM4LBBt2Tm4blg64TDToqpWfd4Q2 z/eDUpeY+uAbAqJUWQtsMBAWu5b1NAGjZts5hOzMEBVioi9WGqMccekwbq7xDf9k0k8B LyPhSpJTOLq/yYp1EXc1zNLxDRofWXdZ8XbC8Am5KNxQfRdv2rm+7BI7LEqDUn6Ocp98 M2IPQ42UAw2ysevxDxUOwsJhOfbwtd84MAOpy9kelpqAymCE4qscsHmD1rZgZJCmrV/E /tCibY/cecEeVWH0QAvvn4j2duzILlTEMwqLWFWggqnnwUB6gByukFyU2eABcvO84kaX fmOQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1684941061; x=1687533061; h=references:to:cc:in-reply-to:date:subject:mime-version:message-id :from:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=1yB/FsRx+5zL23ybcm4qwD8TsIzyLeLgWMfyeSKOMNQ=; b=NfAeMxluefkN82EO7KJUgQu0ITu96TASE7JKZVK1AJZCtfza4rkHDzl4LSDoMeKhsp 9FnetyRFPiV3svd2n9sKFCOCcACN4iSWGUL9qQl0FJ7FTFWFMDAUA6VPXcP2/Lng0DEJ /l/9WlpSq1f0qYkl0p1pUYZCHZREb7lmjBLwaPoMpgH8uM0IMoRYMnEQF3vvrU3o1Fqi FuMu/oBHUpb/Y5bCGmpZelZUZbn9pJoEGv9wmw3t/DIajn4EERU6OMfFYxVuoRWklmpH IZkr5+q4wZDrJ3dSxByJnkHFq196r9fqgKnIjUzn2joHUCHCBKFN6fcTq6xaQpPD35dA mrqQ== X-Gm-Message-State: AC+VfDwFqVbeJhwSTUOWStmGdf6acNlWbbIs//5T7ujP5LtHBTmmgHuB qEokVUgZfL/eHVHe3VMA6BlgIDPuY3M= X-Google-Smtp-Source: ACHHUZ4Il2yAc46NVzC54B0e5E+NWEnNKOiuhzubO8uSZeDitSezIDGRNVlIBjMt1pgsw2jqmhQniQ== X-Received: by 2002:ac2:488b:0:b0:4f4:ca61:82b3 with SMTP id x11-20020ac2488b000000b004f4ca6182b3mr775605lfc.21.1684941060833; Wed, 24 May 2023 08:11:00 -0700 (PDT) Received: from smtpclient.apple ([188.187.60.230]) by smtp.gmail.com with ESMTPSA id b10-20020a056512024a00b004f13634da05sm1749710lfo.180.2023.05.24.08.11.00 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Wed, 24 May 2023 08:11:00 -0700 (PDT) From: Vitaliy Gusev Message-Id: Content-Type: multipart/alternative; boundary="Apple-Mail=_B2C31D91-4697-4C26-ADE9-9F456DBBFFC5" List-Id: Technical discussions relating to FreeBSD List-Archive: https://lists.freebsd.org/archives/freebsd-hackers List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-hackers@freebsd.org Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3731.500.231\)) Subject: Re: BHYVE SNAPSHOT image format proposal Date: Wed, 24 May 2023 18:10:49 +0300 In-Reply-To: Cc: virtualization@freebsd.org, freebsd-hackers@freebsd.org To: Tomek CEDRO References: <67FDC8A8-86A6-4AE4-85F0-FF7BEF9F2F06@gmail.com> X-Mailer: Apple Mail (2.3731.500.231) X-Rspamd-Queue-Id: 4QRF4l11xkz3sk2 X-Spamd-Bar: ---- X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[]; TAGGED_FROM(0.00)[]; ASN(0.00)[asn:15169, ipnet:2a00:1450::/32, country:US] X-Rspamd-Pre-Result: action=no action; module=replies; Message is reply to one we originated X-ThisMailContainsUnwantedMimeParts: N --Apple-Mail=_B2C31D91-4697-4C26-ADE9-9F456DBBFFC5 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 Hi Tomek, Try to answer to the all questions below, please let me know if I miss = some important. > On 23 May 2023, at 21:58, Tomek CEDRO wrote: >=20 > On Tue, May 23, 2023 at 6:06=E2=80=AFPM Vitaliy Gusev wrote: >> Hi, >> Here is a proposal for bhyve snapshot/checkpoint image format = improvements. >> It implies moving snapshot code to nvlist engine. >=20 > Hey there Vitaliy :-) bhyve getting more and more traction, I am new > user of bhyve and no expert, but new and missing features are welcome > I guess.. there was a discussion on the mailing lists recently on > better snapshots mechanism :-) >=20 >=20 >> Current snapshot implementation has disadvantages: >> 3 files per snapshot: .meta, .kern, vram >=20 > No problem, unless new single file will be protected against > corruption (filesystem, transfer, application crash) and possible to > be easily and cheaply modified in place? Current snapshot implementation doesn=E2=80=99t have it. I would say = more, current pkg implementation doesn=E2=80=99t track/notify if some of files are = changed. Binary files on a system can be changed, for example ELF files, without any notification. Tar doesn=E2=80=99t have protection for keeping data. Some filesystems = like ZFS guarantee that data is not modified by underlying disks. Protecting requires more efforts and it should be clearly defined: what = is purpose. If purpose is having checksum with 99.9% reliability, NVLIST HEADER can be = widen to have =E2=80=9Cchecksum=E2=80=9D key/value for a Section. If purpose is having crypto verification - I believe sha256 program = should be your choice. >=20 >> Binary Stream format of data. >=20 > This is small and fast? Will new format too? Small is not so perfect. As the first attempt snapshot code is good. But = if you want to get values related to some specific device, for example, for NIC or HPET, = you cannot get it easily. Please try :) Stream doesn=E2=80=99t have flexibility. It is good for well specified = and long long time discussed protocols like XDR (NFS), when it has RFC and each position in the stream is = described. Example: RFC1813. New format with NVLIST has flexibility and is fast enough. Note, ZFS = uses nvlist for keeping attributes=20 and more another things. >> Adding optional variable - breaks resume >> Removing variable - breaks resume >> Changing saved order of variables - breaks resume >=20 > Obviously need improvement :-) >=20 >> Hard to get information about what is saved and decode. >> Hard to debug if somethings goes wrong >=20 > Additional tools missing? Will new format allow text editor = interaction? Why do you need modify snapshot image ? Could you describe more? Do you modify current 3 snapshot files? >> No versions. If change code, resume of an old images can be >> passed, but with UB. >=20 > Is new format future proof and provides backward compatibility? Intention of moving to the new format - to have backward compatibility = if some code is changed: Adding optional variable=20 Removing variable that is not used anymore Change order of saving variables =E2=80=9CHot Fixes=E2=80=9D. If changes are critical and are incompatible, restore stage should have = clear information about incompatibility and break resume. Ideally it should be able to get = informed even before starting restore process. For this purpose, the new format introduce versions. >=20 >> New nvlist implementation should solve all things above. The first = step - >> improve snapshot/checkpoint saving format. It eliminates three files = usage >> per a snapshot. >>=20 >> (..) >=20 > So this will be new text config based format with variable =3D value = and sections? This is NVLIST approach with key=3Dvalue, where key is string, and value = can be Integer, array, string, etc. >=20 > How much bigger will be the overal file size increase? Not so huge. NVLIST internals is well specified. For example, for my VM [kernel] kernel.offset =3D 0x11f6 (4598) kernel.size =3D 0x19a7 (6567) kernel.type =3D =E2=80=9Cnvlist" [devices] devices.offset =3D 0x2b9d (11165) devices.size =3D 0x10145ba (16860602) devices.type =3D =E2=80=9Cnvlist=E2=80=9D So packed size for kernel is 6567 bytes, for devices is 16860602 = including framebuffer 16MB. If remove fbuf, packed nvlist devices Section has size = 83386 bytes. >=20 > How much longer it will take do decode/encode/process files? It is fast, just several milliseconds. NVLIST is very fast format. It is = already integrated into bhyve as Config engine. >=20 > What is the possibility of format change and backward/foward = compatibility? If you are talking about compatibility of a Image format - it should be = compatible in both directions, at least for not so big format changes. If consider overall snapshot/resume compatibility - I believe forward = compatibility is not case and target. Indeed, why do you need to resume an image = created by a higher version of a program?=20 The most important thing - backward compatibility, i.e. when an image is = created by an older version of a program, but should be resumed on a new one. This is target and and intention of this improvement. >=20 > Have you considered efficiency comparison of current format, proposed > format, and maybe using SQLITE or JSON storage/parsers? For instance > sqlite would be blazingly fast but hard to migrate. json would be most > versatile but more time/memory consuming? Yes, I know about another formats, like JSON or others. NVLIST is the = most effective and suitable for the current purposes. >=20 > Maybe EFL approach of storing configuration files for limited > resources embedded system storage that use binary storage data but can > be decompressed in chunks that can be replaced in place? > https://www.enlightenment.org/develop/efl/start There are many things that can be used, but it should be well known, = easy, stable, fast and supportable. I believe NVLIST is the best choice. >=20 > Sorry for asking those questions but there may be already good and > verified solutions out there not to reinvent the wheel? :-) Thank you for your questions. If you would like, you can try to test the = new implementation and give feedback. =E2=80=94=E2=80=94=E2=80=94 Vitaliy Gusev --Apple-Mail=_B2C31D91-4697-4C26-ADE9-9F456DBBFFC5 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8 Hi = Tomek,

Try to answer to the all questions below, = please let me know if I miss some = important.


On 23 May 2023, at 21:58, Tomek CEDRO = <tomek@cedro.info> wrote:

On Tue, May 23, 2023 at = 6:06=E2=80=AFPM Vitaliy Gusev wrote:
Hi,
Here is a proposal for bhyve snapshot/checkpoint = image format improvements.
It implies moving snapshot code to nvlist = engine.

Hey there Vitaliy :-) bhyve getting more and = more traction, I am new
user of bhyve and no expert, but new and = missing features are welcome
I guess.. there was a discussion on the = mailing lists recently on
better snapshots mechanism = :-)


Current snapshot implementation = has disadvantages:
3 files per snapshot: .meta, .kern, = vram

No problem, unless new single file will be = protected against
corruption (filesystem, transfer, application = crash) and possible to
be easily and cheaply modified in = place?

Current snapshot = implementation doesn=E2=80=99t have it. I would say more, = current
pkg implementation doesn=E2=80=99t track/notify if = some of files are changed.  Binary files on a
system can = be changed, for example ELF files, without any = notification.

Tar doesn=E2=80=99t have = protection for keeping data.  Some filesystems like = ZFS
guarantee that data is not modified by underlying = disks.

Protecting requires more efforts = and it should be clearly defined: what is purpose. If
purpose = is having checksum with 99.9% reliability, NVLIST HEADER can be = widen
to have =E2=80=9Cchecksum= =E2=80=9D key/value for a Section.

If = purpose is having crypto verification - I believe sha256 program = should be your = choice.


Binary Stream = format of data.

This is small and fast? Will new = format too?

Small is not so = perfect. As the first attempt snapshot code is good. But if you want to = get
values related to some specific device, for example, for = NIC or HPET, you cannot get it easily. Please
try = :)

Stream doesn=E2=80=99t have = flexibility. It is good for well specified  and long long time = discussed protocols
like XDR (NFS), when it has RFC and each = position in the stream is described. Example: = RFC1813.

New format with NVLIST has flexibility = and is fast enough. Note, ZFS uses nvlist for keeping = attributes 
and more another = things.


Adding  optional = variable - breaks resume
Removing variable - breaks = resume
Changing saved order of variables - breaks = resume

Obviously need improvement = :-)

Hard to get information about what = is saved and decode.
Hard to debug if somethings goes = wrong

Additional tools missing? Will new format = allow text editor = interaction?

Why do you need = modify snapshot image ? Could you describe more? Do you
modify = current 3 snapshot files?


No versions. If change = code, resume of an old images can be
passed, but with = UB.

Is new format future proof and provides backward = compatibility?

Intention of = moving to the new format - to have backward compatibility if some = code
is changed:
  • Adding optional = variable 
  • Removing variable that is not used = anymore
  • Change order of saving variables
  • =E2=80=9CHot = Fixes=E2=80=9D.

If = changes are critical and are incompatible, restore stage should have = clear information about
incompatibility and break resume. = Ideally it should be able to get informed even before = starting
restore process. For this purpose, the new format = introduce versions.



New nvlist = implementation should solve all things above. The first step = -
improve snapshot/checkpoint saving format. It eliminates three = files usage
per a snapshot.

(..)

So this = will be new text config based format with variable =3D value and = sections?

This is NVLIST = approach with key=3Dvalue, where key is string, and value can = be
Integer, array, string, etc.


How much bigger will be the overal file size = increase?

Not so huge. NVLIST = internals is well specified. For example, for my = VM

  [kernel]

    =     kernel.offset =3D 0x11f6 (4598)

        kernel.size =3D 0x19a7 = (6567)

        kernel.type =3D = =E2=80=9Cnvlist"

  [devices]

    =     devices.offset =3D 0x2b9d (11165)

        devices.size =3D = 0x10145ba (16860602)

        devices.type =3D = =E2=80=9Cnvlist=E2=80=9D


So packed size for = kernel  is 6567 = bytes, for devices  is 16860602 = including
framebuffer 16MB. If remove fbuf, packed nvlist = devices Section has size 83386 bytes.



How much longer it will = take do decode/encode/process = files?

It is fast, just = several milliseconds. NVLIST is very fast format. It is already = integrated
into bhyve as Config = engine.



What is the possibility of format change and = backward/foward = compatibility?

If you = are talking about compatibility of a Image format - it should be = compatible in
both directions, at least for not so big format = changes.

If consider overall = snapshot/resume compatibility - I believe  forward = compatibility
is not case and target. Indeed, why do you need =  to resume an image created by
a higher version of a = program? 

The most important thing - = backward compatibility, i.e. when an image is created
by an = older version of a program, but should be resumed on a new = one.

This is target and and intention of = this improvement.


Have you considered efficiency comparison of = current format, proposed
format, and maybe using SQLITE or JSON = storage/parsers?  For instance
sqlite would be blazingly fast = but hard to migrate. json would be most
versatile but more = time/memory = consuming?

Yes, I know = about another formats, like JSON or others. NVLIST is the = most
effective and suitable for the current = purposes.


Maybe EFL approach of storing configuration = files for limited
resources embedded system storage that use binary = storage data but can
be decompressed in chunks that can be replaced = in = place?
https://www.enlightenment.org/develop/efl/start
<= /blockquote>

There are many things that can be used, = but it should be well known, easy, stable,
fast and = supportable. I believe NVLIST is the best choice.


Sorry for asking those questions but there = may be already good and
verified solutions out there not to reinvent = the wheel? :-)

Thank you for = your questions. If you would like, you can try to test the new = implementation and give = feedback.

=E2=80=94=E2=80=94=E2=80=94
V= italiy Gusev

= --Apple-Mail=_B2C31D91-4697-4C26-ADE9-9F456DBBFFC5--