From nobody Sun Apr 10 20:27:07 2022 X-Original-To: dev-commits-src-main@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 402851A8A18D for ; Sun, 10 Apr 2022 20:27:26 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: from mail-vk1-xa33.google.com (mail-vk1-xa33.google.com [IPv6:2607:f8b0:4864:20::a33]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4Kc3SY2bQtz3LFG for ; Sun, 10 Apr 2022 20:27:25 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: by mail-vk1-xa33.google.com with SMTP id b81so6659547vkf.1 for ; Sun, 10 Apr 2022 13:27:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bsdimp-com.20210112.gappssmtp.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=gyerOyteZnsKYZEuozkrGzj4jd90s9lQGq+ehSHHqdk=; b=y4ed8zBXbSCKND10kdFaHsmYpmLuEM4iM6OBdpUYDi5m2/JhvXYkbfNveUo5NHvoGz cNJ4fX/psBQ045Al+tzeDweF3RMElGdSdnmcw4qHulo5u2st3BxvsP3uTQFyODHYQNRF 7TmzYCHK912tePGVYREj+irI/oPMu57ajSNyC26S+iqxf8YLMcuPGn6kDXCf0TcRUjiA V1p3ITSyvQCiYHnWVyuMqYlrgcb48bQoGlBGlu5Fx1YKydaiER7REVrm+2LHiH47OH/E UpS61x7TP4oJ/454lzpt7V36NYzC9x4S80/wKlOO/iVSn/amX6GsxIgBW4UQrA82Xa8d dGHQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=gyerOyteZnsKYZEuozkrGzj4jd90s9lQGq+ehSHHqdk=; b=jfJDXGOsO5FVuMnW/oqP9DeKkzmPOYsdH17j20c+aCC+7ZjSN8rZ20txSS7CfKWzU0 74LKVRFozsOwEDjEhvtNVRZHxAwPArUsj3OoH6k0JOne9rlgINVyKGcMN1/HE3byqnEj ShPcLfbCSsgW0dYicWmxgFTk+b0Du2BJ/J/2E8jgHvfoXNJ50dniBf4CwJ8m3F3G4m/3 Vj+HQYviKk4xKMSy/R+oxYWQHACAsLXeO6vFIjKR2WA5HkxB9ViEtLxyai8DxWKTznck Qnie4Sw1RzwMigVnDlH8jMW3mT1UIxLuGtPuEQbMNQi3rjseDmj0xiPv/Czo1TR5aNcE K7fw== X-Gm-Message-State: AOAM531E82JssF8g+vyMH3YlSDZwi+LlJsu30pIhHapJU4A2ZzkepzLk 4Bw+oHSwh0vE4ba+c7zVIM+sdSixspFipMsWdzrrnw== X-Google-Smtp-Source: ABdhPJx204iPPeGQ7cmHjOrO9Ta3MOuwxBftNAneaSvpoITy4LHnM11LMFFOKczre816s2w2D/FOMLZcZWGvug3lsc4= X-Received: by 2002:a1f:e6c4:0:b0:345:8164:7d95 with SMTP id d187-20020a1fe6c4000000b0034581647d95mr705424vkh.40.1649622438658; Sun, 10 Apr 2022 13:27:18 -0700 (PDT) List-Id: Commit messages for the main branch of the src repository List-Archive: https://lists.freebsd.org/archives/dev-commits-src-main List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-dev-commits-src-main@freebsd.org X-BeenThere: dev-commits-src-main@freebsd.org MIME-Version: 1.0 References: <202204090506.239567Ag038413@gitrepo.freebsd.org> <3BC728DB-39BB-43BC-BECE-720FECB5B20D@panasas.com> In-Reply-To: From: Warner Losh Date: Sun, 10 Apr 2022 14:27:07 -0600 Message-ID: Subject: Re: 214df80a9cb3 - main - nvme: new define for size of host memory buffer sizes To: Ravi Pokala Cc: Warner Losh , src-committers , "" , dev-commits-src-main@freebsd.org Content-Type: multipart/alternative; boundary="000000000000785a2f05dc52a7de" X-Rspamd-Queue-Id: 4Kc3SY2bQtz3LFG X-Spamd-Bar: -- Authentication-Results: mx1.freebsd.org; dkim=pass header.d=bsdimp-com.20210112.gappssmtp.com header.s=20210112 header.b=y4ed8zBX; dmarc=none; spf=none (mx1.freebsd.org: domain of wlosh@bsdimp.com has no SPF policy when checking 2607:f8b0:4864:20::a33) smtp.mailfrom=wlosh@bsdimp.com X-Spamd-Result: default: False [-3.00 / 15.00]; ARC_NA(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; R_DKIM_ALLOW(-0.20)[bsdimp-com.20210112.gappssmtp.com:s=20210112]; FROM_HAS_DN(0.00)[]; TO_DN_SOME(0.00)[]; NEURAL_HAM_LONG(-1.00)[-1.000]; MIME_GOOD(-0.10)[multipart/alternative,text/plain]; PREVIOUSLY_DELIVERED(0.00)[dev-commits-src-main@freebsd.org]; DMARC_NA(0.00)[bsdimp.com]; RCPT_COUNT_FIVE(0.00)[5]; TO_MATCH_ENVRCPT_SOME(0.00)[]; DKIM_TRACE(0.00)[bsdimp-com.20210112.gappssmtp.com:+]; NEURAL_HAM_SHORT(-1.00)[-0.996]; RCVD_IN_DNSWL_NONE(0.00)[2607:f8b0:4864:20::a33:from]; MLMMJ_DEST(0.00)[dev-commits-src-main]; FORGED_SENDER(0.30)[imp@bsdimp.com,wlosh@bsdimp.com]; R_SPF_NA(0.00)[no SPF record]; MIME_TRACE(0.00)[0:+,1:+,2:~]; ASN(0.00)[asn:15169, ipnet:2607:f8b0::/32, country:US]; FROM_NEQ_ENVFROM(0.00)[imp@bsdimp.com,wlosh@bsdimp.com]; RCVD_TLS_ALL(0.00)[]; RCVD_COUNT_TWO(0.00)[2] X-ThisMailContainsUnwantedMimeParts: N --000000000000785a2f05dc52a7de Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Sat, Apr 9, 2022 at 5:15 PM Warner Losh wrote: > > > On Sat, Apr 9, 2022 at 4:50 PM Ravi Pokala wrote: > >> -----Original Message----- >> From: on behalf of Warner Losh >> >> Date: 2022-04-08, Friday at 22:06 >> To: , , >> >> Subject: git: 214df80a9cb3 - main - nvme: new define for size of host >> memory buffer sizes >> >> The branch main has been updated by imp: >> >> URL: >> https://cgit.FreeBSD.org/src/commit/?id=3D214df80a9cb3e95a140b13af7d19de= ec2bbfae76 >> >> commit 214df80a9cb3e95a140b13af7d19deec2bbfae76 >> Author: Warner Losh >> AuthorDate: 2022-04-09 05:01:06 +0000 >> Commit: Warner Losh >> CommitDate: 2022-04-09 05:05:25 +0000 >> >> nvme: new define for size of host memory buffer sizes >> >> The nvme spec defines the various fields that specify sizes for >> host >> memory buffers in terms of 4096 chunks. So, rather than use a >> bare 4096 >> here, use NVME_HMB_UNITS. This is explicitly not the host page >> size of >> 4096, nor the default memory page size (mps) of the NVMe drive, >> but its >> own thing and needs its own define. >> >> Hi Warner, >> >> Are you sure about that? >> > > NVMe-1.4, Figure 297: Host Memory Buffer =E2=80=93 Host Memory Buffer Des= criptor >> Entry >> >> | Buffer Size (BSIZE): Indicates the number of contiguous >> | memory page size (CC.MPS) units for this descriptor. >> | >> | Buffer Address (BADD): Indicates the host memory address for >> | this descriptor aligned to the memory page size (CC.MPS). >> | The lower bits (n:0) of this field indicate the offset >> | within the memory page is 0h (e.g., if the memory page size >> | is 4 KiB, then bits 11:00 shall be 0h; if the memory page >> | size is 8 KiB, then bits 12:00 shall be 0h). >> >> They both reference mps, not 4096 bytes. >> > > So, some I'm 100% sure about. There's one that was previously incorrectly > hard wired to > 4k. More below. > > From Table 275 of Rev 2.0: > > Host Memory Buffer Preferred Size (HMPRE): This field indicates > the preferred size that the host is requested to allocate for the > Host Memory Buffer feature in 4 KiB units. This value shall be > greater than or equal to the Host Memory Buffer Minimum Size. > If this field is non-zero, then the Host Memory Buffer feature is > supported. If this field is cleared to 0h, then the Host Memory > Buffer feature is not supported. > > Host Memory Buffer Minimum Size (HMMIN): This field indicates > the minimum size that the host is requested to allocate for the > Host Memory Buffer feature in 4 KiB units. If this field is cleared > to 0h, then the host is requested to allocate any amount of host > memory possible up to the HMPRE value. > > Host Memory Buffer Minimum Descriptor Entry Size (HMMINDS): > This field indicates the minimum usable size of a Host Memory > Buffer Descriptor Entry in 4 KiB units. If this field is cleared to 0h, > then the controller does not indicate any limitations on the Host > Memory Buffer Descriptor Entry size. > > These are the hmmin, hmminds and hmpre fields of 'cdata' in the > driver. > > diff --git a/sys/dev/nvme/nvme_ctrlr.c b/sys/dev/nvme/nvme_ctrlr.c >> index 95a2b5c4285d..6996b3151b0d 100644 >> --- a/sys/dev/nvme/nvme_ctrlr.c >> +++ b/sys/dev/nvme/nvme_ctrlr.c >> @@ -936,11 +936,11 @@ nvme_ctrlr_hmb_alloc(struct nvme_controller >> *ctrlr) >> max =3D (uint64_t)physmem * PAGE_SIZE / 20; >> TUNABLE_UINT64_FETCH("hw.nvme.hmb_max", &max); >> > > max is a percent of available memory... > > >> - min =3D (long long unsigned)ctrlr->cdata.hmmin * 4096; >> + min =3D (long long unsigned)ctrlr->cdata.hmmin * NVME_HMB_UNITS; >> if (max =3D=3D 0 || max < min) >> return; >> - pref =3D MIN((long long unsigned)ctrlr->cdata.hmpre * 4096, max)= ; >> - minc =3D MAX(ctrlr->cdata.hmminds * 4096, PAGE_SIZE); >> + pref =3D MIN((long long unsigned)ctrlr->cdata.hmpre * >> NVME_HMB_UNITS, max); >> + minc =3D MAX(ctrlr->cdata.hmminds * NVME_HMB_UNITS, PAGE_SIZE); >> > > So for all of these, we're good. They are in 4kiB chunks. > > >> if (min > 0 && ctrlr->cdata.hmmaxd > 0) >> minc =3D MAX(minc, min / ctrlr->cdata.hmmaxd); >> ctrlr->hmb_chunk =3D pref; >> @@ -1023,7 +1023,7 @@ again: >> for (i =3D 0; i < ctrlr->hmb_nchunks; i++) { >> ctrlr->hmb_desc_vaddr[i].addr =3D >> htole64(ctrlr->hmb_chunks[i].hmbc_paddr); >> - ctrlr->hmb_desc_vaddr[i].size =3D htole32(ctrlr->hmb_chu= nk >> / 4096); >> + ctrlr->hmb_desc_vaddr[i].size =3D htole32(ctrlr->hmb_chu= nk >> / NVME_HMB_UNITS); >> > > This one, you are correct is wrong. I'll fix it when I bring in the > changes to fully support > MPS !=3D 0. For MPS =3D=3D 0, which are the only drives this driver suppo= rts > correctly, this > is a nop. It was wrong before, though. The text you quoted is correct > about this. There > are a couple of PAGE_SIZEs tinbetween these two chunks hat should be > ctrlr->min_page_size instead (since that's the page size we're really > using, not the > host's). But PAGE_SIZE !=3D 4k doesn't currently work with the nvme drive= r > due to > confusion like this. > https://reviews.freebsd.org/D34871 should address that. https://reviews.freebsd.org/D34865 through https://reviews.freebsd.org/D34873 are my current series. With them applied, I'm able to boot with 16k pages on a ZFS-based nvme arm64 system. I believe it would also allow us to use different drive page sizes too, but I haven't tried to figure that out :)... Warner > We also currently set the MPS field in the CC incorrectly when it isn't 0= . > > Good catch. I'll update my pending changes. > > Warner > > >> } >> bus_dmamap_sync(ctrlr->hmb_desc_tag, ctrlr->hmb_desc_map, >> BUS_DMASYNC_PREWRITE); >> >> >> --000000000000785a2f05dc52a7de Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


=
On Sat, Apr 9, 2022 at 5:15 PM Warner= Losh <imp@bsdimp.com> wrote:
=


On Sat, Apr 9, 2022 at 4:50 PM Ravi Pokala <rpokala@freebsd.org&g= t; wrote:
-----O= riginal Message-----
From: <owner-src-committers@freebsd.org> on behalf of Warner Losh <= imp@FreeBSD.org>
Date: 2022-04-08, Friday at 22:06
To: <src-committers@FreeBSD.org>, <dev-commits-src-all@FreeBSD.org= >, <dev-commits-src-main@FreeBSD.org>
Subject: git: 214df80a9cb3 - main - nvme: new define for size of host memor= y buffer sizes

=C2=A0 =C2=A0 The branch main has been updated by imp:

=C2=A0 =C2=A0 URL: https://cgit.FreeBSD.org/src/commit/?id=3D214df80a9cb3e95a140b13af7d19deec= 2bbfae76

=C2=A0 =C2=A0 commit 214df80a9cb3e95a140b13af7d19deec2bbfae76
=C2=A0 =C2=A0 Author:=C2=A0 =C2=A0 =C2=A0Warner Losh <imp@FreeBSD.org>= ;
=C2=A0 =C2=A0 AuthorDate: 2022-04-09 05:01:06 +0000
=C2=A0 =C2=A0 Commit:=C2=A0 =C2=A0 =C2=A0Warner Losh <imp@FreeBSD.org>= ;
=C2=A0 =C2=A0 CommitDate: 2022-04-09 05:05:25 +0000

=C2=A0 =C2=A0 =C2=A0 =C2=A0 nvme: new define for size of host memory buffer= sizes

=C2=A0 =C2=A0 =C2=A0 =C2=A0 The nvme spec defines the various fields that s= pecify sizes for host
=C2=A0 =C2=A0 =C2=A0 =C2=A0 memory buffers in terms of 4096 chunks. So, rat= her than use a bare 4096
=C2=A0 =C2=A0 =C2=A0 =C2=A0 here, use NVME_HMB_UNITS. This is explicitly no= t the host page size of
=C2=A0 =C2=A0 =C2=A0 =C2=A0 4096, nor the default memory page size (mps) of= the NVMe drive, but its
=C2=A0 =C2=A0 =C2=A0 =C2=A0 own thing and needs its own define.

Hi Warner,

Are you sure about that?

NVMe-1.4, Figure 297: Host Memory Buffer = =E2=80=93 Host Memory Buffer Descriptor Entry

| Buffer Size (BSIZE): Indicates the number of contiguous
| memory page size (CC.MPS) units for this descriptor.
|
| Buffer Address (BADD): Indicates the host memory address for
| this descriptor aligned to the memory page size (CC.MPS).
| The lower bits (n:0) of this field indicate the offset
| within the memory page is 0h (e.g., if the memory page size
| is 4 KiB, then bits 11:00 shall be 0h; if the memory page
| size is 8 KiB, then bits 12:00 shall be 0h).

They both reference mps, not 4096 bytes.

So, some I'm 100% sure about. There's one that was previously inc= orrectly hard wired to
4k. More below.

F= rom Table 275 of Rev 2.0:

Host Memory Buffer Prefe= rred Size (HMPRE): This field indicates
the preferred size that t= he host is requested to allocate for the
Host Memory Buffer featu= re in 4 KiB units. This value shall be
greater than or equal to t= he Host Memory Buffer Minimum Size.
If this field is non-zero, th= en the Host Memory Buffer feature is
supported. If this field is = cleared to 0h, then the Host Memory
Buffer feature is not support= ed.

Host Memory Buffer Minimum Size (HMMIN): T= his field indicates
the minimum size that the host is requested t= o allocate for the
Host Memory Buffer feature in 4 KiB units. If = this field is cleared
to 0h, then the host is requested to alloca= te any amount of host
memory possible up to the HMPRE value.
<= /div>

Host Memory Buffer Minimum Descriptor Entry Size (= HMMINDS):
This field indicates the minimum usable size of a Host = Memory
Buffer Descriptor Entry in 4 KiB units. If this field is c= leared to 0h,
then the controller does not indicate any limitatio= ns on the Host
Memory Buffer Descriptor Entry size.

These are the hmmin, hmminds and hmpre fields of 'cdata= ' in the
driver.

=C2=A0 =C2=A0 diff --git a/sys/dev/nvme/nvme_ctrl= r.c b/sys/dev/nvme/nvme_ctrlr.c
=C2=A0 =C2=A0 index 95a2b5c4285d..6996b3151b0d 100644
=C2=A0 =C2=A0 --- a/sys/dev/nvme/nvme_ctrlr.c
=C2=A0 =C2=A0 +++ b/sys/dev/nvme/nvme_ctrlr.c
=C2=A0 =C2=A0 @@ -936,11 +936,11 @@ nvme_ctrlr_hmb_alloc(struct nvme_contro= ller *ctrlr)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 max =3D (uint64_t)physmem * PAGE_SIZE / 20;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 TUNABLE_UINT64_FETCH("hw.nvme.hmb_max"= ;, &max);

max is a percent of avail= able memory...
=C2=A0
=C2=A0 =C2=A0 -=C2=A0 =C2=A0min =3D (long long unsigned)ctrlr->cdata.hmm= in * 4096;
=C2=A0 =C2=A0 +=C2=A0 =C2=A0min =3D (long long unsigned)ctrlr->cdata.hmm= in * NVME_HMB_UNITS;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 if (max =3D=3D 0 || max < min)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 return;
=C2=A0 =C2=A0 -=C2=A0 =C2=A0pref =3D MIN((long long unsigned)ctrlr->cdat= a.hmpre * 4096, max);
=C2=A0 =C2=A0 -=C2=A0 =C2=A0minc =3D MAX(ctrlr->cdata.hmminds * 4096, PA= GE_SIZE);
=C2=A0 =C2=A0 +=C2=A0 =C2=A0pref =3D MIN((long long unsigned)ctrlr->cdat= a.hmpre * NVME_HMB_UNITS, max);
=C2=A0 =C2=A0 +=C2=A0 =C2=A0minc =3D MAX(ctrlr->cdata.hmminds * NVME_HMB= _UNITS, PAGE_SIZE);

So for all of these= , we're good. They are in 4kiB chunks.
=C2=A0
=C2=A0 =C2=A0 =C2=A0 =C2=A0 if (min > 0 && ctrlr->cdata.hmmax= d > 0)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 minc =3D MAX(minc, = min / ctrlr->cdata.hmmaxd);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 ctrlr->hmb_chunk =3D pref;
=C2=A0 =C2=A0 @@ -1023,7 +1023,7 @@ again:
=C2=A0 =C2=A0 =C2=A0 =C2=A0 for (i =3D 0; i < ctrlr->hmb_nchunks; i++= ) {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ctrlr->hmb_desc_= vaddr[i].addr =3D
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 htole= 64(ctrlr->hmb_chunks[i].hmbc_paddr);
=C2=A0 =C2=A0 -=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0ctrlr->hmb_desc_= vaddr[i].size =3D htole32(ctrlr->hmb_chunk / 4096);
=C2=A0 =C2=A0 +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0ctrlr->hmb_desc_= vaddr[i].size =3D htole32(ctrlr->hmb_chunk / NVME_HMB_UNITS);

This one, you are correct is wrong. I'll fix = it when I bring in the changes to fully support
MPS !=3D 0. For M= PS =3D=3D 0, which are the only drives this driver supports correctly, this=
is a nop. It was wrong before, though. The text you quoted=C2=A0= is correct about this. There
are a couple of PAGE_SIZEs tinbetwee= n=C2=A0these two chunks hat should be
ctrlr->min_page_size ins= tead (since that's the page size we're really using, not the
<= div>host's). But PAGE_SIZE !=3D 4k doesn't currently work with the = nvme driver due to
confusion like this.


=
= series. With them applied, I'm able to boot with 16k pages on a ZFS-bas= ed nvme arm64 system.
I believe it would also allow us to use dif= ferent drive page sizes too, but I haven't tried to figure
th= at out :)...

Warner
=C2=A0
We also currently set the MPS field in the CC incorrectly w= hen it isn't 0.

Good catch. I'll update my= pending changes.

Warner
=C2=A0
=C2=A0 =C2=A0 =C2=A0 =C2=A0 }
=C2=A0 =C2=A0 =C2=A0 =C2=A0 bus_dmamap_sync(ctrlr->hmb_desc_tag, ctrlr-&= gt;hmb_desc_map,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 BUS_DMASYNC_PREWRITE);


--000000000000785a2f05dc52a7de--