Re: git: a8089ea5aee5 - main - nvmfd: A simple userspace daemon for the NVMe over Fabrics controller

From: John Baldwin <jhb_at_FreeBSD.org>
Date: Fri, 03 May 2024 19:22:32 UTC
On 5/2/24 5:16 PM, John Baldwin wrote:
> The branch main has been updated by jhb:
> 
> URL: https://cgit.FreeBSD.org/src/commit/?id=a8089ea5aee578e08acab2438e82fc9a9ae50ed8
> 
> commit a8089ea5aee578e08acab2438e82fc9a9ae50ed8
> Author:     John Baldwin <jhb@FreeBSD.org>
> AuthorDate: 2024-05-02 23:35:40 +0000
> Commit:     John Baldwin <jhb@FreeBSD.org>
> CommitDate: 2024-05-02 23:38:39 +0000
> 
>      nvmfd: A simple userspace daemon for the NVMe over Fabrics controller

I'm sure there are some subtle bugs I've missed somewhere, but I have tested the
host and controller against each other (both userspace and kernel) as well as
against Linux.  Some of the patches Warner approved in Phab he specifically
noted as being looked over, but not in detail due to the size, etc.  I kind of
think we might want a separate tag for those types of reviews.  In GDB, we use
an 'Acked-by' tag to mean that a commit is approved, but it's not had a detailed
technical review the way 'Reviewed-by' implies.  If we had such a tag here,
some of these commits probably would have used Acked-by instead of Reviewed by.

Here are some initial notes on using NVMeoF.  They might be a good candidate for
the handbook.  (If we don't yet have notes on iSCSI for the handbook we should
add those as well, maybe I will get around to that as well):

# Overview

NVMe over Fabrics supports access to remote block storage devices as
NVMe namespaces across a network connection similar to using iSCSI to
acccess remote block storage devices as SCSI LUNs.  FreeBSD includes
support for accessing remote namespaces via a host driver as well as
support for exporting local storage devices as namespaces to remote
hosts.

NVMe over Fabrics supports multiple transport layers including
FibreChannel, RDMA (over both iWARP and ROCE) and TCP.  FreeBSD only
includes support for the TCP transport currently.

Enabling support requires loading a kernel module for the transport to
use in addition to the host or controller module.  The TCP transport
is provided by `nvmf_tcp.ko`.

# Host (Initiator)

The fabrics host on FreeBSD exposes remote controllers as `nvmeX`
new-bus devices similar to PCI-express NVMe controllers.  Remote
namespaces are exposed via `ndaX` disk devices via CAM.  The fabrics
host driver does not support the `nvd` disk driver.

## Discovery Service

NVMe over Fabrics defines a discovery service.  A discovery controller
exports a log page enumerating a set of one or more controllers.  Each
log page entry contains the type of a controller (I/O or discovery) as
well as the transport type and transport-specific address.  For the
TCP transport the address includes the IP address and TCP port number.

nvmecontrol(8) supports a `discover` command to query the log page
from a discovery controller.

Example 1: The Discovery Log Page from a Linux Controller

```
# nvmecontrol discover ubuntu:4420
Discovery
=========
Entry 01
========
  Transport type:       TCP
  Address family:       AF_INET
  Subsystem type:       NVMe
  SQ flow control:      optional
  Secure Channel:       Not specified
  Port ID:              1
  Controller ID:        Dynamic
  Max Admin SQ Size:    32
  Sub NQN:              nvme-test-target
  Transport address:    10.0.0.118
  Service identifier:   4420
  Security Type:        None
```

## Connecting To an I/O Controller

nvmecontrol(8) supports `connect` command to establish an association
with a remote controller.  Once the association is established, it is
handed off to the in-kernel host which creates a new `nvmeX` device.

Example 2: Connecting to an I/O Controller

```
# kldload nvmf nvmf_tcp
# nvmecontrol connect ubuntu:4420 nvme-test-target
```

This results in the following lines in dmesg:

```
nvme0: <Fabrics: nvme-test-target>
nda0 at nvme0 bus 0 scbus0 target 0 lun 1
nda0: <Linux 5.15.0-8 843bf4f791f9cdb03d8b>
nda0: Serial Number 843bf4f791f9cdb03d8b
nda0: nvme version 1.3
nda0: 1024MB (2097152 512 byte sectors)
```

The new `nvme0` device can now be used with other nvmecontrol(8)
commands such as `identify` similar to PCI-express controllers.

Example 3: Identify a Remote I/O Controller

```
# nvmecontrol identify nvme0
Controller Capabilities/Features
================================
...
Model Number:                Linux
Firmware Version:            5.15.0-8
...

Fabrics Attributes
==================
I/O Command Capsule Size:    16448 bytes
I/O Response Capsule Size:   16 bytes
In Capsule Data Offset:      0 bytes
Controller Model:            Dynamic
Max SGL Descriptors:         1
Disconnect of I/O Queues:    Not Supported
```

The `nda0` disk device can be used like any other NVMe disk device.

## Connecting via Discovery


nvmecontrol(8)'s `connect-all` command fetches the discovery log page
from the specified discovery controller and creates an association for
each log page entry.

## Disconnecting

nvmecontrol(8)'s `disconnect` command detaches the namespaces from a
remote controller and destroys the association.

Example 4: Disconnecting From a Remote I/O Controller

```
# nvmecontrol disconnect nvme0
```

The `disconnect-all` command destroys associations with all remote
controllers.

## Reconnecting

If a connection is interrupted (for example, TCP connections die), the
association is torn down (all queues are disconnected), but the
`nvmeX` device is left in a quiesced state.  Any pending I/O requests
for remote namespaces are left pending as well.  In this state, the
`reconnect` command can be used to establish a new association to
resume operation with a remote controller.

Example 5: Reconnecting to a Remote I/O Controller

```
# nvmecontrol reconnect nvme0 ubuntu:4420 nvme-test-target
```

# Controller (Target)

The fabrics controller on FreeBSD exposes local block devices as NVMe
namespaces to remote hosts.  The controller support on FreeBSD
includes a userland implementation of a discovery controller as well
as an in-kernel I/O controller.  Similar to the existing iSCSI target
in FreeBSD, the in-kernel I/O controller uses CAM's target layer
(ctl(4)).

Block devices are created by adding ctl(4) LUNs via ctladm(8).  The
discovery service and initial handling of I/O controller connections
is managed by the nvmfd(8) daemon.

Example 6: Exporting a local ZFS Volume

```
# kldload nvmft nvmf_tcp
# ctladm create -b block -o file=/dev/zvol/bhyve/iscsi
LUN created successfully
backend:       block
device type:   0
LUN size:      4294967296 bytes
blocksize      512 bytes
LUN ID:        0
Serial Number: MYSERIAL0000
Device ID:     MYDEVID0000
# nvmfd -F -p 4420 -n nqn.2001-03.com.chelsio:frodo0 -K
```

Open associations can be listed via `ctladm nvlist` and can be
disconnected via `ctladmin nvterminate`.

Eventually NVMe support should be added to ctld(8) by merging nvmfd(8)
into ctld(8).


-- 
John Baldwin