Can In-Kernel TLS (kTLS) work with any OpenSSL Application?
Neel Chauhan
nc at freebsd.org
Thu Jan 28 22:33:23 UTC 2021
Hi Mark,
Thank you so much for your response describing how QAT encryption works.
I learned that my server (HPE ProLiant ML110 Gen10) does not have QAT,
mainly because the chipset (Intel C621) doesn't enable it.
For reference, my firewall box (Intel D-1518-based HPE ProLiant EC200a)
probably does, but I'm not going to use it for Tor.
Tor uses 512-byte sized packets (a.k.a "cells") so even if I had QAT it
may not work well, not to mention Tor is singlethreaded.
I think I'll stick with kTLS with AESNI when 13.0-RELEASE is out. Worse
case scenario I'll buy an AMD Ryzen-based PC and offload my Tor servers
to it (assuming latest Ryzen > Skylake Xeon Scalable in single-thread
performance).
-Neel
On 2021-01-27 11:04, Mark Johnston wrote:
> On Sat, Jan 23, 2021 at 03:25:59PM +0000, Rick Macklem wrote:
>> Ronald Klop wrote:
>> >On Wed, 20 Jan 2021 21:21:15 +0100, Neel Chauhan <nc at freebsd.org> wrote:
>> >
>> >> Hi freebsd-current@,
>> >>
>> >> I know that In-Kernel TLS was merged into the FreeBSD HEAD tree a while
>> >> back.
>> >>
>> >> With 13.0-RELEASE around the corner, I'm thinking about upgrading my
>> >> home server, well if I can accelerate any SSL application.
>> >>
>> >> I'm asking because I have a home server on a symmetrical Gigabit
>> >> connection (Google Fiber/Webpass), and that server runs a Tor relay. If
>> >> you're interested in how Tor works, the EFF has a writeup:
>> >> https://www.eff.org/pages/what-tor-relay
>> >>
>> >> But the main point for you all is: more-or-less Tor relays deal with
>> >> 1000s TLS connections going into and out of the server.
>> >>
>> >> Would In-Kernel TLS help with an application like Tor (or even load
>> >> balancers/TLS termination), or is it more for things like web servers
>> >> sending static files via sendfile() (e.g. CDN used by Netflix).
>> >>
>> >> My server could also work with Intel's QuickAssist (since it has an
>> >> Intel Xeon "Scalable" CPU). Would QuickAssist SSL be more helpful here?
>> There is now qat(4), which KTLS should be able to use, but I do
>> not think it has been tested for this. I also have no idea
>> if it can be used effectively for userland encryption?
>
> KTLS requires support for separate output buffers and AAD buffers,
> which
> I hadn't implemented in the committed driver. I have a working patch
> which adds that, so when that's committed qat(4) could in principle be
> used with KTLS. So far I only tested with /dev/crypto and a couple of
> debug sysctls used to toggle between the different cryptop buffer
> layouts, not with KTLS proper.
>
> qat(4) can be used by userspace via cryptodev(4). This comes with a
> fair bit of overhead since it involves a round-trip through the kernel
> and some extra copying. AFAIK we don't have any framework for exposing
> crypto devices directly to userspace, akin to DPDK's polling mode
> drivers or netmap.
>
> I've seen a few questions about the comparative (dis)advantages of QAT
> and AES-NI so I'll sidetrack a bit and try to characterize qat(4)'s
> performance here based on some microbenchmarking I did this week. This
> was all done in the kernel and so might need some qualification if
> you're interested in using qat(4) from userspace. Numbers below are
> gleaned from an Atom C3558 at 2.2GHz with an integrated QAT device. I
> mostly tested AES-CBC-256 and AES-GCM-256.
>
> The high-level tradeoffs are:
> - qat(4) introduces a lot of latency. For a single synchronous
> operation it can take between 2x and 100x more time than aesni(4) to
> complete. aesni takes 1000-2000 cycles to handle a request plus
> 3-5 cycles per byte depending on the algorithm. qat takes at least
> ~150,000 cycles between calling crypto_dispatch() and the cryptop
> completion callback, plus 5-8 cycles per byte. qat dispatch itself
> is
> quite cheap, typically 1000-2000 cycles depending on the size of the
> buffer. Handling a completion interrupt involves a context switch to
> the driver ithread but this is also a small cost relative to the
> entire operation. So, for anything where latency is crucial QAT is
> probably not a great bet.
> - qat can save a correspondingly large number of CPU cycles. It takes
> qat roughly twice as long as aesni to complete encryption of a 32KB
> buffer using AES-CBC-256 (more with GCM), but with qat the CPU is
> idle
> much of the time. Dispatching the request to firmware takes less
> than
> 1% of the total time elapsed between request dispatch and completion,
> even with small buffers. OTOH with really small buffers aesni can
> complete a request in the time that it takes qat just to dispatch the
> request to the device, so at best qat will give comparable throughput
> and CPU usage and worse latency.
> - qat can handle multiple requests in parallel. This can improve
> throughput dramatically if the producer can keep qat busy.
> Empirically, the maximum throughput improvement is a function of the
> request size. For example, counting the number of cycles required to
> encrypt 100,000 buffers using AES-GCM-256:
>
> max # in flight 1 16 64 128
>
> aesni, 16B 206M n/a n/a n/a
> aesni, 4KB 1.52B n/a n/a n/a
> aesni, 32KB 10.8B n/a n/a n/a
> qat, 16B 17.1B 1.11B 219M 184M
> qat, 4KB 20.9B 1.68B 710M 694M
> qat, 32KB 38.2B 8.37B 4.25B 4.23B
>
> As a side note, OpenCrypto supports async dispatch for software
> crypto
> drivers, in which crypto_dispatch() hands work off to other threads.
> This is enabled by net.inet.ipsec.async_crypto, for example. Of
> course, the maximum parallelism is limited by the number of CPUs in
> the system, but this can improve throughput significantly as well if
> you're willing to spend the corresponding CPU cycles.
>
> To summarize, QAT can be beneficial when some or all of the following
> apply:
> 1. You have large requests. qat can give comparable throughput for
> small requests if the producer can exploit parallelism in qat,
> though
> OpenCrypto's backpressure mechanism is really primitive (arguably
> non-existent) and performance will tank if things get to a point
> where qat can't keep up.
> 2. You're able to dispatch requests in parallel. But see point 1.
> 3. CPU cycles are precious and the extra latency is tolerable.
> 3b. aesni doesn't implement some transform that you care about, but qat
> does. Some (most?) Xeons don't implement the SHA extensions for
> instance. I don't have a sense for how the plain cryptosoft driver
> performs relative to aesni though.
> _______________________________________________
> freebsd-current at freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to
> "freebsd-current-unsubscribe at freebsd.org"
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: OpenPGP digital signature
URL: <http://lists.freebsd.org/pipermail/freebsd-current/attachments/20210128/442ae1e2/attachment.sig>
More information about the freebsd-current
mailing list