BPF64: proposal of platform-independent hardware-friendly backwards-compatible eBPF alternative
- Reply: Poul-Henning Kamp: "Re: BPF64: proposal of platform-independent hardware-friendly backwards-compatible eBPF alternative"
- Reply: Gavin D. Howard: "Re: BPF64: proposal of platform-independent hardware-friendly backwards-compatible eBPF alternative"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Tue, 10 Sep 2024 01:05:44 UTC
Hello! We don't need ELF relocations! We want better loop control! No so little parameters, Verifier! Leave our code alone! -- Ping Floyd I've recently had some experience with Linux's ePBF in it's XDP, and this left quite negative impression. I was following via https://github.com/xdp-project/xdp-tutorial and after 3rd lesson was trying to create a simple program for searching TCP timestamp option and incrementing it by one. As you know, eBPF tool stack consists of at least clang and eBPF verifier in the kernel, and after two dozen tries eBPF verifier still didn't accept my code. I was digging into verifier sources, and the abysses opened in front of me! Carefully and boringly going via disassembler and verifier output, I've found that clang optimizer ignores just checked register - patching one byte in assembler sources (and target .o) did help. I've filed https://github.com/iovisor/bcc/issues/5062 with details if one curious. So, looking at eBPF ecosystem, I must say it's a Frankenstein. Sewn from good, sometimes brilliant parts, it's a monster in outcome. Verifier is in it's own right, compiler/optimizer is in it's own right... But at the end you even don't have a high-level programming language! You must write in C, relatively low-level C, and restricted subset of C. This requires very skilled professionals - it's far from something not even user-friendly, but at least sysadmin-friendly, like `ipfw` or `iptables` firewall rules. Thus I looked at the foundation of eBPF architecture, with which presuppositions in mind it was created with. In fact, it tries to be just usual programming after checks - that is, with all that pointers. It's too x86-centric and Linux-centric - number of registers was added just ten. So if you look at the GitHub ticket above, when I tried to add debug to program - you know, just specific `printf()`s - it failed verifier checks again because compiler now had to move some variables between registers and memory, as there is limit on just 5 arguments to call due to limit of 5 registers! And verifier, despite being more than 20,000 lines of code, still was not smart enough to track info between registers and stack. So, if we'd started from beginning, what should we do? Remember classic BPF: it has very simple validator due to it's Virtual Machine design - only forward jumps, checks for packet boundaries at runtime, etc. You'd say eBPF tries for performance if verifier's checks were passed? But in practice you have to toss in as much packet boundary checks as near to actual access as possible, or verifier may "forget" it, because of compiler optimizer. So this is not of much difference for checking if access is after packet in classic BPF - the same CMP/JUMP in JIT if buffer is linear, and if your OS has put packet in several buffers, like *BSD or DPDK `mbuf`'s, the runtime check overhead is negligible in comparison. Ensuring kernel stability? Just don't allow arbitrary pointers, like original BPF. Guaranteed termination time? It's possible if you place some restrictions. For example, don't allow backward jumps but allow function calls - in case of stack overflow, terminate program. Really need backward jumps? Let's analyze for what purpose. You'll find these are needed for loops on packet contents. Solve it but supporting loops in "hardware"-controlled loops, which can't be infinite. Finally, platforms. It's beginning of sunset of x86 era now - RISC is coming. ARM is now not only on mobiles, but on desktops and servers. Moreover, it's era of specialized hardware accelerators - e.g. GPU, neural processors. Even general purpose ARM64 has 31 register, and specialized hardware can implement much more. Then, don't tie to Linux kernel - BPF helpers are very rigid interface, from ancient era, like syscalls. So, let's continue *Berkeley* Packet Filter with Berkeley RISC design - having register window idea, updated by SPARC and then by Itanium (to not waste registers). Take NetBSD's coprocessor functions which set is passed with a context, instead of hardcoded enums of functions - for example, BPF maps is not something universal, both NetBSD and FreeBSD have their own tables in firewall. Add more features actually needed for *network* processor - e.g. 128-bit registers for IPv6 (eBPF axed out even BPF_MSH!). And do all of this in fully backwards-compatible way - new language should allow to run older programs from e.g. `tcpdump` to run without any modifications, binary-compatible (again, eBPF does not do this - it's incompatible with classiv BPF and uses a translator from it). Next, eBPF took "we are masquerading usual x86 programming" way not only just in assembly language. They have very complex ELF infrastructure around it which may be not suitable for every network card - having pc-addressed literals, as in RISC processors allows for much simpler format: just BLOB of instructions. BPF64 adds BPF_LITERAL "instruction" of varying length (it's interpreted by just skipping over contents as if it was jump), which, if have special signatures and format, allow for this BLOB of instructions to contain some metadata about itself for loading, much simpler than ELF (esp. with DWARF). Then, ecosystem. eBPF defines functions callable from user code like: > enum bpf_func_id___x { BPF_FUNC_snprintf___x = 42 /* avoid zero */ }; That is, ancient syscall-like way of global constant, instead of context. A "context" here is the structure passed with code to execution which contains function pointers of what is available to this user code, in spirit of NetBSD's `bpf_ctx_t` for their BPF_COP/BPF_COPX extensions. This is not only provides better way than "set in stone" syscall-like number, but BPF64 goes further and defines an "packages" in running kernel with namespaces to allow e.g. Foo::Bar::baz() function to call Foo::quux() from another BPF program, populating ("linking") it's context with needed function without relocations. These "packages" expected to be available to admin in e.g. sysctl tree, with descriptions, versioning and other attributes. Some other quotes about how restricted eBPF is: > First, a BPF program using bpf_trace_printk() has to have a GPL-compatible license. > Another hard limitation is that bpf_trace_printk() can accept only up to 3 input arguments (in addition to fmt and fmt_size). This is quite often pretty limiting and you might need to use multiple bpf_trace_printk() invocations to log all the data. This limitation stems from the BPF helpers ability to accept only up to 5 input arguments in total. > Previously, bpf_trace_printk() allowed the use of only one string (%s) argument, which was quite limiting. Linux 5.13 release lifts this restriction and allows multiple string arguments, as long as total formatted output doesn't exceed 512 bytes. Another annoying restriction was the lack of support for width specifiers, like %10d or %-20s. This restriction is gone now as well > Helper function bpf_snprintf > Outputs a string into the str buffer of size str_size based on a format string stored in a read-only map pointed by fmt. > > Each format specifier in fmt corresponds to one u64 element in the data array. For strings and pointers where pointees are accessed, only the pointer values are stored in the data array. The data_len is the size of data in bytes - must be a multiple of 8. > > Formats %s and %p{i,I}{4,6} require to read kernel memory. Reading kernel memory may fail due to either invalid address or valid address but requiring a major memory fault. If reading kernel memory fails, the string for %s will be an empty string, and the ip address for %p{i,I}{4,6} will be 0. Not returning error to bpf program is consistent with what bpf_trace_printk() does for now. > > Returns > > The strictly positive length of the formatted string, including the trailing zero character. If the return value is greater than str_size, str contains a truncated string, guaranteed to be zero-terminated except when str_size is 0. > > Or -EBUSY if the per-CPU memory copy buffer is busy. > > static long (* const bpf_snprintf)(char *str, __u32 str_size, const char *fmt, __u64 *data, __u32 data_len) = (void *) 165; So. In summary: BPF64 has not only traditional ("firewall on network packets") area of use but also suitable - and having goals to design in mind - for: * non-network. e.g. syscall arguments checking * network protocols passing untrusted code which can be run by other side in restricted sandbox (e.g. muSCTP custom (de)compression rules) * an alternative to https://capnproto.org/rpc.html for multi-method calls on high-latency links (e.g. MQTT5-like and/or for environments when Cap’n Proto itself is not applicable, e.g. no direct connections) I've put a sketch of design to https://github.com/nuclight/bpf64 with files: * The `nuc_ts_prog_kern.c` (and it's include `nuc_ts_common_kern_user.h`) is XDP/eBPF program (for Linux 6.5) for parsing TCP packet and incrementing it's Timestamp option, if any, recording statisitics intop eBPF map. * The `nuc_ts_incr.baw` (and it's include `nuc_ts_incr.bah`) is the equivalent program doing the same thing, but in a new BPF64 Assembler Wrapper language, not yet written and subject to change. Note this is a lower-level language than C, viewed as intermediate solution until BPF64 becomes stable, after which more higher-level language (higher than C) should be written, at least as expressible as `tcpdump` (`libpcap`) one. * `bpf64spec.md` for draft of Specification, but currently it's more in a "a draft of draft" state :-) I am requesting comments about this architecture before implementing, especially from people knowledgeable in JIT compilers, because, though while I see BPF64 much more simpler than eBPF (probably just several man-months to implement), my sabbatical has ended - and design mistakes are much harder to fix when implementation already exists. -- WBR, @nuclight