I read Learning eBPF, so I’d like to leave some reading notes. This book was released last year and I’ve been wanting to read it for a while. lizrice/learning-ebpf - github.com has abundant sample code for reference. I’ll just note things that caught my attention personally, without paying much attention to context.

BCC

It starts with an example using BCC. Using bpf_trace_printk() allows you to output text to the pseudo-file /sys/kernel/debug/tracing/trace_pipe. This is good for simple use, but if you want to separate output destinations for each eBPF program, you can use BPF MAP to handle data exchange between kernel and user space yourself. Using Perf ring buffers or BPF ring buffers allows flexible data structure (struct) exchange. The former has separate areas per CPU, while the latter has a common area for all CPUs with correct ordering. Performance is also better apparently. Using a mechanism called Tail Call allows calling another eBPF program. Tail Call doesn’t return to the original eBPF program when completed, so it’s like a jump. To use Tail Call, you need to prepare a MAP of type BPF_MAP_TYPE_PROG_ARRAY in advance.

Virtual Machine

There are 10 general-purpose registers, plus a stack frame pointer 1. The Calling Convention was explained. reg0 is the argument of the eBPF program, and reg1 is its return value. For function calls, reg1 to reg5 are the arguments. Instruction length is 64 bits, but there are also wide instructions that combine them. The SEC() macro identifies the section name after compilation. You can compile from C to eBPF bytecode, or from Rust to eBPF bytecode.

bpf(2)

The bpftool command is a convenient tool that can load and attach eBPF, and it uses bpf(2) internally. Many operations can be achieved with bpf(2), but there are several variations for attachment - sometimes bpf(2), sometimes a combination of perf_event_open(2) and ioctl(2). eBPF programs and MAPs loaded into the kernel are automatically deleted when the reference counter becomes 0, but you can increase the reference counter by 1 through BPF link or pinning to special files. This allows the bpftool command to keep the eBPF program loaded even after execution ends.

CO-RE, libbpf

To use CO-RE (Compile Once - Run Everywhere), use libbpf instead of BCC. Supported in kernel 5.4 and later. By including vmlinux.h, you can use many structures in the kernel. Using bpf_core_read allows you to automatically read kernel data while considering relocation to bridge the gap between the running kernel version and the compile-time kernel version. Compiling with -g makes debugging the eBPF verifier easier. Previously, loops needed to be unrolled, but now bpf_loop and bpf_for_each are provided.

Program Types

There are about 30 program types and over 40 attachment types. You can get a list of available helper functions for each program type with bpftool features. Using a mechanism called Kfuncs allows registering kernel functions to the BPF subsystem. There are also CORE BPF Kfuncs that are compatible across kernel versions. x86 only, but it’s better to use fentry/fexit rather than kprobe/kretprobe. fentry/fexit were introduced together with the idea of BPF trampoline in kernel 5.5. fexit is also convenient in that it can get both arguments and return values together. tracepoints provide a stable interface. There are also BTF Tracepoints that absorb differences in structure members across kernel versions. You can attach to user space functions with uprobe/uretprobe and USDT. There are program types for LSM (Linux Security Module). This allows enforcing security policies from eBPF. Originally done by kernel modules.

Network

Targets include sockets, TC, XDP, Flow dissector 2, LWT (Light Weight Tunnel), cgroup, infrared controllers, etc. However, LWT is apparently not used much. cgroup has other uses, but what relates to eBPF is mostly network stuff. The relationship between program types and attachment types is summarized in bpf_prog_load_check_attach(). For networking, there are many examples such as Cilium, Meta’s XDP usage, Cloudflare’s DDoS Mitigation. In the book, an XDP implementation of an inline load balancer was explained. TC differs from XDP in that it supports both ingress/egress and can use the sk_buff structure.

Attaching with uprobe to functions inside the OpenSSL library like SSL_read() or SSL_write() allows getting plaintext. Of course, it won’t work if statically linked or using a different library. Pixie’s OpenSSL Tracer and BCC’s sslsniff implement this.

Cilium uses eBPF to bypass the network stack in the host netns. Especially iptables doesn’t work well with Kubernetes where IP addresses change frequently. Computational complexity during evaluation is also a problem 3. eBPF is also involved in transparent encryption of Kubernetes with IPSec and WireGuard. I’m not sure how much it relates to eBPF, but ideas like BIG TCP 4 were also mentioned.

Security

There are mechanisms like Falco that hook at syscall entry/exit with eBPF and fire alerts. There’s also cilium/tetragon as a project that attaches to kprobes. Here, TOCTOU (time-of-check to time-of-use) can occur 5 6, where the data pointed to by argument pointers changes. As a related technique, using bpf_send_signal() allows sending signals from within a hook. This prevents attacks due to slight time differences.

Languages

bpftrace is a high-level language. It’s useful when you want to trace quickly. Using cilium/ebpf-go automatically builds eBPF code written in C and embeds the bytecode into Go code. Features include supporting CO-RE and being written entirely in Go. Using Aya allows implementing both kernel-land and user-land in Rust. aya-tool automatically generates Rust data structures.

During development, using BPF_PROG_RUN allows debugging in user space. Enabling /proc/sys/kernel/bpf_stats_enabled outputs statistics, which is convenient. Future evolution includes eBPF program signing (this is challenged by binary conversion due to CO-RE), long-lived kernel pointers, and memory allocation.