I read Linux Observability with BPF, so I’ll leave some notes. This month, Brendan Gregg’s BPF book is also coming out, so I’d like to read that too.
- Using BPF allows hooking kernel events and safely executing code
- BPF verifies that the code won’t destroy or crash the system
- Unlike kernel modules, BPF programs don’t require kernel recompilation
- After BPF code is verified, BPF bytecode is JIT-compiled to machine instructions
- BPF programs are loaded into the BPF VM by the
bpfsyscall
- Alexei Starovoitov introduced eBPF in early 2014
- Old BPF only allowed 2 32-bit registers, but eBPF allows up to 10 64-bit registers
- In June 2014, eBPF was also extended to user space
- BPF program types: Can be broadly classified into tracing and networking
- Socket Filter: First program type to enter the kernel. For observation only
- Kprobe: Run BPF program as kprobe handler. Entry and exit correspond to
SEC(kprobe/sys_exec)andSEC(kretprobe/sys_exec)respectively - Tracepoint: Run BPF program on predefined tracepoints on the kernel side. Can check list at
/sys/kernel/debug/tracing/events - XDP: Run BPF program at an early stage when packets arrive
- Perf: Run BPF program for perf events. BPF program executes every time perf outputs data
- Cgroup Socket: Attached to all processes in that cgroup
- Socket Option: Facebook uses this to control RTOs (recovery time objectives) in connections within data centers
- Socket Map: Used when implementing load balancers in BPF. Cilium and Facebook Katran use this
- BPF Verifier
- In the past, like
CVE-2017-16995, there were vulnerabilities that could bypass BPF and access kernel memory - Verifies with DFS that the program terminates and no dangerous code paths exist
- All loops are prohibited to reject infinite loops. Loop permission is still at the proposal stage at the time of writing
- Instruction count is limited to 4096
- Can check verification results by setting
log_*of bpf syscall
- In the past, like
- BPF programs can call other BPF programs via tail calls
- All context is lost on call, so some method is needed to share information
- BPF Maps
- Can create BPF maps directly with the
bpfsyscall - Easy to use
bpf_map_createhelper function - Note that the signature of
bpf_map_update_elemdiffers between kernel sidebpf/bpf_helpers.hand user sidetools/lib/bpf/bpf.h - Useful to convert error numbers to strings with
strerror(errno) bpf_map_get_next_keyunlike other helper functions can only be used on user side, so be careful- array, hash, cgroup storage maps have spin locks, so concurrent access is possible
- array map pre-allocates memory for the number of elements and is initialized with zeros. Used for global variable allocation?
- There are also LRU hash and LRM (Longest Prefix Match) Trie maps
- From Linux 4.4, two syscalls were added to handle maps and BPF programs from virtual FS
- Can create BPF maps directly with the
- Tracing
- kprobes/kretprobes: Not a stable ABI, so need to check probe target function signature in advance. May change with each Linux version
- BPF program
contextchanges depending on the program - Tracepoint API is compatible across Linux versions. Can check at
/sys/kernel/debug/tracing/events - USDTs (user statically defined tracepoints): Static trace points in user programs
- BPFTool
- Actively developed, so compile from Linux src
- Can check what’s available with
bpftool feature - If JIT is disabled, can enable with
echo 1 > /proc/sys/net/core/bpf_jit_enable - Can check list of BPF programs and maps with
bpftool map showorbpftool prog show - Recommended to version control batch files
- BPFTrace
- High-level DSL for BPF
- Like awk with
BEGIN,END, and actual tracing part structure - Convenient because it automatically creates BPF maps
- kubectl-trace
- Execute BPFTrace programs as kubernetes jobs
- eBPF Exported
- Forward BPF tracing results to Prometheus. Used at Cloudflare
- XDP
xdp_buffpassed ascontextis a simplifiedsk_buff- Has 3 operation modes
- Native XDP: Run BPF program right after driver output. Check if NIC is supported with
git grep -l XDP_SETUP_PROG drivers/ - Offloaded XDP: Check supported NIC with
git grep -l XDP_SETUP_PROG_HW drivers/. Offload BPF program to NIC - Generic XDP: Test mode for developers
- XDP allows unit testing
- Use Cases
- Sysdig is developing troubleshooting tools as OSS with eBPF
- Flowmill is developing data center network monitoring tools. CPU overhead is about
0.1%~0.25%
# SHELL
# Compile BPF program to ELF binary
clang -O2 -target bpf -c bpf_program.c -o bpf_program.o
# Mount virtual FS for BPF
mount -t bpf /sys/fs/bpf /sys/fs/bpf
# Check USDT
tplist -l ./hello_usdt
# Get stack trace and create flamegraph
./profiler.py `pgrep -nx go` > /tmp/profile.out
./flamegraph.pl /tmp/profile.out > /tmp/flamegraph.svg
# kubectl-trace execution example
kubectl trace run pod/pod_identifier -n application_name -e <<PROGRAM
uretprobe:/proc/$container_pid/exe:"main.main" {
printf("exit: %d\n", retval)
}
PROGRAM
# Load XDP BPF program
# If native mode fails, run in generic mode. Can force
ip link set dev eth0 xdp obj program.o sec mysection
// C
// Create BPF map with bpf syscall
int fd = bpf(BPF_MAP_CREATE, &my_map, sizeof(my_map));
// Get bpf map list
int next_key, lookup_key = -1;
while (bpf_map_get_next_key(map_data[0].fd, &lookup_key, &next_key) == 0) {
printf("The next key in the map: %d\n", next_key);
lookup_key = next_key;
}
# BCC (python)
from bcc import BPF
# kprobes example
bpf_source = """
int do_sys_execve(struct pt_regs *ctx, void filename, void argv, void envp) {
char comm[16];
bpf_get_current_comm(&comm, sizeof(comm));
bpf_trace_printk("executing program: %s", comm);
return 0;
}
"""
bpf = BPF(text = bpf_source)
execve_function = bpf.get_syscall_fnname("execve")
bpf.attach_kprobe(event = execve_function, fn_name = "do_sys_execve")
bpf.trace_print()
# tracepoint example
bpf_source = """
int trace_bpf_prog_load(void ctx) {
char comm[16];
bpf_get_current_comm(&comm, sizeof(comm));
bpf_trace_printk("%s is loading a BPF program", comm);
return 0;
}
"""
bpf = BPF(text = bpf_source)
bpf.attach_tracepoint(tp = "bpf:bpf_prog_load",
fn_name = "trace_bpf_prog_load")
bpf.trace_print()
# uprobes example
bpf_source = """
int trace_go_main(struct pt_regs *ctx) {
u64 pid = bpf_get_current_pid_tgid();
bpf_trace_printk("New hello-bpf process running with PID: %d", pid);
}
"""
bpf = BPF(text = bpf_source)
bpf.attach_uprobe(name = "hello-bpf",
sym = "main.main", fn_name = "trace_go_main")
bpf.trace_print()
# USDT example
from bcc import BPF, USDT
bpf_source = """
#include <uapi/linux/ptrace.h>
int trace_binary_exec(struct pt_regs *ctx) {
u64 pid = bpf_get_current_pid_tgid();
bpf_trace_printk("New hello_usdt process running with PID: %d", pid);
}
"""
usdt = USDT(path = "./hello_usdt")
usdt.enable_probe(probe = "probe-main", fn_name = "trace_binary_exec")
bpf = BPF(text = bpf_source, usdt = usdt)
bpf.trace_print()
# BPFTrace DSL
# Execute with bpftrace /tmp/example.bt
BEGIN
{
printf("starting BPFTrace program\n")
}
kprobe:do_sys_open
{
printf("opening file descriptor: %s\n", str(arg1))
@opens[str(arg1)] = count()
}
END
{
printf("exiting BPFTrace program\n")
}