Linux Observability with BPF Reading Notes

I read Linux Observability with BPF, so I’ll leave some notes. This month, Brendan Gregg’s BPF book is also coming out, so I’d like to read that too.

Using BPF allows hooking kernel events and safely executing code
- BPF verifies that the code won’t destroy or crash the system
- Unlike kernel modules, BPF programs don’t require kernel recompilation
- After BPF code is verified, BPF bytecode is JIT-compiled to machine instructions
- BPF programs are loaded into the BPF VM by the bpf syscall
Alexei Starovoitov introduced eBPF in early 2014
- Old BPF only allowed 2 32-bit registers, but eBPF allows up to 10 64-bit registers
- In June 2014, eBPF was also extended to user space
BPF program types: Can be broadly classified into tracing and networking
- Socket Filter: First program type to enter the kernel. For observation only
- Kprobe: Run BPF program as kprobe handler. Entry and exit correspond to SEC(kprobe/sys_exec) and SEC(kretprobe/sys_exec) respectively
- Tracepoint: Run BPF program on predefined tracepoints on the kernel side. Can check list at /sys/kernel/debug/tracing/events
- XDP: Run BPF program at an early stage when packets arrive
- Perf: Run BPF program for perf events. BPF program executes every time perf outputs data
- Cgroup Socket: Attached to all processes in that cgroup
- Socket Option: Facebook uses this to control RTOs (recovery time objectives) in connections within data centers
- Socket Map: Used when implementing load balancers in BPF. Cilium and Facebook Katran use this
BPF Verifier
- In the past, like CVE-2017-16995, there were vulnerabilities that could bypass BPF and access kernel memory
- Verifies with DFS that the program terminates and no dangerous code paths exist
- All loops are prohibited to reject infinite loops. Loop permission is still at the proposal stage at the time of writing
- Instruction count is limited to 4096
- Can check verification results by setting log_* of bpf syscall
BPF programs can call other BPF programs via tail calls
- All context is lost on call, so some method is needed to share information
BPF Maps
- Can create BPF maps directly with the bpf syscall
- Easy to use bpf_map_create helper function
- Note that the signature of bpf_map_update_elem differs between kernel side bpf/bpf_helpers.h and user side tools/lib/bpf/bpf.h
- Useful to convert error numbers to strings with strerror(errno)
- bpf_map_get_next_key unlike other helper functions can only be used on user side, so be careful
- array, hash, cgroup storage maps have spin locks, so concurrent access is possible
- array map pre-allocates memory for the number of elements and is initialized with zeros. Used for global variable allocation?
- There are also LRU hash and LRM (Longest Prefix Match) Trie maps
- From Linux 4.4, two syscalls were added to handle maps and BPF programs from virtual FS
Tracing
- kprobes/kretprobes: Not a stable ABI, so need to check probe target function signature in advance. May change with each Linux version
- BPF program context changes depending on the program
- Tracepoint API is compatible across Linux versions. Can check at /sys/kernel/debug/tracing/events
- USDTs (user statically defined tracepoints): Static trace points in user programs
BPFTool
- Actively developed, so compile from Linux src
- Can check what’s available with bpftool feature
- If JIT is disabled, can enable with echo 1 > /proc/sys/net/core/bpf_jit_enable
- Can check list of BPF programs and maps with bpftool map show or bpftool prog show
- Recommended to version control batch files
BPFTrace
- High-level DSL for BPF
- Like awk with BEGIN, END, and actual tracing part structure
- Convenient because it automatically creates BPF maps
kubectl-trace
- Execute BPFTrace programs as kubernetes jobs
eBPF Exported
- Forward BPF tracing results to Prometheus. Used at Cloudflare
XDP
- xdp_buff passed as context is a simplified sk_buff
- Has 3 operation modes
- Native XDP: Run BPF program right after driver output. Check if NIC is supported with git grep -l XDP_SETUP_PROG drivers/
- Offloaded XDP: Check supported NIC with git grep -l XDP_SETUP_PROG_HW drivers/. Offload BPF program to NIC
- Generic XDP: Test mode for developers
- XDP allows unit testing
Use Cases
- Sysdig is developing troubleshooting tools as OSS with eBPF
- Flowmill is developing data center network monitoring tools. CPU overhead is about 0.1%~0.25%

# SHELL
# Compile BPF program to ELF binary
clang -O2 -target bpf -c bpf_program.c -o bpf_program.o

# Mount virtual FS for BPF
mount -t bpf /sys/fs/bpf /sys/fs/bpf

# Check USDT
tplist -l ./hello_usdt

# Get stack trace and create flamegraph
./profiler.py `pgrep -nx go` > /tmp/profile.out
./flamegraph.pl /tmp/profile.out > /tmp/flamegraph.svg

# kubectl-trace execution example
kubectl trace run pod/pod_identifier -n application_name -e <<PROGRAM
  uretprobe:/proc/$container_pid/exe:"main.main" {
    printf("exit: %d\n", retval)
  }
PROGRAM

# Load XDP BPF program
# If native mode fails, run in generic mode. Can force
ip link set dev eth0 xdp obj program.o sec mysection

// C
// Create BPF map with bpf syscall
int fd = bpf(BPF_MAP_CREATE, &my_map, sizeof(my_map));

// Get bpf map list
int next_key, lookup_key = -1;
while (bpf_map_get_next_key(map_data[0].fd, &lookup_key, &next_key) == 0) {
  printf("The next key in the map: %d\n", next_key);
  lookup_key = next_key;
}

# BCC (python)
from bcc import BPF

# kprobes example
bpf_source = """
int do_sys_execve(struct pt_regs *ctx, void filename, void argv, void envp) {
  char comm[16];
  bpf_get_current_comm(&comm, sizeof(comm));
  bpf_trace_printk("executing program: %s", comm);
  return 0;
}
"""
bpf = BPF(text = bpf_source)
execve_function = bpf.get_syscall_fnname("execve")
bpf.attach_kprobe(event = execve_function, fn_name = "do_sys_execve")
bpf.trace_print()

# tracepoint example
bpf_source = """
int trace_bpf_prog_load(void ctx) {
  char comm[16];
  bpf_get_current_comm(&comm, sizeof(comm));
  bpf_trace_printk("%s is loading a BPF program", comm);
  return 0;
}
"""
bpf = BPF(text = bpf_source)
bpf.attach_tracepoint(tp = "bpf:bpf_prog_load",
fn_name = "trace_bpf_prog_load")
bpf.trace_print()

# uprobes example
bpf_source = """
int trace_go_main(struct pt_regs *ctx) {
  u64 pid = bpf_get_current_pid_tgid();
  bpf_trace_printk("New hello-bpf process running with PID: %d", pid);
}
"""
bpf = BPF(text = bpf_source)
bpf.attach_uprobe(name = "hello-bpf",
sym = "main.main", fn_name = "trace_go_main")
bpf.trace_print()

# USDT example
from bcc import BPF, USDT
bpf_source = """
#include <uapi/linux/ptrace.h>
int trace_binary_exec(struct pt_regs *ctx) {
  u64 pid = bpf_get_current_pid_tgid();
  bpf_trace_printk("New hello_usdt process running with PID: %d", pid);
}
"""
usdt = USDT(path = "./hello_usdt")
usdt.enable_probe(probe = "probe-main", fn_name = "trace_binary_exec")
bpf = BPF(text = bpf_source, usdt = usdt)
bpf.trace_print()

# BPFTrace DSL
# Execute with bpftrace /tmp/example.bt
BEGIN
{
  printf("starting BPFTrace program\n")
}
kprobe:do_sys_open
{
  printf("opening file descriptor: %s\n", str(arg1))
  @opens[str(arg1)] = count()
}
END
{
  printf("exiting BPFTrace program\n")
}