I read Linux Observability with BPF, so I’ll leave some notes. This month, Brendan Gregg’s BPF book is also coming out, so I’d like to read that too.

  • Using BPF allows hooking kernel events and safely executing code
    • BPF verifies that the code won’t destroy or crash the system
    • Unlike kernel modules, BPF programs don’t require kernel recompilation
    • After BPF code is verified, BPF bytecode is JIT-compiled to machine instructions
    • BPF programs are loaded into the BPF VM by the bpf syscall
  • Alexei Starovoitov introduced eBPF in early 2014
    • Old BPF only allowed 2 32-bit registers, but eBPF allows up to 10 64-bit registers
    • In June 2014, eBPF was also extended to user space
  • BPF program types: Can be broadly classified into tracing and networking
    • Socket Filter: First program type to enter the kernel. For observation only
    • Kprobe: Run BPF program as kprobe handler. Entry and exit correspond to SEC(kprobe/sys_exec) and SEC(kretprobe/sys_exec) respectively
    • Tracepoint: Run BPF program on predefined tracepoints on the kernel side. Can check list at /sys/kernel/debug/tracing/events
    • XDP: Run BPF program at an early stage when packets arrive
    • Perf: Run BPF program for perf events. BPF program executes every time perf outputs data
    • Cgroup Socket: Attached to all processes in that cgroup
    • Socket Option: Facebook uses this to control RTOs (recovery time objectives) in connections within data centers
    • Socket Map: Used when implementing load balancers in BPF. Cilium and Facebook Katran use this
  • BPF Verifier
    • In the past, like CVE-2017-16995, there were vulnerabilities that could bypass BPF and access kernel memory
    • Verifies with DFS that the program terminates and no dangerous code paths exist
    • All loops are prohibited to reject infinite loops. Loop permission is still at the proposal stage at the time of writing
    • Instruction count is limited to 4096
    • Can check verification results by setting log_* of bpf syscall
  • BPF programs can call other BPF programs via tail calls
    • All context is lost on call, so some method is needed to share information
  • BPF Maps
    • Can create BPF maps directly with the bpf syscall
    • Easy to use bpf_map_create helper function
    • Note that the signature of bpf_map_update_elem differs between kernel side bpf/bpf_helpers.h and user side tools/lib/bpf/bpf.h
    • Useful to convert error numbers to strings with strerror(errno)
    • bpf_map_get_next_key unlike other helper functions can only be used on user side, so be careful
    • array, hash, cgroup storage maps have spin locks, so concurrent access is possible
    • array map pre-allocates memory for the number of elements and is initialized with zeros. Used for global variable allocation?
    • There are also LRU hash and LRM (Longest Prefix Match) Trie maps
    • From Linux 4.4, two syscalls were added to handle maps and BPF programs from virtual FS
  • Tracing
    • kprobes/kretprobes: Not a stable ABI, so need to check probe target function signature in advance. May change with each Linux version
    • BPF program context changes depending on the program
    • Tracepoint API is compatible across Linux versions. Can check at /sys/kernel/debug/tracing/events
    • USDTs (user statically defined tracepoints): Static trace points in user programs
  • BPFTool
    • Actively developed, so compile from Linux src
    • Can check what’s available with bpftool feature
    • If JIT is disabled, can enable with echo 1 > /proc/sys/net/core/bpf_jit_enable
    • Can check list of BPF programs and maps with bpftool map show or bpftool prog show
    • Recommended to version control batch files
  • BPFTrace
    • High-level DSL for BPF
    • Like awk with BEGIN, END, and actual tracing part structure
    • Convenient because it automatically creates BPF maps
  • kubectl-trace
    • Execute BPFTrace programs as kubernetes jobs
  • eBPF Exported
    • Forward BPF tracing results to Prometheus. Used at Cloudflare
  • XDP
    • xdp_buff passed as context is a simplified sk_buff
    • Has 3 operation modes
    • Native XDP: Run BPF program right after driver output. Check if NIC is supported with git grep -l XDP_SETUP_PROG drivers/
    • Offloaded XDP: Check supported NIC with git grep -l XDP_SETUP_PROG_HW drivers/. Offload BPF program to NIC
    • Generic XDP: Test mode for developers
    • XDP allows unit testing
  • Use Cases
    • Sysdig is developing troubleshooting tools as OSS with eBPF
    • Flowmill is developing data center network monitoring tools. CPU overhead is about 0.1%~0.25%
# SHELL
# Compile BPF program to ELF binary
clang -O2 -target bpf -c bpf_program.c -o bpf_program.o

# Mount virtual FS for BPF
mount -t bpf /sys/fs/bpf /sys/fs/bpf

# Check USDT
tplist -l ./hello_usdt

# Get stack trace and create flamegraph
./profiler.py `pgrep -nx go` > /tmp/profile.out
./flamegraph.pl /tmp/profile.out > /tmp/flamegraph.svg

# kubectl-trace execution example
kubectl trace run pod/pod_identifier -n application_name -e <<PROGRAM
  uretprobe:/proc/$container_pid/exe:"main.main" {
    printf("exit: %d\n", retval)
  }
PROGRAM

# Load XDP BPF program
# If native mode fails, run in generic mode. Can force
ip link set dev eth0 xdp obj program.o sec mysection
// C
// Create BPF map with bpf syscall
int fd = bpf(BPF_MAP_CREATE, &my_map, sizeof(my_map));

// Get bpf map list
int next_key, lookup_key = -1;
while (bpf_map_get_next_key(map_data[0].fd, &lookup_key, &next_key) == 0) {
  printf("The next key in the map: %d\n", next_key);
  lookup_key = next_key;
}
# BCC (python)
from bcc import BPF

# kprobes example
bpf_source = """
int do_sys_execve(struct pt_regs *ctx, void filename, void argv, void envp) {
  char comm[16];
  bpf_get_current_comm(&comm, sizeof(comm));
  bpf_trace_printk("executing program: %s", comm);
  return 0;
}
"""
bpf = BPF(text = bpf_source)
execve_function = bpf.get_syscall_fnname("execve")
bpf.attach_kprobe(event = execve_function, fn_name = "do_sys_execve")
bpf.trace_print()

# tracepoint example
bpf_source = """
int trace_bpf_prog_load(void ctx) {
  char comm[16];
  bpf_get_current_comm(&comm, sizeof(comm));
  bpf_trace_printk("%s is loading a BPF program", comm);
  return 0;
}
"""
bpf = BPF(text = bpf_source)
bpf.attach_tracepoint(tp = "bpf:bpf_prog_load",
fn_name = "trace_bpf_prog_load")
bpf.trace_print()

# uprobes example
bpf_source = """
int trace_go_main(struct pt_regs *ctx) {
  u64 pid = bpf_get_current_pid_tgid();
  bpf_trace_printk("New hello-bpf process running with PID: %d", pid);
}
"""
bpf = BPF(text = bpf_source)
bpf.attach_uprobe(name = "hello-bpf",
sym = "main.main", fn_name = "trace_go_main")
bpf.trace_print()

# USDT example
from bcc import BPF, USDT
bpf_source = """
#include <uapi/linux/ptrace.h>
int trace_binary_exec(struct pt_regs *ctx) {
  u64 pid = bpf_get_current_pid_tgid();
  bpf_trace_printk("New hello_usdt process running with PID: %d", pid);
}
"""
usdt = USDT(path = "./hello_usdt")
usdt.enable_probe(probe = "probe-main", fn_name = "trace_binary_exec")
bpf = BPF(text = bpf_source, usdt = usdt)
bpf.trace_print()
# BPFTrace DSL
# Execute with bpftrace /tmp/example.bt
BEGIN
{
  printf("starting BPFTrace program\n")
}
kprobe:do_sys_open
{
  printf("opening file descriptor: %s\n", str(arg1))
  @opens[str(arg1)] = count()
}
END
{
  printf("exiting BPFTrace program\n")
}