This blog has covered relatively new networking technologies, but it’s good to revisit the fundamentals. The Linux kernel will be used for a long time to come, and even without perfect understanding, having a grasp of the basics is meaningful. This book is over 1,000 pages and consists of parts 1-7, so maintaining motivation to read it all at once is difficult. In this article, I’ll summarize what I’ve read so far. I kept the kernel version 2.6.39 source code at hand while reading. For build methods and such, see my previous article 1.
Part 1
The important data structures for networking are struct sk_buff and struct net_device. First, it’s essential to grasp these two data structures. struct sk_buff (setting aside fragmentation issues) corresponds to one packet. Its instance is often named skb. skb->data points to the header of the network layer responsible for processing. For example, when performing L2 processing, skb->data points to the beginning of the L2 header. As processing progresses, this pointer moves. There is space before and after the actual data.
+------------+ skb->mac skb->nh
| | | |
| head-----------> +------------+ | |
| | | headroom | v v
| data-----------> +------------+ +---------+---------+---------+---
| | | | | L2 | L3 | L4 |
| tail | | Data | | header | header | header | ...
| | | | | +---------+---------+---------+---
| end | | | | ^ ^
| | +----------> +------------+ | |
| | | | tailroom | | |
| +-------------> +------------+ +---------+
| | skb->data
+------------+
struct sk_buff
Let’s look at what members this structure has. users corresponds to a reference counter and can be manipulated with sk_get and kfree_skb. There are also pointers corresponding to each layer, such as mac_header. cb is short for control buffer, a 48-byte region that each layer can use privately (without being aware of other layers). struct sk_buff is managed as a doubly-linked list, with the entire list corresponding to struct sk_buff_head. Let’s use a debugger to examine the contents. By attaching to a function responsible for transmission, we can see the contents of the IP header held inside struct sk_buff.
(gdb) list dev_hard_start_xmit
2086 int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
2087 struct netdev_queue *txq)
2088 {
2089 const struct net_device_ops *ops = dev->netdev_ops;
2090 int rc = NETDEV_TX_OK;
2091
2092 if (likely(!skb->next)) {
(gdb) break dev_hard_start_xmit
(gdb) continue
Breakpoint 2 at 0xffffffff8140e036: file net/core/dev.c, line 2092.
(gdb) print *((struct iphdr *)(skb->head + skb->network_header))
$3 = {ihl = 5 '\005', version = 4 '\004', tos = 0 '\000', tot_len = 18433,
id = 0, frag_off = 0, ttl = 64 '@', protocol = 17 '\021', check = 42617,
saddr = 0, daddr = 4294967295}
Next, let’s look at struct net_device. This is data that is created once for each network interface, regardless of whether it’s virtual or physical. It’s managed in a list by the global variable dev_base. It includes the device name, IRQ number, and various flags (flags, gflags, priv_flags). If it’s a virtual interface, you can find the original device by following the master field. Interfaces are often searched by name or index, so there are hash tables provided by dev_index_head and dev_name_head. As before, let’s attach a debugger to the transmission function and examine the contents of struct net_device.
(gdb) print init_net->dev_base_head
$8 = {next = 0xffff88001d89f080, prev = 0xffff88001ce5f080}
(gdb) print init_net->dev_name_head
$9 = (struct hlist_head *) 0xffff88001d878800
(gdb) print init_net->dev_index_head
$10 = (struct hlist_head *) 0xffff88001d89f800
(gdb) print *dev
$11 = {name = "eth0", '\000' <repeats 11 times>,
pm_qos_req = {list = {prio = 0, prio_list = { ...
Changing topics, let’s move to the interface between user and kernel space. Historically, there are many interfaces here. To summarize, it’s something like this:
| Name | Contents |
|---|---|
| procfs(/proc) | Usually read-only. Network-related files are gathered in /proc/net. Can be registered with proc_net_fops_create. |
| sysctl(/proc/sys) | Can be used from the sysctl command. Corresponds to actual kernel variables. Can be registered with register_sysctl_table. |
| sysfs(/sys) | A reorganization of procfs and sysctl contents in kernel 2.6. |
| ioctl | Used by ifconfig, ethtool, mii-tools. |
| netlink | A recent mechanism provided by the socket API. Used by iproute2. The only one that can send notifications from kernel to user. |
When ioctl is issued from the ifconfig command, depending on the request content such as SIOCGIFADDR, devinet_ioctl() is called via sock_ioctl(). By setting up a debugger, we can confirm it’s actually being called. By the way, the functions called from the ethtool command seem to be callbacks within struct ethtool_ops defined in the driver code.
(gdb) list devinet_ioctl
685 int devinet_ioctl(struct net *net, unsigned int cmd, void __user *arg)
686 {
687 struct ifreq ifr;
688 struct sockaddr_in sin_orig;
689 struct sockaddr_in *sin = (struct sockaddr_in *)&ifr.ifr_addr;
690 struct in_device *in_dev;
(gdb) break devinet_ioctl
Breakpoint 3 at 0xffffffff81475264: file net/ipv4/devinet.c, line 702.
(gdb) continue
(gdb) bt
#0 devinet_ioctl (net=0xffffffff81f8b040 <init_net>, cmd=35093, arg=0x7fff25324a10) at net/ipv4/devinet.c:702
#1 0xffffffff814769d8 in inet_ioctl (sock=<optimized out>, cmd=<optimized out>, arg=<optimized out>)
at net/ipv4/af_inet.c:870
#2 0xffffffff813f7cc0 in sock_do_ioctl (net=0xffffffff81f8b040 <init_net>, sock=<optimized out>, cmd=35093,
arg=140733817440784) at net/socket.c:945
#3 0xffffffff813f8119 in sock_ioctl (file=<optimized out>, cmd=35093, arg=<optimized out>) at net/socket.c:1030
#4 0xffffffff8116cd8c in vfs_ioctl (arg=<optimized out>, cmd=<optimized out>, filp=0xffff88001dbea200)
at fs/ioctl.c:43
#5 do_vfs_ioctl (filp=0xffff88001dbea200, fd=3, cmd=<optimized out>, arg=<optimized out>) at fs/ioctl.c:598
#6 0xffffffff8116d0e1 in sys_ioctl (fd=3, cmd=35093, arg=140733817440784) at fs/ioctl.c:618
#7 0xffffffff814dd5c2 in system_call () at arch/x86/kernel/entry_64.S:487
#8 0x000000000049e417 in ?? ()
#9 0x0000000000000000 in ?? ()
(gdb) p /x cmd
$14 = 0x8915 # corresponds to SIOGIFADDR 0x8915
(gdb) p ifr.ifr_ifrn.ifrn_name
$15 = "\002\000\000\000\000\000\000\000\001\000\000\000\000\000\000"
Part 2
Inside the kernel, there are several subsystems that are interdependent, so when an event occurs or is detected in one subsystem, there’s a need to notify another subsystem. This is realized through a notification chain mechanism. For example, when a link down occurs, entries are deleted from the routing table. There are function pointer lists named xxx_chain, xxx_notifler_chain, and xxx_notifiler_list, and it’s a simple mechanism where you just add to them. Function pointers are registered to these lists with notifier_chain_register. In practice, there are often wrappers like register_inetaddr_notifier and register_netdevice_notifier. Invocation is done with notifler_call_chain. Note that functions registered in the list are executed in the context of the caller of this function. For example, network-related notification chains like inetaddr_chain and netdev_chain may have functions from other subsystems registered, and conversely, network-related functions may be registered in reboot_notifier_list.
Let’s look at the boot process from a system-wide perspective. When booted, start_kernel is called, and within it, the init kernel thread is started. Inside do_initcalls, .initcallN.init sections are executed in order. Here, .init.setup corresponds to kernel parameters, and device_initcall corresponds to initialization of statically linked device drivers. General parameters are defined using the __setup macro, and parameters needed at an early stage are defined using the early_param macro. Kernel module parameters can be defined with the module_param macro and are expanded to /sys/module/module_name/parameters/parameter_name.
(gdb) list do_initcalls
695 static void __init do_initcalls(void)
696 {
697 initcall_t *fn;
698
699 for (fn = __early_initcall_end; fn < __initcall_end; fn++)
700 do_one_initcall(*fn);
701 }
(gdb) b do_initcalls
Breakpoint 2 at 0xffffffff81c23690: file init/main.c, line 700.
(gdb) continue
(gdb) bt
#0 do_initcalls () at init/main.c:700
#1 do_basic_setup () at init/main.c:718
#2 0xffffffff81c23889 in kernel_init (unused=<optimized out>) at init/main.c:801
#3 0xffffffff814de704 in kernel_thread_helper () at arch/x86/kernel/entry_64.S:1161
#4 0x0000000000000000 in ?? ()
Macro
_init_begin -----> +---------------------+
| .init.text | __init
| |
+---------------------+
| .init.data | __initdata
| |
_setup_start -----> +---------------------+
| .init.setup | __setup_param
| |
| |
__initcall_start -----> +---------------------+
| .initcall1.init | core_initcall
| |
+---------------------+
| .initcall2.init | postcore_initcall
| |
+---------------------+
| ... | ...
| |
| |
| |
| |
+---------------------+
| .initcall6.init | device_initcall
| |
+---------------------+
How do device drivers interact with the PCI layer? Here, the following three data structures are important. All are initialized and registered within the device driver.
// Corresponds to a specific model of PCI device
struct pci_device_id {
__u32 vendor, device; /* Vendor and device ID or PCI_ANY_ID*/
__u32 subvendor, subdevice; /* Subsystem ID's or PCI_ANY_ID */
__u32 class, class_mask; /* (class,subclass,prog-if) triplet */
kernel_ulong_t driver_data; /* Data private to the driver */
};
// Corresponds to a PCI device
struct pci_dev {
struct list_head bus_list; /* node in per-bus list */
struct pci_bus *bus; /* bus this device is on */
struct pci_bus *subordinate; /* bus this device bridges to */
void *sysdata; /* hook for sys-specific extension */
struct proc_dir_entry *procent; /* device entry in /proc/bus/pci */
struct pci_slot *slot; /* Physical slot this device is in */
unsigned int devfn; /* encoded device & function index */
unsigned short vendor;
unsigned short device;
unsigned short subsystem_vendor;
unsigned short subsystem_device;
unsigned int class; /* 3 bytes: (base,sub,prog-if) */
u8 revision; /* PCI revision, low byte of class word */
u8 hdr_type; /* PCI header type ('multi' flag masked out) */
u8 pcie_cap; /* PCI-E capability offset */
u8 pcie_type; /* PCI-E device/port type */
u8 rom_base_reg; /* which config register controls the ROM */
u8 pin; /* which interrupt pin this device uses */
struct pci_driver *driver; /* which driver has allocated this device */
...
};
// Corresponds to a PCI device driver
struct pci_driver {
struct list_head node;
const char *name;
const struct pci_device_id *id_table; /* must be non-NULL for probe to be called */
int (*probe) (struct pci_dev *dev, const struct pci_device_id *id); /* New device inserted */
void (*remove) (struct pci_dev *dev); /* Device removed (NULL if not a hot-plug capable driver) */
int (*suspend) (struct pci_dev *dev, pm_message_t state); /* Device suspended */
int (*suspend_late) (struct pci_dev *dev, pm_message_t state);
int (*resume_early) (struct pci_dev *dev);
int (*resume) (struct pci_dev *dev); /* Device woken up */
void (*shutdown) (struct pci_dev *dev);
struct pci_error_handlers *err_handler;
struct device_driver driver;
struct pci_dynids dynids;
};
Here’s an example from the Intel e100 driver:
static int __init e100_init_module(void)
{
if (((1 << debug) - 1) & NETIF_MSG_DRV) {
pr_info("%s, %s\n", DRV_DESCRIPTION, DRV_VERSION);
pr_info("%s\n", DRV_COPYRIGHT);
}
return pci_register_driver(&e100_driver);
}
module_init(e100_init_module);
static struct pci_driver e100_driver = {
.name = DRV_NAME,
.id_table = e100_id_table,
.probe = e100_probe,
.remove = __devexit_p(e100_remove),
#ifdef CONFIG_PM
/* Power Management hooks */
.suspend = e100_suspend,
.resume = e100_resume,
#endif
.shutdown = e100_shutdown,
.err_handler = &e100_err_handler,
};
#define INTEL_8255X_ETHERNET_DEVICE(device_id, ich) {\
PCI_VENDOR_ID_INTEL, device_id, PCI_ANY_ID, PCI_ANY_ID, \
PCI_CLASS_NETWORK_ETHERNET << 8, 0xFFFF00, ich }
static DEFINE_PCI_DEVICE_TABLE(e100_id_table) = {
INTEL_8255X_ETHERNET_DEVICE(0x1029, 0),
INTEL_8255X_ETHERNET_DEVICE(0x1030, 0),
INTEL_8255X_ETHERNET_DEVICE(0x1031, 3),
INTEL_8255X_ETHERNET_DEVICE(0x1032, 3),
...
};
When a device is detected in xxx_probe, memory for struct net_device is allocated and registered to dev_base using register_netdev. Since struct net_device has a large number of parameters, the initialization is split between ether_setup and the device driver’s probe function.
(gdb) list ether_setup
334 void ether_setup(struct net_device *dev)
335 {
336 dev->header_ops = ð_header_ops;
337 dev->type = ARPHRD_ETHER;
338 dev->hard_header_len = ETH_HLEN;
339 dev->mtu = ETH_DATA_LEN;
(gdb) break ether_setup
Breakpoint 2 at 0xffffffff8142a0a9: file net/ethernet/eth.c, line 336.
(gdb) continue
Continuing.
(gdb) bt
#0 ether_setup (dev=0xffff88001ce61000) at net/ethernet/eth.c:336
#1 0xffffffff81412215 in alloc_netdev_mqs (sizeof_priv=<optimized out>, name=0xffffffff817f4daa "eth%d",
setup=0xffffffff8142a0a0 <ether_setup>, txqs=1, rxqs=1) at net/core/dev.c:5824
#2 0xffffffff8142a091 in alloc_etherdev_mqs (sizeof_priv=<optimized out>, txqs=<optimized out>,
rxqs=<optimized out>) at net/ethernet/eth.c:367
#3 0xffffffff813686a1 in virtnet_probe (vdev=0xffff88001cdc7c00) at drivers/net/virtio_net.c:904
#4 0xffffffff812cd903 in virtio_dev_probe (_d=0xffff88001cdc7c08) at drivers/virtio/virtio.c:139
#5 0xffffffff8131d537 in really_probe (dev=0xffff88001cdc7c08, drv=0xffffffff81a95100 <virtio_net_driver>)
at drivers/base/dd.c:129
#6 0xffffffff8131d73e in driver_probe_device (drv=0xffffffff81a95100 <virtio_net_driver>,
dev=0xffff88001cdc7c08) at drivers/base/dd.c:212
#7 0xffffffff8131d84b in __driver_attach (dev=0xffff88001cdc7c08, data=0xffffffff81a95100 <virtio_net_driver>)
at drivers/base/dd.c:286
Device drivers are also responsible for initializing interrupt handlers. This is realized with request_irq. Here, enabling SA_SHIRQ allows that one IRQ number to be shared by multiple interrupt handlers.
int request_threaded_irq(unsigned int irq, irq_handler_t handler,
irq_handler_t thread_fn, unsigned long irqflags,
const char *devname, void *dev_id);
static inline int __must_check
request_irq(unsigned int irq, irq_handler_t handler, unsigned long flags,
const char *name, void *dev)
{
return request_threaded_irq(irq, handler, NULL, flags, name, dev);
}
struct irqaction {
irq_handler_t handler;
unsigned long flags;
void *dev_id;
struct irqaction *next;
int irq;
irq_handler_t thread_fn;
struct task_struct *thread;
unsigned long thread_flags;
unsigned long thread_mask;
const char *name;
struct proc_dir_entry *dir;
} ____cacheline_internodealigned_in_smp;
struct irq_desc irq_desc[NR_IRQS] __cacheline_aligned_in_smp = {
[0 ... NR_IRQS-1] = {
.handle_irq = handle_bad_irq,
.depth = 1,
.lock = __RAW_SPIN_LOCK_UNLOCKED(irq_desc->lock),
}
};
When a device is connected, which driver is chosen? It’s interesting that this mechanism goes through user space once: kernel-user-kernel. For example, when executing modprobe eth0, the device driver 3c59x is loaded based on alias eth0 3c59x written in /etc/modprobe.conf. The kernel functions corresponding to this are request_module and call_usermodehelper.
How does a NIC detect link status? When hardware detects a change in carrier or signal, it performs a notification or changes a Configuration Register. After that, the device driver finds it and calls linkwatch_fire_event to register an event. This event is executed by linkwatch_event in the keventd_wq kernel thread. linkwatch_event is responsible for state changes in struct net_device and notifications.