This blog has covered relatively new networking technologies, but it’s good to revisit the fundamentals. The Linux kernel will be used for a long time to come, and even without perfect understanding, having a grasp of the basics is meaningful. This book is over 1,000 pages and consists of parts 1-7, so maintaining motivation to read it all at once is difficult. In this article, I’ll summarize what I’ve read so far. I kept the kernel version 2.6.39 source code at hand while reading. For build methods and such, see my previous article 1.

Part 1

The important data structures for networking are struct sk_buff and struct net_device. First, it’s essential to grasp these two data structures. struct sk_buff (setting aside fragmentation issues) corresponds to one packet. Its instance is often named skb. skb->data points to the header of the network layer responsible for processing. For example, when performing L2 processing, skb->data points to the beginning of the L2 header. As processing progresses, this pointer moves. There is space before and after the actual data.

+------------+                           skb->mac   skb->nh
|            |                              |         |
| head----------->  +------------+          |         |
|            |      | headroom   |          v         v
| data----------->  +------------+          +---------+---------+---------+---
|            |      |            |          | L2      | L3      | L4      |
| tail       |      | Data       |          | header  | header  | header  | ...
|     |      |      |            |          +---------+---------+---------+---
| end |      |      |            |          ^         ^
|  |  +---------->  +------------+          |         |
|  |         |      | tailroom   |          |         |
|  +------------->  +------------+          +---------+
|            |                               skb->data
+------------+
struct sk_buff

Let’s look at what members this structure has. users corresponds to a reference counter and can be manipulated with sk_get and kfree_skb. There are also pointers corresponding to each layer, such as mac_header. cb is short for control buffer, a 48-byte region that each layer can use privately (without being aware of other layers). struct sk_buff is managed as a doubly-linked list, with the entire list corresponding to struct sk_buff_head. Let’s use a debugger to examine the contents. By attaching to a function responsible for transmission, we can see the contents of the IP header held inside struct sk_buff.

(gdb) list dev_hard_start_xmit
2086    int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
2087                            struct netdev_queue *txq)
2088    {
2089            const struct net_device_ops *ops = dev->netdev_ops;
2090            int rc = NETDEV_TX_OK;
2091
2092            if (likely(!skb->next)) {
(gdb) break dev_hard_start_xmit
(gdb) continue
Breakpoint 2 at 0xffffffff8140e036: file net/core/dev.c, line 2092.
(gdb) print *((struct iphdr *)(skb->head + skb->network_header))
$3 = {ihl = 5 '\005', version = 4 '\004', tos = 0 '\000', tot_len = 18433,
  id = 0, frag_off = 0, ttl = 64 '@', protocol = 17 '\021', check = 42617,
  saddr = 0, daddr = 4294967295}

Next, let’s look at struct net_device. This is data that is created once for each network interface, regardless of whether it’s virtual or physical. It’s managed in a list by the global variable dev_base. It includes the device name, IRQ number, and various flags (flags, gflags, priv_flags). If it’s a virtual interface, you can find the original device by following the master field. Interfaces are often searched by name or index, so there are hash tables provided by dev_index_head and dev_name_head. As before, let’s attach a debugger to the transmission function and examine the contents of struct net_device.

(gdb) print init_net->dev_base_head
$8 = {next = 0xffff88001d89f080, prev = 0xffff88001ce5f080}
(gdb) print init_net->dev_name_head
$9 = (struct hlist_head *) 0xffff88001d878800
(gdb) print init_net->dev_index_head
$10 = (struct hlist_head *) 0xffff88001d89f800
(gdb) print *dev
$11 = {name = "eth0", '\000' <repeats 11 times>,
       pm_qos_req = {list = {prio = 0, prio_list = { ...

Changing topics, let’s move to the interface between user and kernel space. Historically, there are many interfaces here. To summarize, it’s something like this:

Name Contents
procfs(/proc) Usually read-only. Network-related files are gathered in /proc/net. Can be registered with proc_net_fops_create.
sysctl(/proc/sys) Can be used from the sysctl command. Corresponds to actual kernel variables. Can be registered with register_sysctl_table.
sysfs(/sys) A reorganization of procfs and sysctl contents in kernel 2.6.
ioctl Used by ifconfig, ethtool, mii-tools.
netlink A recent mechanism provided by the socket API. Used by iproute2. The only one that can send notifications from kernel to user.

When ioctl is issued from the ifconfig command, depending on the request content such as SIOCGIFADDR, devinet_ioctl() is called via sock_ioctl(). By setting up a debugger, we can confirm it’s actually being called. By the way, the functions called from the ethtool command seem to be callbacks within struct ethtool_ops defined in the driver code.

(gdb) list devinet_ioctl
685     int devinet_ioctl(struct net *net, unsigned int cmd, void __user *arg)
686     {
687             struct ifreq ifr;
688             struct sockaddr_in sin_orig;
689             struct sockaddr_in *sin = (struct sockaddr_in *)&ifr.ifr_addr;
690             struct in_device *in_dev;
(gdb) break devinet_ioctl
Breakpoint 3 at 0xffffffff81475264: file net/ipv4/devinet.c, line 702.
(gdb) continue
(gdb) bt
#0  devinet_ioctl (net=0xffffffff81f8b040 <init_net>, cmd=35093, arg=0x7fff25324a10) at net/ipv4/devinet.c:702
#1  0xffffffff814769d8 in inet_ioctl (sock=<optimized out>, cmd=<optimized out>, arg=<optimized out>)
    at net/ipv4/af_inet.c:870
#2  0xffffffff813f7cc0 in sock_do_ioctl (net=0xffffffff81f8b040 <init_net>, sock=<optimized out>, cmd=35093,
    arg=140733817440784) at net/socket.c:945
#3  0xffffffff813f8119 in sock_ioctl (file=<optimized out>, cmd=35093, arg=<optimized out>) at net/socket.c:1030
#4  0xffffffff8116cd8c in vfs_ioctl (arg=<optimized out>, cmd=<optimized out>, filp=0xffff88001dbea200)
    at fs/ioctl.c:43
#5  do_vfs_ioctl (filp=0xffff88001dbea200, fd=3, cmd=<optimized out>, arg=<optimized out>) at fs/ioctl.c:598
#6  0xffffffff8116d0e1 in sys_ioctl (fd=3, cmd=35093, arg=140733817440784) at fs/ioctl.c:618
#7  0xffffffff814dd5c2 in system_call () at arch/x86/kernel/entry_64.S:487
#8  0x000000000049e417 in ?? ()
#9  0x0000000000000000 in ?? ()
(gdb) p /x cmd
$14 = 0x8915 # corresponds to SIOGIFADDR 0x8915
(gdb) p ifr.ifr_ifrn.ifrn_name
$15 = "\002\000\000\000\000\000\000\000\001\000\000\000\000\000\000"

Part 2

Inside the kernel, there are several subsystems that are interdependent, so when an event occurs or is detected in one subsystem, there’s a need to notify another subsystem. This is realized through a notification chain mechanism. For example, when a link down occurs, entries are deleted from the routing table. There are function pointer lists named xxx_chain, xxx_notifler_chain, and xxx_notifiler_list, and it’s a simple mechanism where you just add to them. Function pointers are registered to these lists with notifier_chain_register. In practice, there are often wrappers like register_inetaddr_notifier and register_netdevice_notifier. Invocation is done with notifler_call_chain. Note that functions registered in the list are executed in the context of the caller of this function. For example, network-related notification chains like inetaddr_chain and netdev_chain may have functions from other subsystems registered, and conversely, network-related functions may be registered in reboot_notifier_list.

Let’s look at the boot process from a system-wide perspective. When booted, start_kernel is called, and within it, the init kernel thread is started. Inside do_initcalls, .initcallN.init sections are executed in order. Here, .init.setup corresponds to kernel parameters, and device_initcall corresponds to initialization of statically linked device drivers. General parameters are defined using the __setup macro, and parameters needed at an early stage are defined using the early_param macro. Kernel module parameters can be defined with the module_param macro and are expanded to /sys/module/module_name/parameters/parameter_name.

(gdb) list do_initcalls
695     static void __init do_initcalls(void)
696     {
697             initcall_t *fn;
698
699             for (fn = __early_initcall_end; fn < __initcall_end; fn++)
700                     do_one_initcall(*fn);
701     }
(gdb) b do_initcalls
Breakpoint 2 at 0xffffffff81c23690: file init/main.c, line 700.
(gdb) continue
(gdb) bt
#0  do_initcalls () at init/main.c:700
#1  do_basic_setup () at init/main.c:718
#2  0xffffffff81c23889 in kernel_init (unused=<optimized out>) at init/main.c:801
#3  0xffffffff814de704 in kernel_thread_helper () at arch/x86/kernel/entry_64.S:1161
#4  0x0000000000000000 in ?? ()
                                                    Macro

    _init_begin  ----->  +---------------------+
                         | .init.text          |   __init
                         |                     |
                         +---------------------+
                         | .init.data          |   __initdata
                         |                     |
   _setup_start  ----->  +---------------------+
                         | .init.setup         |  __setup_param
                         |                     |
                         |                     |
__initcall_start ----->  +---------------------+
                         | .initcall1.init     |  core_initcall
                         |                     |
                         +---------------------+
                         | .initcall2.init     |  postcore_initcall
                         |                     |
                         +---------------------+
                         | ...                 |  ...
                         |                     |
                         |                     |
                         |                     |
                         |                     |
                         +---------------------+
                         | .initcall6.init     |  device_initcall
                         |                     |
                         +---------------------+

How do device drivers interact with the PCI layer? Here, the following three data structures are important. All are initialized and registered within the device driver.

// Corresponds to a specific model of PCI device
struct pci_device_id {
  __u32 vendor, device;   /* Vendor and device ID or PCI_ANY_ID*/
  __u32 subvendor, subdevice; /* Subsystem ID's or PCI_ANY_ID */
  __u32 class, class_mask;  /* (class,subclass,prog-if) triplet */
  kernel_ulong_t driver_data; /* Data private to the driver */
};

// Corresponds to a PCI device
struct pci_dev {
  struct list_head bus_list;  /* node in per-bus list */
  struct pci_bus  *bus;   /* bus this device is on */
  struct pci_bus  *subordinate; /* bus this device bridges to */

  void    *sysdata; /* hook for sys-specific extension */
  struct proc_dir_entry *procent; /* device entry in /proc/bus/pci */
  struct pci_slot *slot;    /* Physical slot this device is in */

  unsigned int  devfn;    /* encoded device & function index */
  unsigned short  vendor;
  unsigned short  device;
  unsigned short  subsystem_vendor;
  unsigned short  subsystem_device;
  unsigned int  class;    /* 3 bytes: (base,sub,prog-if) */
  u8    revision; /* PCI revision, low byte of class word */
  u8    hdr_type; /* PCI header type ('multi' flag masked out) */
  u8    pcie_cap; /* PCI-E capability offset */
  u8    pcie_type;  /* PCI-E device/port type */
  u8    rom_base_reg; /* which config register controls the ROM */
  u8    pin;      /* which interrupt pin this device uses */

  struct pci_driver *driver;  /* which driver has allocated this device */
  ...
};

// Corresponds to a PCI device driver
struct pci_driver {
  struct list_head node;
  const char *name;
  const struct pci_device_id *id_table; /* must be non-NULL for probe to be called */
  int  (*probe)  (struct pci_dev *dev, const struct pci_device_id *id); /* New device inserted */
  void (*remove) (struct pci_dev *dev); /* Device removed (NULL if not a hot-plug capable driver) */
  int  (*suspend) (struct pci_dev *dev, pm_message_t state);  /* Device suspended */
  int  (*suspend_late) (struct pci_dev *dev, pm_message_t state);
  int  (*resume_early) (struct pci_dev *dev);
  int  (*resume) (struct pci_dev *dev);                 /* Device woken up */
  void (*shutdown) (struct pci_dev *dev);
  struct pci_error_handlers *err_handler;
  struct device_driver  driver;
  struct pci_dynids dynids;
};

Here’s an example from the Intel e100 driver:

static int __init e100_init_module(void)
{
  if (((1 << debug) - 1) & NETIF_MSG_DRV) {
    pr_info("%s, %s\n", DRV_DESCRIPTION, DRV_VERSION);
    pr_info("%s\n", DRV_COPYRIGHT);
  }
  return pci_register_driver(&e100_driver);
}

module_init(e100_init_module);

static struct pci_driver e100_driver = {
  .name =         DRV_NAME,
  .id_table =     e100_id_table,
  .probe =        e100_probe,
  .remove =       __devexit_p(e100_remove),
#ifdef CONFIG_PM
  /* Power Management hooks */
  .suspend =      e100_suspend,
  .resume =       e100_resume,
#endif
  .shutdown =     e100_shutdown,
  .err_handler = &e100_err_handler,
};

#define INTEL_8255X_ETHERNET_DEVICE(device_id, ich) {\
  PCI_VENDOR_ID_INTEL, device_id, PCI_ANY_ID, PCI_ANY_ID, \
  PCI_CLASS_NETWORK_ETHERNET << 8, 0xFFFF00, ich }
static DEFINE_PCI_DEVICE_TABLE(e100_id_table) = {
  INTEL_8255X_ETHERNET_DEVICE(0x1029, 0),
  INTEL_8255X_ETHERNET_DEVICE(0x1030, 0),
  INTEL_8255X_ETHERNET_DEVICE(0x1031, 3),
  INTEL_8255X_ETHERNET_DEVICE(0x1032, 3),
  ...
};

When a device is detected in xxx_probe, memory for struct net_device is allocated and registered to dev_base using register_netdev. Since struct net_device has a large number of parameters, the initialization is split between ether_setup and the device driver’s probe function.

(gdb) list ether_setup
334     void ether_setup(struct net_device *dev)
335     {
336             dev->header_ops         = &eth_header_ops;
337             dev->type               = ARPHRD_ETHER;
338             dev->hard_header_len    = ETH_HLEN;
339             dev->mtu                = ETH_DATA_LEN;
(gdb) break ether_setup
Breakpoint 2 at 0xffffffff8142a0a9: file net/ethernet/eth.c, line 336.
(gdb) continue
Continuing.

(gdb) bt
#0  ether_setup (dev=0xffff88001ce61000) at net/ethernet/eth.c:336
#1  0xffffffff81412215 in alloc_netdev_mqs (sizeof_priv=<optimized out>, name=0xffffffff817f4daa "eth%d",
    setup=0xffffffff8142a0a0 <ether_setup>, txqs=1, rxqs=1) at net/core/dev.c:5824
#2  0xffffffff8142a091 in alloc_etherdev_mqs (sizeof_priv=<optimized out>, txqs=<optimized out>,
    rxqs=<optimized out>) at net/ethernet/eth.c:367
#3  0xffffffff813686a1 in virtnet_probe (vdev=0xffff88001cdc7c00) at drivers/net/virtio_net.c:904
#4  0xffffffff812cd903 in virtio_dev_probe (_d=0xffff88001cdc7c08) at drivers/virtio/virtio.c:139
#5  0xffffffff8131d537 in really_probe (dev=0xffff88001cdc7c08, drv=0xffffffff81a95100 <virtio_net_driver>)
    at drivers/base/dd.c:129
#6  0xffffffff8131d73e in driver_probe_device (drv=0xffffffff81a95100 <virtio_net_driver>,
    dev=0xffff88001cdc7c08) at drivers/base/dd.c:212
#7  0xffffffff8131d84b in __driver_attach (dev=0xffff88001cdc7c08, data=0xffffffff81a95100 <virtio_net_driver>)
    at drivers/base/dd.c:286

Device drivers are also responsible for initializing interrupt handlers. This is realized with request_irq. Here, enabling SA_SHIRQ allows that one IRQ number to be shared by multiple interrupt handlers.

int request_threaded_irq(unsigned int irq, irq_handler_t handler,
       irq_handler_t thread_fn, unsigned long irqflags,
       const char *devname, void *dev_id);

static inline int __must_check
request_irq(unsigned int irq, irq_handler_t handler, unsigned long flags,
      const char *name, void *dev)
{
  return request_threaded_irq(irq, handler, NULL, flags, name, dev);
}

struct irqaction {
  irq_handler_t handler;
  unsigned long flags;
  void *dev_id;
  struct irqaction *next;
  int irq;
  irq_handler_t thread_fn;
  struct task_struct *thread;
  unsigned long thread_flags;
  unsigned long thread_mask;
  const char *name;
  struct proc_dir_entry *dir;
} ____cacheline_internodealigned_in_smp;

struct irq_desc irq_desc[NR_IRQS] __cacheline_aligned_in_smp = {
  [0 ... NR_IRQS-1] = {
    .handle_irq = handle_bad_irq,
    .depth    = 1,
    .lock   = __RAW_SPIN_LOCK_UNLOCKED(irq_desc->lock),
  }
};

When a device is connected, which driver is chosen? It’s interesting that this mechanism goes through user space once: kernel-user-kernel. For example, when executing modprobe eth0, the device driver 3c59x is loaded based on alias eth0 3c59x written in /etc/modprobe.conf. The kernel functions corresponding to this are request_module and call_usermodehelper.

How does a NIC detect link status? When hardware detects a change in carrier or signal, it performs a notification or changes a Configuration Register. After that, the device driver finds it and calls linkwatch_fire_event to register an event. This event is executed by linkwatch_event in the keventd_wq kernel thread. linkwatch_event is responsible for state changes in struct net_device and notifications.