Introduction
Continuing gokvm development 1 2 3 4.
Through recent development, I was able to provide a virtual NIC to VMs on gokvm via virtio-net. Networking support was one of the initial goals, so I feel a sense of accomplishment. With this support, VMs on gokvm can now communicate with the host (or the outside via a software switch). It broadens what you can do, such as providing a web server or logging in via SSH, which I think is a major change. As usual, I’ll review by extracting important commits.
c5217550 Add Virt Queue Data Structure
What is Virt Queue in the first place? Virt Queue means a ring-structured queue used for data exchange between guest and host. For example, in the case of naive virtio-net that uses one queue each for sending and receiving, one Virt Queue each for sending and receiving (two Virt Queues in total) is needed. Of course, if supporting multi-queue or control queue, more Virt Queues are needed.
One Virt Queue consists of Descriptor Table, Avail Ring, and Used Ring. Summarizing the address and length of data to be handled as one descriptor, and arranging them in a table-like manner is the Descriptor Table. Avail Ring and Used Ring are similar, both used to communicate descriptor IDs between guest and host. The direction is also fixed, with Avail Ring from guest to host and Used Ring from host to guest. By the way, in the virtio specification 5, the guest is expressed as driver and the host as device.
dc12c9a5 Load Virt Queue
Now, where does Virt Queue exist, and how should we reference and write to it from the host side? Actually, Virt Queue exists in a region on the guest’s physical address, and the guest’s device driver (needs confirmation) is responsible for allocating that region. While the guest kernel performs Probe processing for Virtio devices, by writing to QUEUE_PFN in Virtio Header 6, it can convey the guest physical memory address of the Virt Queue to the host 4. Of course, to initialize multiple Virt Queues, the same processing is repeated while changing QUEUE_SEL.
When the Virtio device’s Probe processing completes successfully, some packet flies from the guest, and when I detect it with Queue Notify and look at the Descriptor Table in the Virt Queue, I can confirm that packet data is actually stored.
Queue Notify was written!
{{Addr:0xf092568, Len:0xa, Flags:0x1, Next:0x1}, ... }
71d91aeb Generate Tap Interface
There seem to be several ways for the VMM to convey packets read from the guest via Virt Queue to the host (or software switch), but here I simply used Tap interface 7. When a certain program creates a Tap interface, that program can read and write packets with read(2) and write(2) system calls. Also, the host kernel registers the Tap interface in the network subsystem, so from the host’s perspective, packets can be sent and received like a normal NIC. This time, I ported the following sample code written in C.
#include <linux/if.h>
#include <linux/if_tun.h>
int tun_alloc(char *dev)
{
struct ifreq ifr;
int fd, err;
if( (fd = open("/dev/net/tun", O_RDWR)) < 0 )
return tun_alloc_old(dev);
memset(&ifr, 0, sizeof(ifr));
// Flags: IFF_TUN - TUN device (no Ethernet headers)
// IFF_TAP - TAP device
//
// IFF_NO_PI - Do not provide packet information
//
ifr.ifr_flags = IFF_TUN;
if( *dev )
strscpy_pad(ifr.ifr_name, dev, IFNAMSIZ);
if( (err = ioctl(fd, TUNSETIFF, (void *) &ifr)) < 0 ){
close(fd);
return err;
}
strcpy(dev, ifr.ifr_name);
return fd;
}
14ae4091 Receive Packets Sent by Guest on Host
When the guest sends a packet to the host, it first adds an entry corresponding to the packet to the Descriptor Table. Then it inserts the ID of that descriptor into Avail Ring. And it notifies the host by writing to the QUEUE_NOTIFY field of Virtio Header. The host checks the processing status of Avail Ring, and if it finds entries that haven’t been processed before, it takes them out, and ultimately extracts packet data via the descriptor.
I’ll omit details, but when handling data of a size that doesn’t fit in one descriptor, or when handling data that is not continuous in the guest physical memory address space, multiple descriptors can be linked in a Linked List manner and treated as one unit.
What I got stuck on here was the queue size. This is the number of entries in each of Virt Queue’s Descriptor Table, Avail Ring, and Used Ring, but because this was too small, I fell into a situation where I couldn’t send more than a certain number of packets. There was nothing displayed in kernel messages, so it took time to resolve.
The cause was that the queue size was too small.
Extracting the relevant code snippet from the kernel code, it becomes as follows 8.
Here, since MAC_SKB_FRAGS is 16, it can be read that if there are not 16+2 or more free spaces in Virt Queue,
packet transmission will stop.
Actually, when I increased the queue size from 8 to 32, I was able to send packets normally.
1694 static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
1695 {
…
1748 if (sq->vq->num_free < 2+MAX_SKB_FRAGS) { ※ stops here
1749 netif_stop_subqueue(dev, qnum);
8bbafbd5 Improve Tap Interface
Going forward, to extract packets from the Tap interface and send them from host to guest, I want to make the Tap interface easier to use in advance. First, I added write and read methods to the Tap device to make it satisfy the io.ReadWriter interface (referring to Golang’s interface). Later, when writing test code, I can benefit from abstraction via Golang interfaces. Also, to extract packets from the Tap interface, I use the read(2) system call, but the default behavior is blocking IO, so that thread (goroutine) will be blocked in execution until a packet exists. So, I configured the file descriptor of the Tap interface with the fcntl(2) system call so that IO can be done non-blocking. This way, if a packet doesn’t exist, EAGAIN is returned immediately, making it easier to handle. Also, I configured it to fire a SIGIO signal when a packet is received.
3bcebf7b Add Tx-dedicated goroutine
When sending a packet from guest to host, until now, every time QUEUE_NOTIFY of Virtio Header was written to,
I was writing to the Tap interface in that thread (goroutine).
I separated this part as a Tx-dedicated goroutine into a different context.
Notification between goroutines was implemented with channels following Golang’s style, like txKick chan interface{}.
6a9710ea Add Rx-dedicated goroutine
If the implementation has progressed this far, packet transmission from host to guest can actually be realized straightforwardly. In this direction of data transfer, the usage of Avail Ring and Used Ring becomes a bit special.
Avail Ring is used to notify empty buffers from guest to host. These empty buffers are prepared by the guest kernel (driver), and the guest physical address and its length are registered in the Descriptor table, and that ID is registered in Avail Ring. On the host side, it takes out the necessary number of descriptors from Avail Ring, copies packet data, and just adds them to Used Ring sequentially. Around here, the guest-side driver has a big job, and the host-side job is small (just write data and increment).
Since SIGIO is fired when a packet lands on the Tap interface from outside, by converting the signal to a channel like signal.Notify(res.rxKick, syscall.SIGIO),
I can send a notification to the Rx-dedicated goroutine, so I can avoid polling.
Since communication is now possible in both directions, I was able to do an operation check with ping as follows:
on guest:
$ ip addr add 192.168.1.1/24 dev eth0
$ ip link set eth0 up
on host:
$ sudo ip link set tap up
$ sudo ip addr add 192.168.1.2/24 dev tap
$ sudo ping 192.168.1.1 -i 0.1 -c 3
PING 192.168.1.1 (192.168.1.1) 56(84) bytes of data.
64 bytes from 192.168.1.1: icmp_seq=1 ttl=64 time=17.4 ms
64 bytes from 192.168.1.1: icmp_seq=2 ttl=64 time=19.6 ms
64 bytes from 192.168.1.1: icmp_seq=3 ttl=64 time=20.7 ms
--- 192.168.1.1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 201ms
rtt min/avg/max/mdev = 17.440/19.246/20.702/1.354 ms
e9d54ab9 Add Black Box Test
I added a black box test to enable automatic testing by CI that uses VMM to start a virtual machine and verifies that communication with the host side is possible via Ping using the Tap interface.
Conclusion
It took longer than I thought, but I was able to bring it to a working state. Since I can now support networking, the range of what can be done with this custom VMM has expanded greatly. Since I was able to grasp the basic behavior of virtio, I think it can also be applied to implementations such as virtio-blk. I want to proceed with implementation steadily.
