This article also has a Chinese version.

This series of articles mainly records my attempt to implement a Hypervisor using Rust. The table of contents:

This article is the fourth in the series. It will cover implementing a Virtio Queue and virtio-net from scratch and using TAP as the backend for a virtio-net device. To better assemble these components, additional components such as a Bus and EventLoop will also be added.

During this experiment, I also contributed some PRs to firecracker and cloud-hypervisor, please refer to the end of the article.

There is a lot of code in the article, please use the directory navigation on the right when necessary.

The previous three articles were completed in the second half of 2022, while this chapter and the corresponding experimental code have been in draft form until recently (now it is 2024). Over a few weekends, I added some code and completed this article.

The next article may support PCI devices and direct I/O for VF devices if I have the time(but don’t expect it, haha).

Introduction to Virtio

For a reference to Virtio, see the official documentation version 1.1 (version 1.0 is also usable): https://docs.oasis-open.org/virtio/virtio/v1.1/csprd01/virtio-v1.1-csprd01.html.

In full virtualization mode, I/O operations attempted by the guest can trigger VM_EXIT, interrupting the ongoing context of the VM and leading to significant context-switching overhead.

Virtio is a standard for virtualizing devices. When both host and guest adhere to this standard, I/O requests and responses can be handled via shared memory, greatly reducing context switches and enhancing I/O performance. Since it requires cooperation from the guest, it is known as paravirtualization.

In addition to optimizing performance through shared memory, other high-performance I/O strategies exist, such as hardware-supported virtualization (SR-IOV), which allows a single physical device to be virtualized into multiple Virtual Functions. This architecture permits individual Functions to be dedicated to VMs (this requires IOMMU support for address translation).

Virtio implementation is typically supported by kernel drivers in the Guest and handled by VMM developers in the Host. Sometimes, for better performance, the Host implementations could be kernel-based (vhost—saving on context switches between user and kernel space and utilizing hardware instructions that are not accessible in user space) or a standalone process (vhost-user—segregating unrelated logic from the VMM code and refining process permissions). It can even be hardware-supported (vDPA).

Both Windows and Linux have virtio drivers (e.g., virtio-net, virtio-blk in Linux), enabling optimized I/O performance when the VMM supports virtio while running within VMs.

A Virtio device consists of five components:

The Device Status Field represents the status of the device; Feature Bits are used for feature negotiation; Configuration Space is for parameters that change infrequently or are specific to initialization (e.g., number of pages for a Balloon device), or unique information about the device; Virtqueues facilitate data transfer (a device can have zero or more virtqueues); Notifications are divided into two main categories: ring notifications (about available and used buffers) and config space notifications.

How does bidirectional notification work? Guests notifying the Host is commonly referred to as a “kick” Typically, this utilizes KVM’s ioeventfd capability. When the Guest writes a specific value to a designated MMIO/PIO address, it triggers an EPT misconfiguration exception. KVM handles this exception and converts it into an eventfd write to notify the Host (though it could also be achieved using VM_EXIT, the performance is generally poorer in comparison). As previously mentioned, the Host notifies the Guest by injecting interrupts, which can be controlled using irqfd.

Data Flow

Handshake

Virtio can operate over MMIO or PCI (there’s also a channel IO version for S390 platforms). PCI interface supports device hot-plugging and conveniently maps physical devices directly into VMs, adding substantial complexity to theimplementation (a comparison on LWN notes that in qemu, virtio-MMIO is implemented in a single file with 421 lines of code, while virtio-PCI spans 24 files with 8952 lines, more than 20 times that of virtio-MMIO). Complex implementations increase the attack surface and can potentially slow down the driver.

As we aim to implement a fully functional yet basic VMM with no substantial physical device mapping requirements, we will use MMIO.

The device needs to be discovered by the Guest and successfully handshake to be usable. Device initialization follows this order:

The driver MUST follow this sequence to initialize a device:

Reset the device.

Set the ACKNOWLEDGE status bit: the guest OS has notice the device.

Set the DRIVER status bit: the guest OS knows how to drive the device.

Read device feature bits, and write the subset of feature bits understood by the OS and driver to the device. During this step the driver MAY read (but MUST NOT write) the device-specific configuration fields to check that it can support the device before accepting it.

Set the FEATURES_OK status bit. The driver MUST NOT accept new feature bits after this step.

Re-read device status to ensure the FEATURES_OK bit is still set: otherwise, the device does not support our subset of features and the device is unusable.

Perform device-specific setup, including discovery of virtqueues for the device, optional per-bus setup, reading and possibly writing the device’s virtio configuration space, and population of virtqueues.

Set the DRIVER_OK status bit. At this point the device is “live”.

If any of these steps go irrecoverably wrong, the driver SHOULD set the FAILED status bit to indicate that it has given up on the device (it can reset the device later to restart if desired). The driver MUST NOT continue initialization in that case.
The driver MUST NOT notify the device before setting DRIVER_OK.

Device initialization and state changes are initiated by the Guest/Driver side. Therefore, in the VMM/Device context, it is only necessary to handle the corresponding events.

Under the PCI protocol, its mechanisms can be used to achieve device discovery. For MMIO, there are no corresponding mechanisms, but devices can be made discoverable to Linux by inserting a device description into the boot cmdline.

For example, inserting virtio_mmio.device=4K@0xd0000000:5 indicates that there is a virtio device based on MMIO with the start address of the MMIO region at 0xd0000000, a length of 4K, and an irq of 5.

The device operates agreed actions by reading and writing specific offsets within the MMIO address range. For instance, at Offset = 0, only reading is allowed, and it must return 0x74726976 (the ASCII little endian representation of “virt”); at Offset = 0x10, it’s read-only, and the Device should return its supported features; at Offset = 0x14, it’s write-only, where the chosen features are written into the Device.

Once the Guest discovers the device, it can choose an appropriate moment to perform the initialization logic. The order of state changes has already been clearly stated in the earlier documentation and needs no further elaboration.

The device needs to support these read and write operations, but as the processing of these operations is quite similar between devices, it’s possible to abstract a Wrapper(or Adapter if you like) structure, which could be named MMIOTransport. This structure wraps the specific device implementation and stores some necessary information for the handshake process (such as queue_select), converting MMIO read/write operations into operations on its own fields and the underlying device. Specific device implementations also need to be somewhat abstracted, exposing a general interface for MMIOTransport to invoke.

Communication

During the handshake:

All reads and writes are processed by the vCPU emulation thread after triggering a VM_EXIT through writing to MMIO.

After the handshake is complete, when the Guest wants to send data:

[Guest] Fetches a Descriptor Chain from the Used Ring and writes data into the buffer it points to.
[Guest] The Descriptor index (if it’s a chain of multiple Descriptors, only the index of the first one is needed) is then written into the Available Ring, and the index on the Available Ring is updated.
[Guest] The Guest needs to notify the Host to consume the data, which is triggered either through PCI or by writing to MMIO, detected by KVM.
[Host Kernel] Upon detection by KVM, the corresponding IoEventFd pre-registered on the VMM side is notified.
[Host User] Upon receiving the notification, the VMM consumes the Available Ring, retrieves the Descriptor Table Index, processes the data in the buffer (for example, forwarding it to a network card), and afterwards places the Descriptor Table Index back into the Used Ring.
[Host User] After writing into the Used Ring, it is necessary to notify the Guest to process the data through IrqFd.

When the Host wants to send data to the Guest:

[Host User] Retrieves a Descriptor Chain from the Available Ring and writes data into the buffer it points to.
[Host User] The Descriptor index (and the total length of the data if it’s a chain of multiple Descriptors, only the index of the first one is needed) is then written into the Used Ring, and the index on the Used Ring is updated.
[Host User] Needs to notify the Guest to consume the data, this is done via IrqFd.
[Host Kernel] After the VMM writes to the eventfd, KVM injects an interrupt associated with it into the virtual machine.
[Guest] Upon receiving the interrupt, the Guest driver consumes the Used Ring, retrieves the Descriptor Table Index and the length of the data, and reads the desired data.
[Guest] After reading, it needs to place the Descriptor Table Index back into the Available Ring.

This detailed explanation delineates the sequence of actions required for bidirectional communication between the Guest and Host in a system utilizing MMIO for virtual device interactions.

Implementation

To implement a virtio device, you need to develop several components:

Virtio Queue: Used for unidirectional communication (for devices requiring bidirectional communication, two queues can be utilized).
MMIO/PCI Transport: Manages feature negotiation, status retrieval and changes, configuration reading and writing, etc.
Virtio Device: The specific implementation of the virtio device, such as virtio-net, virtio-blk, etc.

Virtio Queue

The Virtio Queue is the essence of virtio, defining the usage of shared memory. Virtio currently has three versions (1.0, 1.1, and 1.2, with earlier versions known as legacy). Versions 1.1 and later even differentiate between Split and Packed types, but for simplicity, we will only discuss the Split version.

A virtqueue consists of three parts:

// ref: https://wiki.osdev.org/Virtio
struct VirtualQueue
{
  struct Buffers[QueueSize]
  {
    uint64_t Address; // 64-bit address of the buffer on the guest machine.
    uint32_t Length;  // 32-bit length of the buffer.
    uint16_t Flags;   // 1: Next field contains linked buffer index;  2: Buffer is write-only (clear for read-only).
                      // 4: Buffer contains additional buffer addresses.
    uint16_t Next;    // If flag is set, contains index of next buffer in chain.
  }
 
  struct Available
  {
    uint16_t Flags;             // 1: Do not trigger interrupts.
    uint16_t Index;             // Index of the next ring index to be used.  (Last available ring buffer index+1)
    uint16_t [QueueSize] Ring;  // List of available buffer indexes from the Buffers array above.
    uint16_t EventIndex;        // Only used if VIRTIO_F_EVENT_IDX was negotiated
  }
 
  uint8_t[] Padding;  // Reserved
  // 4096 byte alignment
  struct Used
  {
    uint16_t Flags;            // 1: Do not notify device when buffers are added to available ring.
    uint16_t Index;            // Index of the next ring index to be used.  (Last used ring buffer index+1)
    struct Ring[QueueSize]
    {
      uint32_t Index;  // Index of the used buffer in the Buffers array above.
      uint32_t Length; // Total bytes written to buffer.
    }
    uint16_t AvailEvent;       // Only used if VIRTIO_F_EVENT_IDX was negotiated
  }
}

Inputs

Descriptor Table is responsible for storing multiple Descriptors; a corresponding Descriptor can be accessed via its index. Each Descriptor contains a buffer pointer (GPA) and length, as well as the index of the next Descriptor in the chain.
Available Ring stores multiple Descriptor Indexes and a Ring Index (indicating the next position for writing). The Guest writes to the Available Ring, while the Host only reads from it.
Used Ring stores multiple Descriptor Indexes along with the total length of data written for each Descriptor, and a Ring Index (indicating the next position for writing). The Guest only reads from the Used Ring, while the Host only writes to it.

Descriptor Chain

How to describe a Descriptor Chain? A Descriptor stores the Next Descriptor index, so by additionally storing the Descriptor Table address and memory address mapping, this structure can autonomously retrieve the next Descriptor.

The memory address mapping is stored as a generic type M, which can be either an owned type or a reference type. There are generally three ways we use M: one is for temporary use, where passing &M and constraining M: GuestMemory suffices (auto deref will occur); another is for consuming M to construct a structure that holds M and needs to operate on M, similar to Queue::pop<M>, where passing M and constraining M: Deref<Target = impl GuestMemory> is required (manual deref); the last is for constructing a structure that holds M without consuming it, such as DescriptorChain::<M>::next, which compared to the previous usage, requires an additional constraint M: Clone. It is important to pay attention to the type of M being used; if M corresponds to an ownership type, then the cost of cloning might be significant, and it is necessary to consider whether this meets expectations. If it is expected that M should definitely be a reference type or another type that can be copied cheaply, the previous Clone constraint can be changed to a Copy constraint.

Additionally, storing the queue size to check if the index is valid and storing the current index for retrieval is necessary.

pub struct DescriptorChain<M> {
    // Save guest memory and desc table address to read next
    mem: M,
    desc_table: GuestAddress,
    // Save queue size to check if the index is valid
    queue_size: u16,
    pub index: u16,

    // Copied from Descriptor since we do not require special memory layout
    pub addr: u64,
    pub len: u32,
    pub flags: u16,
    pub next: u16,
}

impl<M: Deref<Target = impl GuestMemory>> DescriptorChain<M> {
    pub fn new(mem: M, desc_table: GuestAddress, queue_size: u16, index: u16) -> Result<Self, ()> {
        // check if index is valid
        if index >= queue_size {
            error!("invalid index {index}, queue size: {queue_size}");
            return Err(());
        }
        Ok(Self::new_unchecked(mem, desc_table, queue_size, index))
    }

    #[inline(always)]
    pub fn new_unchecked(mem: M, desc_table: GuestAddress, queue_size: u16, index: u16) -> Self {
        #[repr(C)]
        #[derive(Default, Clone, Copy)]
        struct Descriptor {
            addr: u64,
            len: u32,
            flags: u16,
            next: u16,
        }
        unsafe impl ByteValued for Descriptor {}

        // if index is valid, we can trust the desc address it refers to is valid since we already checked it when handshake
        let addr = desc_table.unchecked_add((index as u64) << 4);
        // we can unwrap here since addr is valid
        let desc = mem.read_obj::<Descriptor>(addr).unwrap();
        Self {
            mem,
            desc_table,
            queue_size,
            index,
            addr: desc.addr,
            len: desc.len,
            flags: desc.flags,
            next: desc.next,
        }
    }

    pub fn next(&self) -> Option<Result<Self, ()>>
    where
        M: Clone,
    {
        const VIRTQ_DESC_F_NEXT: u16 = 0x1;
        if self.flags & VIRTQ_DESC_F_NEXT != 0 {
            Some(Self::new(
                self.mem.clone(),
                self.desc_table,
                self.queue_size,
                self.next,
            ))
        } else {
            None
        }
    }

    #[inline]
    pub const fn is_write_only(&self) -> bool {
        const VIRTQ_DESC_F_WRITE: u16 = 0x2;
        self.flags & VIRTQ_DESC_F_WRITE != 0
    }

    // For write something like num_buffers
    #[allow(unused)]
    pub fn write_at(self, mut offset: usize, mut data: &[u8]) -> io::Result<()>
    where
        M: Clone,
    {
        if data.is_empty() {
            return Ok(());
        }

        let mut next_desc = Some(self);
        while let Some(desc) = next_desc {
            let skip = offset.min(desc.len as usize);
            offset -= skip;
            if offset == 0 {
                let to_copy = data.len().min(desc.len as usize - skip);
                desc.mem
                    .write_slice(&data[..to_copy], GuestAddress(desc.addr + skip as u64))
                    .map_err(|_| io::Error::new(io::ErrorKind::Other, "Failed to write slice"))?;
                data = &data[to_copy..];
                if data.is_empty() {
                    return Ok(());
                }
            }
            next_desc = desc.next().transpose().map_err(|_| {
                io::Error::new(io::ErrorKind::Other, "Failed to get next descriptor")
            })?;
        }
        Err(io::Error::new(
            io::ErrorKind::Other,
            "Failed to write slice",
        ))
    }
}

This allows us to leverage the DescriptorChain structure to solve the problem of loading the next Descriptor. Subsequently, we need to implement push/pop methods for the Queue to read and write the DescriptorChain.

Queue Definition

First, we need to define the Queue struct. What information should the Queue store?

The memory of the Queue is managed by the Guest, so the Device side implementation only needs to record the memory addresses of these three items;
To store the per-Queue information set during the handshake, it is also necessary to include corresponding fields in the Queue, including size and ready;
Record the current read and write indices (next_avail, next_used), as well as the current batch write count (num_added);
Store information obtained through initial values or feature negotiation during the handshake, including max_size and noti_suppres (noti_suppres indicates the VIRTIO_F_EVENT_IDX feature).

pub struct Queue {
    pub desc_table: GuestAddress,
    pub avail_ring: GuestAddress,
    pub used_ring: GuestAddress,

    pub size: u16,
    pub ready: bool,

    pub next_avail: Wrapping<u16>,
    pub next_used: Wrapping<u16>,
    pub num_added: Wrapping<u16>,

    pub max_size: u16,
    pub noti_suppres: bool,
}

impl Queue {
    pub const fn new(max_size: u16) -> Self {
        Self {
            max_size,
            size: 0,
            ready: false,
            desc_table: GuestAddress(0),
            avail_ring: GuestAddress(0),
            used_ring: GuestAddress(0),
            next_avail: Wrapping(0),
            next_used: Wrapping(0),
            noti_suppres: false,
            num_added: Wrapping(0),
        }
    }
}

Interrupt Suppression

It is necessary to introduce the VIRTIO_F_EVENT_IDX feature.

In implementations that do not support this feature, after writing messages (which could be multiple), it is necessary to notify the counterpart to consume them, as the counterpart cannot detect local memory writes. All communications based on shared memory require either busy-waiting or some form of notification mechanism. However, notifications are costly, and to minimize the number of notifications while avoiding delays, this feature was designed.

Once this feature is successfully negotiated, for the device, it only needs to notify when the counterpart completes writing to the AvailableRing at the index specified by UsedRing.AvailEvent; and it only notifies the counterpart when the local side writes to the UsedRing at the index specified by AvailableRing.EventIndex (the variable naming here corresponds to the previous C language definitions).

With this feature enabled, special attention must be paid to memory visibility issues. Inserting appropriate memory barriers can ensure consistency.

Moreover, once VIRTIO_F_EVENT_IDX is enabled, it implies that all events must be consumed at once (or consumed subsequently through an internally triggered callback). This is because there will be no further notifications from the counterpart, similar to handling epoll ET—either consume all at once or record readiness and trigger fd consumption by an internally generated signal.

Queue Pop

The implementation of pop includes two tasks:

Detecting when there are messages available for consumption.
Retrieving messages from the Queue.

Task 2 is quite straightforward; it simply involves reading the data corresponding to the next_avail index in the Descriptor Table. Since it must ensure that an element is present, let’s name this function pop_unchecked.

impl Queue {
    // this fn assume the element exist, must check it before call this function
    #[inline]
    fn pop_unchecked<M: Deref<Target = impl GuestMemory>>(
        &mut self,
        mem: M,
    ) -> Result<DescriptorChain<M>, ()> {
        let addr = self
            .avail_ring
            .unchecked_add((((self.next_avail.0 % self.size) as u64) << 1) + 4);
        let desc_index = mem.read_obj::<u16>(addr).unwrap();
        let desc_chain = DescriptorChain::new(mem, self.desc_table, self.size, desc_index);
        if desc_chain.is_ok() {
            self.next_avail += 1;
        } else {
            error!("pop_unchecked failed because invalid desc index");
        }
        desc_chain
    }
}

So, how do we determine if there are messages available for consumption? It’s simple: just check if the AvailableRing index is greater than next_avail, right? We implement a function to read the AvailableRing index and based on this, we implement the len function:

impl Queue {
    #[inline(always)]
    fn avail_idx<M: GuestMemory>(&self, mem: &M) -> Wrapping<u16> {
        let addr = self.avail_ring.unchecked_add(2);
        Wrapping(mem.read_obj::<u16>(addr).unwrap())
    }

    #[inline(always)]
    fn len<M: GuestMemory>(&self, mem: &M) -> u16 {
        (self.avail_idx(mem) - self.next_avail).0
    }
}

When .len() > 0, then pop_unchecked works for cases where VIRTIO_F_EVENT_IDX is not negotiated. However, when this feature is enabled, the following steps are necessary:

Attempt to get the length; if len > 0, read and return the data directly.
Write to UsedRing.AvailEvent to notify the Guest that new data is available and to inform the Device.
Attempt to read the length again.

Why is there a need to try again? As mentioned earlier, when VIRTIO_F_EVENT_IDX is enabled, all messages must be consumed in one go. If the producer thinks there is no need to trigger an event, and the consumer stops partway, it can lead to a situation where messages are left unprocessed, causing a hanging.

We define the reading of len by the Host Device as Event 1, and the update of UsedRing.AvailEvent by the Host Device as Event 2; the update of len by the Guest Driver as Event A, and the reading of UsedRing.AvailEvent by the Guest Driver as Event B. Thus, Events 1 and 2 are sequentially guaranteed, as are Events A and B. This leads to several possible sequences:

A12B/A1B2/AB12: All involve updating len before reading it, ensuring no messages are missed;
1A2B/12AB: The Device reads len first and finds no messages, assuming the process is complete, but the Driver successfully reads the latest AvailEvent. Although consumption stops, it will be correctly triggered by the next notification, so no messages are missed;
1AB2: In this case, the Device thinks there are no messages to consume, but the Guest later generates a new message and reads the AvailEvent (finding it unchanged). Finally, the Device updates the AvailEvent and exits the processing flow. Clearly, the Device neither consumes all messages nor ensures the correct setting of AvailEvent, leading to missed messages.

To solve this problem, we attempt to read messages again after updating the AvailEvent. This ensures continued consumption in case 3.

To access the latest len, we first add an Acquire fence during Pop to ensure visibility of the memory writes made before the writer’s Release fence; after writing to the AvailEvent, to make it visible post-reader’s Acquire fence, a Release fence is necessary. At the same time, we still need to access the current latest len, so we use an AcqRel fence.

impl Queue {
    #[inline]
    pub fn pop<M: Deref<Target = impl GuestMemory>>(
        &mut self,
        mem: M,
    ) -> Result<Option<DescriptorChain<M>>, ()> {
        macro_rules! try_pop {
            () => {
                if self.len(mem.deref()) != 0 {
                    return self.pop_unchecked(mem).map(Option::Some);
                }
            };
        }

        // Acquire fence to make sure get updated length and memory
        fence(Ordering::Acquire);

        // if we are able to consume, just do it and return.
        // we should update avail_event as late as possible.
        try_pop!();

        if self.noti_suppres {
            let avail_event_addr = self.used_ring.unchecked_add(((self.size as u64) << 3) + 4);
            mem.write_obj(self.next_avail.0, avail_event_addr).unwrap();

            // AcqRel fence to make sure avail_event is visible to guest, and length is visible to vmm
            fence(Ordering::AcqRel);

            try_pop!();
        }
        Ok(None)
    }
}

Queue Push

On the Device side, Queue Push only requires writing the message into the Used Ring. Since the Descriptor Index is always retrieved from the Available Ring, and the sizes of the Available Ring, Used Ring, and Descriptor Table are consistent, there is guaranteed to be space available for writing.

impl Queue {
    #[inline]
    pub fn push<M: GuestMemory>(&mut self, mem: &M, desc_index: u16, len: u32) {
        // Ref: 2.6.8
        #[repr(C)]
        #[derive(Default, Clone, Copy)]
        struct UsedElem {
            id: u32,
            len: u32,
        }
        unsafe impl ByteValued for UsedElem {}

        // push desc index and length
        let addr = self
            .used_ring
            .unchecked_add((((self.next_used.0 % self.size) as u64) << 3) + 4);
        let id = desc_index as u32;
        mem.write_obj(UsedElem { id, len }, addr).unwrap();

        self.num_added += 1;
        self.next_used += 1;

        // make sure memory update is visible before update length
        fence(Ordering::Release);

        mem.write_obj(self.next_used.0, self.used_ring.unchecked_add(2))
            .unwrap();
    }
}

This implementation is not difficult; it simply involves calculating the address and writing data in the C memory layout directly to it. It is important to ensure that the data is fully written before updating the Ring Index; therefore, a Release fence should be used before updating the Ring Index.

After the data is written and the Ring Index is updated, it is necessary to notify the peer. The status of the VIRTIO_F_EVENT_IDX feature must also be considered:

When VIRTIO_F_EVENT_IDX is not enabled, it is mandatory to notify the peer after batch writing is complete.
1. There is an exception here: do not notify the peer when the flag is set to 1 (the flag can only be 0 or 1).
When VIRTIO_F_EVENT_IDX is enabled, notification is only necessary after writing data to a specific Index.

Therefore, a prepare_notify function can be implemented, and the device is responsible for notification when this function returns true.

impl Queue {
    #[inline]
    pub fn prepare_notify<M: GuestMemory>(&mut self, mem: &M) -> bool {
        if !self.noti_suppres {
            let flags = mem.read_obj::<u16>(self.avail_ring).unwrap();
            return flags == 0;
        }
        // make sure length is visible and used_event is updated
        fence(Ordering::AcqRel);

        let used_event_addr = self.avail_ring.unchecked_add(((self.size as u64) << 1) + 4);
        let used_event = Wrapping(mem.read_obj::<u16>(used_event_addr).unwrap());
        let num_added = self.num_added;
        self.num_added = Wrapping(0);
        // Ref: https://elixir.bootlin.com/linux/v6.10/source/include/uapi/linux/virtio_ring.h#L222
        self.next_used - used_event - Wrapping(1) < num_added
    }
}

由于需要保证先前的内存写入与 Ring Index 更新的可见性，又需要保证后面读 used_event 的可见性，所以这里使用 AcqRel fence。最后计算是否需要 notify 的表达式很巧妙，参考了 kernel driver 侧的实现，它低成本地正确处理了 wrap 时的情况。

To ensure the visibility of previous memory writes and the update of the Ring Index, as well as the subsequent visibility of reading used_event, an AcqRel fence is used here. The formula for determining whether a notification is needed is quite clever, refering from the implementation on the kernel driver side, which handles wrap-around situations efficiently and correctly.

Queue Validate

Since both reading and writing to the Queue are performance-critical paths, and the sizes and addresses of the two Rings and the Descriptor Table are fixed after a successful handshake, it is beneficial to perform validity checks on these parameters in advance to ensure their memory layout and length are correct for subsequent operations.

Based on the documentation, the memory layout for the three components should be validated as follows:

Descriptor Table should be 16-byte aligned, with a size of 16 * (Queue Size).
Available Ring should be 2-byte aligned, with a size of 6 + 2 * (Queue Size).
Used Ring should be 4-byte aligned, with a size of 6 + 8 * (Queue Size).

Additionally, since MMIO writes directly manipulate the size, the validity of the size written needs to be checked:

The size must be a power of 2.
The maximum size is 32768.

Finally, the device status must be verified as ready by reading and writing through the offset 0x044.

These checks ensure that the Queue’s memory structures are correct.

impl Queue {
    // Ref: 2.6
    pub fn is_valid<M: GuestMemory>(&self, mem: &M) -> bool {
        // must be ready first
        if !self.ready {
            return false;
        }
        // negotiated size must be less than or equal to max_size
        if self.size > self.max_size {
            return false;
        }
        // size: 0~32768 && power of 2
        if self.size == 0 || self.size & (self.size - 1) != 0 || self.size > 32768 {
            return false;
        }
        // alignment 16
        if self.desc_table.raw_value() & 0xf != 0 {
            return false;
        }
        // alignment 2
        if self.avail_ring.raw_value() & 0x1 != 0 {
            return false;
        }
        // alignment 4
        if self.used_ring.raw_value() & 0x3 != 0 {
            return false;
        }
        // check desc_table's memory range is valid
        if mem
            .get_slice(self.desc_table, (self.size as usize) << 4)
            .is_err()
        {
            return false;
        }
        // check avail_ring's memory range is valid
        if mem
            .get_slice(self.avail_ring, ((self.size as usize) << 1) + 6)
            .is_err()
        {
            return false;
        }
        // check used_ring's memory range is valid
        if mem
            .get_slice(self.avail_ring, ((self.size as usize) << 3) + 6)
            .is_err()
        {
            return false;
        }
        true
    }
}

MMIO Transport

All virtio devices require feature negotiation, state management, and config space operations, and besides MMIO, they may also need to support PCI (with roughly similar operational semantics). A good design is to abstract the access to virtio devices into a trait and write an Adaptor (i.e., Transport) to connect it to MMIO or PCI access.

The interface design for virtio devices depends on the transport access. MMIOTransport needs to implement responses to specific MMIO offset reads and writes according to the document 4.2.2 MMIO Device Register Layout.

Here, I directly copied the interface from Firecracker (with very minor changes):

// Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
// SPDX-License-Identifier: Apache-2.0
//
// Portions Copyright 2017 The Chromium OS Authors. All rights reserved.
// Use of this source code is governed by a BSD-style license that can be
// found in the THIRD-PARTY file.

pub trait VirtioDevice: Send {
    fn avail_features(&self) -> u64;
    fn acked_features(&self) -> u64;
    fn set_acked_features(&mut self, acked_features: u64);
    fn has_feature(&self, feature: u64) -> bool {
        (self.acked_features() & feature) != 0
    }

    fn device_type(&self) -> u32;
    fn queues(&self) -> &[Queue];
    fn queues_mut(&mut self) -> &mut [Queue];
    fn queue_events(&self) -> &[EventFd];

    fn interrupt_trigger(&self) -> &IrqTrigger;
    fn interrupt_status(&self) -> Arc<AtomicU32> {
        Arc::clone(&self.interrupt_trigger().irq_status)
    }

    fn avail_features_by_page(&self, page: u32) -> u32 {
        let avail_features = self.avail_features();
        match page {
            // Get the lower 32-bits of the features bitfield.
            0 => (avail_features & 0xFFFFFFFF) as u32,
            // Get the upper 32-bits of the features bitfield.
            1 => (avail_features >> 32) as u32,
            _ => 0u32,
        }
    }
    fn ack_features_by_page(&mut self, page: u32, value: u32) {
        let mut v = match page {
            0 => u64::from(value),
            1 => u64::from(value) << 32,
            _ => 0u64,
        };

        // Check if the guest is ACK'ing a feature that we didn't claim to have.
        let avail_features = self.avail_features();
        let unrequested_features = v & !avail_features;
        if unrequested_features != 0 {
            // Don't count these features as acked.
            v &= !unrequested_features;
        }
        self.set_acked_features(self.acked_features() | v);
    }

    fn read_config(&self, offset: u32, data: &mut [u8]);
    fn write_config(&mut self, offset: u32, data: &[u8]);

    fn activate(&mut self, mem: GuestMemoryMmap) -> io::Result<()>;
    fn is_activated(&self) -> bool;

    fn reset(&mut self) -> Option<(EventFd, Vec<EventFd>)> {
        None
    }
}

MMIOTransport needs to hold a specific device in order to access it via the VirtioDevice interface; additionally, it needs to record some necessary states during feature negotiation and handshake. The implementation here also needs to follow the document mentioned earlier.

Here, I still copied the code from Firecracker (also with very minor changes):


// Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
// SPDX-License-Identifier: Apache-2.0
//
// Portions Copyright 2017 The Chromium OS Authors. All rights reserved.
// Use of this source code is governed by a BSD-style license that can be
// found in the THIRD-PARTY file.

const MMIO_MAGIC_VALUE: u32 = 0x7472_6976;
const MMIO_VERSION: u32 = 2;
const VENDOR_ID: u32 = 0;

pub struct MMIOTransport {
    device: Arc<Mutex<dyn VirtioDevice>>,
    features_select: u32,
    acked_features_select: u32,
    queue_select: u32,
    device_status: u32,
    interrupt_status: Arc<AtomicU32>,
    config_generation: Wrapping<u32>,
    mem: GuestMemoryMmap,
}

impl MMIOTransport {
    pub fn new(device: Arc<Mutex<dyn VirtioDevice>>, mem: GuestMemoryMmap) -> Self {
        let interrupt_status = device.lock().unwrap().interrupt_status();

        Self {
            device,
            features_select: 0,
            acked_features_select: 0,
            queue_select: 0,
            device_status: 0,
            interrupt_status,
            config_generation: Wrapping(0),
            mem,
        }
    }
}

mod device_status {
    pub const INIT: u32 = 0;
    pub const ACKNOWLEDGE: u32 = 1;
    pub const DRIVER: u32 = 2;
    pub const FAILED: u32 = 128;
    pub const FEATURES_OK: u32 = 8;
    pub const DRIVER_OK: u32 = 4;
    pub const DEVICE_NEEDS_RESET: u32 = 64;
}

impl MMIOTransport {
    #[inline]
    fn with_queue<F: FnOnce(&Queue) -> O, O>(&self, f: F) -> Option<O> {
        self.device
            .lock()
            .unwrap()
            .queues()
            .get(self.queue_select as usize)
            .map(f)
    }

    #[inline]
    fn with_queue_mut<F: FnOnce(&mut Queue) -> O, O>(&mut self, f: F) -> Option<O> {
        self.device
            .lock()
            .unwrap()
            .queues_mut()
            .get_mut(self.queue_select as usize)
            .map(f)
    }

    #[inline]
    const fn check_device_status(&self, set: u32, clr: u32) -> bool {
        self.device_status & (set | clr) == set
    }

    #[inline]
    fn update_queue_field<F: FnOnce(&mut Queue)>(&mut self, f: F) {
        if self.check_device_status(
            device_status::FEATURES_OK,
            device_status::DRIVER_OK | device_status::FAILED,
        ) {
            self.with_queue_mut(f);
        }
    }

    fn set_device_status(&mut self, status: u32) {
        use device_status::*;
        match !self.device_status & status {
            ACKNOWLEDGE if self.device_status == INIT => {
                self.device_status = status;
            }
            DRIVER if self.device_status == ACKNOWLEDGE => {
                self.device_status = status;
            }
            FEATURES_OK if self.device_status == (ACKNOWLEDGE | DRIVER) => {
                self.device_status = status;
            }
            DRIVER_OK if self.device_status == (ACKNOWLEDGE | DRIVER | FEATURES_OK) => {
                self.device_status = status;
                let mut locked_device = self.device.lock().unwrap();
                let device_activated = locked_device.is_activated();
                if !device_activated
                    && locked_device
                        .queues()
                        .iter()
                        .all(|q: &Queue| q.is_valid(&self.mem))
                {
                    let activate_result = locked_device.activate(self.mem.clone());
                    if activate_result.is_err() {
                        self.device_status |= DEVICE_NEEDS_RESET;

                        // Ref: 2.1.2
                        let _ = locked_device.interrupt_trigger().trigger(IrqType::Config);
                    }
                }
            }
            _ if (status & FAILED) != 0 => {
                // TODO: notify backend driver to stop the device
                self.device_status |= FAILED;
            }
            _ if status == 0 => {
                let mut locked_device = self.device.lock().unwrap();
                if locked_device.is_activated() {
                    let mut device_status = self.device_status;
                    let reset_result = locked_device.reset();
                    match reset_result {
                        Some((_interrupt_evt, mut _queue_evts)) => {}
                        None => {
                            device_status |= FAILED;
                        }
                    }
                    self.device_status = device_status;
                }

                if self.device_status & FAILED == 0 {
                    self.features_select = 0;
                    self.acked_features_select = 0;
                    self.queue_select = 0;
                    self.interrupt_status.store(0, Ordering::SeqCst);
                    self.device_status = device_status::INIT;
                    // . Keep interrupt_evt and queue_evts as is. There may be pending notifications in those
                    //   eventfds, but nothing will happen other than supurious wakeups.
                    // . Do not reset config_generation and keep it monotonically increasing
                    for queue in locked_device.queues_mut() {
                        *queue = Queue::new(queue.max_size);
                    }
                }
            }
            _ => {}
        }
    }

    pub fn bus_read(&mut self, offset: u64, data: &mut [u8]) {
        match offset {
            0x00..=0xff if data.len() == 4 => {
                let v = match offset {
                    0x0 => MMIO_MAGIC_VALUE,
                    0x04 => MMIO_VERSION,
                    0x08 => self.device.lock().unwrap().device_type(),
                    0x0c => VENDOR_ID,
                    0x10 => {
                        let mut features = self
                            .device
                            .lock()
                            .unwrap()
                            .avail_features_by_page(self.features_select);
                        if self.features_select == 1 {
                            features |= 0x1; // enable support of VirtIO Version 1
                        }
                        features
                    }
                    0x34 => self.with_queue(|q| u32::from(q.max_size)).unwrap_or(0),
                    0x44 => self.with_queue(|q| u32::from(q.ready)).unwrap_or(0),
                    0x60 => self.interrupt_status.load(Ordering::SeqCst),
                    0x70 => self.device_status,
                    0xfc => self.config_generation.0,
                    _ => {
                        return;
                    }
                };
                data.copy_from_slice(v.to_le_bytes().as_slice());
            }
            0x100..=0xfff => self
                .device
                .lock()
                .unwrap()
                .read_config((offset - 0x100) as u32, data),
            _ => {}
        }
    }

    pub fn bus_write(&mut self, offset: u64, data: &[u8]) {
        fn hi(v: &mut GuestAddress, x: u32) {
            *v = (*v & 0xffff_ffff) | (u64::from(x) << 32)
        }

        fn lo(v: &mut GuestAddress, x: u32) {
            *v = (*v & !0xffff_ffff) | u64::from(x)
        }

        match offset {
            0x00..=0xff if data.len() == 4 => {
                let mut buf: [u8; 4] = [0; 4];
                buf.copy_from_slice(data);
                let v = u32::from_le_bytes(buf);
                match offset {
                    0x14 => self.features_select = v,
                    0x20 => {
                        if self.check_device_status(
                            device_status::DRIVER,
                            device_status::FEATURES_OK
                                | device_status::FAILED
                                | device_status::DEVICE_NEEDS_RESET,
                        ) {
                            self.device
                                .lock()
                                .unwrap()
                                .ack_features_by_page(self.acked_features_select, v);
                        }
                    }
                    0x24 => self.acked_features_select = v,
                    0x30 => self.queue_select = v,
                    0x38 => self.update_queue_field(|q| q.size = (v & 0xffff) as u16),
                    0x44 => self.update_queue_field(|q| q.ready = v == 1),
                    0x64 => {
                        if self.check_device_status(device_status::DRIVER_OK, 0) {
                            self.interrupt_status.fetch_and(!v, Ordering::SeqCst);
                        }
                    }
                    0x70 => self.set_device_status(v),
                    0x80 => self.update_queue_field(|q| lo(&mut q.desc_table, v)),
                    0x84 => self.update_queue_field(|q| hi(&mut q.desc_table, v)),
                    0x90 => self.update_queue_field(|q| lo(&mut q.avail_ring, v)),
                    0x94 => self.update_queue_field(|q| hi(&mut q.avail_ring, v)),
                    0xa0 => self.update_queue_field(|q| lo(&mut q.used_ring, v)),
                    0xa4 => self.update_queue_field(|q| hi(&mut q.used_ring, v)),
                    _ => {}
                }
            }
            0x100..=0xfff => {
                if self.check_device_status(
                    device_status::DRIVER,
                    device_status::FAILED | device_status::DEVICE_NEEDS_RESET,
                ) {
                    self.device
                        .lock()
                        .unwrap()
                        .write_config((offset - 0x100) as u32, data);
                    self.config_generation += 1;
                }
            }
            _ => {}
        }
    }
}

Virtio-net Impl

Virtio-net includes two virtio queues for bidirectional communication.

How should we handle the packets read and written by virtio-net? Virtio-net targets layer 2 packets, and one implementation is to create a TAP device (which also operates at layer 2), and then connect the virtio-net’s tx and rx to this TAP device.

How can we use the TAP device to allow the VM to access networks outside the host?

One method is to assign an address and subnet to the TAP, and set the VM’s IP as another address in that subnet; then use iptables to perform NAT, with forwarding by the host kernel.
Another method is to bridge the TAP with an outgoing network card (such as eth0) and configure the same segment IP inside the VM.

Define the Net structure (including the TAP device, two queues, their respective ioeventfds, device-bound irqfds, features, and status):

const NET_NUM_QUEUES: usize = 2;
const VIRTIO_TYPE_NET: u32 = 1;

pub struct Net {
    tap: Tap,
    avail_features: u64,
    acked_features: u64,
    queues: [Queue; NET_NUM_QUEUES],
    queue_evts: [EventFd; NET_NUM_QUEUES],
    irq_trigger: IrqTrigger,
    config_space: ConfigSpace,
    state: DeviceState,

    pub activate_evt: EventFd,
}

unsafe impl Send for Net {}

impl VirtioDevice for Net {
    fn avail_features(&self) -> u64 {
        self.avail_features
    }

    fn acked_features(&self) -> u64 {
        self.acked_features
    }

    fn set_acked_features(&mut self, acked_features: u64) {
        self.acked_features = acked_features;
    }

    fn device_type(&self) -> u32 {
        VIRTIO_TYPE_NET
    }

    fn queues(&self) -> &[Queue] {
        self.queues.as_slice()
    }

    fn queues_mut(&mut self) -> &mut [Queue] {
        self.queues.as_mut_slice()
    }

    fn queue_events(&self) -> &[EventFd] {
        self.queue_evts.as_slice()
    }

    fn interrupt_trigger(&self) -> &IrqTrigger {
        &self.irq_trigger
    }

    fn read_config(&self, offset: u32, data: &mut [u8]) {
        if let Some(config_space_bytes) = self.config_space.as_slice().get(offset as usize..) {
            let len = config_space_bytes.len().min(data.len());
            data[..len].copy_from_slice(&config_space_bytes[..len]);
            debug!("read config space {len}");
        }
    }

    fn write_config(&mut self, offset: u32, data: &[u8]) {
        if let Some(config_space_bytes) =
            self.config_space.as_mut_slice().get_mut(offset as usize..)
        {
            let len = config_space_bytes.len().min(data.len());
            config_space_bytes[..len].copy_from_slice(&data[..len]);
            debug!("write config space {len}");
        }
    }

    fn activate(&mut self, mem: GuestMemoryMmap) -> io::Result<()> {
        if self.has_feature(VIRTIO_RING_F_EVENT_IDX) {
            for queue in &mut self.queues {
                queue.noti_suppres = true;
            }
        }
        // since eventloop and activate are not in the same thread, we need to notify eventloop
        self.activate_evt.write(1)?;
        self.state = DeviceState::Activated { mem };
        debug!("activate done");
        Ok(())
    }

    fn is_activated(&self) -> bool {
        self.state.is_activated()
    }
}

After the handshake, MMIOTransport calls activate to notify that the device is ready. The handling of this event involves registering callbacks for processing tx, rx, and tap to the main thread’s epoll. This way, when an event occurs (tx/rx ready, or tap read/write ready), it can trigger the Net device to perform the relay action.

Implementing relay is not difficult; the simplest method is to maintain a buffer, reading from one side and writing to the other. However, since the Descriptor Chain is essentially equivalent to [iovec], we can define a conversion trait to transform the Descriptor Chain into [iovec] at low cost. This allows us to directly use readv/writev and avoid additional copying overhead (here, the design involves passing in &mut Vec, expecting the caller to manage it to avoid frequent allocation overhead of Vec):

pub trait IntoIoVec {
    fn into_iovec(self, buffer: &mut Vec<libc::iovec>, write_only: bool) -> io::Result<&[libc::iovec]>;
}
impl<M: Deref<Target = impl GuestMemory> + Copy> IntoIoVec for DescriptorChain<M> {
    fn into_iovec(
        self,
        buffer: &mut Vec<libc::iovec>,
        write_only: bool,
    ) -> io::Result<&[libc::iovec]> {
        unsafe { buffer.set_len(0) };
        let mut next_desc = Some(self);

        while let Some(desc) = next_desc {
            if write_only && desc.is_write_only() || (!write_only && !desc.is_write_only()) {
                buffer.push(libc::iovec {
                    iov_base: desc
                        .mem
                        .get_slice(GuestAddress(desc.addr), desc.len as usize)
                        .map_err(|_| io::Error::new(io::ErrorKind::Other, "Failed to get slice"))?
                        .ptr_guard_mut()
                        .as_ptr() as *mut _,
                    iov_len: desc.len as _,
                });
            } else {
                return Err(io::Error::new(
                    io::ErrorKind::Other,
                    "Descriptor write_only bit not matches",
                ));
            }
            next_desc = desc.next().transpose().map_err(|_| {
                io::Error::new(io::ErrorKind::Other, "Failed to get next descriptor")
            })?;
        }
        Ok(buffer.as_slice())
    }
}

Another issue is state management. The tx and rx queues can be considered as edge-triggered (regardless of whether they are registered as ET/LT on epoll, as long as VIRTIO_F_EVENT_IDX is negotiated, it behaves as edge-triggered, and under normal implementation, it requires consuming to empty to receive the next notification). Therefore, it is necessary to consider whether the tap fd should focus on edge-triggered or level-triggered. Typically, using edge-triggered and recording the state is more efficient. There is also a background here, which is that the TUN/TAP device is always writable (according to the documentation, the kernel will silently drop when the buffer is full, which can be confirmed by observing the SKB_DROP_REASON_FULL_RING metric through ebpf).

Here, tx, rx, and tap all use edge-triggered, and copying can only occur when both sides are ready, so there are three implementation methods:

Do not record the state: Determine by the return value of read/write, which may incur additional costs, such as when tap is not readable, triggering the read queue to read from tap when rx’s ready is triggered, which is an ineffective operation; similarly, when rx is empty, tap readable triggers the read queue to write to tap, which is also ineffective.
Record the state of one side (the side with higher operational cost): Obviously, the cost of reading and writing to the virtio queue is not high, the readability and writability of the tx and rx queues correspond to two pop operations of the Available Ring, which are negligible compared to the syscall cost of TAP. We record the readable state of the TAP device, which can solve some of the problems in the first method (recording the TAP state is more important for cases where VIRTIO_F_EVENT_IDX is not negotiated):
1. Pop and unpop of tx queue when not readable, as well as some structural transformation overhead.
Record the state of both sides (tx and rx as well as TAP’s read/write, a total of 4 states): This implementation is the most efficient. It can solve the issues in method 2:
1. Determination when tx avail_ring has no data.
2. Determination when rx avail_ring has no data.

After analyzing the three implementation methods, considering that the determination of no data on avail_ring is actually quite cheap (the cost involved is memory barriers and numerical comparison overhead), we implement according to method 2.

Event triggering:

RX Ready: Determine if TAP is readable and copy data from TAP to RX in a loop.
TX Ready: Copy data from TX to TAP in a loop.
TAP Read Ready: Copy data from TAP to RX in a loop.

Here, the sending side loop copying is implemented as process_tx, and the receiving side is implemented as process_rx. Later, these three events need to be mapped to their respective logic (for example, events triggered by EventFd need to read the EventFd first, and events triggered by TAP readable need to update the readable state) and these two processing functions. This part of the code will be supplemented after the EventLoop implementation, and here only the two common process functions are implemented.

In actual production environments like firecracker and cloud-hypervisor, rate limiters are also used. After integrating rate limiters, both implementations are relatively complex and have their own issues. Firecracker has a problem with copying data when reading, as it first reads into its own buffer before copying to the desc chain; cloud-hypervisor has frequent epoll_ctl syscalls due to using level-triggered, and also has a lot of ineffective code for handling TAP TX, because TUN/TAP device’s write will never return WOULD_BLOCK (when the kernel ring buffer is full, it will silently drop).

I will try to submit some PRs to address these issues (PR List at the end of this artical).

impl Net {
    pub fn process_tx(&mut self) -> io::Result<()> {
        let mem = self.state.unwrap();
        let queue = &mut self.queues[TX_INDEX];

        let mut try_notify = false;
        loop {
            let desc = match queue.pop(mem) {
                Ok(Some(x)) => x,
                Ok(None) => break,
                Err(_) => {
                    error!("queue pop failed when process tx");
                    continue;
                }
            };

            let desc_idx = desc.index;
            match self.tap.writev(desc) {
                Ok(_n) => {
                    queue.push(mem, desc_idx, 0);
                    try_notify = true;
                }
                Err(e) => {
                    queue.push(mem, desc_idx, 0);
                    error!("tap writev failed {e}");
                    try_notify = true;
                }
            };
        }

        if try_notify && queue.prepare_notify(mem) {
            self.irq_trigger.trigger(IrqType::Vring)?;
        }
        Ok(())
    }

    pub fn process_rx(&mut self) -> io::Result<()> {
        if !self.tap.readable {
            debug!("tap not readable when process rx");
            return Ok(());
        }

        let mem = self.state.unwrap();
        let queue = &mut self.queues[RX_INDEX];

        let mut try_notify = false;
        loop {
            let desc = match queue.pop(mem) {
                Ok(Some(x)) => x,
                Ok(None) => break,
                Err(_) => {
                    error!("queue pop failed when process rx");
                    continue;
                }
            };
            let desc_idx = desc.index;
            match self.tap.readv(desc) {
                Ok(n) => {
                    if let Err(e) = desc.write_at(VNET_HDR_NUM_BUFFERS_OFFSET, &u16::to_le_bytes(1))
                    {
                        error!("write vnet header num buffers failed: {e}");
                    }
                    queue.push(mem, desc_idx, n as u32);
                    try_notify = true;
                }
                Err(e) if e.kind() == io::ErrorKind::WouldBlock => {
                    debug!("tap not readable when readv, mark readable false");
                    self.tap.readable = false;
                    queue.undo_pop();
                    break;
                }
                Err(e) => {
                    error!("readv failed: {e:?}");
                    queue.push(mem, desc_idx, 0);
                    try_notify = true;
                }
            };
        }
        if try_notify && queue.prepare_notify(mem) {
            self.irq_trigger.trigger(IrqType::Vring)?;
        }
        Ok(())
    }
}

TAP Device Impl

TAP devices are used to provide layer 2 networking. Creating a TAP network interface in Linux is straightforward; simply open /dev/net/tun and configure it appropriately using ioctl.

Unlike conventional usage, since we directly relay packets with virtio net headers, it is necessary to remove and add the relevant packet headers when entering and exiting the TAP device. This process can be handled in user space or directly by the kernel: add the IFF_VNET_HDR flag when configuring and set the header length via ioctl.

However, it is important to note that according to the virtio-net protocol specification, corresponding data must be filled in during the forwarding process. For example, the device must set the num_buffers field in the virtio_net_hdr (however, after reviewing the corresponding code in QEMU and Firecracker, I found that they do not populate this field when VIRTIO_NET_F_MRG_RXBUF is not enabled, whereas the specification requires it to be set to 1; so I also looked at the Kernel’s driver implementation, which does not attempt to read this field when VIRTIO_NET_F_MRG_RXBUF is not negotiated. Although not populating it does not comply with the standard, it works for the current virtio-net driver implementation).

pub struct Tap {
    file: File,
    buffer: Vec<libc::iovec>,

    pub readable: bool,
    pub writable: bool,
}

impl AsRawFd for Tap {
    fn as_raw_fd(&self) -> std::os::unix::prelude::RawFd {
        self.file.as_raw_fd()
    }
}

impl Tap {
    pub fn new(name: &str) -> io::Result<Self> {
        const DEVICE: &str = "/dev/net/tun\0";
        const FLAGS: i16 = (libc::IFF_TAP | libc::IFF_NO_PI | libc::IFF_VNET_HDR) as i16;
        const TUNSETIFF: libc::c_ulong = 0x4004_54ca;

        // open /dev/net/tun
        let fd = unsafe {
            libc::open(
                DEVICE.as_ptr() as *const _,
                libc::O_RDWR | libc::O_NONBLOCK | libc::O_CLOEXEC,
            )
        };
        if fd < 0 {
            return Err(io::Error::last_os_error());
        }
        let file = unsafe { File::from_raw_fd(fd) };

        // construct CStr
        let mut cname = [0; IFNAMSIZ];
        if name.as_bytes().len() >= IFNAMSIZ {
            return Err(io::Error::new(
                io::ErrorKind::InvalidInput,
                "interface name too long",
            ));
        }
        cname[..name.as_bytes().len()].copy_from_slice(name.as_bytes());

        // ioctl TUNSETIFF
        let mut ifr = ifreq {
            ifr_name: unsafe { transmute::<[u8; IFNAMSIZ], [libc::c_char; IFNAMSIZ]>(cname) },
            ifr_ifru: unsafe { zeroed() },
        };
        ifr.ifr_ifru.ifru_flags = FLAGS;
        if unsafe { libc::ioctl(fd, TUNSETIFF, &ifr) } < 0 {
            return Err(io::Error::last_os_error());
        }

        Ok(Self {
            file,
            buffer: Vec::new(),

            readable: false,
            writable: false,
        })
    }

    pub fn set_vnet_hdr_size(&self, size: libc::c_int) -> io::Result<()> {
        const TUNSETVNETHDRSZ: libc::c_ulong = 0x4004_54d8;

        if unsafe { libc::ioctl(self.file.as_raw_fd(), TUNSETVNETHDRSZ, &size) } < 0 {
            return Err(io::Error::last_os_error());
        }

        Ok(())
    }

    pub fn readv<B: IntoIoVec>(&mut self, data: B) -> io::Result<usize> {
        let iov = data.into_iovec(&mut self.buffer, true)?;
        let ret = unsafe { libc::readv(self.file.as_raw_fd(), iov.as_ptr(), iov.len() as i32) };
        if ret == -1 {
            return Err(io::Error::last_os_error());
        }

        Ok(ret as usize)
    }

    pub fn writev<B: IntoIoVec>(&mut self, data: B) -> io::Result<usize> {
        let iov = data.into_iovec(&mut self.buffer, false)?;
        let ret = unsafe { libc::writev(self.file.as_raw_fd(), iov.as_ptr(), iov.len() as i32) };
        if ret == -1 {
            return Err(io::Error::last_os_error());
        }

        Ok(ret as usize)
    }
}

Segment Offload

TAP devices can also be made more efficient:

By adding the IFF_MULTI_QUEUE feature to support multiple queues, and utilizing different threads to handle different queues, this can maximize CPU multi-core utilization and enhance maximum throughput. Whether to support this feature can be decided based on the target scenario, for example, for instances that expect high-density deployment with low per-machine IO requirements, it is not necessary to support this feature. There are no plans to support this feature in this series of articles.
By passing advanced features to the TAP device, such as TSO, USO, UFO. In this case, these bits need to be enabled in the device-supported features, and the corresponding features of the TAP device need to be activated after successfully negotiating the corresponding bits. This blog by Cloudflare provides some reference data on the performance of TSO/USO.

To ensure compatibility, referencing the behavior of QEMU, we need to detect the availability of these features under the current kernel when creating the TAP, and thus construct the device feature.

We need to detect the TAP features and their corresponding virtio features:

TUN_F_CSUM - VIRTIO_NET_F_GUEST_CSUM
TUN_F_TSO4 - VIRTIO_NET_F_GUEST_TSO4
TUN_F_TSO6 - VIRTIO_NET_F_GUEST_TSO6
TUN_F_TSO_ECN - VIRTIO_NET_F_GUEST_ECN
TUN_F_UFO - VIRTIO_NET_F_GUEST_UFO
TUN_F_USO4 - VIRTIO_NET_F_GUEST_USO4
TUN_F_USO6 - VIRTIO_NET_F_GUEST_USO6

Note: Definitions can be found in version 1.3 and later of the virtio documentation, or in the kernel header.

In the VIRTIO documentation, there are two types of features: one type such as VIRTIO_NET_F_CSUM, and another such as VIRTIO_NET_F_GUEST_CSUM.

The former, without “GUEST” indicates the features for host data reception. For example, VIRTIO_NET_F_CSUM indicates that the host accepts data with partial checksums. If the host declares support for this feature, it needs to perform verification when necessary; if this feature is not negotiated, the guest must ensure that the data it provides has the correct checksum. Therefore, we can enable the TUN_F_CSUM flag for the corresponding TAP device, which indicates that the TAP user can accept unchecksummed packets (corresponding to the comment in iftun.h: You can hand me unchecksummed packets.).

The latter, with “GUEST,” indicates the features for guest data reception. For example, VIRTIO_NET_F_GUEST_CSUM indicates that the guest/driver can accept data with partial checksums, and it will perform checksum verification internally when necessary.

To implement support for offloading segmentation, two things need to be done:

Guest → TAP direction: The guest needs to send longer segments (meaning that checksum calculations are also offloaded to the host), requiring negotiation of VIRTIO_NET_F_CSUM and VIRTIO_NET_F_HOST_{TSO4,TSO6,ECN,UFO,USO} (Device can receive); the TAP needs to support sending these types of packets, requiring the enabling of TUN_F_CSUM and other flags.
TAP → Guest direction: The guest needs to receive longer segments (meaning that checksum calculations on received packets are performed by the guest driver), requiring negotiation of VIRTIO_NET_F_GUEST_CSUM and VIRTIO_NET_F_GUEST_{TSO4,TSO6,ECN,UFO,USO4,USO6} (Driver can receive); the TAP needs to support reading these types of packets, requiring the enabling of TUN_F_TSO4 and other flags.

Final process (using TSO4 as an example):

Probe the flags supported by TAP (TUN_F_TSO4);
Convert the flag to virtio features (VIRTIO_NET_F_GUEST_TSO4 | VIRTIO_NET_F_HOST_TSO4). This is because TAP supports TSO4, so we can send and receive TSO4 data;
Convert the negotiated feature back to TAP flags (VIRTIO_NET_F_GUEST_TSO4 → TUN_F_TSO4) and set it. This is because enabling TUN_F_TSO4 means that we might receive data in this format, and since we pass it directly to the guest, the guest must support it, so VIRTIO_NET_F_GUEST_TSO4 must be negotiated to set this flag on TAP;

#[derive(Debug, Clone, Copy)]
pub struct TapOffload(libc::c_uint);

impl TapOffload {
    const AVAIL_MAPPING: [(libc::c_uint, u64); 7] = [
        (TUN_F_CSUM, VIRTIO_NET_F_GUEST_CSUM | VIRTIO_NET_F_CSUM),
        (TUN_F_TSO4, VIRTIO_NET_F_GUEST_TSO4 | VIRTIO_NET_F_HOST_TSO4),
        (TUN_F_TSO6, VIRTIO_NET_F_GUEST_TSO6 | VIRTIO_NET_F_HOST_TSO6),
        (
            TUN_F_TSO_ECN,
            VIRTIO_NET_F_GUEST_ECN | VIRTIO_NET_F_HOST_ECN,
        ),
        (TUN_F_UFO, VIRTIO_NET_F_GUEST_UFO | VIRTIO_NET_F_HOST_UFO),
        (TUN_F_USO4, VIRTIO_NET_F_GUEST_USO4 | VIRTIO_NET_F_HOST_USO),
        (TUN_F_USO6, VIRTIO_NET_F_GUEST_USO6 | VIRTIO_NET_F_HOST_USO),
    ];
    const FEAT_MAPPING: [(u64, libc::c_uint); 7] = [
        (VIRTIO_NET_F_GUEST_CSUM, TUN_F_CSUM),
        (VIRTIO_NET_F_GUEST_TSO4, TUN_F_TSO4),
        (VIRTIO_NET_F_GUEST_TSO6, TUN_F_TSO6),
        (VIRTIO_NET_F_GUEST_ECN, TUN_F_TSO_ECN),
        (VIRTIO_NET_F_GUEST_UFO, TUN_F_UFO),
        (VIRTIO_NET_F_GUEST_USO4, TUN_F_USO4),
        (VIRTIO_NET_F_GUEST_USO6, TUN_F_USO6),
    ];
    pub fn virtio_features(&self) -> u64 {
        // Tun must support TUN_F_CSUM
        if self.0 & TUN_F_CSUM == 0 {
            return 0;
        }
        let mut features = 0;
        for (tun, virtio) in Self::AVAIL_MAPPING.iter() {
            if self.0 & tun != 0 {
                features |= *virtio;
            }
        }
        features
    }
    pub fn from_virtio_features(features: u64) -> Self {
        // Feature must contain VIRTIO_NET_F_CSUM
        if features & VIRTIO_NET_F_CSUM == 0 {
            return Self(0);
        }
        let mut tun = 0;
        for (virtio, t) in Self::FEAT_MAPPING.iter() {
            if features & virtio != 0 {
                tun |= *t;
            }
        }
        Self(tun)
    }
    pub fn into_inner(self) -> libc::c_uint {
        self.0
    }
}

impl Tap {
    pub fn probe(&self) -> TapOffload {
        let mut features: libc::c_uint = 0;
        const PROBE_LIST: &[libc::c_uint] = &[
            0,
            TUN_F_TSO4,
            TUN_F_TSO6,
            TUN_F_TSO_ECN,
            TUN_F_UFO,
            TUN_F_USO4,
            TUN_F_USO6,
        ];
        for p in PROBE_LIST {
            let probe = *p | TUN_F_CSUM;
            if self.set_offload(probe).is_ok() {
                features |= probe;
            }
        }
        TapOffload(features)
    }

    pub fn set_offload(&self, offload: libc::c_uint) -> io::Result<()> {
        const TUNSETOFFLOAD: libc::c_ulong = 0x4004_54d0;
        if unsafe { libc::ioctl(self.file.as_raw_fd(), TUNSETOFFLOAD, offload) } < 0 {
            return Err(io::Error::last_os_error());
        }
        Ok(())
    }
}

Peripheral Component Implementation

Bus

Since we may need to mount multiple devices, a component is required to:

Register devices to address segments.
Distribute read and write requests based on address segments.

Considering the current existence of PIO and MMIO address spaces and corresponding devices, we can define such a Bus:

pub struct Bus<A, D> {
    devices: Vec<BusDevice<A, D>>,
}

struct BusDevice<A, D> {
    range: Range<A>,
    device: Arc<Mutex<D>>,
}

The reason for using Arc<Mutex<D>> is that the device needs to be shared by the Bus and event callbacks; the use of Vec within the Bus is because it allows for quick lookup of the device corresponding to an address using binary search.

Implementation of insertion and lookup:

impl<A, D> Bus<A, D> {
    pub const fn new() -> Self {
        Self {
            devices: Vec::new(),
        }
    }
}

impl<A: PartialOrd<A>, D> Bus<A, D> {
    pub fn insert(&mut self, range: Range<A>, device: Arc<Mutex<D>>) -> Result<(), ()> {
        if range.is_empty() {
            return Err(());
        }
        let insert_idx = self
            .devices
            .partition_point(|d| d.range.start < range.start);
        if insert_idx < self.devices.len() && range.end > self.devices[insert_idx].range.start {
            // address overlap
            return Err(());
        }
        self.devices.insert(insert_idx, BusDevice { range, device });
        Ok(())
    }

    pub fn get(&self, addr: &A) -> Option<(&A, &Arc<Mutex<D>>)> {
        self.devices
            .binary_search_by(|d| {
                if d.range.contains(addr) {
                    std::cmp::Ordering::Equal
                } else if &d.range.start > addr {
                    std::cmp::Ordering::Greater
                } else {
                    std::cmp::Ordering::Less
                }
            })
            .ok()
            .map(|idx| (&self.devices[idx].range.start, &self.devices[idx].device))
    }
}

When we need to distribute read and write requests, the device does not care about its absolute address, or rather, it cannot understand the meaning of the absolute address. What it cares about is the offset of the address, so we need to calculate and constrain the offset:

pub trait DeviceIO {
    type Offset;
    fn read(&mut self, offset: Self::Offset, data: &mut [u8]);
    fn write(&mut self, offset: Self::Offset, data: &[u8]);
}

impl<A, D, O> Bus<A, D>
where
    A: PartialOrd<A> + Sub<A, Output = O> + Copy + std::fmt::LowerHex,
    D: DeviceIO<Offset = O>,
{
    #[inline]
    pub fn read(&mut self, addr: A, data: &mut [u8]) -> bool {
        if let Some((base, device)) = self.get(&addr) {
            let mut device = device.lock().unwrap();
            let offset = addr - *base;
            device.read(offset, data);
            true
        } else {
            false
        }
    }

    #[inline]
    pub fn write(&mut self, addr: A, data: &[u8]) -> bool {
        if let Some((base, device)) = self.get(&addr) {
            let mut device = device.lock().unwrap();
            let offset = addr - *base;
            device.write(offset, data);
            true
        } else {
            false
        }
    }
}

Thus, we can provide aliases for PIOBus and MMIOBus based on the Bus.

pub type PIOBus = Bus<u16, PIODevice>;
pub type MMIOBus = Bus<u64, MMIODevice>;

#[non_exhaustive]
pub enum PIODevice {
    Serial(SerialDevice),
}

#[non_exhaustive]
pub enum MMIODevice {
    MMIOTransport(MMIOTransport),
}

We can implement a more convenient insertion method for these two Buses:

impl PIOBus {
    #[inline]
    pub fn insert_serial(&mut self, address: u16, serial: Arc<Mutex<PIODevice>>) -> Result<(), ()> {
        const SERIAL_PORT_SIZE: u16 = 0x8;
        self.insert(address..address + SERIAL_PORT_SIZE, serial)
    }
}

impl MMIOBus {
    #[inline]
    pub fn insert_virtio(
        &mut self,
        address: u64,
        device: Arc<Mutex<dyn VirtioDevice>>,
        mem: GuestMemoryMmap,
    ) -> Result<(), ()> {
        const MMIO_SIZE: u64 = 0x1000;
        let device = MMIODevice::MMIOTransport(MMIOTransport::new(device, mem));
        self.insert(address..address + MMIO_SIZE, Arc::new(Mutex::new(device)))
    }
}

EventLoop

Previously, our event processing logic was expressed in this form:

// only pseudocode here
loop {
    let events = epoll_wait();
    for event in events {
        match event.user_data {
            1 => { ... },
            2 => { ... },
            _ => { ... },
        }
    }
}

The flaw of this format is that all logic must be mixed together, and it cannot dynamically register or delete, making the code unmaintainable when there are many events.

We have now added new devices and need to handle multiple events, so it is necessary to implement a more user-friendly EventLoop to solve this problem (the Rust asynchronous runtime I wrote about before is also a kind of EventLoop in some sense, which you can refer to in the Rust Runtime Design and Implementation Series).

The core of implementing an EventLoop lies in how to describe a Callback: we will temporarily define it as T. Based on this, we can implement event registration and define the infrastructure. For implementation, we use Slab to store T, and use the slab id as user_data.

use slab::Slab;
use vmm_sys_util::epoll::{ControlOperation, Epoll, EpollEvent};

pub use vmm_sys_util::epoll::EventSet;
pub const DEFAULT_EPOLL_EVENT_CAP: usize = 128;

pub struct EventLoop<T = Arc<dyn EventHandler>> {
    epoll: Epoll,
    events: Vec<EpollEvent>,
    callbacks: Slab<T>,
}

impl<T> EventLoop<T> {
    pub fn new() -> io::Result<Self> {
        Self::with_capacity(DEFAULT_EPOLL_EVENT_CAP)
    }

    pub fn with_capacity(size: usize) -> io::Result<Self> {
        let epoll = Epoll::new()?;
        Ok(Self {
            epoll,
            // EpollEvent is transparent and can be safely initialized with zeroed.
            events: RefCell::new(vec![unsafe { zeroed() }; size]),
            callbacks: Slab::new(),
        })
    }

    pub fn register(&mut self, fd: &impl AsRawFd, event: EventSet, callback: T) -> io::Result<()> {
        let fd = fd.as_raw_fd();
        let user_data = self.callbacks.insert(callback);
        self.epoll
            .ctl(
                ControlOperation::Add,
                fd,
                EpollEvent::new(event, user_data as u64),
            )
            .map_err(|e| {
                let _ = self.callbacks.remove(user_data);
                e
            })?;
        Ok(())
    }
}

The EventLoop also needs to provide a function to wait for and process events, which involves event handling logic. At this point, constraints on T are necessary.

Since we can obtain &mut T, the most intuitive constraint we can apply is:

1
2
3

pub trait EventHandlerMut {
    fn handle_event_mut(&mut self, events: EventSet);
}

When we try to implement any combination of Arc/Rc and Mutex/RefCell, a problem arises:

With Arc, we can only get a read-only reference to the inner layer, which prevents us from calling its handle_event_mut method!

There are two ways to solve this problem. One is to manually expand the combination and implement the trait for Arc<Mutex<T>> and Arc<RefCell<T>>, forwarding it to T’s implementation. Another way is to introduce a read-only version of the trait:

1
2
3

pub trait EventHandler {
    fn handle_event(&self, events: EventSet);
}

For Arc<T>, when T implements EventHandler, Arc<T> can then implement EventHandlerMut based on this; for Mutex<T>, when T implements EventHandlerMut, Mutex<T> can implement EventHandler. The effect of this is that if T: EventHandlerMut then Arc<Mutex<T>>: EventHandlerMut, achieving our goal.

We can write its tick method:

impl<T: EventHandlerMut> EventLoop<T> {
    pub fn tick(&mut self, timeout: i32) -> io::Result<usize> {
        let n = self.epoll.wait(timeout, &mut self.events[..])?;
        for event in self.events.iter().take(n) {
            self.callbacks[event.data() as usize].handle_event_mut(event.event_set());
        }
        Ok(n)
    }
}

Additionally, regarding the previously mentioned need for dynamic registration after device activation, there are some challenges in supporting this:

Device activation is executed within the event loop’s handler, where the handler holds &mut callbacks.
At the same time, this handler expects to register new event sets and callbacks to the EventLoop, which requires writing to callbacks.

At this point, we can introduce a pre-registration to solve this problem (alternatively, we could make the handler a shared ownership structure, cloning it each time to process and thereby releasing the reference to callbacks, but this would introduce some performance overhead):

pre_register: This operation only adds the callback to callbacks, obtaining a slab id, and requires holding &mut EventLoop.
associate: Associates the slab id with epoll (i.e., performs epoll_ctl EPOLL_ADD).

The first stage is done at startup, while the second stage can be executed at any time.

pub struct EventLoop<T = Arc<dyn EventHandler>> {
    epoll: Rc<Epoll>,
    events: Vec<EpollEvent>,
    callbacks: Slab<T>,
}

#[derive(Debug)]
pub struct PreRegister {
    epoll: Rc<Epoll>,
    user_data: usize,
}

impl<T> EventLoop<T> {
    pub fn pre_register(&mut self, callback: T) -> PreRegister {
        let user_data = self.callbacks.insert(callback);
        PreRegister {
            epoll: self.epoll.clone(),
            user_data,
        }
    }
}

impl PreRegister {
    pub fn associate(&self, fd: &impl AsRawFd, event: EventSet) -> io::Result<()> {
        let fd = fd.as_raw_fd();
        self.epoll.ctl(
            ControlOperation::Add,
            fd,
            EpollEvent::new(event, self.user_data as u64),
        )?;
        Ok(())
    }
}

An EventLoop corresponds to one thread, which can handle all IO, thus achieving lower resource usage and higher deployment density. However, the peak IO performance is not very high because only one or a few threads perform IO operations. Alternatively, each device could have its own thread running an EventLoop (even enabling the device’s MultiQueue feature and starting independent threads for each Queue), which would allow for higher peak performance but at a greater resource cost.

In my code, I used the first approach, simulating all IO on the main thread.

Assembly

We have now implemented all the necessary components, and it’s time to assemble them in the main function.

Here’s a summary of the components we have implemented:

VM configuration, Kernel, and initrd loading: This part was completed before this section.
Serial implementation: This was also completed before this section and needs to be mounted on the PIOBus.
EventLoop: Used for elegantly managing events and executing corresponding Callbacks.
Bus: Includes MMIOBus and PIOBus, used for dispatching events after VM_EXIT.
TAP component and Virtio Queue implementation: Used internally by the Net component.
Net component: Holds the TAP and is responsible for transferring data between the virtio queue and TAP.
MMIOTransport: Wraps the Net component and mounts it to the MMIOBus, exposing the device operation interface in MMIO form.

In the main function, we can initialize in the following order:

Initialize logging to provide some observability.
Create KVM fd, VM fd, and create irq_chip, pit, and initialize memory.
Create vCPU.
Load initrd and kernel.
Initialize registers and page tables and complete mode switching.
Write boot cmdline and configure Linux boot parameters.
Create PIOBus and insert Serial device.
Create MMIOBus and insert MMIO devices.
Create EventLoop, and register stdin, virtio-net-activate, and exit_evt three fds.
Start a new thread to simulate vCPU, and notify exit_evt upon exit.
The main thread starts the EventLoop, and stops the loop and exits after receiving notification from exit_evt.

fn main() {
    let file_appender = tracing_appender::rolling::daily("/tmp", "mini-vmm.log");
    let (non_blocking, _guard) = tracing_appender::non_blocking(file_appender);
    tracing_subscriber::registry()
        .with(tracing_subscriber::fmt::layer().with_writer(non_blocking))
        .with(
            tracing_subscriber::EnvFilter::builder()
                .with_default_directive(LevelFilter::DEBUG.into())
                .from_env_lossy(),
        )
        .init();
    tracing::info!("mini-vmm started");

    // create vm
    let kvm = Kvm::new().expect("open kvm device failed");
    let vm = kvm.create_vm().expect("create vm failed");

    // initialize irq chip and pit
    vm.create_irq_chip().unwrap();
    let pit_config = kvm_pit_config {
        flags: KVM_PIT_SPEAKER_DUMMY,
        ..Default::default()
    };
    vm.create_pit2(pit_config).unwrap();

    // create memory
    let guest_addr = GuestAddress(0x0);
    let guest_mem = GuestMemoryMmap::<()>::from_ranges(&[(guest_addr, MEMORY_SIZE)]).unwrap();
    let host_addr = guest_mem.get_host_address(guest_addr).unwrap();
    let mem_region = kvm_userspace_memory_region {
        slot: 0,
        guest_phys_addr: 0,
        memory_size: MEMORY_SIZE as u64,
        userspace_addr: host_addr as u64,
        flags: KVM_MEM_LOG_DIRTY_PAGES,
    };
    unsafe {
        vm.set_user_memory_region(mem_region)
            .expect("set user memory region failed")
    };
    vm.set_tss_address(KVM_TSS_ADDRESS).expect("set tss failed");

    // create vcpu and set cpuid
    let mut vcpu = vm.create_vcpu(0).expect("create vcpu failed");
    let kvm_cpuid = kvm.get_supported_cpuid(KVM_MAX_CPUID_ENTRIES).unwrap();
    vcpu.set_cpuid2(&kvm_cpuid).unwrap();

    // load linux kernel
    let mut kernel_file = File::open(KERNEL_PATH).expect("open kernel file failed");
    let kernel_entry = Elf::load(
        &guest_mem,
        None,
        &mut kernel_file,
        Some(GuestAddress(HIMEM_START)),
    )
    .unwrap()
    .kernel_load;

    // load initrd
    let initrd_content = std::fs::read(INITRD_PATH).expect("read initrd file failed");
    let first_region = guest_mem.find_region(GuestAddress::new(0)).unwrap();
    assert!(
        initrd_content.len() <= first_region.size(),
        "too big initrd"
    );
    let initrd_addr =
        GuestAddress((first_region.size() - initrd_content.len()) as u64 & !(4096 - 1));
    guest_mem
        .read_volatile_from(
            initrd_addr,
            &mut Cursor::new(&initrd_content),
            initrd_content.len(),
        )
        .unwrap();

    // set regs
    let mut regs = vcpu.get_regs().unwrap();
    regs.rip = kernel_entry.raw_value();
    regs.rsp = BOOT_STACK_POINTER;
    regs.rbp = BOOT_STACK_POINTER;
    regs.rsi = ZERO_PAGE_START;
    regs.rflags = 2;
    vcpu.set_regs(&regs).unwrap();

    // set sregs
    let mut sregs = vcpu.get_sregs().unwrap();
    const CODE_SEG: kvm_segment = seg_with_st(1, 0b1011);
    const DATA_SEG: kvm_segment = seg_with_st(2, 0b0011);

    // construct kvm_segment and set to segment registers
    sregs.cs = CODE_SEG;
    sregs.ds = DATA_SEG;
    sregs.es = DATA_SEG;
    sregs.fs = DATA_SEG;
    sregs.gs = DATA_SEG;
    sregs.ss = DATA_SEG;

    // construct gdt table, write to memory and set it to register
    let gdt_table: [u64; 3] = [
        0,                       // NULL
        to_gdt_entry(&CODE_SEG), // CODE
        to_gdt_entry(&DATA_SEG), // DATA
    ];
    let boot_gdt_addr = GuestAddress(BOOT_GDT_OFFSET);
    for (index, entry) in gdt_table.iter().enumerate() {
        let addr = guest_mem
            .checked_offset(boot_gdt_addr, index * std::mem::size_of::<u64>())
            .unwrap();
        guest_mem.write_obj(*entry, addr).unwrap();
    }
    sregs.gdt.base = BOOT_GDT_OFFSET;
    sregs.gdt.limit = std::mem::size_of_val(&gdt_table) as u16 - 1;

    // enable protected mode
    sregs.cr0 |= X86_CR0_PE;

    // set page table
    let boot_pml4_addr = GuestAddress(0xa000);
    let boot_pdpte_addr = GuestAddress(0xb000);
    let boot_pde_addr = GuestAddress(0xc000);

    guest_mem
        .write_slice(
            &(boot_pdpte_addr.raw_value() | 0b11).to_le_bytes(),
            boot_pml4_addr,
        )
        .unwrap();
    guest_mem
        .write_slice(
            &(boot_pde_addr.raw_value() | 0b11).to_le_bytes(),
            boot_pdpte_addr,
        )
        .unwrap();

    for i in 0..512 {
        guest_mem
            .write_slice(
                &((i << 21) | 0b10000011u64).to_le_bytes(),
                boot_pde_addr.unchecked_add(i * 8),
            )
            .unwrap();
    }
    sregs.cr3 = boot_pml4_addr.raw_value();
    sregs.cr4 |= X86_CR4_PAE;
    sregs.cr0 |= X86_CR0_PG;
    sregs.efer |= EFER_LMA | EFER_LME;
    vcpu.set_sregs(&sregs).unwrap();

    // crate and write boot_params
    let mut params = boot_params::default();
    // <https://www.kernel.org/doc/html/latest/x86/boot.html>
    const KERNEL_TYPE_OF_LOADER: u8 = 0xff;
    const KERNEL_BOOT_FLAG_MAGIC_NUMBER: u16 = 0xaa55;
    const KERNEL_HDR_MAGIC_NUMBER: u32 = 0x5372_6448;
    const KERNEL_MIN_ALIGNMENT_BYTES: u32 = 0x0100_0000;

    params.hdr.type_of_loader = KERNEL_TYPE_OF_LOADER;
    params.hdr.boot_flag = KERNEL_BOOT_FLAG_MAGIC_NUMBER;
    params.hdr.header = KERNEL_HDR_MAGIC_NUMBER;
    params.hdr.cmd_line_ptr = BOOT_CMD_START as u32;
    params.hdr.cmdline_size = 1 + BOOT_CMD.len() as u32;
    params.hdr.kernel_alignment = KERNEL_MIN_ALIGNMENT_BYTES;
    params.hdr.ramdisk_image = initrd_addr.raw_value() as u32;
    params.hdr.ramdisk_size = initrd_content.len() as u32;

    // Value taken from <https://elixir.bootlin.com/linux/v5.10.68/source/arch/x86/include/uapi/asm/e820.h#L31>
    const E820_RAM: u32 = 1;
    const EBDA_START: u64 = 0x9fc00;
    const FIRST_ADDR_PAST_32BITS: u64 = 1 << 32;
    const MEM_32BIT_GAP_SIZE: u64 = 768 << 20;
    const MMIO_MEM_START: u64 = FIRST_ADDR_PAST_32BITS - MEM_32BIT_GAP_SIZE;

    // Borrowed from firecracker
    add_e820_entry(&mut params, 0, EBDA_START, E820_RAM);
    let last_addr = guest_mem.last_addr();
    let first_addr_past_32bits = GuestAddress(FIRST_ADDR_PAST_32BITS);
    let end_32bit_gap_start = GuestAddress(MMIO_MEM_START);
    let himem_start = GuestAddress(HIMEM_START);
    if last_addr < end_32bit_gap_start {
        add_e820_entry(
            &mut params,
            himem_start.raw_value(),
            // it's safe to use unchecked_offset_from because
            // mem_end > himem_start
            last_addr.unchecked_offset_from(himem_start) + 1,
            E820_RAM,
        );
    } else {
        add_e820_entry(
            &mut params,
            himem_start.raw_value(),
            // it's safe to use unchecked_offset_from because
            // end_32bit_gap_start > himem_start
            end_32bit_gap_start.unchecked_offset_from(himem_start),
            E820_RAM,
        );

        if last_addr > first_addr_past_32bits {
            add_e820_entry(
                &mut params,
                first_addr_past_32bits.raw_value(),
                // it's safe to use unchecked_offset_from because
                // mem_end > first_addr_past_32bits
                last_addr.unchecked_offset_from(first_addr_past_32bits) + 1,
                E820_RAM,
            );
        }
    }
    // load boot command
    let mut boot_cmdline = Cmdline::new(0x10000).unwrap();
    boot_cmdline.insert_str(BOOT_CMD).unwrap();
    load_cmdline(&guest_mem, GuestAddress(BOOT_CMD_START), &boot_cmdline).unwrap();

    // write boot params
    LinuxBootConfigurator::write_bootparams(
        &BootParams::new(&params, GuestAddress(ZERO_PAGE_START)),
        &guest_mem,
    )
    .unwrap();

    // initialize pio devices
    const SERIAL_PORT: [u16; 4] = [0x3f8, 0x2f8, 0x3e8, 0x2e8];
    const COM_1_3_GSI: u32 = 4;
    const COM_2_4_GSI: u32 = 3;
    let mut pio_bus = PIOBus::new();

    // pio device: std serial
    let stdio_serial_inner = new_serial(SerialOutOpt::StdOut);
    let serial_2_4_inner = new_serial(SerialOutOpt::Sink);
    let com_1_3_trigger = stdio_serial_inner.interrupt_evt().try_clone().unwrap();
    vm.register_irqfd(&com_1_3_trigger, COM_1_3_GSI).unwrap();
    vm.register_irqfd(serial_2_4_inner.interrupt_evt().as_ref(), COM_2_4_GSI)
        .unwrap();

    let serial_3 = new_serial_with_event(com_1_3_trigger, SerialOutOpt::Sink);

    let stdio_serial = Arc::new(Mutex::new(bus::PIODevice::Serial(stdio_serial_inner)));
    let com_3_serial = Arc::new(Mutex::new(bus::PIODevice::Serial(serial_3)));
    let com_2_4_serial = Arc::new(Mutex::new(bus::PIODevice::Serial(serial_2_4_inner)));
    pio_bus
        .insert_serial(SERIAL_PORT[0], stdio_serial.clone())
        .unwrap();
    pio_bus
        .insert_serial(SERIAL_PORT[1], com_2_4_serial.clone())
        .unwrap();
    pio_bus.insert_serial(SERIAL_PORT[2], com_3_serial).unwrap();
    pio_bus
        .insert_serial(SERIAL_PORT[3], com_2_4_serial)
        .unwrap();

    // initialize mmio devices
    let mut mmio_bus = MMIOBus::new();
    let virtio_net = virtio::net::Net::new("tap0", Some(GUEST_MAC)).unwrap();

    let virtio_net_activate_evt = virtio_net.activate_evt.as_raw_fd();
    vm.register_irqfd(&virtio_net.interrupt_trigger().irq_evt, 5)
        .unwrap();
    for (q_sel, evt) in virtio_net.queue_events().iter().enumerate() {
        let addr = IoEventAddress::Mmio(QUEUE_NOTIFY_OFFSET + MMIO_MEM_START);
        vm.register_ioevent(evt, &addr, q_sel as u32).unwrap();
    }

    let virtio_net = Arc::new(Mutex::new(virtio_net));
    mmio_bus
        .insert_virtio(MMIO_MEM_START, virtio_net.clone(), guest_mem.clone())
        .unwrap();

    // set stdin
    let stdin_fd = {
        let stdin = std::io::stdin().lock();
        stdin.set_raw_mode().expect("set terminal raw mode failed");
        stdin
            .set_non_block(true)
            .expect("set terminal non block failed");
        stdin.as_raw_fd()
    };

    // create poller
    let mut poller: EventLoop = EventLoop::new().unwrap();

    poller
        .register(&stdin_fd, EventSet::IN, stdio_serial)
        .unwrap();
    let cb = Arc::new(EventLoopRegisterHandler::new(virtio_net, &mut poller));
    poller
        .register(&virtio_net_activate_evt, EventSet::IN, cb)
        .unwrap();

    // run vcpu in another thread
    let exit_evt = EventWrapper::new();
    let vcpu_exit_evt = exit_evt.try_clone().unwrap();
    std::thread::spawn(move || {
        loop {
            match vcpu.run() {
                Ok(run) => match run {
                    VcpuExit::IoIn(addr, data) => {
                        pio_bus.read(addr, data);
                    }
                    VcpuExit::IoOut(addr, data) => {
                        pio_bus.write(addr, data);
                    }
                    VcpuExit::MmioRead(addr, data) => {
                        mmio_bus.read(addr, data);
                    }
                    VcpuExit::MmioWrite(addr, data) => {
                        mmio_bus.write(addr, data);
                    }
                    VcpuExit::Hlt => {
                        info!("KVM_EXIT_HLT");
                        break;
                    }
                    VcpuExit::Shutdown => {
                        info!("KVM_EXIT_SHUTDOWN");
                        break;
                    }
                    r => {
                        info!("KVM_EXIT: {:?}", r);
                    }
                },
                Err(e) => {
                    error!("KVM Run error: {:?}", e);
                    break;
                }
            }
        }
        vcpu_exit_evt.trigger().unwrap();
    });

    // prepare loop
    let exit = Rc::new(Cell::new(false));
    let exit_trigger = exit.clone();
    poller
        .register(
            &exit_evt,
            EventSet::IN,
            Arc::new(Mutex::new(FnHandler(move |_| {
                exit_trigger.set(true);
            }))),
        )
        .unwrap();

    // run eventloop
    loop {
        poller.tick(-1).unwrap();
        if exit.get() {
            info!("vcpu stopped, main loop exit");
            break;
        }
    }
}

Running

After starting the Guest, configure the Host TAP and verify that the Guest can access the Host normally:

Enable kernel forwarding on the Host and configure NAT rules; after setting up the default route in the Guest, the Guest can connect to the external network:

After a simple configuration of /etc/resolv.conf, you can verify its correct operation by downloading a large file:

The complete code can be found here: https://github.com/ihciah/mini-vmm

Optimizing Open-Source

In this experiment, I referenced some implementations from Firecracker, Cloud-Hypervisor, and QEMU. I adopted their excellent designs in some aspects and proposed what I believe to be superior new implementations in others. I will select a few points that deserve improvement and submit PRs for them (referring to the first two projects). Here is the PR list:

Remove redundant Descriptor Chain checks for Firecracker: virtio: skip redundant memory check
Optimize iovec buffer allocation for Net devices in Cloud-Hypervisor: virtio-devices: net: reduce vec allocations for iovec conversion
General iovec buffer allocation optimization for Firecracker (not submitted due to a conflicting implementation merged later): improve: use persistent buffer as iovec container
Major refactor of virtio-net for Firecracker: refactor(virtio-net): avoid copy on rx
1. Changed from initially reading into a buffer and then copying to the desc chain to directly reading into the desc chain, which is expected to improve performance.
2. Added Readiness management to avoid frequent ready checks on the RX Queue when there is no data in the TAP, and to enhance code readability.
Remove TAP RX Readiness management and switch to Edge Trigger to avoid repetitive epoll_ctl at runtime for Cloud-Hypervisor: working
Correct TAP offload flag error for Firecracker: fix(tap): use correct virtio feature for CSUM offload
Correct Queue operation errors for Cloud-Hypervisor: fix(virtq): only enable_notification when about to stop consumption
1. The correct approach is to enable_notification after consuming to empty, then check again. The current implementation enables notification every time a desc chain is popped.
More TODO(this list will be updated)

Next Stop - Ihcblog!

Mini VMM in Rust 4 - Implement Virtio Device