This article is the third in the series, where we’ll do some preparatory work and actually get a real Linux system running.
In the previous chapters, we have managed to run arbitrary code in 64-bit mode. The goal of this chapter is to get a real Linux Kernel up and running.
Some might wonder, Linux can also be started in real mode, so why do we need to go through all this trouble? It’s because under normal circumstances, Linux relies on a bootloader to perform the mode switch and kernel code loading, whereas our VMM can handle this step more efficiently. After we switch modes, we just need to ensure that the kernel code and initrd are loaded into the corresponding page table entries in memory, then we can directly jump-start the vcpu.
Environment Preparation
Since we only aim to start Linux and not a full-blown Linux with a hard disk, we just need to prepare:
Kernel file: You can compile it yourself, or download a pre-compiled and optimized file from the address provided by firecracker.
initrd image: (This is not actually a disk image, it just carries the old name) You can package it yourself, or use a script to create one.
Here is a simple guide I wrote while trying out Rust for Linux, which includes kernel compilation, manual initrd building, and booting. It can serve as a reference. However, building the Kernel and initrd is not our current focus. To ensure we are not sidetracked by these issues, we’ll just use ready-made ones.
initrd.img: Build according to https://github.com/marcov/firecracker-initrd.git (Note: this is somewhat outdated, it may prompt that the root user’s password is too simple, just modify it manually)
We place both vmlinux.bin and initrd.img files under /tmp/mini-kvm.
IRQ and PIT Creation
PIC and APIC
In addition to the CPU and memory, another important part of a computer is I/O devices. There are two ways to determine if a device has data: either the CPU polls, which can be costly when done frequently, and incur latency when done too infrequently, or the device notifies the CPU when the data is ready, through what is known as an interrupt. At the end of each instruction cycle, the CPU checks the interrupt flag IF to see if it’s set, and if so, it jumps to the corresponding interrupt handler.
External devices come in all shapes and sizes, so it’s not feasible for the CPU to have corresponding pins to receive interrupts for each type of device. Therefore, a dispatcher role played by hardware is needed to assist. IBM designed the 8259A interrupt controller with 8 signal lines, working in programmable form, allowing dynamic registration of pins and priorities, interrupt masking, and more. To support more peripherals, multiple 8259As are often cascaded to work together. This type of programmable interrupt controller is known as PIC (Programmable Interrupt Controller).
In the era of multiple CPUs, Intel proposed APIC (Advanced Programmable Interrupt Controller) technology. APIC consists of two parts: one is LAPIC (Local APIC), which exists in each CPU (now there’s one in each logical core); the other is IOAPIC, which might be singular or plural, connecting to external devices. Both are interconnected via the APIC Bus. External devices broadcast interrupts to LAPIC via IOAPIC, and LAPIC decides whether to handle them.
IRQ Virtualization
KVM has virtualized IRQ chips for us, and we only need to create it to use:
1
vm.create_irq_chip().unwrap();
For the need to trigger a certain interrupt, we only need to register an EventFd and the corresponding IRQ number:
1
vm.register_irqfd(&evtfd, 0).unwrap();
Clock Signal Virtualization
In computer systems, there are two types of time-related devices: clocks and timers. We can obtain the current time information through a clock, such as the TSC (Time Stamp Counter) device; with a timer, we can trigger interrupts at a specific time or at regular intervals to make the CPU aware of the passage of time while executing userspace code, such as a PIT (Programmable Interval Timer).
The PIT has relatively low precision and is only used during system startup; after startup, the LAPIC Timer is used, which operates within the CPU with higher precision.
To create a virtual PIT device, we just need to use KVM’s capabilities:
Information about the CPU is obtained through the CPUID instruction, and we need to modify the CPUID seen inside the VM. We need to tell KVM the CPUID information the Guest expects to see at the very beginning so that when the Guest executes CPUID leading to VM_EXIT later on, KVM can handle it themselves without having to involve the userspace VMM.
You might be curious, isn’t executing the CPUID instruction just a matter of setting EAX/ECX? Where do this function and index come from? Referring here https://elixir.bootlin.com/linux/latest/source/arch/x86/kvm/cpuid.c#L1392 we can see that the function is obtained from *EAX, and the index is obtained from *ECX. So when we cross-reference the previous specifications, we can map function and index to *EAX, *ECX, respectively.
Setting CPUID
This part mainly references firecracker code, where some configurations may be necessary and some may not.
Referring to the wiki mentioned earlier, we can find when EAX=1 (since this is an input, it corresponds to our function=1):
Bit 31 of ECX should be set to 1 to indicate a hypervisor.
Bits 32:24 of EBX should be set to the Local APIC ID, number from 0 for multiple vcpus.
Bits 15:8 of EBX should be set to the CLFLUSH line size. On x86 the cache line is usually 64 bytes, according to the wiki, our set value will be multiplied by 8 to set the actual value, so we should set it to 8.
Bit 19 of EDX should be set to 1 to enable CLFLUSH, this setting only takes effect if the CLFLUSH line size is set. TODO: Why haven’t the reference projects set this one?
Bits 23:16 of EBX should be set to the number of logical processors in a single physical package, usually set to the power of two greater or equal to the number of vCPUs (although it’s okay not to set it).
Bit 28 of EDX should be set to 1 to enable hyper-threading, which only takes effect if the previous logical processor number is configured. Usually set when vCPU > 1.
Bit 24 of ECX should be set to 1 to enable tsc-deadline.
EAX=4 mainly relates to cache and cores, such as how many cores are on one socket:
Omitted for brevity.
EAX=6 Fan and power management:
Set bit 3 of ECX to 0, which disables Performance-Energy Bias capability.
Set bit 1 of EAX to 0, which disables Intel Turbo Boost Technology capability.
EAX=10 Performance monitoring:
Set everything to 0 to turn it off.
EAX=11 Extended Topology Entry:
Omitted for brevity.
EAX=0x80000002..=0x80000004 CPU model information:
You can make it up yourself.
Simple Handling
In fact, you can just pass the cpuid out directly without handling it:
We have to do the work of the bootloader, load the kernel and initrd, boot parameters into the memory, and place some necessary information in memory to pass to the kernel.
This normally involves the bootloader obtaining available memory information through a BIOS interrupt (interrupt number 0x15, AX=0xE820, hence the derived structure name e820 entry). Here, we manually represent the available memory as multiple e820 entries and pass them to the kernel.
TODO: Memory layout
Creating Input/Output Devices
There are generally two types of input/output devices: PortIO and mmap IO. Here, we will focus solely on PortIO communication.
PortIO has a 64K Port address space, with typical addresses including (reference link):
COM1: I/O port 0x3F8, IRQ 4
COM2: I/O port 0x2F8, IRQ 3
COM3: I/O port 0x3E8, IRQ 4
COM4: I/O port 0x2E8, IRQ 3
In Linux, /dev/ttyS{0/1…} corresponds to COM{1/2…}. Therefore, to get Linux console input and output via PortIO, one simply has to handle COM1 (0x3F8, IRQ 4) and specify console=ttyS0 in the boot arguments.
Here, we create an EventFd and register it to IRQ 4. When we need to trigger COM1 to perform PortIO IN, we can inject an interrupt into the VM through this EventFd. Subsequently, the driver inside the guest VM will perform PIO IN, which then triggers a VM EXIT.
In practice, we use the vm_superio crate that provides an emulated serial port. We use this EventFd as its trigger and standard output as its output.
To adapt to its interface, we need to create two additional structures: EventWrapper and DummySerialEvent. The main purpose is to implement Trigger and SerialEvents. The part of the code for SerialEvents is not critical; it is only to satisfy its interface constraints. As for Trigger, it merely requires writing to the eventfd.
When encountering VcpuExit::IoIn and VcpuExit::IoOut, we can obtain the corresponding PortIO address and data. At this point, after making the necessary checks, we can hand it over to stdio_serial for processing. For output, stdio_serial writes directly to stdout; for input, we need to handle it ourselves.
vCPU Run
As previously mentioned, we need to forward the IoIn and IoOut events to the Serial for processing.
In addition, since we need to consume a thread to run the vCPU and we also need to interact with the terminal, two threads are needed here. We need some cross-thread communication means to achieve normal exit, that is, to notify the main thread to exit after the vCPU simulation stops, another eventfd is used here. For convenience, we reuse the previous EventWrapper structure (this is not necessary, and eventfd can be used directly).
// run vcpu in another thread letexit_evt = EventWrapper::new(); letvcpu_exit_evt = exit_evt.try_clone().unwrap(); letstdio_serial_read = stdio_serial.clone();
Upon the completion of the thread’s execution, we can be notified via exit_evt, which allows our main thread to wait for stdin input while also waiting for the vcpu exit event.
Stdin Handling
The Serial device requires us to handle the input data ourselves, and while waiting for user-side stdin, we also need to wait for the vcpu exit so that the main thread can exit when the vm stops. As you may have guessed, we can use epoll as the multiplexing mechanism here since KVM is already Linux-only, eliminating the need to consider cross-platform issues.
Here, we use PollContext encapsulated by vmm_sys_util.
For stdin handling, we need to use raw mode, as we need to forward keystrokes such as CTRL+C.
OpenRC 0.44.10 is starting up Linux 4.14.174 (x86_64)
* Mounting /proc ... [ ok ] * Mounting /run ... * /run/openrc: creating directory * /run/lock: creating directory * /run/lock: correcting owner * Caching service dependencies ... [ ok ] * Clock skew detected with `(null)' * Adjusting mtime of `/run/openrc/deptree' to Fri Sep 23 07:15:15 2022
* WARNING: clock skew detected! * WARNING: clock skew detected! * Mounting devtmpfs on /dev ... [ ok ] * Mounting /dev/mqueue ... [ ok ] * Mounting /dev/pts ... [ ok ] * Mounting /dev/shm ... [ ok ] * Loading modules ...modprobe: can't change directory to '/lib/modules': No such file or directory modprobe: can't change directory to '/lib/modules': No such file or directory [ ok ] * Mounting misc binary format filesystem ... [ ok ] * Mounting /sys ... [ ok ] * Mounting security filesystem ... [ ok ] * Mounting debug filesystem ... [ ok ] * Mounting SELinux filesystem ... [ ok ] * Mounting persistent storage (pstore) filesystem ... [ ok ] * WARNING: clock skew detected! * Starting fcnet ... [ ok ] * Checking local filesystems ... [ ok ] * Remounting filesystems ... [ ok ] * Mounting local filesystems ... [ ok ] * Setting hostname ... [ ok ] * Starting networking ... * eth0 ...Cannot find device "eth0" Device "eth0" does not exist. [ ok ] * Starting networking ... * lo ... [ ok ] * eth0 ... [ ok ]
Welcome to Alpine Linux 3.16 Kernel 4.14.174 on an x86_64 (ttyS0)
[ 2.744128] random: fast init done localhost login: root Password: Welcome to Alpine!
The Alpine Wiki contains a large amount of how-to guides and general information about administrating Alpine systems. See <http://wiki.alpinelinux.org/>.
You can setup the system with the command: setup-alpine
You may change this message by editing /etc/motd.
login[1080]: root login on 'ttyS0' localhost:~# pwd /root localhost:~# reboot -f [ 15.780943] reboot: Restarting system [ 15.780943] reboot: machine restart KVM_EXIT_SHUTDOWN vcpu stopped, main loop exit