Next Stop - Ihcblog!

Some creations and thoughts sharing | sub site:ihc.im

0%

Mini VMM in Rust 2 - Mode Switch

This article also has a Chinese version.

This series of articles mainly records my attempt to implement a Hypervisor using Rust. The table of contents:

  1. Mini VMM in Rust - Basic
  2. Mini VMM in Rust - Mode Switch
  3. Mini VMM in Rust - Run Real Linux Kernel
  4. Mini VMM in Rust - Implement Virtio Devices

This article is the second in the series, which mainly introduces a few common operating modes and switch between them.

Chap 2: Real Mode to Protected/Long Mode

2.1. Running Modes

This section briefly introduces several common x86 running modes and their mechanisms.

Protected Mode and Real Mode

The 8086 CPU operates in 16 bits, with segmented addressing. The linear address is obtained by segment << 4 + offset, using this mechanism to increase the addressing space from 64K to 1M. Registers such as CS and DS store the specific segment values.

After the introduction of the 80286, Protected Mode was introduced; with the 80386, the 32-bit era began (which also supported paging, more on this later). However, for compatibility, the system still enters Real Mode after startup and must manually switch to Protected Mode.

Segmentation, GDT, and LDT

Under Protected Mode, segmentation is still mandatory, but the design has changed significantly. Descriptions of a segment changed to Base, Limit, and Permission bits. Instead of referencing a segment by writing a base directly then <<4 and adding, segment descriptors are used. With the presence of segment descriptors, they are either a pointer or an index.

In fact, a segment descriptor includes “index” information and requested privilege level; the “index” includes an index number and a bit for selecting the table. Since there is an index, a table is needed to store it, and the address of the table is stored in a register. In fact, there are two descriptor tables, GDT (G for Global) and LDT (L for Local). The mentioned special bit is used to select which table to use.

These are not two independent tables. In fact, the LDT is a subordinate table under the GDT, and the GDT table can store the descriptor of the LDT table.

When selecting the LDT, according to the index value in the LDTR, find the corresponding LDT descriptor in the GDT to obtain the base and limit of that LDT, and then lookup that LDT according to the index value in the segment selector; when using the GDT, it is simpler, directly find the location of the GDT table through the GDTR, and then lookup the GDT according to the index value in the segment selector.

GDT and LDT

GDT and LDT2

In design, the operating system stores each process’s segment descriptor table as a bunch of LDTs, then records each LDT in the GDT, and process switching is achieved by switching the LDTR register. But in reality, this set of mechanisms is not used (how is privilege control done then? It relies on the paging mechanism). Linux’s GDT only has four valid segments: user code, user data, kernel code, and kernel data (the difference between user and kernel is in the permission bits).

After enabling the paging mechanism, this system essentially only does a small part of the permission check and serves no other purpose. So we often directly configure the base of the few segments in the GDT to 0, so whether using cs or ds, the final calculated address is consistent, which is the so-called flat model.

Long Mode

In the 64-bit era, another mode called Long Mode (also known as IA-32e by Intel) was introduced. Sounds dizzy, right? Let’s look at the diagram:

Different Modes

Different Modes2

Overall, there are two modes: Legacy Mode and Long Mode, each with several sub-modes. Legacy is a general term for some old artifacts, and we are now familiar with the two common modes there. We are primarily looking at Long Mode’s Compatibility mode and 64-bit mode.

Real Mode supports segmentation, not paging; after the 80386, segmentation is still supported while also supporting paging, but segmentation is mandatory, and paging is optional; in Long Mode, paging becomes mandatory, while the segmentation mechanism gradually becomes superfluous.

To fully utilize the hardware (such as added registers, more than 4GB of memory), the program itself must be 64-bit, which requires running in 64-bit mode. If the goal is to run 32-bit programs without modification, then you would use Compatibility mode.

So, how to switch between these modes? After all, the CPU is in Real Mode after booting up; and the common scenario is that we want to run both 64-bit and 32-bit programs on a 64-bit machine.

Mode Switch

Mode Switch

By setting certain bits in specific registers, we can control the CPU to complete the switch.

For detailed parameters, refer to here: https://wiki.osdev.org/CPU_Registers_x86-64#IA32_EFER

For more details, you can check the SDM.

2.2. Entering Protected Mode

Entering Protected Mode requires correctly configuring the GDT and some special registers, and then setting the lowest bit of CR0 to 1.

Configuring Segment Registers

Segment registers correspond to the kvm_segment structure, which includes base, limit, selector, etc. Essentially, it is a layer of abstraction over a GDT entry, since processors from manufacturers other than Intel also exist.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
const fn seg_with_st(selector_index: u16, type_: u8) -> kvm_segment {
kvm_segment {
base: 0,
limit: 0x000fffff,
selector: selector_index << 3,
// 0b1011: Code, Execute/Read, accessed
// 0b0011: Data, Read/Write, accessed
type_,
present: 1,
dpl: 0,
db: 1,
s: 1,
l: 0,
g: 1,
avl: 0,
unusable: 0,
padding: 0,
}
}

We define a const fn to create it quickly.

Our goal is a flat model, so the base must be 0, and the limit must be 0xfffff.

Segment Selector

The 0-1 and 2 bits of the selector represent RPL and TI, so we need to shift the passed-in index 3 bits to the left.

Segment Type

Type has 4 bits, for a data segment, the first bit is 0, and the last three bits represent E (direction of growth), W (writable), A (write flag, hardware sets to 1 after access); for code segments, the first bit is 1, and the last three bits represent C (whether lower privileges are allowed to call), R (readable), A (write flag). Therefore, we can use 1011 for code segments and 0011 for data segments.

present represents whether the segment is present or not, corresponds to the P of the GDT entry; S is set to 1 indicating this is a code or data segment; L indicates whether it is a Long Mode; G indicates the unit of the limit, set to 1 to use 4K as the unit (when a maximum of 4k*0xfffff bytes is represented, and set to 0 means the unit is bytes, with a maximum of 2^20 bytes). dpl, db, and others can also be assigned corresponding values by referring to the GDT definition.

1
2
3
4
5
6
7
8
9
10
11
let mut sregs = vcpu.get_sregs().unwrap();
const CODE_SEG: kvm_segment = seg_with_st(1, 0b1011);
const DATA_SEG: kvm_segment = seg_with_st(2, 0b0011);

// construct kvm_segment and set to segment registers
sregs.cs = CODE_SEG;
sregs.ds = DATA_SEG;
sregs.es = DATA_SEG;
sregs.fs = DATA_SEG;
sregs.gs = DATA_SEG;
sregs.ss = DATA_SEG;

Here the segment registers have been correctly processed.

Configuring the GDT

The GDT is a table, and each entry stores the description of a segment.

Segment Descriptor

Each entry is a 64-bit value, and the meaning of each bit can be referred to in the diagram above. Simply put, each entry corresponds to a segment, mainly including its base, limit, and permission bits, among other information. We need to construct this table.

Most fields in each entry are very similar to those in kvm_segment. We can write a conversion function to get an entry from kvm_segment.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
// Ref: <https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html> 3-10 Vol. 3A
const fn to_gdt_entry(seg: &kvm_segment) -> u64 {
let base = seg.base;
let limit = seg.limit as u64;
// flags: G, DB, L, AVL
let flags = (seg.g as u64 & 0x1) << 3
| (seg.db as u64 & 0x1) << 2
| (seg.l as u64 & 0x1) << 1
| (seg.avl as u64 & 0x1);
// access: P, DPL, S, Type
let access = (seg.present as u64 & 0x1) << 7
| (seg.dpl as u64 & 0x11) << 5
| (seg.s as u64 & 0x1) << 4
| (seg.type_ as u64 & 0x1111);
((base & 0xff00_0000u64) << 32)
| ((base & 0x00ff_ffffu64) << 16)
| (limit & 0x0000_ffffu64)
| ((limit & 0x000f_0000u64) << 32)
| (flags << 52)
| (access << 40)
}

Afterward, we just need to construct the GDT table and copy it to the user memory:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// construct gdt table, write to memory, and set it to register
let gdt_table: [u64; 3] = [
0, // NULL
to_gdt_entry(&CODE_SEG), // CODE
to_gdt_entry(&DATA_SEG), // DATA
];
let boot_gdt_addr = GuestAddress(BOOT_GDT_OFFSET);
for (index, entry) in gdt_table.iter().enumerate() {
let addr = guest_mem
.checked_offset(boot_gdt_addr, index * std::mem::size_of::<u64>())
.unwrap();
guest_mem.write_obj(*entry, addr).unwrap();
}
sregs.gdt.base = BOOT_GDT_OFFSET;
sregs.gdt.limit = std::mem::size_of_val(&gdt_table) as u16 - 1;

Finally, turn on the protective mode switch and submit sreg.

1
2
3
// enable protected mode
sregs.cr0 |= X86_CR0_PE;
vcpu.set_sregs(&sregs).unwrap();

Thus far, we have completed the initialization of the necessary components for Protected Mode.

Generating Corresponding Architecture Code

In our previous section on Real Mode, we manually translated a section of code into binary using nasm. We now need a piece of code that can run in Protected Mode. We still use nasm to assemble the following code section:

1
2
3
4
bits 32
mov eax, 0x42
mov ds:[0x10000], eax
hlt

ndisasm -b32 demo results in:

1
2
3
00000000  B842000000        mov eax,0x42
00000005 3EA300000100 mov [ds:0x10000],eax
0000000B F4 hlt

Besides writing by hand or using nasm, we can also perform this assembling operation with pwntools (which those who have played CTF should be very familiar with):

1
2
3
from pwn import context, asm
context.update(arch = 'i386')
asm('mov eax,0x42; mov [ds:0x10000],eax; hlt')

So we get [0xb8, 0x42, 0x00, 0x00, 0x00, 0x3e, 0xa3, 0x00, 0x00, 0x01, 0x00, 0xf4].

Copying and Running Code

This part is the same as what we did in Real Mode.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// copy code
// B842000000 mov eax,0x42
// 3EA300000100 mov [ds:0x10000],eax
// F4 hlt
let code = [
0xb8, 0x42, 0x00, 0x00, 0x00, 0x3e, 0xa3, 0x00, 0x00, 0x01, 0x00, 0xf4,
];
guest_mem.write_slice(&code, GuestAddress(0x0)).unwrap();
let reason = vcpu.run().unwrap();
let regs = vcpu.get_regs().unwrap();
println!("exit reason: {:?}", reason);
println!("rax: {:x}, rip: 0x{:X?}", regs.rax, regs.rip);
println!(
"memory at 0x10000: 0x{:X}",
guest_mem.read_obj::<u32>(GuestAddress(0x10000)).unwrap()
);

We can obtain the following results:

1
2
3
exit reason: Hlt
rax: 42, rip: 0xC
memory at 0x10000: 0x42

Thus, we correctly completed computation and memory operation under Protected Mode (without paging enabled). Of course, you can also try writing the data segment’s GDT entry the same as the code segment’s, which will encounter an error because the segment does not have write permission.

The complete code of this section:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
use kvm_bindings::{
kvm_segment, kvm_userspace_memory_region, KVM_MAX_CPUID_ENTRIES, KVM_MEM_LOG_DIRTY_PAGES,
};
use kvm_ioctls::Kvm;
use vm_memory::{Bytes, GuestAddress, GuestMemory, GuestMemoryMmap};

const MEMORY_SIZE: usize = 0x30000;

const KVM_TSS_ADDRESS: usize = 0xfffb_d000;
const X86_CR0_PE: u64 = 0x1;
const BOOT_GDT_OFFSET: u64 = 0x500;

fn main() {
// create vm
let kvm = Kvm::new().expect("open kvm device failed");
let vm = kvm.create_vm().expect("create vm failed");

// create memory
let guest_addr = GuestAddress(0x0);
let guest_mem = GuestMemoryMmap::<()>::from_ranges(&[(guest_addr, MEMORY_SIZE)]).unwrap();
let host_addr = guest_mem.get_host_address(guest_addr).unwrap();
let mem_region = kvm_userspace_memory_region {
slot: 0,
guest_phys_addr: 0,
memory_size: MEMORY_SIZE as u64,
userspace_addr: host_addr as u64,
flags: KVM_MEM_LOG_DIRTY_PAGES,
};
unsafe {
vm.set_user_memory_region(mem_region)
.expect("set user memory region failed")
};
vm.set_tss_address(KVM_TSS_ADDRESS as usize)
.expect("set tss failed");

// create vcpu and set cpuid
let vcpu = vm.create_vcpu(0).expect("create vcpu failed");
let kvm_cpuid = kvm.get_supported_cpuid(KVM_MAX_CPUID_ENTRIES).unwrap();
vcpu.set_cpuid2(&kvm_cpuid).unwrap();

// set regs
let mut regs = vcpu.get_regs().unwrap();
regs.rip = 0;
regs.rflags = 2;
vcpu.set_regs(&regs).unwrap();

// set sregs
let mut sregs = vcpu.get_sregs().unwrap();
const CODE_SEG: kvm_segment = seg_with_st(1, 0b1011);
const DATA_SEG: kvm_segment = seg_with_st(2, 0b0011);

// construct kvm_segment and set to segment registers
sregs.cs = CODE_SEG;
sregs.ds = DATA_SEG;
sregs.es = DATA_SEG;
sregs.fs = DATA_SEG;
sregs.gs = DATA_SEG;
sregs.ss = DATA_SEG;

// construct gdt table, write to memory and set it to register
let gdt_table: [u64; 3] = [
0, // NULL
to_gdt_entry(&CODE_SEG), // CODE
to_gdt_entry(&DATA_SEG), // DATA
];
let boot_gdt_addr = GuestAddress(BOOT_GDT_OFFSET);
for (index, entry) in gdt_table.iter().enumerate() {
let addr = guest_mem
.checked_offset(boot_gdt_addr, index * std::mem::size_of::<u64>())
.unwrap();
guest_mem.write_obj(*entry, addr).unwrap();
}
sregs.gdt.base = BOOT_GDT_OFFSET;
sregs.gdt.limit = std::mem::size_of_val(&gdt_table) as u16 - 1;

// enable protected mode
sregs.cr0 |= X86_CR0_PE;
vcpu.set_sregs(&sregs).unwrap();

// copy code
// B842000000 mov eax,0x42
// 3EA300000100 mov [ds:0x10000],eax
// F4 hlt
let code = [
0xb8, 0x42, 0x00, 0x00, 0x00, 0x3e, 0xa3, 0x00, 0x00, 0x01, 0x00, 0xf4,
];
guest_mem.write_slice(&code, GuestAddress(0x0)).unwrap();
let reason = vcpu.run().unwrap();
let regs = vcpu.get_regs().unwrap();
println!("exit reason: {:?}", reason);
println!("rax: {:x}, rip: 0x{:X?}", regs.rax, regs.rip);
println!(
"memory at 0x10000: 0x{:X}",
guest_mem.read_obj::<u32>(GuestAddress(0x10000)).unwrap()
);
}

const fn seg_with_st(selector_index: u16, type_: u8) -> kvm_segment {
kvm_segment {
base: 0,
limit: 0x000fffff,
selector: selector_index << 3,
// 0b1011: Code, Executed/Read, accessed
// 0b0011: Data, Read/Write, accessed
type_,
present: 1,
dpl: 0,
db: 1,
s: 1,
l: 0,
g: 1,
avl: 0,
unusable: 0,
padding: 0,
}
}

// Ref: <https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html> 3-10 Vol. 3A
const fn to_gdt_entry(seg: &kvm_segment) -> u64 {
let base = seg.base;
let limit = seg.limit as u64;
// flags: G, DB, L, AVL
let flags = (seg.g as u64 & 0x1) << 3
| (seg.db as u64 & 0x1) << 2
| (seg.l as u64 & 0x1) << 1
| (seg.avl as u64 & 0x1);
// access: P, DPL, S, Type
let access = (seg.present as u64 & 0x1) << 7
| (seg.dpl as u64 & 0x11) << 5
| (seg.s as u64 & 0x1) << 4
| (seg.type_ as u64 & 0x1111);
((base & 0xff00_0000u64) << 32)
| ((base & 0x00ff_ffffu64) << 16)
| (limit & 0x0000_ffffu64)
| ((limit & 0x000f_0000u64) << 32)
| (flags << 52)
| (access << 40)
}

2.3 Enabling Pagination

Enabling PAE (Physical Address Extension)

No PAE

With PAE

Typically, on x86 architecture, the page table is 3-level deep (4K pages). However, if you’re using larger pages (such as 4M), or if PAE is enabled, it can affect the number of levels in the page table. Enabling PAE allows a 32-bit machine to support up to 64 GB of memory, but due to the limitations of the address space, the maximum available memory for a single process in a flat memory model remains 4GB.

To enable PAE, you must set the 5th bit of the CR4 register to 1, while ensuring that the paging feature is enabled (CR0.PG) and LME is set to 0 (IA32_EFER.LME).

These two figures allow us to compare the structure of the page table before and after PAE is enabled. Note that, besides the addition of a PDPTE level, the PDE has also changed from 32bit to 64bit (which is how it can support a larger physical memory).

Initializing the Page Table

Here we choose the most common configuration: 32-bit with PAE enabled, and 4K pages (actually, using 2M pages would be simpler and more efficient in this case, as this process generally doesn’t involve context switching, and there’s only one page table globally). In later experiments with 64-bit, we will try using even larger page sizes.

We will now initialize the page table after PAE is enabled. Refer to the images above; we need to construct the PDPTE Registers, Page Directory, Page Table, and Pages, and initialize some elements for them (otherwise we would encounter tedious page faults at this stage, which is merely a temporary boot environment; the actual page fault handling will be left to the operating system that boots up later).

For more options, refer to the SDM Volume 3, Chapter 4.4.

According to the table above, we need to:

  1. Write the address of the PDPTE Register into the CR3 register.
  2. Initialize entries in the PDPTE Register (only one element needs to be filled out of the four, which maps the lower 1GB of address space; all lower 12 bits should be set to zero, the least significant bit set to 1).
  3. Initialize entries in the Page Directory, the size of which is 512 entries; fill them all out, each corresponding to 2M, and pointing to the physical memory addresses (the next 4K will be for that page).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// set page table
let boot_pdpte_addr = GuestAddress(0xa000);
let boot_pde_addr = GuestAddress(0xb000);
let boot_pte_addr = GuestAddress(0xc000);

guest_mem
.write_slice(
&(boot_pde_addr.raw_value() as u64 | 1).to_le_bytes(),
boot_pdpte_addr,
)
.unwrap();

guest_mem
.write_slice(
&(boot_pte_addr.raw_value() as u64 | 0b11).to_le_bytes(),
boot_pde_addr,
)
.unwrap();

for i in 0..512 {
guest_mem
.write_slice(
&((i << 12) + 0b11u64).to_le_bytes(),
boot_pte_addr.unchecked_add(i * 8),
)
.unwrap();
}

Configuring the Control Register

We set the CR3 to the address of the PDPTE (Page Directory Pointer Table Entry) and enable paging and PAE (Physical Address Extension), then configure the register accordingly.

1
2
3
4
sregs.cr3 = boot_pdpte_addr.raw_value() as u64;
sregs.cr4 |= X86_CR4_PAE;
sregs.cr0 |= X86_CR0_PG;
vcpu.set_sregs(&sregs).unwrap();

Code Generation and Execution

In this part, we no longer rely on segment registers for memory access.

1
2
3
4
bits 32
mov eax, 0x42
mov [0x10000], eax
hlt

Generating the corresponding machine code:

1
2
3
00000000  B842000000        mov eax,0x42
00000005 A300100000 mov [0x1000],eax
0000000A F4 hlt

Run:

1
2
3
exit reason: Hlt
rax: 42, rip: 0xB
memory at 0x10000: 0x42

The complete code of this section:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
use kvm_bindings::{
kvm_segment, kvm_userspace_memory_region, KVM_MAX_CPUID_ENTRIES, KVM_MEM_LOG_DIRTY_PAGES,
};
use kvm_ioctls::Kvm;
use vm_memory::{Address, Bytes, GuestAddress, GuestMemory, GuestMemoryMmap};

const MEMORY_SIZE: usize = 0x30000;

const KVM_TSS_ADDRESS: usize = 0xfffb_d000;
const X86_CR0_PE: u64 = 0x1;
const X86_CR4_PAE: u64 = 0x20;
const X86_CR0_PG: u64 = 0x80000000;
const BOOT_GDT_OFFSET: u64 = 0x500;

fn main() {
// create vm
let kvm = Kvm::new().expect("open kvm device failed");
let vm = kvm.create_vm().expect("create vm failed");

// create memory
let guest_addr = GuestAddress(0x0);
let guest_mem = GuestMemoryMmap::<()>::from_ranges(&[(guest_addr, MEMORY_SIZE)]).unwrap();
let host_addr = guest_mem.get_host_address(guest_addr).unwrap();
let mem_region = kvm_userspace_memory_region {
slot: 0,
guest_phys_addr: 0,
memory_size: MEMORY_SIZE as u64,
userspace_addr: host_addr as u64,
flags: KVM_MEM_LOG_DIRTY_PAGES,
};
unsafe {
vm.set_user_memory_region(mem_region)
.expect("set user memory region failed")
};
vm.set_tss_address(KVM_TSS_ADDRESS as usize)
.expect("set tss failed");

// create vcpu and set cpuid
let vcpu = vm.create_vcpu(0).expect("create vcpu failed");
let kvm_cpuid = kvm.get_supported_cpuid(KVM_MAX_CPUID_ENTRIES).unwrap();
vcpu.set_cpuid2(&kvm_cpuid).unwrap();

// set regs
let mut regs = vcpu.get_regs().unwrap();
regs.rip = 0;
regs.rflags = 2;
vcpu.set_regs(&regs).unwrap();

// set sregs
let mut sregs = vcpu.get_sregs().unwrap();
const CODE_SEG: kvm_segment = seg_with_st(1, 0b1011);
const DATA_SEG: kvm_segment = seg_with_st(2, 0b0011);

// construct kvm_segment and set to segment registers
sregs.cs = CODE_SEG;
sregs.ds = DATA_SEG;
sregs.es = DATA_SEG;
sregs.fs = DATA_SEG;
sregs.gs = DATA_SEG;
sregs.ss = DATA_SEG;

// construct gdt table, write to memory and set it to register
let gdt_table: [u64; 3] = [
0, // NULL
to_gdt_entry(&CODE_SEG), // CODE
to_gdt_entry(&DATA_SEG), // DATA
];
let boot_gdt_addr = GuestAddress(BOOT_GDT_OFFSET);
for (index, entry) in gdt_table.iter().enumerate() {
let addr = guest_mem
.checked_offset(boot_gdt_addr, index * std::mem::size_of::<u64>())
.unwrap();
guest_mem.write_obj(*entry, addr).unwrap();
}
sregs.gdt.base = BOOT_GDT_OFFSET;
sregs.gdt.limit = std::mem::size_of_val(&gdt_table) as u16 - 1;

// enable protected mode
sregs.cr0 |= X86_CR0_PE;

// set page table
let boot_pdpte_addr = GuestAddress(0xa000);
let boot_pde_addr = GuestAddress(0xb000);
let boot_pte_addr = GuestAddress(0xc000);

guest_mem
.write_slice(
&(boot_pde_addr.raw_value() as u64 | 1).to_le_bytes(),
boot_pdpte_addr,
)
.unwrap();

guest_mem
.write_slice(
&(boot_pte_addr.raw_value() as u64 | 0b11).to_le_bytes(),
boot_pde_addr,
)
.unwrap();

for i in 0..512 {
guest_mem
.write_slice(
&((i << 12) + 0b11u64).to_le_bytes(),
boot_pte_addr.unchecked_add(i * 8),
)
.unwrap();
}
sregs.cr3 = boot_pdpte_addr.raw_value() as u64;
sregs.cr4 |= X86_CR4_PAE;
sregs.cr0 |= X86_CR0_PG;
vcpu.set_sregs(&sregs).unwrap();

// copy code
// B842000000 mov eax,0x42
// A300000100 mov [0x10000],eax
// F4 hlt
let code = [
0xb8, 0x42, 0x00, 0x00, 0x00, 0xa3, 0x00, 0x00, 0x01, 0x00, 0xf4,
];
guest_mem.write_slice(&code, GuestAddress(0x0)).unwrap();
let reason = vcpu.run().unwrap();
let regs = vcpu.get_regs().unwrap();
println!("exit reason: {:?}", reason);
println!("rax: {:x}, rip: 0x{:X?}", regs.rax, regs.rip);
println!(
"memory at 0x10000: 0x{:X}",
guest_mem.read_obj::<u32>(GuestAddress(0x10000)).unwrap()
);
}

const fn seg_with_st(selector_index: u16, type_: u8) -> kvm_segment {
kvm_segment {
base: 0,
limit: 0x000fffff,
selector: selector_index << 3,
// 0b1011: Code, Executed/Read, accessed
// 0b0011: Data, Read/Write, accessed
type_,
present: 1,
dpl: 0,
db: 1,
s: 1,
l: 0,
g: 1,
avl: 0,
unusable: 0,
padding: 0,
}
}

// Ref: <https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html> 3-10 Vol. 3A
const fn to_gdt_entry(seg: &kvm_segment) -> u64 {
let base = seg.base;
let limit = seg.limit as u64;
// flags: G, DB, L, AVL
let flags = (seg.g as u64 & 0x1) << 3
| (seg.db as u64 & 0x1) << 2
| (seg.l as u64 & 0x1) << 1
| (seg.avl as u64 & 0x1);
// access: P, DPL, S, Type
let access = (seg.present as u64 & 0x1) << 7
| (seg.dpl as u64 & 0x11) << 5
| (seg.s as u64 & 0x1) << 4
| (seg.type_ as u64 & 0x1111);
((base & 0xff00_0000u64) << 32)
| ((base & 0x00ff_ffffu64) << 16)
| (limit & 0x0000_ffffu64)
| ((limit & 0x000f_0000u64) << 32)
| (flags << 52)
| (access << 40)
}

2.4. Entering 64-bit Mode

To enter 64-bit mode, we need to enable PE (Protection Enable), PG (Paging), and LME (Long Mode Enable). If done correctly, we can verify that LMA (Long Mode Active) is set to 1 by reading it.

LME and LMA are two bits in the IA32_EFER register (EFER stands for Extended Feature Enable Register).

Merely modifying the register is quite simple; what is more complex is the page table. The structure of the page table also changes after entering 64-bit mode.

Referring to Figure 4-1, here we need 4-level paging. We have the option to choose between 4K, 2M, and 1G pages.

Based on the above figure, we opt for 2M pages (corresponding to Figure 4-9).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
// set page table
let boot_pml4_addr = GuestAddress(0xa000);
let boot_pdpte_addr = GuestAddress(0xb000);
let boot_pde_addr = GuestAddress(0xc000);

guest_mem
.write_slice(
&(boot_pdpte_addr.raw_value() as u64 | 0b11).to_le_bytes(),
boot_pml4_addr,
)
.unwrap();
guest_mem
.write_slice(
&(boot_pde_addr.raw_value() as u64 | 0b11).to_le_bytes(),
boot_pdpte_addr,
)
.unwrap();

for i in 0..512 {
guest_mem
.write_slice(
&((i << 21) | 0b10000011u64).to_le_bytes(),
boot_pde_addr.unchecked_add(i * 8),
)
.unwrap();
}
sregs.cr3 = boot_pml4_addr.raw_value() as u64;
sregs.cr4 |= X86_CR4_PAE;
sregs.cr0 |= X86_CR0_PG;
sregs.efer |= EFER_LMA | EFER_LME;

GDT Adaptation

In our GDT entries, we initially set the L bit (Long Mode) to 0, however, this needs to be changed to 1; furthermore, in accordance with the manual’s requirement, when the L bit is set, the D bit must be cleared to zero.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
const fn seg_with_st(selector_index: u16, type_: u8) -> kvm_segment {
kvm_segment {
base: 0,
limit: 0x000fffff,
selector: selector_index << 3,
// 0b1011: Code, Executed/Read, accessed
// 0b0011: Data, Read/Write, accessed
type_,
present: 1,
dpl: 0,
// If L-bit is set, then D-bit must be cleared.
db: 0,
s: 1,
l: 1,
g: 1,
avl: 0,
unusable: 0,
padding: 0,
}
}

Code Generation and Running

1
2
3
4
bits 64
mov rax, 0x4200000042
mov [0x10000], rax
hlt

Got:

1
2
3
00000000  48B84200000042000000  mov rax,0x4200000042
0000000A 4889042500000100 mov [0x10000],rax
00000012 F4 hlt

Run:

1
2
3
exit reason: Hlt
rax: 42, rip: 0xE
memory at 0x10000: 0x42

The complete code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
use kvm_bindings::{
kvm_segment, kvm_userspace_memory_region, KVM_MAX_CPUID_ENTRIES, KVM_MEM_LOG_DIRTY_PAGES,
};
use kvm_ioctls::Kvm;
use vm_memory::{Address, Bytes, GuestAddress, GuestMemory, GuestMemoryMmap};

const MEMORY_SIZE: usize = 0x30000;

const KVM_TSS_ADDRESS: usize = 0xfffb_d000;
const X86_CR0_PE: u64 = 0x1;
const X86_CR4_PAE: u64 = 0x20;
const X86_CR0_PG: u64 = 0x80000000;
const BOOT_GDT_OFFSET: u64 = 0x500;
const EFER_LME: u64 = 0x100;
const EFER_LMA: u64 = 0x400;

fn main() {
// create vm
let kvm = Kvm::new().expect("open kvm device failed");
let vm = kvm.create_vm().expect("create vm failed");

// create memory
let guest_addr = GuestAddress(0x0);
let guest_mem = GuestMemoryMmap::<()>::from_ranges(&[(guest_addr, MEMORY_SIZE)]).unwrap();
let host_addr = guest_mem.get_host_address(guest_addr).unwrap();
let mem_region = kvm_userspace_memory_region {
slot: 0,
guest_phys_addr: 0,
memory_size: MEMORY_SIZE as u64,
userspace_addr: host_addr as u64,
flags: KVM_MEM_LOG_DIRTY_PAGES,
};
unsafe {
vm.set_user_memory_region(mem_region)
.expect("set user memory region failed")
};
vm.set_tss_address(KVM_TSS_ADDRESS as usize)
.expect("set tss failed");

// create vcpu and set cpuid
let vcpu = vm.create_vcpu(0).expect("create vcpu failed");
let kvm_cpuid = kvm.get_supported_cpuid(KVM_MAX_CPUID_ENTRIES).unwrap();
vcpu.set_cpuid2(&kvm_cpuid).unwrap();

// set regs
let mut regs = vcpu.get_regs().unwrap();
regs.rip = 0;
regs.rflags = 2;
vcpu.set_regs(&regs).unwrap();

// set sregs
let mut sregs = vcpu.get_sregs().unwrap();
const CODE_SEG: kvm_segment = seg_with_st(1, 0b1011);
const DATA_SEG: kvm_segment = seg_with_st(2, 0b0011);

// construct kvm_segment and set to segment registers
sregs.cs = CODE_SEG;
sregs.ds = DATA_SEG;
sregs.es = DATA_SEG;
sregs.fs = DATA_SEG;
sregs.gs = DATA_SEG;
sregs.ss = DATA_SEG;

// construct gdt table, write to memory and set it to register
let gdt_table: [u64; 3] = [
0, // NULL
to_gdt_entry(&CODE_SEG), // CODE
to_gdt_entry(&DATA_SEG), // DATA
];
let boot_gdt_addr = GuestAddress(BOOT_GDT_OFFSET);
for (index, entry) in gdt_table.iter().enumerate() {
let addr = guest_mem
.checked_offset(boot_gdt_addr, index * std::mem::size_of::<u64>())
.unwrap();
guest_mem.write_obj(*entry, addr).unwrap();
}
sregs.gdt.base = BOOT_GDT_OFFSET;
sregs.gdt.limit = std::mem::size_of_val(&gdt_table) as u16 - 1;

// enable protected mode
sregs.cr0 |= X86_CR0_PE;

// set page table
let boot_pml4_addr = GuestAddress(0xa000);
let boot_pdpte_addr = GuestAddress(0xb000);
let boot_pde_addr = GuestAddress(0xc000);

guest_mem
.write_slice(
&(boot_pdpte_addr.raw_value() as u64 | 0b11).to_le_bytes(),
boot_pml4_addr,
)
.unwrap();
guest_mem
.write_slice(
&(boot_pde_addr.raw_value() as u64 | 0b11).to_le_bytes(),
boot_pdpte_addr,
)
.unwrap();

for i in 0..512 {
guest_mem
.write_slice(
&((i << 21) | 0b10000011u64).to_le_bytes(),
boot_pde_addr.unchecked_add(i * 8),
)
.unwrap();
}
sregs.cr3 = boot_pml4_addr.raw_value() as u64;
sregs.cr4 |= X86_CR4_PAE;
sregs.cr0 |= X86_CR0_PG;
sregs.efer |= EFER_LMA | EFER_LME;
vcpu.set_sregs(&sregs).unwrap();

// copy code
// B842000000 mov eax,0x42
// 4889042500000100 mov [0x10000],rax
// F4 hlt
let code = [
0xB8, 0x42, 0x00, 0x00, 0x00, 0x48, 0x89, 0x04, 0x25, 0x00, 0x00, 0x01, 0x00, 0xf4,
];
guest_mem.write_slice(&code, GuestAddress(0x0)).unwrap();
let reason = vcpu.run().unwrap();
let regs = vcpu.get_regs().unwrap();
println!("exit reason: {:?}", reason);
println!("rax: {:x}, rip: 0x{:X?}", regs.rax, regs.rip);
println!(
"memory at 0x10000: 0x{:X}",
guest_mem.read_obj::<u32>(GuestAddress(0x10000)).unwrap()
);
}

const fn seg_with_st(selector_index: u16, type_: u8) -> kvm_segment {
kvm_segment {
base: 0,
limit: 0x000fffff,
selector: selector_index << 3,
// 0b1011: Code, Executed/Read, accessed
// 0b0011: Data, Read/Write, accessed
type_,
present: 1,
dpl: 0,
// If L-bit is set, then D-bit must be cleared.
db: 0,
s: 1,
l: 1,
g: 1,
avl: 0,
unusable: 0,
padding: 0,
}
}

// Ref: <https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html> 3-10 Vol. 3A
const fn to_gdt_entry(seg: &kvm_segment) -> u64 {
let base = seg.base;
let limit = seg.limit as u64;
// flags: G, DB, L, AVL
let flags = (seg.g as u64 & 0x1) << 3
| (seg.db as u64 & 0x1) << 2
| (seg.l as u64 & 0x1) << 1
| (seg.avl as u64 & 0x1);
// access: P, DPL, S, Type
let access = (seg.present as u64 & 0x1) << 7
| (seg.dpl as u64 & 0x11) << 5
| (seg.s as u64 & 0x1) << 4
| (seg.type_ as u64 & 0x1111);
((base & 0xff00_0000u64) << 32)
| ((base & 0x00ff_ffffu64) << 16)
| (limit & 0x0000_ffffu64)
| ((limit & 0x000f_0000u64) << 32)
| (flags << 52)
| (access << 40)
}

Welcome to my other publishing channels