This article is the second in the series, which mainly introduces a few common operating modes and switch between them.
Chap 2: Real Mode to Protected/Long Mode
2.1. Running Modes
This section briefly introduces several common x86 running modes and their mechanisms.
Protected Mode and Real Mode
The 8086 CPU operates in 16 bits, with segmented addressing. The linear address is obtained by segment << 4 + offset, using this mechanism to increase the addressing space from 64K to 1M. Registers such as CS and DS store the specific segment values.
After the introduction of the 80286, Protected Mode was introduced; with the 80386, the 32-bit era began (which also supported paging, more on this later). However, for compatibility, the system still enters Real Mode after startup and must manually switch to Protected Mode.
Segmentation, GDT, and LDT
Under Protected Mode, segmentation is still mandatory, but the design has changed significantly. Descriptions of a segment changed to Base, Limit, and Permission bits. Instead of referencing a segment by writing a base directly then <<4 and adding, segment descriptors are used. With the presence of segment descriptors, they are either a pointer or an index.
In fact, a segment descriptor includes “index” information and requested privilege level; the “index” includes an index number and a bit for selecting the table. Since there is an index, a table is needed to store it, and the address of the table is stored in a register. In fact, there are two descriptor tables, GDT (G for Global) and LDT (L for Local). The mentioned special bit is used to select which table to use.
These are not two independent tables. In fact, the LDT is a subordinate table under the GDT, and the GDT table can store the descriptor of the LDT table.
When selecting the LDT, according to the index value in the LDTR, find the corresponding LDT descriptor in the GDT to obtain the base and limit of that LDT, and then lookup that LDT according to the index value in the segment selector; when using the GDT, it is simpler, directly find the location of the GDT table through the GDTR, and then lookup the GDT according to the index value in the segment selector.
In design, the operating system stores each process’s segment descriptor table as a bunch of LDTs, then records each LDT in the GDT, and process switching is achieved by switching the LDTR register. But in reality, this set of mechanisms is not used (how is privilege control done then? It relies on the paging mechanism). Linux’s GDT only has four valid segments: user code, user data, kernel code, and kernel data (the difference between user and kernel is in the permission bits).
After enabling the paging mechanism, this system essentially only does a small part of the permission check and serves no other purpose. So we often directly configure the base of the few segments in the GDT to 0, so whether using cs or ds, the final calculated address is consistent, which is the so-called flat model.
Long Mode
In the 64-bit era, another mode called Long Mode (also known as IA-32e by Intel) was introduced. Sounds dizzy, right? Let’s look at the diagram:
Overall, there are two modes: Legacy Mode and Long Mode, each with several sub-modes. Legacy is a general term for some old artifacts, and we are now familiar with the two common modes there. We are primarily looking at Long Mode’s Compatibility mode and 64-bit mode.
Real Mode supports segmentation, not paging; after the 80386, segmentation is still supported while also supporting paging, but segmentation is mandatory, and paging is optional; in Long Mode, paging becomes mandatory, while the segmentation mechanism gradually becomes superfluous.
To fully utilize the hardware (such as added registers, more than 4GB of memory), the program itself must be 64-bit, which requires running in 64-bit mode. If the goal is to run 32-bit programs without modification, then you would use Compatibility mode.
So, how to switch between these modes? After all, the CPU is in Real Mode after booting up; and the common scenario is that we want to run both 64-bit and 32-bit programs on a 64-bit machine.
By setting certain bits in specific registers, we can control the CPU to complete the switch.
Entering Protected Mode requires correctly configuring the GDT and some special registers, and then setting the lowest bit of CR0 to 1.
Configuring Segment Registers
Segment registers correspond to the kvm_segment structure, which includes base, limit, selector, etc. Essentially, it is a layer of abstraction over a GDT entry, since processors from manufacturers other than Intel also exist.
Our goal is a flat model, so the base must be 0, and the limit must be 0xfffff.
The 0-1 and 2 bits of the selector represent RPL and TI, so we need to shift the passed-in index 3 bits to the left.
Type has 4 bits, for a data segment, the first bit is 0, and the last three bits represent E (direction of growth), W (writable), A (write flag, hardware sets to 1 after access); for code segments, the first bit is 1, and the last three bits represent C (whether lower privileges are allowed to call), R (readable), A (write flag). Therefore, we can use 1011 for code segments and 0011 for data segments.
present represents whether the segment is present or not, corresponds to the P of the GDT entry; S is set to 1 indicating this is a code or data segment; L indicates whether it is a Long Mode; G indicates the unit of the limit, set to 1 to use 4K as the unit (when a maximum of 4k*0xfffff bytes is represented, and set to 0 means the unit is bytes, with a maximum of 2^20 bytes). dpl, db, and others can also be assigned corresponding values by referring to the GDT definition.
// construct kvm_segment and set to segment registers sregs.cs = CODE_SEG; sregs.ds = DATA_SEG; sregs.es = DATA_SEG; sregs.fs = DATA_SEG; sregs.gs = DATA_SEG; sregs.ss = DATA_SEG;
Here the segment registers have been correctly processed.
Configuring the GDT
The GDT is a table, and each entry stores the description of a segment.
Each entry is a 64-bit value, and the meaning of each bit can be referred to in the diagram above. Simply put, each entry corresponds to a segment, mainly including its base, limit, and permission bits, among other information. We need to construct this table.
Most fields in each entry are very similar to those in kvm_segment. We can write a conversion function to get an entry from kvm_segment.
Thus far, we have completed the initialization of the necessary components for Protected Mode.
Generating Corresponding Architecture Code
In our previous section on Real Mode, we manually translated a section of code into binary using nasm. We now need a piece of code that can run in Protected Mode. We still use nasm to assemble the following code section:
Besides writing by hand or using nasm, we can also perform this assembling operation with pwntools (which those who have played CTF should be very familiar with):
Thus, we correctly completed computation and memory operation under Protected Mode (without paging enabled). Of course, you can also try writing the data segment’s GDT entry the same as the code segment’s, which will encounter an error because the segment does not have write permission.
use kvm_bindings::{ kvm_segment, kvm_userspace_memory_region, KVM_MAX_CPUID_ENTRIES, KVM_MEM_LOG_DIRTY_PAGES, }; use kvm_ioctls::Kvm; use vm_memory::{Bytes, GuestAddress, GuestMemory, GuestMemoryMmap};
Typically, on x86 architecture, the page table is 3-level deep (4K pages). However, if you’re using larger pages (such as 4M), or if PAE is enabled, it can affect the number of levels in the page table. Enabling PAE allows a 32-bit machine to support up to 64 GB of memory, but due to the limitations of the address space, the maximum available memory for a single process in a flat memory model remains 4GB.
To enable PAE, you must set the 5th bit of the CR4 register to 1, while ensuring that the paging feature is enabled (CR0.PG) and LME is set to 0 (IA32_EFER.LME).
These two figures allow us to compare the structure of the page table before and after PAE is enabled. Note that, besides the addition of a PDPTE level, the PDE has also changed from 32bit to 64bit (which is how it can support a larger physical memory).
Initializing the Page Table
Here we choose the most common configuration: 32-bit with PAE enabled, and 4K pages (actually, using 2M pages would be simpler and more efficient in this case, as this process generally doesn’t involve context switching, and there’s only one page table globally). In later experiments with 64-bit, we will try using even larger page sizes.
We will now initialize the page table after PAE is enabled. Refer to the images above; we need to construct the PDPTE Registers, Page Directory, Page Table, and Pages, and initialize some elements for them (otherwise we would encounter tedious page faults at this stage, which is merely a temporary boot environment; the actual page fault handling will be left to the operating system that boots up later).
For more options, refer to the SDM Volume 3, Chapter 4.4.
According to the table above, we need to:
Write the address of the PDPTE Register into the CR3 register.
Initialize entries in the PDPTE Register (only one element needs to be filled out of the four, which maps the lower 1GB of address space; all lower 12 bits should be set to zero, the least significant bit set to 1).
Initialize entries in the Page Directory, the size of which is 512 entries; fill them all out, each corresponding to 2M, and pointing to the physical memory addresses (the next 4K will be for that page).
We set the CR3 to the address of the PDPTE (Page Directory Pointer Table Entry) and enable paging and PAE (Physical Address Extension), then configure the register accordingly.
use kvm_bindings::{ kvm_segment, kvm_userspace_memory_region, KVM_MAX_CPUID_ENTRIES, KVM_MEM_LOG_DIRTY_PAGES, }; use kvm_ioctls::Kvm; use vm_memory::{Address, Bytes, GuestAddress, GuestMemory, GuestMemoryMmap};
To enter 64-bit mode, we need to enable PE (Protection Enable), PG (Paging), and LME (Long Mode Enable). If done correctly, we can verify that LMA (Long Mode Active) is set to 1 by reading it.
LME and LMA are two bits in the IA32_EFER register (EFER stands for Extended Feature Enable Register).
Merely modifying the register is quite simple; what is more complex is the page table. The structure of the page table also changes after entering 64-bit mode.
Referring to Figure 4-1, here we need 4-level paging. We have the option to choose between 4K, 2M, and 1G pages.
Based on the above figure, we opt for 2M pages (corresponding to Figure 4-9).
In our GDT entries, we initially set the L bit (Long Mode) to 0, however, this needs to be changed to 1; furthermore, in accordance with the manual’s requirement, when the L bit is set, the D bit must be cleared to zero.
use kvm_bindings::{ kvm_segment, kvm_userspace_memory_region, KVM_MAX_CPUID_ENTRIES, KVM_MEM_LOG_DIRTY_PAGES, }; use kvm_ioctls::Kvm; use vm_memory::{Address, Bytes, GuestAddress, GuestMemory, GuestMemoryMmap};