Rust Runtime Design and Implementation - Design Part 1

This article also has a Chinese version.

This series of articles mainly introduces how to design and implement a Runtime based on the io-uring and Thread-per-core model.

Our final Runtime product Monoio is now open source, and you can find it at github.com/bytedance/monoio.

This article is the second in the series, starting with Monoio as an example to discuss the design choices and trade-offs in Runtimes, while referencing and comparing with the designs of Tokio and Glommio.

Monoio

Motivation

During my time at ByteDance, I participated in the development of Mesh Proxy (based on Envoy) and felt that, due to issues with C++, we were forced to adopt very inelegant code organization and design.

Therefore, I explored the possibility of using Linkerd2-proxy (a Mesh Proxy based on Rust + Tokio) to replace the current version. Benchmark data showed that the performance gain in the HTTP scenario was only about 10%; Envoy benchmarks, on the other hand, revealed that a lot of the CPU consumption was on syscalls.

Using Rust’s generics, we can eliminate the runtime overhead from dynamic dispatch abstractions present in C++; for I/O, we considered using io-uring to replace epoll.

Preliminary Research

In the early stages of the project, we compared the performance of several solutions: 1. Tokio, 2. Glommio, 3. bare epoll, 4. bare io-uring. It turned out that bare io-uring did indeed have a performance lead, but Glommio, which is based on io-uring, did not perform as expected. We tried to fork Glommio’s code and optimize it, only to find significant issues with the project itself, such as misinterpretations of the flags used to create urings; moreover, its Task implementation was less performant than Tokio’s.

Building Our Own Wheel

Eventually, we decided to build our own Runtime system to meet our internal needs and deliver extreme performance.

This project was completed jointly by myself and @dyxushuai. In the process of implementation, we referenced many projects such as Tokio and Tokio-uring and tried some of our own designs.

Model Discussion

Different design models have their own suitable application scenarios.

Tokio uses a fair scheduling model, with internal scheduling logic similar to Golang, allowing tasks to be transferred between threads, making the most of multicore performance.

Glommio is another Rust Runtime, implemented based on io-uring, with a scheduling logic that is simpler compared to Tokio and follows a thread-per-core model.

Each model has its pros and cons. The former is more flexible with a stronger universal appeal, but this also comes at a cost:

Poor performance on multicore machines.

In my 1K Echo test (2021-11-26 latest version), Tokio with 4 cores only performed about 2.2 times better than with 1 core. Meanwhile, our own Monoio could maintain near-linear scalability.

1 Core 4 Cores

The detailed test report can be found here.
We should not ignore the constraints on the Task itself. If a Task can be scheduled between Threads, then it must implement Send + Sync. This is a significant limitation for user code.

As an example, if you want to implement a cache service, under the fair scheduling model, the corresponding cache map would need to be protected with Atomic or Mutex operations to ensure Send + Sync; however, if implemented with the thread-per-core model, using just thread-local storage would suffice. Additionally, nginx and envoy are also based on this type of model.

But the thread-per-core is not a panacea. For example, in business systems, different requests may have different processing logic; some persistent connections require a lot of computation, while others consume almost no CPU. Based on this model, this could easily lead to an imbalance between CPU cores, with one core being overloaded and another remaining very idle.

Event-driven Model

This discussion focuses mainly on io-uring and epoll.

epoll is just a notification mechanism; essentially, tasks are still performed through direct user code syscalls, like read. In scenarios with high-frequency syscalls, frequent user-space to kernel-space switching can consume considerable resources. io-uring allows for asynchronous syscalls, which can greatly reduce the number of syscalls, even without enabling SQ_POLL.

Here are some issues with io-uring:

Compatibility issues. Not to mention cross-platform, as it is linux only (epoll has similar existences on other platforms, which can be seamlessly compatible based on the well-established mio). It also demands certain linux kernel versions, and performance can vary between implementations. Large companies tend to have their own modified kernel versions, so keeping up with backporting can also be a headache. Additionally, for Mac/Windows users, it presents certain challenges for the development experience.
Buffer lifecycle issues. io-uring is fully asynchronous; once an Op is pushed onto the SQ, the buffer cannot be moved and must be ensured valid until the syscall is complete or the Cancel Op has been executed. Whether in C/C++ or Rust, this presents a challenge for buffer lifecycle management. epoll does not have this problem because syscalls are user-executed; the buffer can’t be manipulated during the syscall anyway, so it remains valid until the syscall returns.

Lifecycle, IO Interface, and GAT

The previous section mentioned an issue with io-uring: there needs to be a mechanism to ensure a buffer is valid during the execution of an Op.

Consider the following situation:

The user creates a Buffer.
The user obtains a reference to the buffer (whether & or &mut) for read and write operations.
The Runtime returns a Future, but the user immediately drops it.
Now no one holds a reference to the buffer, and the user can drop it directly.
However, the address and length of the buffer have already been submitted to the kernel, and it may be about to be processed or already in process. We can push a CancelOp to attempt to mitigate this, but we cannot ensure the CancelOp will be consumed immediately.
The kernel could now be operating on a section of memory in error, and if this memory is reused by the user program, it could lead to memory corruption.

If Rust implemented Async Drop, this could still be done—we could use the buffer as normal with a reference. But, since we can’t ensure the kernel’s operations on the buffer are canceled in time, it’s difficult to guarantee the validity of the buffer without taking ownership.

This presents a new challenge for the IO interface: conventional IO interfaces only require &self or &mut self, but we need to take full ownership.

We referenced tokio-uring for this part of the design and defined it as a trait. This Trait must enable GAT (Generic Associated Types).

/// AsyncReadRent: async read with a ownership of a buffer
pub trait AsyncReadRent {
    /// The future of read Result<size, buffer>
    type ReadFuture<'a, T>: Future<Output = BufResult<usize, T>>
    where
        Self: 'a,
        T: 'a;
    /// The future of readv Result<size, buffer>
    type ReadvFuture<'a, T>: Future<Output = BufResult<usize, T>>
    where
        Self: 'a,
        T: 'a;

    /// Same as read(2)
    fn read<T: IoBufMut>(&self, buf: T) -> Self::ReadFuture<'_, T>;
    /// Same as readv(2)
    fn readv<T: IoVecBufMut>(&self, buf: T) -> Self::ReadvFuture<'_, T>;
}

/// AsyncWriteRent: async write with a ownership of a buffer
pub trait AsyncWriteRent {
    /// The future of write Result<size, buffer>
    type WriteFuture<'a, T>: Future<Output = BufResult<usize, T>>
    where
        Self: 'a,
        T: 'a;
    /// The future of writev Result<size, buffer>
    type WritevFuture<'a, T>: Future<Output = BufResult<usize, T>>
    where
        Self: 'a,
        T: 'a;

    /// Same as write(2)
    fn write<T: IoBuf>(&self, buf: T) -> Self::WriteFuture<'_, T>;

    /// Same as writev(2)
    fn writev<T: IoVecBuf>(&self, buf_vec: T) -> Self::WritevFuture<'_, T>;
}

Similar to Tokio’s approach, we also provide an Ext with a default implementation:

pub trait AsyncReadRentExt<T: 'static> {
    /// The future of Result<size, buffer>
    type Future<'a>: Future<Output = BufResult<usize, T>>
    where
        Self: 'a,
        T: 'a;

    /// Read until buf capacity is fulfilled
    fn read_exact(&self, buf: T) -> <Self as AsyncReadRentExt<T>>::Future<'_>;
}

impl<A, T> AsyncReadRentExt<T> for A
where
    A: AsyncReadRent,
    T: 'static + IoBufMut,
{
    type Future<'a>
    where
        A: 'a,
    = impl Future<Output = BufResult<usize, T>>;

    fn read_exact(&self, mut buf: T) -> Self::Future<'_> {
        async move {
            let len = buf.bytes_total();
            let mut read = 0;
            while read < len {
                let slice = buf.slice(read..len);
                let (r, slice_) = self.read(slice).await;
                buf = slice_.into_inner();
                match r {
                    Ok(r) => {
                        read += r;
                        if r == 0 {
                            return (Err(std::io::ErrorKind::UnexpectedEof.into()), buf);
                        }
                    }
                    Err(e) => return (Err(e), buf),
                }
            }
            (Ok(read), buf)
        }
    }
}

pub trait AsyncWriteRentExt<T: 'static> {
    /// The future of Result<size, buffer>
    type Future<'a>: Future<Output = BufResult<usize, T>>
    where
        Self: 'a,
        T: 'a;

    /// Write all
    fn write_all(&self, buf: T) -> <Self as AsyncWriteRentExt<T>>::Future<'_>;
}

impl<A, T> AsyncWriteRentExt<T> for A
where
    A: AsyncWriteRent,
    T: 'static + IoBuf,
{
    type Future<'a>
    where
        A: 'a,
    = impl Future<Output = BufResult<usize, T>>;

    fn write_all(&self, mut buf: T) -> Self::Future<'_> {
        async move {
            let len = buf.bytes_init();
            let mut written = 0;
            while written < len {
                let slice = buf.slice(written..len);
                let (r, slice_) = self.write(slice).await;
                buf = slice_.into_inner();
                match r {
                    Ok(r) => {
                        written += r;
                        if r == 0 {
                            return (Err(std::io::ErrorKind::WriteZero.into()), buf);
                        }
                    }
                    Err(e) => return (Err(e), buf),
                }
            }
            (Ok(written), buf)
        }
    }
}

Turning on GAT makes many things easier for us.

In our trait, we defined lifetimes for the associated type Future, allowing it to capture references to &self instead of having to Clone self or separately define a structure with lifetime annotations.

Defining Future

How do we define a Future? Normally, we define a structure and implement the Future trait for it. The key here is to implement the poll function, which takes a Context and returns a Poll synchronously. To implement poll, we usually have to manually manage the state, which can be challenging and prone to error.

At this point, you might ask, can’t we just use async and await? In fact, an async block does generate a state machine, more or less the same as one you might write manually. But the problem is that the generated structure is nameless, so if you want to use the type of this Future as an associated type, it becomes difficult. In this case, you can enable type_alias_impl_trait and use an opaque type as an associated type; alternatively, you can incur some runtime overhead by using Box<dyn Future>.

Producing Future

Besides using async blocks, the conventional way is to manually construct a structure that implements Future. There are two types of such Futures:

Futures that own their state and don’t require additional lifetime annotations. These Futures are not associated with any other structures, and if you need them to depend on some non-Copy data, you might consider using shared ownership structures like Rc or Arc.
Futures that hold a reference and have lifetime annotations on the structure itself. For example, in Tokio’s AsyncReadExt, the signature for read is fn read<'a>(&'a mut self, buf: &'a mut [u8]) -> Read<'a, Self>. The constructed Read<'a, Self> captures references to both self and buf, offering no runtime overhead compared to shared ownership. However, this type of Future is difficult to use as a trait’s type alias unless you enable generic_associated_types and type_alias_impl_trait and then use an opaque type.

Defining IO Trait

Typically, our IO interfaces are defined in the form of poll functions (such as poll_read), and any wrappers for IO should be based on this trait (let’s temporarily call it the base trait).

However, to provide a user-friendly interface, an additional Ext trait is usually offered, primarily using its default implementation. The Ext trait is automatically implemented for every class that implements the base trait. For example, read returns a Future, and using await on this future is clearly easier than manually managing state and calling poll.

So why is the base trait defined in the form of poll? Can’t we directly define a Future? Because the poll form is synchronous, it does not need to capture anything, which makes it easy to define and fairly universal. If we define a Future directly, then we either set the return type of the Future in stone like the Ext trait (which would prevent wrapping and custom implementations, defeating the purpose of defining a trait), or we make the Future type an associated type (as mentioned earlier, without enabling GAT, we can’t have lifetimes; it has to be static).

In summary, in the current stable version of Rust, only a poll-form base trait combined with a future-form Ext trait can be used to define IO interfaces.

Once GAT is enabled, we can do this. We can directly define a Future with a lifetime in the associated type of the trait, capturing self.

Is this a silver bullet? No. The only problem is that once you start using this GAT pattern, you have to always use it. If you keep flipping back and forth between the poll form and the GAT form, it will be quite painful. While it’s possible to maintain a state based on a poll form interface and implement a Future (the simplest example being poll_fn), the reverse is hard: it’s difficult to store a Future with a lifetime. Even though it’s possible to do it using some unsafe hacks (which come with costs), there are many limitations and it’s not recommended. If you really want to give it a try, you can refer to monoio-compat, which implements Tokio’s AsyncRead and AsyncWrite based on a GAT future.

Waker and vtable

TODO

Buffer Management

Buffer management is based on the design of tokio-uring.

The user provides ownership of the Buffer, and ownership is returned upon Future completion.
A global status is maintained using a Slab, where the internal ownership of the Buffer is transferred upon Op drop, and true destruction occurs during CQE handling. Ownership is thrown back upon successful completion.