Some interesting attempts at eBPF with Rust

This article also has a Chinese version.

I’ve recently made two attempts with eBPF and found it quite interesting, so I thought I’d share.

What is eBPF?

eBPF is an extension of BPF that uses a custom instruction set similar to RISC-V and operates within a kernel virtual machine during execution. Upon loading eBPF into the kernel, it first passes through kernel verification, and after successful verification, it can be executed efficiently with just-in-time compilation.

In the past, BPF was commonly used for perf and trace. Now, with eBPF, we can perform some custom business logic. Compared to inserting logic as a user-space proxy, doing so in kernel space has two benefits: it avoids the overhead of context switching between kernel and user space, and it can take control of data packets earlier, potentially avoiding protocol stack overhead.

Using eBPF to Filter DNS Pollution

https://github.com/ihciah/clean-dns-bpf

During the National Day holiday, out of boredom, I came across a tweet that caught my interest. Someone managed to filter DNS pollution by manually creating bytecode. Clearly, I am not a craftsman, as I wanted to create something both highly readable and functional.

The GFW pollutes DNS by sending fake responses. All we need to do is identify and discard these poisoned responses to get the correct DNS resolution. One way to implement this custom logic is to create a proxy. Many people have done this, so I’m not going to bother. Another method is to embed the code into the kernel.

Embedding code into the kernel isn’t easy. You can either change the code and compile your own kernel, or create a kernel module, or use BPF/eBPF. The first two methods are too cumbersome, and there’s a risk of crashing, plus I don’t have the energy to keep up with Linux kernel updates; so eBPF is the only viable option. Although it has some limitations, it’s because of these that eBPF is universal, portable, and won’t crash the kernel when it runs inside the kernel VM.

Since I don’t know C, I gave Rust + RedBPF a try, and it seemed feasible. Manipulating XDP could theoretically yield the best performance since it’s the earliest point we can intervene. Additionally, for some network cards, the execution can even be offloaded to hardware.

Then I fired up Wireshark, tried querying a few domain names using 8.8.8.8 under the Great Firewall.

Taking twitter.com as an example, when requesting the A record for twitter.com from 8.8.8.8, a normal response would return 2 results (1Q2A); while the GFW would only return 1 result, with 2 fake responses. In the fake responses, one of the packet’s IP Identification = 0x0000, and the other’s IP Flags = 0x40 (Don’t fragment); whereas in a normal response the IPID would not be 0 and IP Flags = 0.

We just need to drop packets that match these characteristics. We can then verify that twitter.com can be resolved correctly (as well as fb and other non-Google services).

However, for google.com, this method did not perform as expected. Normal responses have DNS Flags = 0x8180, while the fake responses have Flags 0x8590 (with additional markings for Authoritative and Answer Authenticated), 0x85a0 (Authoritative and Non-authenticated data: Acceptable), and 0x8580 (Authoritative); additionally, normal responses reuse the name in Query with c00c (0b11 + offset) in the Answer section, but the fake responses write it out again.

To avoid false positives, we could first allow packets with multiple Answers to pass through (since the fake responses observed only contained a single Answer). Later if it’s marked Authoritative, but Authority RRs = 0 (I’m not sure if I understood this field correctly), then drop it. The c00c characteristic can also be a basis for judgment, but it would require more parsing and calculation and is not currently in use.

With these filters in place, we can now correctly obtain the A record for google.com~

At this point, we can verify that Google’s domains can also resolve correctly.

More Usable

Our tool is now working, but it’s not user-friendly enough: users need to run my executable as a daemon, and it’s the executable that manages the injection of eBPF.

But injecting eBPF logic is all we need to do! There’s no need to run a daemon. After some research, I found that the ip command can be used to directly load eBPF onto network interfaces. However, the elf file I compiled just wouldn’t load properly. After some digging, I found someone with a similar issue: you need to rename the main section to ‘prog’ and remove some sections. And while I was at it, I removed some extraneous debug information:

llvm-objcopy \
--remove-section .debug_loc \
--remove-section .debug_info \
--remove-section .debug_ranges \
--remove-section .BTF.ext \
--remove-section .eh_frame \
--remove-section .debug_line \
--remove-section .debug_pubnames \
--remove-section .debug_pubtypes \
--remove-section .debug_abbrev \
--remove-section .debug_str \
--remove-section .text \
--remove-section .BTF \
--remove-section .symtab \
--remove-section .rel.BTF \
--rename-section xdp/clean_dns=prog \
./clean-dns.elf

After that, you can load it with ip link set dev eth0 xdp obj ./clean-dns.elf.

After the project was released, many people in the issues said they couldn’t load it, which puzzled me since I couldn’t reproduce the problem. Later, a kind soul submitted a PR saying that not every network card can handle the xdpdrv mode, and it worked after manually specifying the xdpgeneric mode.

Using eBPF for Data Forwarding

https://github.com/ihciah/socks5-forwarder

I had a personal requirement to allow a client that does not support SOCKS5 proxy to connect to a fixed remote server via a SOCKS5 proxy.

Meeting the requirement is quite easy: just base it on Tokio and make the transfer, plus there’s an existing SOCKS5 component for use. After spending about half an hour, I wrote the code (found here), and it ran smoothly.

Although this was by no means a performance bottleneck, I was simply not content with using a Proxy, feeling as if it was wasting computing resources. Our goal was to assist with the handshake, so could we not, after assisting with the handshake, hand over the copying task to the kernel? Copying to and fro and taking on the switch overhead seemed entirely unnecessary. Three months after completing the Proxy solution, I began to experiment with new approaches.

Linux offers splice, which allows for zero-copy data transfers between a file descriptor and a Pipe. In a similar functioning project, I also found an implementation that leverages splice and a Pipe for zero-copy.

With eBPF’s vast capabilities, it ought to be able to do the job, and even do it better (for example, if I want to apply simple encryption to subsequent data, that’s something splice couldn’t handle).

I wanted the kernel to recognize the data for a particular socket and directly redirect it to another socket. In this case, userspace code manipulates a sockmap and a hashmap, which are shared between the userspace code and the BPF code.

Thus, when we need to hijack a socket and forward it directly, we only need to inform the kernel about the ip:port and the destination socket for the transfer through these two maps. The BPF code, when processing data, can then identify the connection and redirect it directly.

#[stream_verdict]
fn verdict(skb: SkBuff) -> SkAction {
    let (ip, port, lip, lport) = unsafe {
        let remote_ip_addr = (skb.skb as usize + offset_of!(__sk_buff, remote_ip4)) as *const u32;
        let remote_port_addr = (skb.skb as usize + offset_of!(__sk_buff, remote_port)) as *const u32;
        let local_ip_addr = (skb.skb as usize + offset_of!(__sk_buff, local_ip4)) as *const u32;
        let local_port_addr = (skb.skb as usize + offset_of!(__sk_buff, local_port)) as *const u32;
        (ptr::read(remote_ip_addr), ptr::read(remote_port_addr), ptr::read(local_ip_addr), ptr::read(local_port_addr))
    };


    let key = IdxMapKey { addr: ip, port };
    if let Some(idx) = unsafe {IDX_MAP.get(&key)} {
        return match unsafe { SOCKMAP.redirect(skb.skb as *mut _, *idx) } {
            Ok(_) => {
                SkAction::Pass
            },
            Err(_) => {
                SkAction::Drop
            },
        };
    }
    let key = IdxMapKey { addr: lip, port: lport };
    if let Some(idx) = unsafe {IDX_MAP.get(&key)} {
        return match unsafe { SOCKMAP.redirect(skb.skb as *mut _, *idx) } {
            Ok(_) => {
                SkAction::Pass
            },
            Err(_) => {
                SkAction::Drop
            },
        };
    }
    SkAction::Pass
}

Subsequently, in user space, we only need to manipulate the map to control the BPF:

async fn bpf_relay<O, IR, IW, OR, OW>(
    bpf: Arc<Mutex<O>>,
    in_conn_info: ConnInfo<IR, IW>,
    out_conn_info: ConnInfo<OR, OW>,
) -> anyhow::Result<()>
where
    O: BPFOperator<K = IdxMapKey>,
    IR: AsyncRead + Unpin,
    IW: AsyncWrite + Unpin,
    OR: AsyncRead + Unpin,
    OW: AsyncWrite + Unpin,
{
    // used for delete from idx_map and sockmap
    let mut inbound_addr_opt = None;
    let mut outbound_addr_opt = None;


    // add socket and key to idx_map and sockmap for ipv4
    // Note: Local port is stored in host byte order while remote port is in network byte order.
    // https://github.com/torvalds/linux/blob/v5.10/include/uapi/linux/bpf.h#L4110
    if let (V4(in_addr), V4(out_addr)) = (in_conn_info.addr, out_conn_info.addr) {
        let inbound_addr = IdxMapKey {
            addr: u32::to_be(u32::from(in_addr.ip().to_owned())),
            port: u32::to_be(in_addr.port().into()),
        };
        let outbound_addr = IdxMapKey {
            addr: u32::to_be(u32::from(out_addr.ip().to_owned())),
            port: out_addr.port().into(),
        };
        inbound_addr_opt = Some(inbound_addr);
        outbound_addr_opt = Some(outbound_addr);
        let mut guard = bpf.lock().unwrap();
        let _ = guard.add(out_conn_info.fd, inbound_addr);
        let _ = guard.add(in_conn_info.fd, outbound_addr);
    }


    // block on copy data
    // Note: Here we copy bidirectional manually, remove from map ASAP to
    // avoid outbound port reuse and packet mis-redirected.
    tracing::info!("Relay started");


    let (mut ri, mut wi) = (in_conn_info.read_half, in_conn_info.write_half);
    let (mut ro, mut wo) = (out_conn_info.read_half, out_conn_info.write_half);
    let client_to_server = async {
        let _ = tokio::io::copy(&mut ri, &mut wo).await;
        tracing::info!("Relay inbound -> outbound finished");
        let _ = wo.shutdown().await;
        if let Some(addr) = inbound_addr_opt {
            let _ = bpf.lock().unwrap().delete(addr);
        }
    };


    let server_to_client = async {
        let _ = tokio::io::copy(&mut ro, &mut wi).await;
        tracing::info!("Relay outbound -> inbound finished");
        let _ = wi.shutdown().await;
        if let Some(addr) = outbound_addr_opt {
            let _ = bpf.lock().unwrap().delete(addr);
        }
    };


    tokio::join!(client_to_server, server_to_client);
    tracing::info!("Relay finished");


    Ok::<(), anyhow::Error>(())
}

pub(crate) trait BPFOperator {
    type K;


    fn add(&mut self, fd: RawFd, key: Self::K) -> Result<(), Error>;
    fn delete(&mut self, key: Self::K) -> Result<(), Error>;
}


pub struct Shared<'a, K>
where
    K: Clone,
{
    sockmap: SockMap<'a>,
    idx_map: HashMap<'a, K, u32>,


    idx_slab: slab::Slab<()>,
}


impl<'a, K> Shared<'a, K>
where
    K: Clone,
{
    pub fn new(sockmap: SockMap<'a>, idx_map: HashMap<'a, K, u32>, capacity: usize) -> Self {
        Self {
            sockmap,
            idx_map,
            idx_slab: slab::Slab::with_capacity(capacity),
        }
    }
}


impl<'a, KS> BPFOperator for Shared<'a, KS>
where
    KS: Clone,
{
    type K = KS;


    fn add(&mut self, fd: RawFd, key: Self::K) -> Result<(), Error> {
        let idx = self.idx_slab.insert(()) as u32;
        self.idx_map.set(key, idx)?;
        self.sockmap.set(idx, fd)
    }


    fn delete(&mut self, key: Self::K) -> Result<(), Error> {
        if let Some(idx) = self.idx_map.get(key.clone()) {
            self.idx_slab.remove(idx as usize);
            self.idx_map.delete(key);
            self.sockmap.delete(idx)
        } else {
            Ok(())
        }
    }
}

However, it seems that this tool has some compatibility issues—it depends on the kernel’s support for BTF (BPF Type Format). It worked on Arch with kernel version 5.14 and on Debian with 5.10 after being compiled, but it didn’t work on Debian with kernel version 5.4—I suspect this is due to missing BTF support or changes in the signature of helper functions.

This setup can work with Layer 4 (L4) proxies, and it should also be feasible with Layer 7 (L7) proxies. The user-space code handles headers, reads out the length of the body, and then informs BPF through a map, after which the kernel forwards the body. This task could also be done using splice, which might have better performance with eBPF, since verdict on the stream occurs earlier (I’m dealing with IP packets).

However, from the standpoint of syscall count, manipulating the map is also a syscall. Compared to the splice solution, which requires 3 syscalls (1 to create the Pipe and 2 for splice operations), it doesn’t offer much of an advantage. Therefore, it seems that this setup only makes sense for long-duration forwarding and large packet forwarding.

The most frustrating part of working on this was discovering that local_port and remote_port have different byte orders, and there was no documentation available. I had to dig through kernel code to find this out—a humbling experience.

Conclusion

This article briefly introduced my two simple experiments with eBPF. For more information on which BPF hook points and types are available within Linux, you can refer to this article.