This article also has an Chinese version.
A year ago, I released an open-source project called Rust2Go(related blog: Design and Implementation of a Rust-Go FFI Framework), which provides high-performance asynchronous and synchronous call support from Rust to Go. This project serves as the foundation for several community projects and multiple internal projects within my company. I’ve continued optimizing its performance and developing new features.
I’ll be speaking on this topic at Rust Asia Conf 2025—everyone interested is warmly welcome to join!
Recently, I explored CGO-related topics and, based on a newly developed high-performance CGO mechanism, added support in Rust2Go for actively invoking Rust from Go. This article focuses primarily on the former.
Note: This article is not limited to Rust. It is applicable to all Go cross-language projects. A corresponding repository and example are also provided. Users with such needs are welcome to adopt and use it.
This article will proceed in the following order:
- Introduce the principles of CGO calls and their performance issues, which are the main targets of optimization;
- Show how to optimize CGO calls using the simplest assembly techniques;
- Highlight stack space issues and explain how switching to the G0 stack can resolve them;
- Introduce Async Preemption and how to block it to ensure the G0 stack remains unpolluted;
- Discuss the optimization results and application scenarios.
CGO Calls
Golang provides the ability to call and be called through CGO, following the C calling convention. Regardless of the direction of the call, both capabilities are required to initiate the call and pass return values.
Unlike typical FFI in native languages, Golang code execution relies on the runtime. Therefore, calling Go functions from other languages is essentially more of a task dispatch than a direct FFI call. This introduces significant overhead, though much of it is necessary—for example, contention costs involved in managing the task queue, and cross-thread overhead incurred when waking up Go threads. This part of the implementation depends on the internals of the Go runtime and is relatively difficult to optimize.
On the other hand, calls initiated from Golang are much easier to optimize. The optimizations discussed in this article are primarily targeted at this case.
Let’s start with a simple function call:
1 | package main |
This code includes a dummy C function implementation, which is invoked in the main function using the C.
prefix.
Normally, such code can be executed at very low cost in native languages. Whether it’s a C function or something else, the result is binary machine instructions—so the compiler simply needs to recognize and align with the target ABI. In this example, since there are no parameters or return values, the cost of calling convention conversion should be minimal.
However, if you benchmark this process, you’ll find that it takes nearly 30ns to complete—an overhead that’s relatively expensive.
Assembly Optimization
We have observed the relatively high overhead of CGO calls and recognized that, at their core, they are just binary instructions. So why not directly insert assembly to initiate the call?
In fact, projects like fastcgo and rustgo attempted this approach as far back as seven years ago. However, their code no longer works with more recent versions of Golang.
If we ignore other factors for a moment, all we need to do is write a bit of assembly to convert parameters from the Go ABI to the C ABI, and then execute a CALL instruction.
In fact, this is not sufficient—we’ll explain why later.
Let’s try this approach from scratch!
Golang ABI
When it comes to assembly calls, the first thing we need to pay attention to is the ABI. An ABI is a specification that defines the interface between two binary modules—it dictates how function parameters and return values are passed, how registers and the stack are used, and more.
In early versions of Golang, there was no explicit or public ABI. In 2018, the team recognized this issue and formalized the existing behavior as ABI0
, while also introducing a new standard called ABIInternal
. ABIInternal
is designed as a rolling internal ABI standard, whereas ABI0
and any future ABI{n}
are considered snapshots of this internal ABI.
Link to the proposal: https://github.com/golang/proposal/blob/master/design/27539-internal-abi.md 。
Using the new ABI can bring potential performance improvements. For example, under ABI0, all parameters and return values are passed via the stack, which is significantly slower than passing through registers. Additionally, under ABI0, all registers are considered clobbered during a CALL, meaning that even if the callee does not modify a particular register, the caller must still save and restore it—an entirely unnecessary overhead in some cases.
In Golang, handwritten assembly must follow the ABI0.
Unfortunately, both the Plan 9 and Go ABIs prevent us from passing arguments via registers or manually marking clobbered registers to further squeeze out performance. The good news is that our core logic isn’t complex, so even the simplest implementation is sufficient.
Further reading: If you’re interested in clobbered registers, here is an interesting example. It manually marks clobbered registers to delegate register operations during stackful coroutine switches to the compiler, thereby reducing unnecessary register saves and restores.
Minimal ASM
Below, I’ll use an example with two input parameters to implement the conversion from the Go ABI to the System V ABI. The main focus here is on argument passing: we need to read the parameters from the stack according to the Go ABI, and then write them into the designated registers as specified by the System V ABI.
amd64.go
:
1 | //go:noescape |
amd64.s
:
1 | #include "textflag.h" |
This piece of code reads the parameters from the stack and stores them into the RDI
and RSI
registers according to the System V ABI. It then loads the target function address into RAX
and executes CALL AX
.
PLAN9 assembly requires adherence to certain conventions. Specifically:
- NOSPLIT prevents the function from being instrumented with stack size checks.
- NOPTR indicates that the function does not contain pointers, so the garbage collector does not need to scan it.
- NOFRAME avoids stack frame allocation (which requires declaring a stack size of 0).
- $0 declares that the function requires 0 bytes of stack space.
PLAN9 introduces the concept of pseudo-registers. For example, although parameters are actually accessed via SP with a compile-time-determined offset, in assembly they are accessed using the FP register: symbol+offset(FP)
. Here, FP
is a pseudo-register. Similarly, SB
, PC
, and TLS
are also pseudo-registers.
If we benchmark this approach, we can observe that it reduces the per-call CGO overhead from nearly 30ns to around 1ns!
Stack Switching
At this point, you might be wondering: if it’s really this simple, why is the CGO implementation so inefficient? In fact, the previous implementation does have issues—one of them being the stack size problem.
Stack Probing vs Manual Stack Switching
Golang allocates a small initial stack on the heap for each goroutine and uses compiler-inserted stack size check code (probes) at the beginning of functions. When the available stack is insufficient, the runtime automatically grows and relocates the stack to ensure that the function always has enough space to execute.
However, such stack probing cannot be implemented from a foreign language. Even if the probe check could somehow be replicated, the foreign language’s pointer tagging and memory management would need to fully align with Go’s, in order for the Go runtime to correctly update pointer targets during stack growth. This is nearly impossible—and provides no real benefit.
If we ignore the stack size problem, then the only option is for external code to allocate, manage, and switch stacks manually. This approach avoids consuming any goroutine’s stack space, thus preventing stack overflows. However, this places a huge burden on the foreign-side implementer.
Switching to the G0 Stack
Let’s think about this: when a goroutine’s stack is insufficient, the Go runtime needs to manage memory and perform stack growth—does it do all this without using any stack space? Where does that stack come from?
The answer is: the G0 stack.
Golang uses the GMP model—G for Goroutine, M for Thread, and P for logical Processor. G0 is the first goroutine created when each M (thread) starts and is used to run scheduling-related code. The G0 stack resides in the thread stack. Its size is smaller than the full thread stack, but larger than a typical goroutine stack.
1 | MOVD $runtime·g0(SB), g |
This assembly code allocates 64KB of space within the thread stack to serve as the G0 stack.
64KB is considered a relatively large stack size—enough to run most regular functions. At the beginning of our assembly function, we can switch the SP (stack pointer) register to point to the G0 stack address.
From runtime2.go, we can find relevant structure definitions. We’ll need the corresponding offset values to access specific fields.
1 | type g struct { |
We can see that by accessing offset 0x30
from the g
pointer, we obtain the m
pointer; from offset 0x0
of m
, we get the g0
pointer; and from offset 0x38
of g0
, we retrieve the gobuf—whose first field is the SP
.
At this point, we can write the assembly code to switch to the G0
stack:
amd64.go
:
1 | //go:noescape |
amd64.s
:
1 | #include "textflag.h" |
This piece of code performs the following steps:
- Loads the parameters from the stack into registers
- Saves the old
SP
intoR8
- Loads the
G0
stack address into theSP
register - Aligns the stack, pushes
R8
(the oldSP
), and ensures proper stack alignment - Executes the
CALL
instruction - Restores
SP
from the stack and returns withRET
At this point, if we invoke a simple external function that requires a relatively large stack, we can verify that it executes correctly.
Great—but is that the end of it? If you compare this with the fastcgo code mentioned earlier (in fact, I only discovered that earlier work when I was writing the PoC to this point), you’ll find that the approaches are indeed quite similar.
Stack Size Probe & Crash?
I wanted to try something a bit more experimental—could I conditionally switch stacks? That is, only switch if the goroutine’s own stack space is insufficient? This might reduce cache misses and yield even better performance.
To do that, we need to answer one question: how big is “not enough”? To find out, I wanted to probe how much stack space is actually used by the external function in my current scenario. (Typically, I use CGO to invoke a Rust-side callback, which usually just wakes a task internally.)
I came up with an idea: before executing the CALL
, I could write a canary value at SP - {n}
, and after the call completes, read it back. If the value is unchanged, it means that memory region was untouched—i.e., the stack usage was less than n. I didn’t even need to write actual code for this; I used memory write/read
directly in lldb to perform the experiment.
But then it crashed. Maybe the idea was too wild? After several debugging attempts, I found that even just setting a breakpoint in lldb—without doing any memory writing—would also cause a crash.
The crash was a memory fault on the Rust side. After inspecting the corresponding memory of some struct, I realized the issue wasn’t trivial. The real culprit was Go’s asynchronous preemption mechanism.
Async Preemption
Golang’s scheduler is typically considered cooperative—that’s where the “co” in “coroutine” comes from. Starting from earlier versions, the Go compiler inserts checks at function entry points, memory allocations, and other locations to track how long the current coroutine (goroutine) has been running. If it exceeds a certain threshold, the runtime will attempt to switch to another goroutine.
The downside of this mechanism is that task switching depends on hitting these checkpoints. If a goroutine enters a tight loop of pure computation without function calls or allocations, it may block the entire thread.
The Go team was well aware of this issue. Starting with Go 1.14, they introduced preemptive scheduling, known in the codebase as asynchronous preemption.
Async Preemption is implemented using signal-based mechanisms. When a monitor thread detects that a goroutine has been running for too long, it sends a SIGURG
signal to the target thread. This causes the kernel to pause the thread’s execution and transfer control to a signal handler. Inside this handler, the Go runtime has the opportunity to perform a goroutine switch or detach the current M (thread) from the P (logical processor)—commonly used when a blocking syscall is made directly by user code.
The crash mentioned earlier occurred precisely because of this: when Go sends a signal to a target thread, the signal handler runs on a separate gsignalStack—which is fine. But if the Go runtime decides to perform a goroutine switch, or executes any logic that depends on the G0 stack, it will corrupt the G0 state we depend on.
Even though the signal handler restores the register state afterward—thinking it has cleanly returned to the pre-interruption state—it doesn’t realize that our logic also relies on the G0 stack remaining untouched, which is no longer guaranteed.
Blocking Async Preemption
We must block Async Preemption (alternatively, we could temporarily mask signals—but doing so would involve a syscall on every call, which would severely impact performance).
Let’s take a look at how Go’s Async Preemption is triggered on Unix amd64 systems:
1 | TEXT runtime·sigtramp(SB),NOSPLIT|NOFRAME,$0 |
1 | const sigPreempt = _SIGURG |
The sigtramp
function is the entry point for signal handling. After switching the current G to gsignal
, it calls runtime.sighandler
. For sighandler
to proceed with preemption, the following conditions must be met:
- The signal received is
SIGURG
- The asyncpreempt feature is not disabled
- The signal is not currently delayed
If all conditions are satisfied, it then checks wantAsyncPreempt
and isAsyncSafePoint
. If both are true, preemptive scheduling is triggered.
1 | func wantAsyncPreempt(gp *g) bool { |
As we can see, modifying the state of G can break the preemption conditions—and so can modifying the state of M. Here, we’ll use one of the simplest possible approaches:
1 | // Only user Gs can have safe-points. We check this first |
By setting the current g to G0, we effectively break the condition required for async preemption to proceed.
Implementation
amd64.go
:
1 | //go:noescape |
amd64.s
:
1 | #include "textflag.h" |
In this code, after loading the arguments into registers, we update g to point to g0, then switch the SP to the G0 stack. On the G0 stack, we save the old SP and the original g.
This ensures that if the thread is interrupted by a signal at any point:
- Either it hasn’t yet started using the G0 stack, in which case preemption can proceed safely;
- Or it has already broken the preemption condition by setting g = g0, in which case preemption is blocked and the G0 stack may be in use.
With this design, we can safely execute arbitrary functions without worrying about signal-based interruption corrupting the G0 stack due to async preemption.
Benchmark
Based on the benchmark, we can obtain the following results:
1 | go1.18: |
We can see that compared to the CGO version, the ASM G0 version significantly reduces per-call latency (from 28ns to 2ns). Meanwhile, the in-place version (which does not switch to the G0 stack) doesn’t offer a substantial performance advantage over the G0-switching version.
Revisiting the earlier idea of conditionally switching stacks: given the cost of branch prediction and the risk of stack overflows—especially since stack usage can only be imprecisely estimated and is compiler-dependent—the potential performance gains aren’t worth the complexity or risk.
As a result, I’ve decided to expose two separate call interfaces:
- A G0-switching version, which is safe for functions with unknown or potentially large stack usage;
- An in-place version, which can be used only when the user is certain the target function uses minimal stack space.
CGO or ASM?
From the earlier data, it’s clear that even after addressing stack switching and async preemption, the CGO-based implementation still has significant overhead compared to the assembly-based solution. So, the natural question is: can we always use ASM to invoke external functions?
In reality, the earlier solution blocks preemption—it does not make the function safely preemptible. Therefore, if the invoked function is long-running (e.g., heavy computation or a blocking syscall), it can monopolize the Go worker (P) for too long, leading to increased latency for local tasks on that P. Hence, this is not a universal solution.
There are also some risks with the ASM approach. A known one is that offset values in Go’s internal structs (like g, m, g0, etc.) may change between Go versions. The good news is that these fields are fairly stable—tests show that Go 1.18, 1.22, and 1.23 all work correctly with the current implementation.
Conclusion: For short-lived functions, especially those that are called frequently and have strict performance requirements, the ASM-based approach is recommended.
Here’s an example of how to make such a call:
go:
1 | /* |
rust/C/…:
1 |
|
- In Go, declare the C function using
import "C"
- Use
asmcall.CallFuncG0P{n}
in Go code to initiate the call - Implement the function on the peer side (Rust/C/C++/…) and export it using the C ABI, and keeping the function name unchanged
- Link appropriately (you can refer to the examples in Rust2Go for guidance).
Using asmcall is expected to save over 25ns per call.
If you need to perform the call using the standard CGO approach instead, you can simply replace the asmcall
package in the above code with cgocall
.
Use Cases & Conclusion
This technique is applicable to cross-language calls initiated from Golang, and can be used to invoke functions implemented in Rust, C, C++, and other languages from Go. When frequently calling lightweight or moderately complex external functions, replacing CGO with ASM can lead to significant performance improvements.
Using ASM in place of CGO can reduce per-call overhead from 28ns down to around 2ns.
Developers with relevant needs are welcome to adopt and integrate this approach!