Rust2Go Part2: Exploring CGO Calls for Extreme Performance

This article also has an Chinese version.

A year ago, I released an open-source project called Rust2Go（related blog: Design and Implementation of a Rust-Go FFI Framework）, which provides high-performance asynchronous and synchronous call support from Rust to Go. This project serves as the foundation for several community projects and multiple internal projects within my company. I’ve continued optimizing its performance and developing new features.

I’ll be speaking on this topic at Rust Asia Conf 2025—everyone interested is warmly welcome to join!

Recently, I explored CGO-related topics and, based on a newly developed high-performance CGO mechanism, added support in Rust2Go for actively invoking Rust from Go. This article focuses primarily on the former.

Note: This article is not limited to Rust. It is applicable to all Go cross-language projects. A corresponding repository and example are also provided. Users with such needs are welcome to adopt and use it.

This article will proceed in the following order:

Introduce the principles of CGO calls and their performance issues, which are the main targets of optimization;
Show how to optimize CGO calls using the simplest assembly techniques;
Highlight stack space issues and explain how switching to the G0 stack can resolve them;
Introduce Async Preemption and how to block it to ensure the G0 stack remains unpolluted;
Discuss the optimization results and application scenarios.

CGO Calls

Golang provides the ability to call and be called through CGO, following the C calling convention. Regardless of the direction of the call, both capabilities are required to initiate the call and pass return values.

Unlike typical FFI in native languages, Golang code execution relies on the runtime. Therefore, calling Go functions from other languages is essentially more of a task dispatch than a direct FFI call. This introduces significant overhead, though much of it is necessary—for example, contention costs involved in managing the task queue, and cross-thread overhead incurred when waking up Go threads. This part of the implementation depends on the internals of the Go runtime and is relatively difficult to optimize.

On the other hand, calls initiated from Golang are much easier to optimize. The optimizations discussed in this article are primarily targeted at this case.

Let’s start with a simple function call:

package main

/*
void noop() {}
*/
import "C"

func main() {
    C.noop()
}

This code includes a dummy C function implementation, which is invoked in the main function using the C. prefix.

Normally, such code can be executed at very low cost in native languages. Whether it’s a C function or something else, the result is binary machine instructions—so the compiler simply needs to recognize and align with the target ABI. In this example, since there are no parameters or return values, the cost of calling convention conversion should be minimal.

However, if you benchmark this process, you’ll find that it takes nearly 30ns to complete—an overhead that’s relatively expensive.

Assembly Optimization

We have observed the relatively high overhead of CGO calls and recognized that, at their core, they are just binary instructions. So why not directly insert assembly to initiate the call?

In fact, projects like fastcgo and rustgo attempted this approach as far back as seven years ago. However, their code no longer works with more recent versions of Golang.

If we ignore other factors for a moment, all we need to do is write a bit of assembly to convert parameters from the Go ABI to the C ABI, and then execute a CALL instruction.

In fact, this is not sufficient—we’ll explain why later.

Let’s try this approach from scratch!

Golang ABI

When it comes to assembly calls, the first thing we need to pay attention to is the ABI. An ABI is a specification that defines the interface between two binary modules—it dictates how function parameters and return values are passed, how registers and the stack are used, and more.

In early versions of Golang, there was no explicit or public ABI. In 2018, the team recognized this issue and formalized the existing behavior as ABI0, while also introducing a new standard called ABIInternal. ABIInternal is designed as a rolling internal ABI standard, whereas ABI0 and any future ABI{n} are considered snapshots of this internal ABI.

Link to the proposal: https://github.com/golang/proposal/blob/master/design/27539-internal-abi.md 。

Using the new ABI can bring potential performance improvements. For example, under ABI0, all parameters and return values are passed via the stack, which is significantly slower than passing through registers. Additionally, under ABI0, all registers are considered clobbered during a CALL, meaning that even if the callee does not modify a particular register, the caller must still save and restore it—an entirely unnecessary overhead in some cases.

In Golang, handwritten assembly must follow the ABI0.

Unfortunately, both the Plan 9 and Go ABIs prevent us from passing arguments via registers or manually marking clobbered registers to further squeeze out performance. The good news is that our core logic isn’t complex, so even the simplest implementation is sufficient.

Further reading: If you’re interested in clobbered registers, here is an interesting example. It manually marks clobbered registers to delegate register operations during stackful coroutine switches to the compiler, thereby reducing unnecessary register saves and restores.

Minimal ASM

Below, I’ll use an example with two input parameters to implement the conversion from the Go ABI to the System V ABI. The main focus here is on argument passing: we need to read the parameters from the stack according to the Go ABI, and then write them into the designated registers as specified by the System V ABI.

amd64.go:

1
2
3

//go:noescape
//go:nosplit
func CallP2(fn, arg0, arg1 unsafe.Pointer)

amd64.s:

#include "textflag.h"

TEXT ·CallP2(SB), NOSPLIT|NOPTR|NOFRAME, $0
    MOVQ    fn+0x0(FP), AX
    MOVQ    arg0+0x8(FP), DI
    MOVQ    arg1+0x10(FP), SI
    CALL    AX
    RET

This piece of code reads the parameters from the stack and stores them into the RDI and RSI registers according to the System V ABI. It then loads the target function address into RAX and executes CALL AX.

PLAN9 assembly requires adherence to certain conventions. Specifically:

NOSPLIT prevents the function from being instrumented with stack size checks.
NOPTR indicates that the function does not contain pointers, so the garbage collector does not need to scan it.
NOFRAME avoids stack frame allocation (which requires declaring a stack size of 0).
$0 declares that the function requires 0 bytes of stack space.

PLAN9 introduces the concept of pseudo-registers. For example, although parameters are actually accessed via SP with a compile-time-determined offset, in assembly they are accessed using the FP register: symbol+offset(FP). Here, FP is a pseudo-register. Similarly, SB, PC, and TLS are also pseudo-registers.

Reference: Dropping down Go functions in assembly language

If we benchmark this approach, we can observe that it reduces the per-call CGO overhead from nearly 30ns to around 1ns!

Stack Switching

At this point, you might be wondering: if it’s really this simple, why is the CGO implementation so inefficient? In fact, the previous implementation does have issues—one of them being the stack size problem.

Stack Probing vs Manual Stack Switching

Golang allocates a small initial stack on the heap for each goroutine and uses compiler-inserted stack size check code (probes) at the beginning of functions. When the available stack is insufficient, the runtime automatically grows and relocates the stack to ensure that the function always has enough space to execute.

However, such stack probing cannot be implemented from a foreign language. Even if the probe check could somehow be replicated, the foreign language’s pointer tagging and memory management would need to fully align with Go’s, in order for the Go runtime to correctly update pointer targets during stack growth. This is nearly impossible—and provides no real benefit.

If we ignore the stack size problem, then the only option is for external code to allocate, manage, and switch stacks manually. This approach avoids consuming any goroutine’s stack space, thus preventing stack overflows. However, this places a huge burden on the foreign-side implementer.

Switching to the G0 Stack

Let’s think about this: when a goroutine’s stack is insufficient, the Go runtime needs to manage memory and perform stack growth—does it do all this without using any stack space? Where does that stack come from?

The answer is: the G0 stack.

Golang uses the GMP model—G for Goroutine, M for Thread, and P for logical Processor. G0 is the first goroutine created when each M (thread) starts and is used to run scheduling-related code. The G0 stack resides in the thread stack. Its size is smaller than the full thread stack, but larger than a typical goroutine stack.

asm_arm64.s:

MOVD  $runtime·g0(SB), g
MOVD  RSP, R7
MOVD  $(-64*1024)(R7), R0
MOVD  R0, g_stackguard0(g)
MOVD  R0, g_stackguard1(g)
MOVD  R0, (g_stack+stack_lo)(g)
MOVD  R7, (g_stack+stack_hi)(g)

This assembly code allocates 64KB of space within the thread stack to serve as the G0 stack.

64KB is considered a relatively large stack size—enough to run most regular functions. At the beginning of our assembly function, we can switch the SP (stack pointer) register to point to the G0 stack address.

From runtime2.go, we can find relevant structure definitions. We’ll need the corresponding offset values to access specific fields.

type g struct {
    stack       stack   // offset known to runtime/cgo
    stackguard0 uintptr // offset known to liblink
    stackguard1 uintptr // offset known to liblink

    _panic    *_panic // innermost panic - offset known to liblink
    _defer    *_defer // innermost defer
    m         *m      // current m; offset known to arm liblink
    sched     gobuf
    ...
}

type m struct {
    g0      *g     // goroutine with scheduling stack
    morebuf gobuf  // gobuf arg to morestack
    divmod  uint32 // div/mod denominator for arm - known to liblink
    _       uint32 // align next field to 8 bytes
    ...
}

type stack struct {
    lo uintptr
    hi uintptr
}

type gobuf struct {
    sp   uintptr
    pc   uintptr
    ...
}

We can see that by accessing offset 0x30 from the g pointer, we obtain the m pointer; from offset 0x0 of m, we get the g0 pointer; and from offset 0x38 of g0, we retrieve the gobuf—whose first field is the SP.

At this point, we can write the assembly code to switch to the G0 stack:

amd64.go:

1
2
3

//go:noescape
//go:nosplit
func CallG0P2(fn, arg0, arg1 unsafe.Pointer)

amd64.s:

#include "textflag.h"

TEXT ·CallG0P2(SB), NOSPLIT|NOPTR|NOFRAME, $0
    MOVQ    fn+0x0(FP), AX
    MOVQ    arg0+0x8(FP), DI
    MOVQ    arg1+0x10(FP), SI

    MOVQ    SP, R8
    MOVQ    0x30(g), R9  /* g.m */
    MOVQ    0x0(R9), R9  /* g.m.g0 */

    MOVQ    0x38(R9), SP
    ANDQ    $-16, SP

    LEAQ    -8(SP), SP
    PUSHQ   R8

    CALL    AX

    POPQ    SP
    RET

This piece of code performs the following steps:

Loads the parameters from the stack into registers
Saves the old SP into R8
Loads the G0 stack address into the SP register
Aligns the stack, pushes R8 (the old SP), and ensures proper stack alignment
Executes the CALL instruction
Restores SP from the stack and returns with RET

At this point, if we invoke a simple external function that requires a relatively large stack, we can verify that it executes correctly.

Great—but is that the end of it? If you compare this with the fastcgo code mentioned earlier (in fact, I only discovered that earlier work when I was writing the PoC to this point), you’ll find that the approaches are indeed quite similar.

Stack Size Probe & Crash?

I wanted to try something a bit more experimental—could I conditionally switch stacks? That is, only switch if the goroutine’s own stack space is insufficient? This might reduce cache misses and yield even better performance.

To do that, we need to answer one question: how big is “not enough”? To find out, I wanted to probe how much stack space is actually used by the external function in my current scenario. (Typically, I use CGO to invoke a Rust-side callback, which usually just wakes a task internally.)

I came up with an idea: before executing the CALL, I could write a canary value at SP - {n}, and after the call completes, read it back. If the value is unchanged, it means that memory region was untouched—i.e., the stack usage was less than n. I didn’t even need to write actual code for this; I used memory write/read directly in lldb to perform the experiment.

But then it crashed. Maybe the idea was too wild? After several debugging attempts, I found that even just setting a breakpoint in lldb—without doing any memory writing—would also cause a crash.

The crash was a memory fault on the Rust side. After inspecting the corresponding memory of some struct, I realized the issue wasn’t trivial. The real culprit was Go’s asynchronous preemption mechanism.

Async Preemption

Golang’s scheduler is typically considered cooperative—that’s where the “co” in “coroutine” comes from. Starting from earlier versions, the Go compiler inserts checks at function entry points, memory allocations, and other locations to track how long the current coroutine (goroutine) has been running. If it exceeds a certain threshold, the runtime will attempt to switch to another goroutine.

The downside of this mechanism is that task switching depends on hitting these checkpoints. If a goroutine enters a tight loop of pure computation without function calls or allocations, it may block the entire thread.

The Go team was well aware of this issue. Starting with Go 1.14, they introduced preemptive scheduling, known in the codebase as asynchronous preemption.

Async Preemption is implemented using signal-based mechanisms. When a monitor thread detects that a goroutine has been running for too long, it sends a SIGURG signal to the target thread. This causes the kernel to pause the thread’s execution and transfer control to a signal handler. Inside this handler, the Go runtime has the opportunity to perform a goroutine switch or detach the current M (thread) from the P (logical processor)—commonly used when a blocking syscall is made directly by user code.

The crash mentioned earlier occurred precisely because of this: when Go sends a signal to a target thread, the signal handler runs on a separate gsignalStack—which is fine. But if the Go runtime decides to perform a goroutine switch, or executes any logic that depends on the G0 stack, it will corrupt the G0 state we depend on.

Even though the signal handler restores the register state afterward—thinking it has cleanly returned to the pre-interruption state—it doesn’t realize that our logic also relies on the G0 stack remaining untouched, which is no longer guaranteed.

Blocking Async Preemption

We must block Async Preemption (alternatively, we could temporarily mask signals—but doing so would involve a syscall on every call, which would severely impact performance).

Let’s take a look at how Go’s Async Preemption is triggered on Unix amd64 systems:

sys_plan9_amd64.s:

TEXT runtime·sigtramp(SB),NOSPLIT|NOFRAME,$0
    get_tls(AX)

    // check that g exists
    MOVQ    g(AX), BX
    CMPQ    BX, $0
    JNE     3(PC)
    CALL    runtime·badsignal2(SB) // will exit
    RET

    // save args
    MOVQ    ureg+0(FP), CX
    MOVQ    note+8(FP), DX

    // change stack
    MOVQ    g_m(BX), BX
    MOVQ    m_gsignal(BX), R10
    MOVQ    (g_stack+stack_hi)(R10), BP
    MOVQ    BP, SP

    // make room for args and g
    SUBQ    $128, SP

    // save g
    MOVQ    g(AX), BP
    MOVQ    BP, 32(SP)

    // g = m->gsignal
    MOVQ    R10, g(AX)

    // load args and call sighandler
    MOVQ    CX, 0(SP)
    MOVQ    DX, 8(SP)
    MOVQ    BP, 16(SP)

    CALL    runtime·sighandler(SB)
    MOVL    24(SP), AX

    // restore g
    get_tls(BX)
    MOVQ    32(SP), R10
    MOVQ    R10, g(BX)

    // call noted(AX)
    MOVQ    AX, 0(SP)
    CALL    runtime·noted(SB)
    RET

signal_unix.go:

const sigPreempt = _SIGURG

//go:nowritebarrierrec
func sighandler(sig uint32, info *siginfo, ctxt unsafe.Pointer, gp *g) {
    gsignal := getg()
    mp := gsignal.m
    c := &sigctxt{info, ctxt}

    delayedSignal := *cgo_yield != nil && mp != nil && gsignal.stack == mp.g0.stack

    ...

    if sig == sigPreempt && debug.asyncpreemptoff == 0 && !delayedSignal {
        // Might be a preemption signal.
        doSigPreempt(gp, c)
        // Even if this was definitely a preemption signal, it
        // may have been coalesced with another signal, so we
        // still let it through to the application.
    }
}

func doSigPreempt(gp *g, ctxt *sigctxt) {
    // Check if this G wants to be preempted and is safe to
    // preempt.
    if wantAsyncPreempt(gp) {
        if ok, newpc := isAsyncSafePoint(gp, ctxt.sigpc(), ctxt.sigsp(), ctxt.siglr()); ok {
            // Adjust the PC and inject a call to asyncPreempt.
            ctxt.pushCall(abi.FuncPCABI0(asyncPreempt), newpc)
        }
    }

    // Acknowledge the preemption.
    gp.m.preemptGen.Add(1)
    gp.m.signalPending.Store(0)

    if GOOS == "darwin" || GOOS == "ios" {
        pendingPreemptSignals.Add(-1)
    }
}

The sigtramp function is the entry point for signal handling. After switching the current G to gsignal, it calls runtime.sighandler. For sighandler to proceed with preemption, the following conditions must be met:

The signal received is SIGURG
The asyncpreempt feature is not disabled
The signal is not currently delayed

If all conditions are satisfied, it then checks wantAsyncPreempt and isAsyncSafePoint. If both are true, preemptive scheduling is triggered.

preempt.go:

func wantAsyncPreempt(gp *g) bool {
    // Check both the G and the P.
    return (gp.preempt || gp.m.p != 0 && gp.m.p.ptr().preempt) && readgstatus(gp)&^_Gscan == _Grunning
}

func isAsyncSafePoint(gp *g, pc, sp, lr uintptr) (bool, uintptr) {
    mp := gp.m

    // Only user Gs can have safe-points. We check this first
    // because it's extremely common that we'll catch mp in the
    // scheduler processing this G preemption.
    if mp.curg != gp {
        return false, 0
    }

    // Check M state.
    if mp.p == 0 || !canPreemptM(mp) {
        return false, 0
    }

    // Check stack space.
    if sp < gp.stack.lo || sp-gp.stack.lo < asyncPreemptStack {
        return false, 0
    }

    // Check if PC is an unsafe-point.
    f := findfunc(pc)
    if !f.valid() {
        // Not Go code.
        return false, 0
    }
    if (GOARCH == "mips" || GOARCH == "mipsle" || GOARCH == "mips64" || GOARCH == "mips64le") && lr == pc+8 && funcspdelta(f, pc) == 0 {
        // We probably stopped at a half-executed CALL instruction,
        // where the LR is updated but the PC has not. If we preempt
        // here we'll see a seemingly self-recursive call, which is in
        // fact not.
        // This is normally ok, as we use the return address saved on
        // stack for unwinding, not the LR value. But if this is a
        // call to morestack, we haven't created the frame, and we'll
        // use the LR for unwinding, which will be bad.
        return false, 0
    }
    up, startpc := pcdatavalue2(f, abi.PCDATA_UnsafePoint, pc)
    if up == abi.UnsafePointUnsafe {
        // Unsafe-point marked by compiler. This includes
        // atomic sequences (e.g., write barrier) and nosplit
        // functions (except at calls).
        return false, 0
    }
    if fd := funcdata(f, abi.FUNCDATA_LocalsPointerMaps); fd == nil || f.flag&abi.FuncFlagAsm != 0 {
        // This is assembly code. Don't assume it's well-formed.
        // TODO: Empirically we still need the fd == nil check. Why?
        //
        // TODO: Are there cases that are safe but don't have a
        // locals pointer map, like empty frame functions?
        // It might be possible to preempt any assembly functions
        // except the ones that have funcFlag_SPWRITE set in f.flag.
        return false, 0
    }
    // Check the inner-most name
    u, uf := newInlineUnwinder(f, pc)
    name := u.srcFunc(uf).name()
    if stringslite.HasPrefix(name, "runtime.") ||
        stringslite.HasPrefix(name, "runtime/internal/") ||
        stringslite.HasPrefix(name, "reflect.") {
        // For now we never async preempt the runtime or
        // anything closely tied to the runtime. Known issues
        // include: various points in the scheduler ("don't
        // preempt between here and here"), much of the defer
        // implementation (untyped info on stack), bulk write
        // barriers (write barrier check),
        // reflect.{makeFuncStub,methodValueCall}.
        //
        // TODO(austin): We should improve this, or opt things
        // in incrementally.
        return false, 0
    }
    switch up {
    case abi.UnsafePointRestart1, abi.UnsafePointRestart2:
        // Restartable instruction sequence. Back off PC to
        // the start PC.
        if startpc == 0 || startpc > pc || pc-startpc > 20 {
            throw("bad restart PC")
        }
        return true, startpc
    case abi.UnsafePointRestartAtEntry:
        // Restart from the function entry at resumption.
        return true, f.entry()
    }
    return true, pc
}

func canPreemptM(mp *m) bool {
    return mp.locks == 0 && mp.mallocing == 0 && mp.preemptoff == "" && mp.p.ptr().status == _Prunning
}

As we can see, modifying the state of G can break the preemption conditions—and so can modifying the state of M. Here, we’ll use one of the simplest possible approaches:

// Only user Gs can have safe-points. We check this first
// because it's extremely common that we'll catch mp in the
// scheduler processing this G preemption.
if mp.curg != gp {
    return false, 0
}

By setting the current g to G0, we effectively break the condition required for async preemption to proceed.

Implementation

amd64.go:

1
2
3

//go:noescape
//go:nosplit
func CallG0P2(fn, arg0, arg1 unsafe.Pointer)

amd64.s:

#include "textflag.h"

TEXT ·CallP2(SB), NOSPLIT|NOPTR|NOFRAME, $0
    MOVQ    fn+0x0(FP), AX
    MOVQ    arg0+0x8(FP), DI
    MOVQ    arg1+0x10(FP), SI

    /* save SP */
    MOVQ    SP, R8

    /* read g.0 and g.m.g0 */
    MOVQ    0x30(g), R9                /* g.m */
    MOVQ    0x0(R9), R9                /* g.m.g0 */

    /* mark unpreemptible by replacing g with g0 */
    MOVQ    g, R10
    MOVQ    R9, g

    /* switch SP to g0 and align stack */
    MOVQ    0x38(R9), SP              /* g.m.g0.sched */
    ANDQ    $-16, SP

    /* push SP and original g */
    PUSHQ   R8
    PUSHQ   R10

    /* call the function */
    CALL    AX

    /* restore g and SP */
    POPQ    g
    POPQ    SP
    RET

In this code, after loading the arguments into registers, we update g to point to g0, then switch the SP to the G0 stack. On the G0 stack, we save the old SP and the original g.

This ensures that if the thread is interrupted by a signal at any point:

Either it hasn’t yet started using the G0 stack, in which case preemption can proceed safely;
Or it has already broken the preemption condition by setting g = g0, in which case preemption is blocked and the G0 stack may be in use.

With this design, we can safely execute arbitrary functions without worrying about signal-based interruption corrupting the G0 stack due to async preemption.

Benchmark

Based on the benchmark, we can obtain the following results:

go1.18:
goos: linux
goarch: amd64
pkg: github.com/ihciah/rust2go/asmcall/bench
cpu: AMD Ryzen 7 7840HS w/ Radeon 780M Graphics
BenchmarkCgo-16                 40091475                28.48 ns/op
BenchmarkAsm-16                 520479445                2.285 ns/op
BenchmarkAsmLocal-16            670385510                1.774 ns/op
PASS
ok      github.com/ihciah/rust2go/asmcall/bench 3.973s

go1.22:
goos: linux
goarch: amd64
pkg: github.com/ihciah/rust2go/asmcall/bench
cpu: AMD Ryzen 7 7840HS w/ Radeon 780M Graphics
BenchmarkCgo-16                 40595916                29.19 ns/op
BenchmarkAsm-16                 506890142                2.324 ns/op
BenchmarkAsmLocal-16            675166923                1.829 ns/op
PASS
ok      github.com/ihciah/rust2go/asmcall/bench 4.055s

We can see that compared to the CGO version, the ASM G0 version significantly reduces per-call latency (from 28ns to 2ns). Meanwhile, the in-place version (which does not switch to the G0 stack) doesn’t offer a substantial performance advantage over the G0-switching version.

Revisiting the earlier idea of conditionally switching stacks: given the cost of branch prediction and the risk of stack overflows—especially since stack usage can only be imprecisely estimated and is compiler-dependent—the potential performance gains aren’t worth the complexity or risk.

As a result, I’ve decided to expose two separate call interfaces:

A G0-switching version, which is safe for functions with unknown or potentially large stack usage;
An in-place version, which can be used only when the user is certain the target function uses minimal stack space.

CGO or ASM?

From the earlier data, it’s clear that even after addressing stack switching and async preemption, the CGO-based implementation still has significant overhead compared to the assembly-based solution. So, the natural question is: can we always use ASM to invoke external functions?

In reality, the earlier solution blocks preemption—it does not make the function safely preemptible. Therefore, if the invoked function is long-running (e.g., heavy computation or a blocking syscall), it can monopolize the Go worker (P) for too long, leading to increased latency for local tasks on that P. Hence, this is not a universal solution.

There are also some risks with the ASM approach. A known one is that offset values in Go’s internal structs (like g, m, g0, etc.) may change between Go versions. The good news is that these fields are fairly stable—tests show that Go 1.18, 1.22, and 1.23 all work correctly with the current implementation.

Conclusion: For short-lived functions, especially those that are called frequently and have strict performance requirements, the ASM-based approach is recommended.

Here’s an example of how to make such a call:

go:

/*
const void some_func(const void*);
*/
import "C"

import "github.com/ihciah/rust2go/asmcall"

func main() {
    x := 1
    asmcall.CallFuncG0P1(unsafe.Pointer(C.some_func), unsafe.Pointer(&x))
}

rust/C/…:

#[no_mangle]
unsafe extern "C" fn some_func(x: *mut i32) {
    *x = 2;
}

In Go, declare the C function using import "C"
Use asmcall.CallFuncG0P{n} in Go code to initiate the call
Implement the function on the peer side (Rust/C/C++/…) and export it using the C ABI, and keeping the function name unchanged
Link appropriately (you can refer to the examples in Rust2Go for guidance).

Using asmcall is expected to save over 25ns per call.
If you need to perform the call using the standard CGO approach instead, you can simply replace the asmcall package in the above code with cgocall.

Use Cases & Conclusion

This technique is applicable to cross-language calls initiated from Golang, and can be used to invoke functions implemented in Rust, C, C++, and other languages from Go. When frequently calling lightweight or moderately complex external functions, replacing CGO with ASM can lead to significant performance improvements.

Using ASM in place of CGO can reduce per-call overhead from 28ns down to around 2ns.

Developers with relevant needs are welcome to adopt and integrate this approach!