Posted 2024-12-24, updated 2025-01-03
Recently I diagnosed and fixed two frame pointer unwinding crashes in Go. The root causes were two flavors of the same problem: buggy assembly code clobbered a frame pointer. By "clobbered" I mean wrote over the value without saving & restoring it. One bug clobbered the frame pointer register. The other bug clobbered a frame pointer saved on the stack. This post explains the bugs, talks a bit about ABIs and calling conventions, and makes some recommendations for how to avoid the bugs.
Here's the short version of what you should do when writing assembly for Go to avoid the problems discussed in this post:
BP
for AMD64, R29
for ARM64).
Leaving a non-frame pointer value in those registers prior to calling a Go function can crash the runtime execution tracer and profilers.
The first issue was reported at go.dev/issue/69629.
After upgrading to Go 1.23, and building with profile-guided optimization (PGO),
the block profiler consistently crashed when the program was under load.
The crash happened when collecting a call stack via frame pointer unwinding.
This affected the program when built for the amd64
architecture,
but not when it was built for arm64
.
I had contributed frame pointer unwinding for the block and mutex profilers for Go 1.23,
so I took a look at the issue.
I won't go into the full investigation here.
There were many dead ends.
Thanks to the issue reporter, Liz Fong-Jones (@lizthegrey), for being so willing to collect and share information.
The key clue turned out to be the address of the faulting memory access, 0x30bc5b31
,
which was the same in every occurrence of the crash.
The program was reading 8 bytes above a frame pointer,
so the supposed frame pointer was 0x30bc5b29
.
In a moment of desperation (or inspiration?)
I put this value into GitHub code search and turned up several
Metrohash implementations.
The github.com/dgryski/go-metro implementation uses assembly. Turns out the affected program was using this library! Here's some assembly from the library:
// func Hash64(buffer_base uintptr, buffer_len int64, buffer_cap int64, seed uint64) uint64 TEXT ·Hash64(SB),4,$0-40 MOVQ seed+24(FP), AX MOVQ buffer_base+0(FP), BX MOVQ buffer_len+8(FP), CX MOVQ $3603962101, DX IMULQ DX, AX MOVQ $5961697176435608501, DX ADDQ DX, AX CMPQ CX, $32 JLT after32 MOVQ AX, DX MOVQ AX, DI MOVQ AX, SI MOVQ AX, BP
The MOVQ AX, BP
instruction is a problem.
The BP
register is used as a "frame pointer" register for AMD64.
Frame pointers form a linked list used to efficiently collect backtraces,
and the head of the list lives in that register.
The function also has no call frame (see the $0
in the signature),
so the assembler won't add any code to save and restore the frame pointer.
That MOVQ
instruction overwrites BP
with a value used for the hash function,
without saving the old value.
So the old value is gone forever, and BP
now holds an invalid frame pointer.
Later in the function,
there is a MOVQ $817650473, BP
instruction.
This is the last write to BP
before the function returns.
And 817650473 == 0x30bc5b29
, our mystery value.
Because this function leaves a non-frame pointer value in the frame pointer register, any unwinding that passes through it between when when this function returns and when its caller returns (restoring the caller's caller's presumably valid frame pointer) will fail. The affected program crashed with PGO enabled because the function which called this buggy hash function was inlined, extending the scope where the frame pointer register was invalid. I reported the problem.
The assembly in the go-metro
repo is not written by hand.
It was generated programmatically with PeachPy.
With PeachPy, you write Python scripts which generate assembly.
The go-metro
PeachPy code doesn't explicitly use the frame pointer register.
Rather, PeachPy treats it as a general-purpose register and uses it in the generated code.
So this is really a bug in PeachPy.
Despite this bug in PeachPy,
writing assembly with generator tools is a really good idea.
Assembly generators handle register allocation,
so you can ask for registers and refer to them by a useful names,
rather than having to remember that CX
is your loop counter and DX
is the hash seed, etc.
The generator can learn which registers are special and save you from breaking things.
The generator can automatically allocate the right size stack frame based on locals your functions uses.
You can encapsulate repeated logic in functions used by the generator,
which are way better than preprocessor macros since they actually have type checking,
and they end up inlined in the assembly output so you don't pay for an actual function call.
While PeachPy doesn't appear to be actively maintained,
the Avo assembly generator is.
And Avo does know that the frame pointer register needs to be saved.
Avo specifically targets Go assembly and only supports AMD64 as of this writing.
I recommend using Avo for Go assembly when you can.
I opened a PR to port go-metro
's assembly generation to Avo.
Also worth noting:
go vet
has a check for overwriting the frame pointer register without saving it.
The check unfortunately misses this bug, though.
The check was written to be simple and conservative to avoid false positives,
and opted to stop checking a function at the first branch.
But in this case, the frame pointer is clobbered after the first branch.
I opened go.dev/issue/69838 to track improving this check.
I found the next bug while looking for occurrences of the previous bug at Datadog. One of our programs was crashing with stack dumps like this:
SIGSEGV: segmentation violation PC=0x69f0e0 m=30 sigcode=1 addr=0xc3ffa97ffc05 goroutine 0 [idle]: runtime.fpTracebackPCs(...) runtime.traceStack() runtime.traceLocker.stack(...) runtime.traceLocker.GoStop() runtime.traceLocker.GoPreempt(...) runtime.goschedImpl(...) runtime.gopreempt_m() runtime.mcall() goroutine 321 [running]: runtime.asyncPreempt2() runtime.asyncPreempt() github.com/apache/arrow/go/v13/arrow/memory.memory_memset_neon() github.com/apache/arrow/go/v13/arrow/memory.Set(...) [ ... ]
Seeing the execution tracer and asynchronous preemption gave me flashbacks to
"the one-instruction window".
The memory_memset_neon
function jumped out at me.
I recognized "neon" as the Neon SIMD extension for ARM.
I smelled assembly and took a look at the code.
The memory_memset_neon
function calls
this
_memset_neon
function,
implemented in assembly.
Here's the beginning of the function:
// func _memset_neon(buf unsafe.Pointer, len, c uintptr) TEXT ·_memset_neon(SB), $0-24 MOVD buf+0(FP), R0 MOVD len+8(FP), R1 MOVD c+16(FP), R2 WORD $0xa9bf7bfd // stp x29, x30, [sp, #-16]! WORD $0x8b010008 // add x8, x0, x1 WORD $0xeb00011f // cmp x8, x0 WORD $0x910003fd // mov x29, sp BLS LBB0_7 [ ... ]
This assembly is extra odd since it's (partly) automatically translated from the output of a C compiler.
The WORD
instructions are the bytecode for machine instructions.
This is usually used for instructions the Go assembler doesn't yet support,
but the assembly translator probably does it for simplicity.
The original instructions are in the comments.
The first stp
instruction is key here.
It decrements the stack pointer by 16,
then saves the current values of x29
and x30
on the stack
starting at the newly decremented stack pointer.
The x29
register is the frame pointer register for ARM64.
So, it looks like this function saves the frame pointer as it should, right?
Turns out it's wrong in a non-obvious way.
The Go compiler saves the caller's frame pointer one word below the function's stack frame.
[2]
See "the one-instruction window" for some diagrams that illustrate how the frame is laid out.
So, the top word of a function's stack frame on entry actually belongs to the caller.
That stp
instruction is actually saving x30
(the link register)
where the caller saved the frame pointer.
When the function returns,
it will restore its caller's frame pointer to the frame pointer register,
and that frame pointer now points to an invalid frame pointer (an instruction address from x30
).
That invalid frame pointer will be there until the caller returns.
This includes if the caller is asynchronously preempted while the execution tracer is running.
When that happens, the frame pointer will be visited when collecting a backtrace,
hence the crash.
I reported the bug, and sent a PR to fix it. At the time I sent the PR, I did what seemed like the least intrusive thing to fix the problem. I decremented the stack pointer a little bit more so the caller's saved frame pointer wouldn't be clobbered.
However, while writing this post I realized there are probably better ways to deal with the problem.
For one, the functions don't even need stack frames
(see again the $0
)
so the frame pointer and link register save/restore logic could just be eliminated.
But it can be nice to have the frame pointer set up so that you get complete stack traces from something like Linux perf
.
It's probably okay to manually create the call frame and save the registers like the code does. There will typically be room below the stack pointer in any function call [3], so we're unlikely to write out of bounds. But it's better as a programmer not to rely on implementation details like that. It would be just as easy to give each function a non-zero frame size, even if they don't have any locals. That way, the assembler can insert the frame pointer save and restore. If the frame size is small enough, and the compiler/assembler know that small frames are safe to allocate without stack growth, extra bounds check or growth logic can be safely eliminated.
Of course, the underlying source of the bug is the assembly translator used for that code. The assembly was translated from the output of a C compiler. The C compiler isn't going to know about the requirements for Go assembly code, though. And certainly not something like the top word of the stack frame being off-limits.
There are Apache Arrow clients for many programming languages besides just Go, and it makes sense to try to share low-level, performance-sensitive code where possible, rather than writing the same primitives multiple times. And calling assembly functions is much, much cheaper than calling C functions. So I can understand why that approach was taken. PeachPy even tried to do something similar, supporting generating assembly for several different targets from a single script. Unfortunately there doesn't seem to be a tool that supports that well right now.
These kinds of rules for assembly-level code—which registers are special, when registers need to be saved and restored, how to pass arguments to and return values from functions, the exact layout of call frames and other data—are part of an "application binary interface" (ABI). There are many ABIs, even for a single processor architecture: The System V ABI is used by Linux and some other Unix-descendent operating systems. Microsoft has its own ABI. Programming languages can also have their own ABIs, on top of the operating system ABIs.
As a Go programmer, you should refer to the Go assembly guide. This describes the stable ABI for Go assembly, which is what you should target.
If you're ever looking at assembly generated by the compiler, or hand-written assembly in the runtime, refer to the internal ABI guide.
If you're planning to write Go assembly, be aware of the details of the Go ABI, but also lean on tools like Avo to handle the subtle details for you so you can focus on the code you actually want to write.
Thanks to Felix Geisendörfer for feedback
perf
.
The compiler can also omit the stack growth check for functions with very small frames,
so it's cheap to declare a small frame even if you don't need it.
Back
PUSHQ BP
instruction,
and restored before any return via POPQ BP
.
This is almost the same as ARM64,
in the sense that the first word of the frame is a frame pointer and the word above it is a return address.
But on AMD64 that word is the caller's frame pointer,
rather than the caller's caller's frame pointer,
and the function only writes strictly within its own frame.
Frame pointer support for ARM64 was added in 2018.
At the time,
the Go compiler already saved the link register (R30
) at the bottom of a function's call frame.
By convention, the frame pointer is stored one word below the return address.
So, for simplicity, the frame pointer is saved one word below the function's stack frame.
It was done this way for ARM64 for simplicity,
since the link register is already at the bottom of the frame.
As for why that is done,
I haven't yet found any record of the rationale.
In a sense it's similar to pushing the return address to the stack for AMD64 CALL
.
But again, it's different for ARM64 since the return address is in a register,
and the previous return address goes on the stack...
Back