The standard Go toolchain comes with an assembler out of the box. Said assembler is highly idiosyncratic, using syntax inherited from Plan 9 and choosing its own names for platform-specific instructions and registers. But it’s great to have it readily available.
More mundanely, Go comes with a garbage collector. This post explains how to make these two components play nice, if we want to manipulate Go pointers from our assembly.
Preamble: Go’s GC write barriers #
Go’s garbage collector strives to minimize long pauses, which calls for concurrent garbage collection: the garbage is picked up while your code is running. Even if you’re not a garbage collection expert you would have recognized this as a tricky problem. As the Go GC is marking reachable objects, new objects might become reachable due to code running alongside the GC.
A common technique deployed to address this problem consists of instrumenting all pointer stores to inform the GC that the destination is now being pointed to.1 This instrumentation will augment all assignments with a bit of code informing the GC of the new reference. Or more concretely code like
will become more like
Where add_to_gc_queue(y) makes it so that y will be picked up by the GC even if x had already been examined. The widget above is often called a “write barrier” in the context of garbage collection.2
As you can imagine this instrumentation has a cost, a cost that is particularly dear when it comes to the very common task of storing pointers in stack variables. So Go chooses to not instrument stores where the receiver of the store is on the stack. Instead until Go 1.8, the GC stopped the world before collection, taking care to re-scan stacks for goroutines which had run after the time they had been first examined. This ensured that no new stack references would … go undetected.
This final stack rescanning procedure could often take uncomfortably long, and therefore Go switched to a broader write barrier, which roughly consists of adding both the old and the new reference to the GC queue:
Above we have x to be a pointer to a pointer to highlight that the receiver of the pointer itself lives on the heap, rather than on the stack.
The details are fiddly, but involving both the old and new pointer let the Go developers remove the final stack-scanning, decreasing the duration of stop-the-world pauses by two orders of magnitude.
The problem #
When the Go compiler generates code, it automatically instruments all the relevant stores. However, fun awaits when we want to perform some pointer stores ourselves.
As a motivating example, consider a concurrent hash table storing key-value pairs together:
We might want to leverage 128-bit atomic load/stores – which are not available directly in Go – by writing our get/put functions in assembly.3 However we face a challenge: when writing to a slot from assembly, we need to inform Go’s GC of the store that just happened, or else pay the cost of rare and hard-to-debug faults, given that we’re breaking invariants central to the correctness of Go’s garbage collection.
In practice, the assembly for a store not happening on the stack looks like this:
All we need to do is replicate this widget in our own assembly, replacing the scalar store above with our wider store for our concurrent hash map.
The problem is that somewhat understandably the Go designers really don’t want you to do that, and they therefore started forbidding linking to runtime symbols. This is to avoid having libraries relying on runtime internals, and being henceforth unable to change them.
Fortunately, they whitelisted functions which were previously referenced by popular Go packages, and it just so happened that a single package used exactly the symbols we need.
To link to them, you must blacklist future Go versions:
Presumably the plan is to eventually completely phase out these functions too. But in the meantime, you can hand-craft write barriers with abandon.
Bonus: allocating aligned memory in Go #
While investigating the above, a more fundamental problem arose: allocating our slots to be 128-bit aligned, to be able to use AVX instructions on them. Go makes this surprisingly tricky. Allocating a []byte and using unsafe.Slice() to create an aligned []slot from it is doomed to fail, since the Go runtime identifies allocated regions by their original type, and therefore will be blind to the pointers in our array of slots.
Regrettably, there seem to be no “blessed” ways of doing this, but after a few experiments Peter Cawley produced this devious bit of code to bend Go’s malloc to our will:
Note that while the runtime tracks where pointers are based on Go’s type system, it does not care about the precise source of pointers, making the above trick viable. There are surely other ways to do this, but the above is possibly the cutest.
Acknowledgements #
Many thanks to everybody I nerd-sniped by bringing up this topic, and to Peter Cawley and Samir Jindel in particular.