Help needed to identify golang fork / memory corruption issue on FreeBSD
Steven Hartland
killing at multiplay.co.uk
Tue Mar 28 22:50:40 UTC 2017
On 28/03/2017 12:38, Konstantin Belousov wrote:
> On Tue, Mar 28, 2017 at 09:48:23AM +0100, Steven Hartland wrote:
>> On 28/03/2017 09:38, Konstantin Belousov wrote:
>>> On Tue, Mar 28, 2017 at 09:23:24AM +0100, Steven Hartland wrote:
>>>> As I stopped the panic before that I couldn't tell so I've re-run with
>>>> some debug added just before the panic to capture the addresses of the
>>>> workbuf structure that the issue was detected in, here goes (parent:
>>>> 62620, child: 98756):
>>>>
>>>> workbuf: 0x800b51800
>>>> fatal error: workbuf is not empty
>>>> workbuf: 0x800a72000
>>>> fatal error: workbuf is empty
>>>> workbuf: 0x800a72000
>>>> fatal error: workbuf is not empty
>>> I do not understand. Why do you show several addresses ? Wouldn't the
>>> runtime panic after detecting the discrepancy, so there could be only one
>>> address ?
>> There are several goroutines (threads) running each detected an error,
>> as I'm blocking the panic by entering a sleep in the faulting goroutine
>> to enable the capture of procstat, other routines continue and detect an
>> error too.
> Ok.
>
> So I tried to simulate the load with an isolated test. Code below is
> naive, but it should illustrate the idea. Parent allocates some
> number of private-mapped areas, then runs threads which write bytes into
> the areas. Simultaneously parent forks children which write distinct
> byte into the same anonymous memory.
>
> Parent checks that it cannot see a byte written by children.
>
> So far it did not tripped on my test machine. Feel free to play with it,
> if you have more insights what go runtime does, modify the code to simulate
> the failing test more accurately.
I've updated to it to be more like the go, so single forking thread
(non-main), ancillary threads mainly idle until triggered by forking
thread to perform a check, and still no failure.
What's curious is why I don't get the issue if either:
1. The machine has just a single core.
2. The work (GC) is moved after the child wait.
Given the above I added some debug:
func (b *workbuf) checknonempty() {
if b.nobj == 0 {
print("workbuf is empty: b: ", b, ", nobj: ", b.nobj,
", nobj2: ", b.nobj2, ", pushcnt: ", b.node.pushcnt, "\n")
throw("workbuf is empty")
}
}
func (b *workbuf) checkempty() {
if b.nobj != 0 {
print("workbuf is not empty: b: ", b, ", nobj: ",
b.nobj, ", nobj2: ", b.nobj2, ", pushcnt: ", b.node.pushcnt, "\n")
throw("workbuf is not empty")
}
}
Here's the output:
workbuf is not empty: b: 0x800c51000, nobj: 4, nobj2: -2, pushcnt: 104881
fatal error: workbuf is not empty
Nothing strange, but now lets have a look using gdb after the parent has
exited:
(gdb) frame 8
#8 0x000000000041f1e8 in runtime.(*workbuf).checkempty (b=0x800c51000)
at /usr/local/go/src/runtime/mgcwork.go:328
328 throw("workbuf is not empty")
(gdb) print b
$3 = (struct runtime.workbuf *) 0x800c51000
(gdb) print *b
$4 = {runtime.workbufhdr = {node = {next = 0, pushcnt = 104881}, nobj =
0, nobj2 = -8},....
So after the error was printed the value for nobj was some how
corrected, however nobj2 being -8 indicates the last call which altered
nobj was func (w *gcWork) get() uintptr where as the -2 indicates it was
a putfull which is very muddled up.
I was curious what the child had at 0x800c51000 but couldn't persuade
gdb to cast and output it as a
struct runtime.workbuf.
Regards
Steve
More information about the freebsd-hackers
mailing list