Kernel crashes after sleep: how to debug?

Mon Jul 22 17:45:49 UTC 2013

On Friday, July 19, 2013 10:16:15 pm Yuri wrote:
> On 07/19/2013 14:04, John Baldwin wrote:
> > Hmm, that definitely looks like garbage.  How are you with gdb scripting?
> > You could write a script that walks the PQ_ACTIVE queue and see if this
> > pointers ends up in there.  It would then be interesting to see if the
> > previous page's next pointer is corrupted, or if the pageq.tqe_prev references
> > that page then it could be that this vm_page structure has been stomped on
> > instead.
> 
> As you suggested, I printed the list of pages. Actually, iteration in 
> frame 8 goes through PQ_INACTIVE pages. So I printed those.
> <...skipped...>
> ### page#2245 ###
> $4492 = (struct vm_page *) 0xfffffe00b5a27658
> $4493 = {pageq = {tqe_next = 0xfffffe00b5a124d8, tqe_prev = 
> 0xfffffe00b5b79038}, listq = {tqe_next = 0x0, tqe_prev = 
> 0xfffffe00b5a276e0},
>    left = 0x0, right = 0x0, object = 0xfffffe005e3f7658, pindex = 5, 
> phys_addr = 1884901376, md = {pv_list = {tqh_first = 0xfffffe005e439ce8,
>        tqh_last = 0xfffffe00795eacc0}, pat_mode = 6}, queue = 0 '\0', 
> segind = 2 '\002', hold_count = 0, order = 13 '\r', pool = 0 '\0',
>    cow = 0, wire_count = 0, aflags = 1 '\001', flags = 64 '@', oflags = 
> 0, act_count = 9 '\t', busy = 0 '\0', valid = 255 '�', dirty = 255 '�'}
> ### page#2246 ###
> $4494 = (struct vm_page *) 0xfffffe00b5a124d8
> $4495 = {pageq = {tqe_next = 0xfffffe00b460abf8, tqe_prev = 
> 0xfffffe00b5a27658}, listq = {tqe_next = 0x0, tqe_prev = 
> 0xfffffe005e3f7cf8},
>    left = 0x0, right = 0x0, object = 0xfffffe005e3f7cb0, pindex = 1, 
> phys_addr = 1881952256, md = {pv_list = {tqh_first = 0xfffffe005e42dd48,
>        tqh_last = 0xfffffe007adb03a8}, pat_mode = 6}, queue = 0 '\0', 
> segind = 2 '\002', hold_count = 0, order = 13 '\r', pool = 0 '\0',
>    cow = 0, wire_count = 0, aflags = 1 '\001', flags = 64 '@', oflags = 
> 0, act_count = 9 '\t', busy = 0 '\0', valid = 255 '�', dirty = 255 '�'}
> ### page#2247 ###
> $4496 = (struct vm_page *) 0xfffffe00b460abf8
> $4497 = {pageq = {tqe_next = 0xfe26, tqe_prev = 0xfffffe00b5a124d8}, 
> listq = {tqe_next = 0xfffffe0081ad8f70, tqe_prev = 0xfffffe0081ad8f78},
>    left = 0x6, right = 0xd00000201, object = 0x100000000, pindex = 
> 4294901765, phys_addr = 18446741877712530608, md = {pv_list = {
>        tqh_first = 0xfffffe00b460abc0, tqh_last = 0xfffffe00b5579020}, 
> pat_mode = -1268733096}, queue = 72 'H', segind = -85 '�',
>    hold_count = -19360, order = 0 '\0', pool = 254 '�', cow = 65535, 
> wire_count = 0, aflags = 0 '\0', flags = 0 '\0', oflags = 0,
>    act_count = 0 '\0', busy = 176 '�', valid = 208 '�', dirty = 126 '~'}
> ### page#2248 ###
> $4498 = (struct vm_page *) 0xfe26
> 
> The page #2247 is the same that caused the problem in frame 8. tqe_next 
> is apparently invalid, so iteration stopped here.
> It appears that this structure has been stomped on. This page is 
> probably supposed to be a valid inactive page.

Yes, it's phys_addr is also way off. I think you might even be able to
figure out which phys_addr it is supposed to have based on the virtual
address (see PHYS_TO_VM_PAGE() in vm/vm_page.c) by using the vm_page
address and phys_addr of the prior entries to establish the relative
offset.  It is certainly a page "earlier" in the array.

> > Ultimately I think you will need to look at any malloc/VM/page operations
> > done in the suspend and resume paths to see where this happens.  It might
> > be slightly easier if the same page gets trashed every time as you could
> > print out the relevant field periodically during suspend and resume to
> > narrow down where the breakage occurs.
> 
> I am thinking to put code walking through all page queues and verifying 
> that they are not damaged in this way into the code when each device is 
> waking up from sleep.
> dev/acpica/acpi.c has acpi_EnterSleepState, which, as I understand, 
> contains top-level code for S3 sleep. Before sleep it invokes event 
> 'power_suspend' on all devices, and after sleep it calls 'power_resume' 
> on devices. So maybe I will call the page check procedure after 
> 'power_suspend' and 'power_resume'.
> 
> But it is possible that memory gets damaged somewhere else after 
> power_resume happens.
> Do you have any thought/suggestions?

Well, I think you should try what you've suggeseted above first.  If that
doesn't narrow it down then we can brainstorm some other places to inspect.

-- 
John Baldwin