Kernel crashes after sleep: how to debug?
John Baldwin
jhb at freebsd.org
Mon Jul 22 17:45:49 UTC 2013
On Friday, July 19, 2013 10:16:15 pm Yuri wrote:
> On 07/19/2013 14:04, John Baldwin wrote:
> > Hmm, that definitely looks like garbage. How are you with gdb scripting?
> > You could write a script that walks the PQ_ACTIVE queue and see if this
> > pointers ends up in there. It would then be interesting to see if the
> > previous page's next pointer is corrupted, or if the pageq.tqe_prev references
> > that page then it could be that this vm_page structure has been stomped on
> > instead.
>
> As you suggested, I printed the list of pages. Actually, iteration in
> frame 8 goes through PQ_INACTIVE pages. So I printed those.
> <...skipped...>
> ### page#2245 ###
> $4492 = (struct vm_page *) 0xfffffe00b5a27658
> $4493 = {pageq = {tqe_next = 0xfffffe00b5a124d8, tqe_prev =
> 0xfffffe00b5b79038}, listq = {tqe_next = 0x0, tqe_prev =
> 0xfffffe00b5a276e0},
> left = 0x0, right = 0x0, object = 0xfffffe005e3f7658, pindex = 5,
> phys_addr = 1884901376, md = {pv_list = {tqh_first = 0xfffffe005e439ce8,
> tqh_last = 0xfffffe00795eacc0}, pat_mode = 6}, queue = 0 '\0',
> segind = 2 '\002', hold_count = 0, order = 13 '\r', pool = 0 '\0',
> cow = 0, wire_count = 0, aflags = 1 '\001', flags = 64 '@', oflags =
> 0, act_count = 9 '\t', busy = 0 '\0', valid = 255 '�', dirty = 255 '�'}
> ### page#2246 ###
> $4494 = (struct vm_page *) 0xfffffe00b5a124d8
> $4495 = {pageq = {tqe_next = 0xfffffe00b460abf8, tqe_prev =
> 0xfffffe00b5a27658}, listq = {tqe_next = 0x0, tqe_prev =
> 0xfffffe005e3f7cf8},
> left = 0x0, right = 0x0, object = 0xfffffe005e3f7cb0, pindex = 1,
> phys_addr = 1881952256, md = {pv_list = {tqh_first = 0xfffffe005e42dd48,
> tqh_last = 0xfffffe007adb03a8}, pat_mode = 6}, queue = 0 '\0',
> segind = 2 '\002', hold_count = 0, order = 13 '\r', pool = 0 '\0',
> cow = 0, wire_count = 0, aflags = 1 '\001', flags = 64 '@', oflags =
> 0, act_count = 9 '\t', busy = 0 '\0', valid = 255 '�', dirty = 255 '�'}
> ### page#2247 ###
> $4496 = (struct vm_page *) 0xfffffe00b460abf8
> $4497 = {pageq = {tqe_next = 0xfe26, tqe_prev = 0xfffffe00b5a124d8},
> listq = {tqe_next = 0xfffffe0081ad8f70, tqe_prev = 0xfffffe0081ad8f78},
> left = 0x6, right = 0xd00000201, object = 0x100000000, pindex =
> 4294901765, phys_addr = 18446741877712530608, md = {pv_list = {
> tqh_first = 0xfffffe00b460abc0, tqh_last = 0xfffffe00b5579020},
> pat_mode = -1268733096}, queue = 72 'H', segind = -85 '�',
> hold_count = -19360, order = 0 '\0', pool = 254 '�', cow = 65535,
> wire_count = 0, aflags = 0 '\0', flags = 0 '\0', oflags = 0,
> act_count = 0 '\0', busy = 176 '�', valid = 208 '�', dirty = 126 '~'}
> ### page#2248 ###
> $4498 = (struct vm_page *) 0xfe26
>
> The page #2247 is the same that caused the problem in frame 8. tqe_next
> is apparently invalid, so iteration stopped here.
> It appears that this structure has been stomped on. This page is
> probably supposed to be a valid inactive page.
Yes, it's phys_addr is also way off. I think you might even be able to
figure out which phys_addr it is supposed to have based on the virtual
address (see PHYS_TO_VM_PAGE() in vm/vm_page.c) by using the vm_page
address and phys_addr of the prior entries to establish the relative
offset. It is certainly a page "earlier" in the array.
> > Ultimately I think you will need to look at any malloc/VM/page operations
> > done in the suspend and resume paths to see where this happens. It might
> > be slightly easier if the same page gets trashed every time as you could
> > print out the relevant field periodically during suspend and resume to
> > narrow down where the breakage occurs.
>
> I am thinking to put code walking through all page queues and verifying
> that they are not damaged in this way into the code when each device is
> waking up from sleep.
> dev/acpica/acpi.c has acpi_EnterSleepState, which, as I understand,
> contains top-level code for S3 sleep. Before sleep it invokes event
> 'power_suspend' on all devices, and after sleep it calls 'power_resume'
> on devices. So maybe I will call the page check procedure after
> 'power_suspend' and 'power_resume'.
>
> But it is possible that memory gets damaged somewhere else after
> power_resume happens.
> Do you have any thought/suggestions?
Well, I think you should try what you've suggeseted above first. If that
doesn't narrow it down then we can brainstorm some other places to inspect.
--
John Baldwin
More information about the freebsd-hackers
mailing list