arm SMP on Cortex-A15
Ian Lepore
ian at FreeBSD.org
Thu Mar 27 13:45:38 UTC 2014
The r263251 changes are not a workaround, they are a required fix.
There are really two things happening in 263251...
The first, and absolutely required, part of the change is that the arm11
functions for TLB maintenance won't work on armv7 with MP extensions.
The instructions used in the arm11 tlb maintenance are not the new
"Inner Sharable" functions that broadcast the TLB operation to the other
cores. I think that may be at the root of the problem you're seeing.
The second part of the change, the part I'm not quite as sure about, is
my interpretation of the ARM ARM in regard to unified versus separate
TLBs. The way I read it, "Invalidate unified TLB by MVA Inner
Sharable" (TLBIMVAAIS) invalidates both the data and instruction TLB
entries if they are separate. My evidence for that is that there are no
separate TLB operations that broadcast the maintenance operations
through the inner sharable domain, there are only the unified ops. And
then there are these two items after the table of operations in section
B3.12.34 CP15 c8, TLB maintenance operations:
* If an Instruction TLB or Data TLB operation is used on a system
that implements a Unified TLB then the operation is performed on
the Unified TLB
* If a Unified TLB operation is used on a system that implements
separate Instruction and Data TLBs then the operation is
performed on both the Instruction TLB and the Data TLB.
In other words, it looks to me like ARM's intent is to remove the
distinction between separate and unified TLBs when it comes to
maintenance ops, but for now they've left the old opcodes in place for
compatibility with old non-SMP code. But for SMP you have to use the
new unified ops, because they haven't provided "Inner Sharable" flavors
of the separate-TLB ops.
So in r263251 the flushI and flushD and flushID vectors all point to the
flushID function, and that means we use the unified-inner-sharable
operations for SMP, and I think the only potential downside to doing it
that way is that in the flushD case we're also doing a flush of the
branch target cache that doesn't really need to happen; that's probably
a minor performance hit but not a correctness problem.
-- Ian
On Thu, 2014-03-27 at 14:10 +0100, Wojciech Macek wrote:
> I started cleaning up and have some concerns. Sorry guys, the r263251 still
> looks like a workaround to me :)
> The fact that TLB cache issue is seen only on executable pages and that
> Olivier's fix helps suggests, that we are missing flush-I somewhere (crash
> appears only when deallocating userspace vm, so my guess is pmap_kremove,
> where definitely should be flushID). For test purposes I reverted r263251
> and added additional flushID/flushD logic to the pmap-v6 in
> performance-critical sections, or modified flushD to flushID where the
> speed is not the case. All changes integrated with previous patches can be
> found here:
> https://drive.google.com/file/d/0B-7yTLrPxaWtRFRHTUtpZ3VKbG8/edit?usp=sharing
> With above patch I'm able to run all tests without any problems.
>
> Maybe it would be better to fix these suspicious places in pmap instead?
>
> Regards,
> Wojtek
>
>
> 2014-03-25 14:35 GMT+01:00 Ian Lepore <ian at freebsd.org>:
>
> > Good, that means we can stop searching for the missing tlb flush when a
> > pte is invalidated. :) You guys can commit whenever you're ready (or I
> > can do it if you'd like).
> >
> > -- Ian
> >
> > On Tue, 2014-03-25 at 11:14 +0100, Wojciech Macek wrote:
> > > Hi Ian,
> > >
> > > The r263251 fix helped, thanks!
> > > So now, if you don';t have any objections, I will clean up a little the
> > > cpufunc.S & pmap-v6 changes and make them ready for submitting.
> > >
> > > Regards,
> > > Wojtek
> > >
> > >
> > > 2014-03-24 14:31 GMT+01:00 Wojciech Macek <wma at semihalf.com>:
> > >
> > > > Without the unconditional invalidation, the panic shows up just at the
> > > > beginning, after rootfs is mounted and init scripts are running. When a
> > > > userspace process is exitting, its memory resources are freed - this
> > is the
> > > > moment pmap_remove_pages fails due to tharanslation fault. It is the
> > > > "typical" crash I observed when TLB-cache holds an old entry. Below
> > there
> > > > is a backtrace, but I doubt if it can be helpful.
> > > >
> > > > Regarding old pte/tlb, the TLB cache contains entry from old process
> > > > context, when in-memory-PTE value is already correct - at least this
> > was
> > > > the scenario when I debugged it last year. So, invalidating after
> > *pte=0 is
> > > > definitely not our case. The issue shows up only on a15, where the
> > > > tlb-prefetcher can cache pte entries anytime.
> > > >
> > > > I believe I don't have r263251 integrated. I'll give it a try -
> > typically,
> > > > the tlb-caused crash appears only on pages containing shared libraries
> > code
> > > > (with executable attr), so there is a chance Olivier's fix help.
> > > >
> > > >
> > > > The fault:
> > > >
> > > > vm_fault(0xc5b894f0, 0, 2, 0) -> 1
> > > > Fatal kernel mode data abort: 'Translation Fault (P)'
> > > > trapframe: 0xef2cca40
> > > > FSR=00000817, FAR=00000030, spsr=60000013
> > > > r0 =00000000, r1 =c320a048, r2 =00000000, r3 =c3208074
> > > > r4 =c5b7cd08, r5 =c5b7cd04, r6 =c5b05800, r7 =c5b895ac
> > > > r8 =c320a044, r9 =fffffffe, r10=c5b895ac, r11=ef2ccae0
> > > > r12=00000000, ssp=ef2cca90, slr=c0604148, pc =c0628a60
> > > >
> > > > [ thread pid 83 tid 100050 ]
> > > > Stopped at pmap_remove_pages+0x270: streq r3, [r0,
> > #0x030]
> > > > db> bt
> > > > Tracing pid 83 tid 100050 td 0xc5bc4320
> > > > db_trace_self() at db_trace_self
> > > > pc = 0xc061f62c lr = 0xc024ddbc (db_hex2dec+0x498)
> > > > sp = 0xef2cc738 fp = 0xef2cc750
> > > > r10 = 0xc0708270
> > > > db_hex2dec() at db_hex2dec+0x498
> > > > pc = 0xc024ddbc lr = 0xc024d76c (db_command_loop+0x2f0)
> > > > sp = 0xef2cc758 fp = 0xef2cc7f8
> > > > r4 = 0x00000000 r5 = 0x00000000
> > > > r6 = 0xc0695cf1
> > > > db_command_loop() at db_command_loop+0x2f0
> > > > pc = 0xc024d76c lr = 0xc024d4dc (db_command_loop+0x60)
> > > > sp = 0xef2cc800 fp = 0xef2cc810
> > > > r4 = 0xc0666f88 r5 = 0xc067b997
> > > > r6 = 0xc0752954 r7 = 0xc0748f80
> > > > r8 = 0xef2cca40 r9 = 0xc07084e0
> > > > r10 = 0xc0748f84
> > > > db_command_loop() at db_command_loop+0x60
> > > > pc = 0xc024d4dc lr = 0xc024ffb8 (X_db_symbol_values+0x254)
> > > > sp = 0xef2cc818 fp = 0xef2cc938
> > > > r4 = 0x00000000 r5 = 0xef2cc820
> > > > r6 = 0xc0748fb0
> > > > X_db_symbol_values() at X_db_symbol_values+0x254
> > > > pc = 0xc024ffb8 lr = 0xc0430554 (kdb_trap+0x164)
> > > > sp = 0xef2cc940 fp = 0xef2cc968
> > > > r4 = 0x00000000 r5 = 0x00000817
> > > > r6 = 0xc0748fb0 r7 = 0xc0748f80
> > > > kdb_trap() at kdb_trap+0x164
> > > > pc = 0xc0430554 lr = 0xc0632ef0 (data_abort_handler+0x7dc)
> > > > sp = 0xef2cc970 fp = 0xef2cc988
> > > > r4 = 0xef2cca40 r5 = 0x600000d3
> > > > r6 = 0x00000030 r7 = 0x00000817
> > > > r8 = 0xc5b894f0 r9 = 0x00000001
> > > > r10 = 0xef2cca40
> > > > data_abort_handler() at data_abort_handler+0x7dc
> > > > pc = 0xc0632ef0 lr = 0xc0632cc0 (data_abort_handler+0x5ac)
> > > > sp = 0xef2cc990 fp = 0xef2cca38
> > > > r4 = 0x00000817 r5 = 0xc5bc4320
> > > > r6 = 0xc5a47a0c r7 = 0x00000004
> > > > data_abort_handler() at data_abort_handler+0x5ac
> > > > pc = 0xc0632cc0 lr = 0xc0621214 (exception_exit)
> > > > sp = 0xef2cca40 fp = 0xef2ccae0
> > > > r4 = 0xc5b7cd08 r5 = 0xc5b7cd04
> > > > r6 = 0xc5b05800 r7 = 0xc5b895ac
> > > > r8 = 0xc320a044 r9 = 0xfffffffe
> > > > r10 = 0xc5b895ac
> > > > exception_exit() at exception_exit
> > > > pc = 0xc0621214 lr = 0xc0604148 (PHYS_TO_VM_PAGE+0x48)
> > > > sp = 0xef2cca94 fp = 0xef2ccae0
> > > > r0 = 0x00000000 r1 = 0xc320a048
> > > > r2 = 0x00000000 r3 = 0xc3208074
> > > > r4 = 0xc5b7cd08 r5 = 0xc5b7cd04
> > > > r6 = 0xc5b05800 r7 = 0xc5b895ac
> > > > r8 = 0xc320a044 r9 = 0xfffffffe
> > > > r10 = 0xc5b895ac r12 = 0x00000000
> > > > pmap_remove_pages() at pmap_remove_pages+0x270
> > > > pc = 0xc0628a60 lr = 0xc05f2d08 (vmspace_exit+0xd8)
> > > > sp = 0xef2ccae8 fp = 0xef2ccb10
> > > > r4 = 0xc5b895a8 r5 = 0xc5bc4320
> > > > r6 = 0x00000001 r7 = 0xc5a47960
> > > > r8 = 0xc5b895ac r9 = 0xc5b894f0
> > > > r10 = 0xc0753be0
> > > > vmspace_exit() at vmspace_exit+0xd8
> > > > pc = 0xc05f2d08 lr = 0xc03a7348 (exit1+0x930)
> > > > sp = 0xef2ccb18 fp = 0xef2ccb70
> > > > r4 = 0xc5a479fc r5 = 0x00000004
> > > > r6 = 0xc583861c r7 = 0x00000001
> > > > r8 = 0xc5a47960 r9 = 0xc5bc4320
> > > > r10 = 0xc5a47a0c
> > > > exit1() at exit1+0x930
> > > > pc = 0xc03a7348 lr = 0xc03f1604 (sigexit+0x8c4)
> > > > sp = 0xef2ccb78 fp = 0xef2ccd68
> > > > r4 = 0x00000002 r5 = 0xc5bc4320
> > > > r6 = 0xc5a47960 r7 = 0xc5a47a0c
> > > > r8 = 0xc5bc4320 r9 = 0xc5b7a000
> > > > r10 = 0x00000002
> > > > sigexit() at sigexit+0x8c4
> > > > pc = 0xc03f1604 lr = 0xc03f23a0 (postsig+0x39c)
> > > > sp = 0xef2ccd70 fp = 0xef2cce18
> > > > r4 = 0x00000001 r5 = 0xc5bc4320
> > > > r6 = 0xc5a47960 r7 = 0xc5b7aab8
> > > > r8 = 0xc5a47a0c r9 = 0xc5b7a000
> > > > r10 = 0x00000002
> > > > postsig() at postsig+0x39c
> > > > pc = 0xc03f23a0 lr = 0xc044388c (ast+0x4f4)
> > > > sp = 0xef2cce20 fp = 0xef2cce58
> > > > r4 = 0x00000001 r5 = 0xc5bc4320
> > > > r6 = 0xc5a47960 r7 = 0xc5a47a0c
> > > > r8 = 0xc5a47a0c r9 = 0x01020804
> > > > r10 = 0x00000ab8
> > > > ast() at ast+0x4f4
> > > > pc = 0xc044388c lr = 0xc0621080 (swi_entry+0x6c)
> > > > sp = 0xef2cce60 fp = 0xbfffe438
> > > > r4 = 0x40000013 r5 = 0xc5bc4320
> > > > r6 = 0x00000001 r7 = 0x00000154
> > > > r8 = 0x20037008 r9 = 0xbfffee5c
> > > > r10 = 0xbfffea10
> > > > swi_entry() at swi_entry+0x6c
> > > > pc = 0xc0621080 lr = 0xc0621080 (swi_entry+0x6c)
> > > > sp = 0xef2cce60 fp = 0xbfffe438
> > > > Unable to unwind further
> > > > db>
> > > >
> > > >
> > > >
> > > > 2014-03-22 14:20 GMT+01:00 Ian Lepore <ian at freebsd.org>:
> > > >
> > > > On Fri, 2014-03-21 at 07:20 +0100, Wojciech Macek wrote:
> > > >> > No, changing flushD to flushID did not make any difference, but I
> > think
> > > >> it
> > > >> > should be there - D-only flushing might not be sufficient.
> > > >> >
> > > >>
> > > >> Olivier reminded me right after I posted that: last week I made a
> > change
> > > >> to cpufunc.c that makes flushD and flushID the same. So of course it
> > > >> made no difference. :) It really should be flushID though, in case
> > > >> that ever changes.
> > > >>
> > > >> You didn't say whether you have that change, which was r263251.
> > > >>
> > > >> > Currently, I'm running pmap_kernel_internal attached below. It is
> > doing
> > > >> > unconditional flushID at the end, just like the old comment was
> > saying
> > > >> :)
> > > >> > SMP seems to be stable.
> > > >> >
> > > >>
> > > >> That seems to say that somehow there is a valid TLB entry even though
> > > >> the old pte for that entry is zero. That means there's a problem
> > > >> somewhere else in the code, but I don't see it. It looks to me like
> > we
> > > >> do a TLB flush everywhere that we zero out a pte.
> > > >>
> > > >> You said without the unconditional flush it panics at startup. Where
> > in
> > > >> startup? Early, or after init is launched or what? Where does the
> > > >> panic backtrace to?
> > > >>
> > > >> If we've got some other pte/tlb maintenance problem, I'd hate to hide
> > it
> > > >> with this unconditional flush and have it appear as some other problem
> > > >> later that will be even harder to track down.
> > > >>
> > > >> -- Ian
> > > >>
> > > >>
> > > >>
> > > >
> >
> >
> >
> _______________________________________________
> freebsd-arm at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-arm
> To unsubscribe, send any mail to "freebsd-arm-unsubscribe at freebsd.org"
More information about the freebsd-arm
mailing list