svn commit: r280279 - head/sys/sys
Jung-uk Kim
jkim at FreeBSD.org
Mon Apr 13 20:04:47 UTC 2015
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256
On 04/13/2015 13:36, Alan Cox wrote:
> On 03/30/2015 10:50, John Baldwin wrote:
>> On Sunday, March 22, 2015 09:41:53 AM Bruce Evans wrote:
>>> On Sat, 21 Mar 2015, John Baldwin wrote:
>>>
>>>> On 3/21/15 12:35 PM, Konstantin Belousov wrote:
>>>>> On Sat, Mar 21, 2015 at 12:04:41PM -0400, John Baldwin
>>>>> wrote:
>>>>>> On 3/20/15 9:02 AM, Konstantin Belousov wrote:
>>>>>>> On Fri, Mar 20, 2015 at 10:27:06AM +0000, John Baldwin
>>>>>>> wrote:
>>>>>>>> Author: jhb Date: Fri Mar 20 10:27:06 2015 New
>>>>>>>> Revision: 280279 URL:
>>>>>>>> https://svnweb.freebsd.org/changeset/base/280279
>>>>>>>>
>>>>>>>> Log: Expand the bitcount* API to support 64-bit
>>>>>>>> integers, plain ints and longs and create a "hidden"
>>>>>>>> API that can be used in other system headers without
>>>>>>>> adding namespace pollution. - If the POPCNT
>>>>>>>> instruction is enabled at compile time, use
>>>>>>>> __builtin_popcount*() to implement __bitcount*(),
>>>>>>>> otherwise fall back to software implementations.
>>>>>>> Are you aware of the Haswell errata HSD146 ? I see the
>>>>>>> described behaviour on machines back to SandyBridge,
>>>>>>> but not on Nehalems. HSD146. POPCNT Instruction May
>>>>>>> Take Longer to Execute Than Expected Problem: POPCNT
>>>>>>> instruction execution with a 32 or 64 bit operand may
>>>>>>> be delayed until previous non-dependent instructions
>>>>>>> have executed.
>>>>>>>
>>>>>>> Jilles noted that gcc head and 4.9.2 already provides a
>>>>>>> workaround by xoring the dst register. I have some
>>>>>>> patch for amd64 pmap, see the end of the message.
>>>>>> No, I was not aware, but I think it's hard to fix this
>>>>>> anywhere but the compiler. I set CPUTYPE in src.conf on
>>>>>> my Ivy Bridge desktop and clang uses POPCOUNT for this
>>>>>> function from ACPI-CA:
>>>>>>
>>>>>> static UINT8 AcpiRsCountSetBits ( UINT16
>>>>>> BitField) { UINT8 BitsSet;
>>>>>>
>>>>>>
>>>>>> ACPI_FUNCTION_ENTRY ();
>>>>>>
>>>>>>
>>>>>> for (BitsSet = 0; BitField; BitsSet++) { /* Zero the
>>>>>> least significant bit that is set */
>>>>>>
>>>>>> BitField &= (UINT16) (BitField - 1); }
>>>>>>
>>>>>> return (BitsSet); }
>>>>>>
>>>>>> (I ran into this accidentally because a kernel built on
>>>>>> my system failed to boot in older qemu because the kernel
>>>>>> paniced with an illegal instruction fault in this
>>>>>> function.)
>>> Does it do the same for the similar home made popcount in
>>> pmap?:
>> Yes:
>>
>> ffffffff807658d4: f6 04 25 46 e2 d6 80 testb
>> $0x80,0xffffffff80d6e246 ffffffff807658db: 80
>> ffffffff807658dc: 74 32 je
>> ffffffff80765910 <pmap_demote_pde_locked+0x4d0> ffffffff807658de:
>> 48 89 4d b8 mov %rcx,-0x48(%rbp) ffffffff807658e2:
>> f3 48 0f b8 4d b8 popcnt -0x48(%rbp),%rcx ffffffff807658e8:
>> 48 8b 50 20 mov 0x20(%rax),%rdx ffffffff807658ec:
>> 48 89 55 b0 mov %rdx,-0x50(%rbp) ffffffff807658f0:
>> f3 48 0f b8 55 b0 popcnt -0x50(%rbp),%rdx ffffffff807658f6:
>> 01 ca add %ecx,%edx ffffffff807658f8:
>> 48 8b 48 28 mov 0x28(%rax),%rcx ffffffff807658fc:
>> 48 89 4d a8 mov %rcx,-0x58(%rbp) ffffffff80765900:
>> f3 48 0f b8 4d a8 popcnt -0x58(%rbp),%rcx ffffffff80765906:
>> eb 1b jmp ffffffff80765923
>> <pmap_demote_pde_locked+0x4e3> ffffffff80765908: 0f 1f 84
>> 00 00 00 00 nopl 0x0(%rax,%rax,1) ffffffff8076590f: 00
>> ffffffff80765910: f3 48 0f b8 c9 popcnt
>> %rcx,%rcx ffffffff80765915: f3 48 0f b8 50 20 popcnt
>> 0x20(%rax),%rdx ffffffff8076591b: 01 ca
>> add %ecx,%edx ffffffff8076591d: f3 48 0f b8 48 28
>> popcnt 0x28(%rax),%rcx ffffffff80765923: 01 d1
>> add %edx,%ecx
>>
>> It also uses popcnt for this in blist_fill() and
>> blist_meta_fill():
>>
>> 742 /* Count the number of blocks we're about to
>> allocate */ 743 bitmap = scan->u.bmu_bitmap & mask;
>> 744 for (nblks = 0; bitmap != 0; nblks++) 745
>> bitmap &= bitmap - 1;
>>
>>> Always using new API would lose the micro-optimizations given
>>> by the runtime decision for default CFLAGS (used by
>>> distributions for portability). To keep them, it seems best to
>>> keep the inline asm but replace popcnt_pc_map_elem(elem) by
>>> __bitcount64(elem). -mno-popcount can then be used to work
>>> around slowness in the software (that is actually hardware)
>>> case.
>> I'm not sure if bitcount64() is strictly better than the loop in
>> this case even though it is O(1) given the claimed nature of the
>> values in the comment.
>>
>
>
> I checked. Even with zeroes being more common than ones,
> bitcount64() is faster than the simple loop. Using bitcount64,
> reserve_pv_entries() takes on average 4265 cycles during
> "buildworld" on my test machine. In contrast, with the simple
> loop, it takes on average 4507 cycles. Even though bitcount64 is a
> lot larger than the simple loop, we do the 3 bit count operations
> many times in a loop, so the extra i-cache misses are being made up
> for by the repeated execution of the faster code.
>
> However, in the popcnt case, we are spilling the bit map to memory
> in order to popcnt it. That's rather silly:
>
> 3570: 48 8b 48 18 mov 0x18(%rax),%rcx 3574:
> f6 04 25 00 00 00 00 testb $0x80,0x0 357b: 80 357c:
> 74 42 je 35c0 <pmap_demote_pde_locked+0x2f0>
> 357e: 48 89 4d b8 mov %rcx,-0x48(%rbp) 3582:
> 31 c9 xor %ecx,%ecx 3584: f3 48 0f b8 4d
> b8 popcnt -0x48(%rbp),%rcx 358a: 48 8b 50 20
> mov 0x20(%rax),%rdx 358e: 48 89 55 b0 mov
> %rdx,-0x50(%rbp) 3592: 31 d2 xor
> %edx,%edx 3594: f3 48 0f b8 55 b0 popcnt
> -0x50(%rbp),%rdx 359a: 01 ca add
> %ecx,%edx 359c: 48 8b 48 28 mov
> 0x28(%rax),%rcx 35a0: 48 89 4d a8 mov
> %rcx,-0x58(%rbp) 35a4: 31 c9 xor
> %ecx,%ecx 35a6: f3 48 0f b8 4d a8 popcnt
> -0x58(%rbp),%rcx 35ac: 01 d1 add
> %edx,%ecx 35ae: e9 12 01 00 00 jmpq 36c5
> <pmap_demote_pde_locked+0x3f5>
Please try the attached patch.
Jung-uk Kim
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAEBCAAGBQJVLCFZAAoJEHyflib82/FGOp0H/1+Jr+cKUn/MnV5O5SghPw9f
XzTM4+BV9BcWabLRjFe1LR065SfLDXqKLuU4h5lmVSlXQaxElAXxaMeyO3mrMzR4
Sb1xr0rf+ZfUARJeEJWI65Wpn+gEH+7XxXAIAetYGMwwclBOBgbZIoDXITnCaUFa
/pi3zQIey8EzbvlzhQcffLDV8oF4f8HNEMoSxMRtOiZNNPu/8ECnyGeHZhOd++kh
pwZNsSbcCw3RXMheuErTpKPrJSEXgMNmWG3G00aP7L8IjcObgOqMUQt+8eT8Ge8B
tEv40kgm2G/OG2akONh4/6bX3hyodW3IHcb6AYhqZogiDIqd/eXD4jDup/kkVxU=
=1Ca9
-----END PGP SIGNATURE-----
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pmap.diff
Type: text/x-patch
Size: 2181 bytes
Desc: not available
URL: <http://lists.freebsd.org/pipermail/svn-src-head/attachments/20150413/39f35620/attachment.bin>
More information about the svn-src-head
mailing list