svn commit: r280279 - head/sys/sys
Adrian Chadd
adrian at freebsd.org
Mon Apr 13 19:06:48 UTC 2015
Hi,
These CPUs are supposed to have loop unwinder / streaming hardware. Is
it not unwinding/streaming this loop for us?
-a
On 13 April 2015 at 10:36, Alan Cox <alc at rice.edu> wrote:
> On 03/30/2015 10:50, John Baldwin wrote:
>> On Sunday, March 22, 2015 09:41:53 AM Bruce Evans wrote:
>>> On Sat, 21 Mar 2015, John Baldwin wrote:
>>>
>>>> On 3/21/15 12:35 PM, Konstantin Belousov wrote:
>>>>> On Sat, Mar 21, 2015 at 12:04:41PM -0400, John Baldwin wrote:
>>>>>> On 3/20/15 9:02 AM, Konstantin Belousov wrote:
>>>>>>> On Fri, Mar 20, 2015 at 10:27:06AM +0000, John Baldwin wrote:
>>>>>>>> Author: jhb
>>>>>>>> Date: Fri Mar 20 10:27:06 2015
>>>>>>>> New Revision: 280279
>>>>>>>> URL: https://svnweb.freebsd.org/changeset/base/280279
>>>>>>>>
>>>>>>>> Log:
>>>>>>>> Expand the bitcount* API to support 64-bit integers, plain ints and longs
>>>>>>>> and create a "hidden" API that can be used in other system headers without
>>>>>>>> adding namespace pollution.
>>>>>>>> - If the POPCNT instruction is enabled at compile time, use
>>>>>>>> __builtin_popcount*() to implement __bitcount*(), otherwise fall back
>>>>>>>> to software implementations.
>>>>>>> Are you aware of the Haswell errata HSD146 ? I see the described behaviour
>>>>>>> on machines back to SandyBridge, but not on Nehalems.
>>>>>>> HSD146. POPCNT Instruction May Take Longer to Execute Than Expected
>>>>>>> Problem: POPCNT instruction execution with a 32 or 64 bit operand may be
>>>>>>> delayed until previous non-dependent instructions have executed.
>>>>>>>
>>>>>>> Jilles noted that gcc head and 4.9.2 already provides a workaround by
>>>>>>> xoring the dst register. I have some patch for amd64 pmap, see the end
>>>>>>> of the message.
>>>>>> No, I was not aware, but I think it's hard to fix this anywhere but the
>>>>>> compiler. I set CPUTYPE in src.conf on my Ivy Bridge desktop and clang
>>>>>> uses POPCOUNT for this function from ACPI-CA:
>>>>>>
>>>>>> static UINT8
>>>>>> AcpiRsCountSetBits (
>>>>>> UINT16 BitField)
>>>>>> {
>>>>>> UINT8 BitsSet;
>>>>>>
>>>>>>
>>>>>> ACPI_FUNCTION_ENTRY ();
>>>>>>
>>>>>>
>>>>>> for (BitsSet = 0; BitField; BitsSet++)
>>>>>> {
>>>>>> /* Zero the least significant bit that is set */
>>>>>>
>>>>>> BitField &= (UINT16) (BitField - 1);
>>>>>> }
>>>>>>
>>>>>> return (BitsSet);
>>>>>> }
>>>>>>
>>>>>> (I ran into this accidentally because a kernel built on my system failed
>>>>>> to boot in older qemu because the kernel paniced with an illegal instruction
>>>>>> fault in this function.)
>>> Does it do the same for the similar home made popcount in pmap?:
>> Yes:
>>
>> ffffffff807658d4: f6 04 25 46 e2 d6 80 testb $0x80,0xffffffff80d6e246
>> ffffffff807658db: 80
>> ffffffff807658dc: 74 32 je ffffffff80765910 <pmap_demote_pde_locked+0x4d0>
>> ffffffff807658de: 48 89 4d b8 mov %rcx,-0x48(%rbp)
>> ffffffff807658e2: f3 48 0f b8 4d b8 popcnt -0x48(%rbp),%rcx
>> ffffffff807658e8: 48 8b 50 20 mov 0x20(%rax),%rdx
>> ffffffff807658ec: 48 89 55 b0 mov %rdx,-0x50(%rbp)
>> ffffffff807658f0: f3 48 0f b8 55 b0 popcnt -0x50(%rbp),%rdx
>> ffffffff807658f6: 01 ca add %ecx,%edx
>> ffffffff807658f8: 48 8b 48 28 mov 0x28(%rax),%rcx
>> ffffffff807658fc: 48 89 4d a8 mov %rcx,-0x58(%rbp)
>> ffffffff80765900: f3 48 0f b8 4d a8 popcnt -0x58(%rbp),%rcx
>> ffffffff80765906: eb 1b jmp ffffffff80765923 <pmap_demote_pde_locked+0x4e3>
>> ffffffff80765908: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1)
>> ffffffff8076590f: 00
>> ffffffff80765910: f3 48 0f b8 c9 popcnt %rcx,%rcx
>> ffffffff80765915: f3 48 0f b8 50 20 popcnt 0x20(%rax),%rdx
>> ffffffff8076591b: 01 ca add %ecx,%edx
>> ffffffff8076591d: f3 48 0f b8 48 28 popcnt 0x28(%rax),%rcx
>> ffffffff80765923: 01 d1 add %edx,%ecx
>>
>> It also uses popcnt for this in blist_fill() and blist_meta_fill():
>>
>> 742 /* Count the number of blocks we're about to allocate */
>> 743 bitmap = scan->u.bmu_bitmap & mask;
>> 744 for (nblks = 0; bitmap != 0; nblks++)
>> 745 bitmap &= bitmap - 1;
>>
>>> Always using new API would lose the micro-optimizations given by the runtime
>>> decision for default CFLAGS (used by distributions for portability). To
>>> keep them, it seems best to keep the inline asm but replace
>>> popcnt_pc_map_elem(elem) by __bitcount64(elem). -mno-popcount can then
>>> be used to work around slowness in the software (that is actually
>>> hardware) case.
>> I'm not sure if bitcount64() is strictly better than the loop in this case
>> even though it is O(1) given the claimed nature of the values in the comment.
>>
>
>
> I checked. Even with zeroes being more common than ones, bitcount64()
> is faster than the simple loop. Using bitcount64, reserve_pv_entries()
> takes on average 4265 cycles during "buildworld" on my test machine. In
> contrast, with the simple loop, it takes on average 4507 cycles. Even
> though bitcount64 is a lot larger than the simple loop, we do the 3 bit
> count operations many times in a loop, so the extra i-cache misses are
> being made up for by the repeated execution of the faster code.
>
> However, in the popcnt case, we are spilling the bit map to memory in
> order to popcnt it. That's rather silly:
>
> 3570: 48 8b 48 18 mov 0x18(%rax),%rcx
> 3574: f6 04 25 00 00 00 00 testb $0x80,0x0
> 357b: 80
> 357c: 74 42 je 35c0
> <pmap_demote_pde_locked+0x2f0>
> 357e: 48 89 4d b8 mov %rcx,-0x48(%rbp)
> 3582: 31 c9 xor %ecx,%ecx
> 3584: f3 48 0f b8 4d b8 popcnt -0x48(%rbp),%rcx
> 358a: 48 8b 50 20 mov 0x20(%rax),%rdx
> 358e: 48 89 55 b0 mov %rdx,-0x50(%rbp)
> 3592: 31 d2 xor %edx,%edx
> 3594: f3 48 0f b8 55 b0 popcnt -0x50(%rbp),%rdx
> 359a: 01 ca add %ecx,%edx
> 359c: 48 8b 48 28 mov 0x28(%rax),%rcx
> 35a0: 48 89 4d a8 mov %rcx,-0x58(%rbp)
> 35a4: 31 c9 xor %ecx,%ecx
> 35a6: f3 48 0f b8 4d a8 popcnt -0x58(%rbp),%rcx
> 35ac: 01 d1 add %edx,%ecx
> 35ae: e9 12 01 00 00 jmpq 36c5
> <pmap_demote_pde_locked+0x3f5>
>
> Caveat: I'm still using clang 3.5. Maybe the newer clang doesn't spill?
>
>
>
More information about the svn-src-head
mailing list