svn commit: r280279 - head/sys/sys
Konstantin Belousov
kostikbel at gmail.com
Fri Mar 20 13:02:27 UTC 2015
On Fri, Mar 20, 2015 at 10:27:06AM +0000, John Baldwin wrote:
> Author: jhb
> Date: Fri Mar 20 10:27:06 2015
> New Revision: 280279
> URL: https://svnweb.freebsd.org/changeset/base/280279
>
> Log:
> Expand the bitcount* API to support 64-bit integers, plain ints and longs
> and create a "hidden" API that can be used in other system headers without
> adding namespace pollution.
> - If the POPCNT instruction is enabled at compile time, use
> __builtin_popcount*() to implement __bitcount*(), otherwise fall back
> to software implementations.
Are you aware of the Haswell errata HSD146 ? I see the described behaviour
on machines back to SandyBridge, but not on Nehalems.
HSD146. POPCNT Instruction May Take Longer to Execute Than Expected
Problem: POPCNT instruction execution with a 32 or 64 bit operand may be
delayed until previous non-dependent instructions have executed.
Jilles noted that gcc head and 4.9.2 already provides a workaround by
xoring the dst register. I have some patch for amd64 pmap, see the end
of the message.
> - Use the existing bitcount16() and bitcount32() from <sys/systm.h> to
> implement the non-POPCNT __bitcount16() and __bitcount32() in
> <sys/types.h>.
Why is it in sys/types.h ?
> - For the non-POPCNT __bitcount64(), use a similar SWAR method on 64-bit
> systems. For 32-bit systems, use two __bitcount32() operations on the
> two halves.
> - Use __bitcount32() to provide a __bitcount() that operates on plain ints.
> - Use either __bitcount32() or __bitcount64() to provide a
> __bitcountl() that operates on longs.
> - Add public bitcount*() wrappers for __bitcount*() for use in the kernel
> in <sys/libkern.h>.
> - Use __builtinl() instead of __builtin_popcountl() in BIT_COUNT().
>
> Discussed with: bde
diff --git a/sys/amd64/amd64/pmap.c b/sys/amd64/amd64/pmap.c
index 6a4077c..f6fbc33 100644
--- a/sys/amd64/amd64/pmap.c
+++ b/sys/amd64/amd64/pmap.c
@@ -413,6 +417,7 @@ static void free_pv_chunk(struct pv_chunk *pc);
static void free_pv_entry(pmap_t pmap, pv_entry_t pv);
static pv_entry_t get_pv_entry(pmap_t pmap, struct rwlock **lockp);
static int popcnt_pc_map_elem(uint64_t elem);
+static int popcnt_pc_map_elem_pq(uint64_t elem);
static vm_page_t reclaim_pv_chunk(pmap_t locked_pmap, struct rwlock **lockp);
static void reserve_pv_entries(pmap_t pmap, int needed,
struct rwlock **lockp);
@@ -2997,6 +3020,29 @@ popcnt_pc_map_elem(uint64_t elem)
}
/*
+ * The erratas for Intel processors state that "POPCNT Instruction May
+ * Take Longer to Execute Than Expected". It is believed that the
+ * issue is the spurious dependency on the destination register.
+ * Provide a hint to the register rename logic that the destination
+ * value is overwritten, by clearing it, as suggested in the
+ * optimization manual. It should be cheap for unaffected processors
+ * as well.
+ *
+ * Reference numbers for erratas are
+ * 4th Gen Core: HSD146
+ * 5th Gen Core: BDM85
+ */
+static int
+popcnt_pc_map_elem_pq(uint64_t elem)
+{
+ u_long result;
+
+ __asm __volatile("xorl %k0,%k0;popcntq %1,%0"
+ : "=&r" (result) : "rm" (elem));
+ return (result);
+}
+
+/*
* Ensure that the number of spare PV entries in the specified pmap meets or
* exceeds the given count, "needed".
*
@@ -3029,9 +3075,9 @@ retry:
free += popcnt_pc_map_elem(pc->pc_map[1]);
free += popcnt_pc_map_elem(pc->pc_map[2]);
} else {
- free = popcntq(pc->pc_map[0]);
- free += popcntq(pc->pc_map[1]);
- free += popcntq(pc->pc_map[2]);
+ free = popcnt_pc_map_elem_pq(pc->pc_map[0]);
+ free += popcnt_pc_map_elem_pq(pc->pc_map[1]);
+ free += popcnt_pc_map_elem_pq(pc->pc_map[2]);
}
if (free == 0)
break;
More information about the svn-src-all
mailing list