SSE in libthr
Rui Paulo
rpaulo at me.com
Fri Mar 27 20:49:08 UTC 2015
On Mar 27, 2015, at 12:26, Eric van Gyzen <vangyzen at FreeBSD.org> wrote:
>
> In a nutshell:
>
> Clang emits SSE instructions on amd64 in the common path of
> pthread_mutex_unlock. This reduces performance by a non-trivial amount. I'd
> like to disable SSE in libthr.
>
> In more detail:
>
> In libthr/thread/thr_mutex.c, we find the following:
>
> #define MUTEX_INIT_LINK(m) do { \
> (m)->m_qe.tqe_prev = NULL; \
> (m)->m_qe.tqe_next = NULL; \
> } while (0)
>
> In 9.1, clang 3.1 emits two ordinary mov instructions:
>
> movq $0x0,0x8(%rax)
> movq $0x0,(%rax)
>
> Since 10.0 and clang 3.3, clang emits these SSE instructions:
>
> xorps %xmm0,%xmm0
> movups %xmm0,(%rax)
>
> Although these look harmless enough, using the FPU can reduce performance by
> incurring extra overhead due to context-switching the FPU state.
>
> As I mentioned, this code is used in the common path of pthread_mutex_unlock. I
> have a simple test program that creates four threads, all contending for a
> single mutex, and measures the total number of lock acquisitions over several
> seconds. When libthr is built with SSE, as is current, I get around 53 million
> locks in 5 seconds. Without SSE, I get around 60 million (13% more). DTrace
> shows around 790,000 calls to fpudna versus 10 calls. There could be other
> factors involved, but I presume that the FPU context switches account for most
> of the change in performance.
>
> Even when I add some SSE usage in the application--incidentally, these same
> instructions--building libthr without SSE improves performance from 53.5 million
> to 55.8 million (4.3%).
>
> In the real-world application where I first noticed this, performance improves
> by 3-5%.
>
> I would appreciate your thoughts and feedback. The proposed patch is below.
>
> Eric
>
>
>
> Index: base/head/lib/libthr/arch/amd64/Makefile.inc
> ===================================================================
> --- base/head/lib/libthr/arch/amd64/Makefile.inc (revision 280703)
> +++ base/head/lib/libthr/arch/amd64/Makefile.inc (working copy)
> @@ -1,3 +1,8 @@
> #$FreeBSD$
>
> SRCS+= _umtx_op_err.S
> +
> +# Using SSE incurs extra overhead per context switch,
> +# which measurably impacts performance when the application
> +# does not otherwise use FP/SSE.
> +CFLAGS+=-mno-sse
Good catch!
Regarding your patch, I think we should disable even more, if possible. How about:
CFLAGS+= -mno-mmx -mno-3dnow -mno-sse -mno-sse2 -mno-sse3
--
Rui Paulo
More information about the freebsd-current
mailing list