Re: Software performance complexity (was The Case for Rust (in any system))

In reply to: Gavin D. Howard: "Re: The Case for Rust (in any system)"
Go to: [ bottom of page ] [ top of archives ] [ this month ]

From: Jason Bacon <bacon4000_at_gmail.com>
Date: Sat, 14 Sep 2024 12:19:07 UTC

On 9/13/24 21:24, Gavin D. Howard wrote:
>> Try and explain this for example:
>>
>> Sorting int array with clang++18 and subscripts...
>> User time = 4.74 seconds (.07900 minutes) (.00131 hours).
>> RSS = 4204 KB
>>
>> Sorting long array with clang++18 and subscripts...
>> User time = 2.22 seconds (.03700 minutes) (.00061 hours).
>> RSS = 4608 KB
> 
> A new, curious participant here.
> 
> My guess is that the ints are being extended to longs inside the loop,
> which would require an extra sign extension instruction.

According to C standards, and int should never be promoted unless 
necessary to perform an operation with a higher type, e.g. int + long, 
int * float, or pass to an argument of a higher type.  The purpose of 
int is to provide the fastest data type on any platform when you don't 
care if it uses 16, 32, or 64 bits.

> 
> I don't think that explains the time doubling, but simply running that
> one instruction may not be the only cause of performance loss from an
> extra instruction.
> 
> That one instruction may actually be the straw that broke the L1 camel's
> back; without it, the L1 instruction cache may not overflow, but with
> it, the L1 instruction cache may overflow, causing cache misses into L2
> on every iteration of the loop. It would also occupy one of the
> arithmetic units, which could lead to less instruction level
> parallelism or give the compiler less room for unrolling the loop.
> 
> Just a theory; I have no clue. If you have code to share, I'd love to
> see it and try to reproduce the effect.

I suspect the parameters for triggering certain optimizations are 
different for C++ and long int than for other cases.  See the link to 
the llvm Github issue earlier in the message you replied to here.  The 
link to the code is also there.

Also, clang is slightly faster than gcc on an old AMD Phenom.  It's 
running FreeBSD 14.0 + latest packages, just like the i5, where gcc is 
much faster.  MD5s of the binaries are the same, regardless of where 
they were compiled.  I'd expect that when just using -O2 (and no 
-march=native).

Bottom line: There is no reliably predicting software performance in the 
real world.  Measuring it empirically is the only way to be sure.

Cheers,

	J

-- 
Life is a game.  Play hard.  Play fair.  Have fun.