stream benchmarking on RPi

Fri Sep 6 14:30:16 UTC 2013

On Fri, Sep 6, 2013 at 6:37 AM, Zbigniew Bodek <zbb at semihalf.com> wrote:
> Hello  Jia-Shiun.
>
> Thanks for your effort in testing.
> I am actually in the middle of superpages tests and another benchmark and
> set of
> results will be very helpful especially for comparison.
>
> Just for the record: did you enable superpages for your kernel?
> SP are not yet enabled by default, therefore one needs to set
> vm.pmap.sp_enabled to non-zero value in loader.conf (if you are using
> loader)
> or set this value in src by editing sys/arm/arm/pmap-v6.c -> sp_enabled.
>
> Nevertheless I've made short tests on Armada XP (clang).
> I used two array sizes (default and 2 x default). I also made few runs to
> ensure
> that the results are steady.
> Please check below (improvement in copy can be seen but from what one can
> observe via sysctl vm.pmap.section not so many superpages are "requested"
> during the test):

Yes I confirmed that superpages was not enabled yet. I thought it was on
by default. Should have paid more attention. Then the improvement I've
seen can also attribute to someone else. Any nominees? ;)

after enabling it in loader.rc ("set vm.pmap.sp_enabled=1"), the
benchmark did not see big difference. Like your results,
differences are visible, but not big.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:             372.6     0.043278     0.042943     0.043590
Scale:             31.1     0.529411     0.514686     0.545614
Add:               69.2     0.363791     0.346574     0.381367
Triad:             27.4     0.909578     0.875739     0.995989
-------------------------------------------------------------

sp did only have a few activities. I suppose it to be more obvious for
usages heavily sporting and fragmenting memory, rather than
sequential large block accesses like stream did? After several
stream runs:
# sysctl vm.pmap.section
vm.pmap.section.demotions: 0
vm.pmap.section.mappings: 0
vm.pmap.section.p_failures: 120
vm.pmap.section.promotions: 277

BTW I modified the array size from 10m to 1m, otherwise it will allocate
more than 200MB/s and run for several minutes. It should not affect
result much on processors having speed like this .

I was checking if there is anything can be done to improve performance
of RPi. Building world takes days and nights. (But works! Ya!)
For stream it looks more like being bound to some OS/compiler/etc.
usage rather than hard limit of hardware. Let's see what else can be found.

Thanks,

Jia-Shiun.