Re: git: 32a2fed6e71f - stable/13 - openssl: Fix detection of ARMv7 and ARM64 CPU features

From: Mark Millard via arm <arm_at_freebsd.org>
Date: Thu, 25 Nov 2021 00:13:11 UTC
On 2021-Nov-24, at 15:25, Mark Millard <marklmi@yahoo.com> wrote:

> On 2021-Nov-24, at 13:23, Mark Millard <marklmi@yahoo.com> wrote:
> 
>> On 2021-Nov-24, at 13:19, Mark Millard <marklmi@yahoo.com> wrote:
>> 
>>> On 2021-Nov-24, at 01:51, Mark Millard <marklmi@yahoo.com> wrote:
>>> 
>>>> [Actually, the main [so: 14] equivalent.]
>>>> 
>>>> All Cortex-A72 based . . .
>>>> 
>>>> First, older system versions (before that update)
>>>> then after the update:
>>>> 
>>>> 
>>>> RPi4B 8 GiByte (older FreeBSD first, otherwise new),
>>>> Cortex-A72's:
>>>> 
>>>> # openssl speed -evp aes-256-gcm
>>>> . . .
>>>> type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
>>>> aes-256-gcm      51925.92k    58449.46k    60430.32k    61050.13k    61180.98k    61482.75k
>>>> 
>>>> type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
>>>> aes-256-gcm      28880.07k    30837.33k    31630.29k    31855.62k    31921.54k    32034.53k
>>>> 
>>>> So: slowed down, unlike the other examples below.
>>>> 
>>>> # env OPENSSL_armcap=0 openssl speed -evp aes-256-gcm
>>>> . . .
>>>> type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
>>>> aes-256-gcm      51894.33k    58540.45k    60815.22k    61534.47k    61906.84k    62042.10k
>>>> 
>>>> So: back to the prior speed.
>>>> 
>>>> But all these are based on config.txt containing:
>>>> 
>>>> over_voltage=6 
>>>> arm_freq=2000 
>>>> sdram_freq_min=3200 
>>>> force_turbo=1
>>>> 
>>>> (The RPi4B has a heat-sink and a fan.)
>>>> 
>>>> Note: See later about the RPi4B CPU features.
>>>> 
>>>> 
>>>> MACCHIATObin Double Shot (older first), Cortex-A72's:
>>>> 
>>>> # openssl speed -evp aes-256-gcm
>>>> . . .
>>>> type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
>>>> aes-256-gcm      50808.49k    58466.08k    60769.11k    61444.92k    61767.94k    61707.61k
>>>> 
>>>> type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
>>>> aes-256-gcm     163579.14k   456319.27k   786544.01k   940234.41k  1003230.55k  1005671.31k
>>>> 
>>>> 
>>>> HoneyComb (older first), Cortex-A782's:
>>>> 
>>>> # openssl speed -evp aes-256-gcm
>>>> . . .
>>>> type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
>>>> aes-256-gcm      57659.60k    64599.05k    67719.81k    68373.74k    68724.24k    68793.80k
>>>> 
>>>> type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
>>>> aes-256-gcm     177925.57k   502311.65k   866287.95k  1036500.35k  1106598.06k  1106721.91k
>>>> 
>>>> Rock64 (older first), Cortex-A53's:
>>>> 
>>>> # openssl speed -evp aes-256-gcm
>>>> . . .
>>>> type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
>>>> aes-256-gcm      18378.23k    23401.45k    24834.99k    25206.10k    25337.86k    25258.19k
>>>> 
>>>> type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
>>>> aes-256-gcm      52711.29k   163586.49k   318738.69k   420277.93k   461373.44k   463192.06k
>>>> 
>>>> 
>>>> OPi+2E (older first), Cortex-A7's (so armv7):
>>>> 
>>>> # openssl speed -evp aes-256-gcm
>>>> . . .
>>>> type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
>>>> aes-256-gcm       9343.10k    11156.39k    11827.64k    11995.30k    12025.86k    12031.32k
>>>> 
>>>> type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
>>>> aes-256-gcm      11013.41k    13598.44k    14034.26k    15045.97k    15262.90k    15302.66k
>>>> 
>>>> 
>>>> 
>>>> For reference:
>>>> 
>>>> For the RPi4B examples (2 notes added):
>>>> 
>>>> CPU  0: ARM Cortex-A72 r0p3 affinity:  0
>>>>                Cache Type = <64 byte D-cacheline,64 byte I-cacheline,PIPT ICache,64 byte ERG,64 byte CWG>
>>>> Instruction Set Attributes 0 = <CRC32>
>>>> *** NOTE the lack of ",SHA2,SHA1,AES+PMULL" above ***
>>>> Instruction Set Attributes 1 = <>
>>>>      Processor Features 0 = <AdvSIMD,FP,EL3 32,EL2 32,EL1 32,EL0 32>
>>>>      Processor Features 1 = <>
>>>>   Memory Model Features 0 = <TGran4,TGran64,SNSMem,BigEnd,16bit ASID,16TB PA>
>>>>   Memory Model Features 1 = <8bit VMID>
>>>>   Memory Model Features 2 = <32bit CCIDX,48bit VA>
>>>>          Debug Features 0 = <DoubleLock,2 CTX BKPTs,4 Watchpoints,6 Breakpoints,PMUv3,Debugv8>
>>>>          Debug Features 1 = <>
>>>>      Auxiliary Features 0 = <>
>>>>      Auxiliary Features 1 = <>
>>>> AArch32 Instruction Set Attributes 5 = <CRC32,SEVL>
>>>> *** NOTE the lack of ",SHA2,SHA1,AES+VMULL" above ***
>>>> AArch32 Media and VFP Features 0 = <FPRound,FPSqrt,FPDivide,DP VFPv3+v4,SP VFPv3+v4,AdvSIMD>
>>>> AArch32 Media and VFP Features 1 = <SIMDFMAC,FPHP DP Conv,SIMDHP SP Conv,SIMDSP,SIMDInt,SIMDLS,FPDNaN,FPFtZ>
>>>> 
>>>> For the MACCHIATObin Double Shot examples:
>>>> 
>>>> CPU  0: ARM Cortex-A72 r0p1 affinity:  0  0
>>>>                Cache Type = <64 byte D-cacheline,64 byte I-cacheline,PIPT ICache,64 byte ERG,64 byte CWG>
>>>> Instruction Set Attributes 0 = <CRC32,SHA2,SHA1,AES+PMULL>
>>>> Instruction Set Attributes 1 = <>
>>>>      Processor Features 0 = <AdvSIMD,FP,EL3 32,EL2 32,EL1 32,EL0 32>
>>>>      Processor Features 1 = <>
>>>>   Memory Model Features 0 = <TGran4,TGran64,SNSMem,BigEnd,16bit ASID,16TB PA>
>>>>   Memory Model Features 1 = <8bit VMID>
>>>>   Memory Model Features 2 = <32bit CCIDX,48bit VA>
>>>>          Debug Features 0 = <DoubleLock,2 CTX BKPTs,4 Watchpoints,6 Breakpoints,PMUv3,Debugv8>
>>>>          Debug Features 1 = <>
>>>>      Auxiliary Features 0 = <>
>>>>      Auxiliary Features 1 = <>
>>>> AArch32 Instruction Set Attributes 5 = <CRC32,SHA2,SHA1,AES+VMULL,SEVL>
>>>> AArch32 Media and VFP Features 0 = <FPRound,FPSqrt,FPDivide,DP VFPv3+v4,SP VFPv3+v4,AdvSIMD>
>>>> AArch32 Media and VFP Features 1 = <SIMDFMAC,FPHP DP Conv,SIMDHP SP Conv,SIMDSP,SIMDInt,SIMDLS,FPDNaN,FPFtZ>
>>>> 
>>>> 
>>>> For the HoneyComb examples:
>>>> 
>>>> CPU  0: ARM Cortex-A72 r0p3 affinity:  0  0
>>>>                Cache Type = <64 byte D-cacheline,64 byte I-cacheline,PIPT ICache,64 byte ERG,64 byte CWG>
>>>> Instruction Set Attributes 0 = <CRC32,SHA2,SHA1,AES+PMULL>
>>>> Instruction Set Attributes 1 = <>
>>>>      Processor Features 0 = <GIC,AdvSIMD,FP,EL3 32,EL2 32,EL1 32,EL0 32>
>>>>      Processor Features 1 = <>
>>>>   Memory Model Features 0 = <TGran4,TGran64,SNSMem,BigEnd,16bit ASID,16TB PA>
>>>>   Memory Model Features 1 = <8bit VMID>
>>>>   Memory Model Features 2 = <32bit CCIDX,48bit VA>
>>>>          Debug Features 0 = <DoubleLock,2 CTX BKPTs,4 Watchpoints,6 Breakpoints,PMUv3,Debugv8>
>>>>          Debug Features 1 = <>
>>>>      Auxiliary Features 0 = <>
>>>>      Auxiliary Features 1 = <>
>>>> AArch32 Instruction Set Attributes 5 = <CRC32,SHA2,SHA1,AES+VMULL,SEVL>
>>>> AArch32 Media and VFP Features 0 = <FPRound,FPSqrt,FPDivide,DP VFPv3+v4,SP VFPv3+v4,AdvSIMD>
>>>> AArch32 Media and VFP Features 1 = <SIMDFMAC,FPHP DP Conv,SIMDHP SP Conv,SIMDSP,SIMDInt,SIMDLS,FPDNaN,FPFtZ>
>>>> 
>>>> 
>>>> 
>>>> 
>>>> For the Rock64 examples:
>>>> 
>>>> CPU  0: ARM Cortex-A53 r0p4 affinity:  0
>>>>                Cache Type = <64 byte D-cacheline,64 byte I-cacheline,VIPT ICache,64 byte ERG,64 byte CWG>
>>>> Instruction Set Attributes 0 = <CRC32,SHA2,SHA1,AES+PMULL>
>>>> Instruction Set Attributes 1 = <>
>>>>      Processor Features 0 = <AdvSIMD,FP,EL3 32,EL2 32,EL1 32,EL0 32>
>>>>      Processor Features 1 = <>
>>>>   Memory Model Features 0 = <TGran4,TGran64,SNSMem,BigEnd,16bit ASID,1TB PA>
>>>>   Memory Model Features 1 = <8bit VMID>
>>>>   Memory Model Features 2 = <32bit CCIDX,48bit VA>
>>>>          Debug Features 0 = <DoubleLock,2 CTX BKPTs,4 Watchpoints,6 Breakpoints,PMUv3,Debugv8>
>>>>          Debug Features 1 = <>
>>>>      Auxiliary Features 0 = <>
>>>>      Auxiliary Features 1 = <>
>>>> AArch32 Instruction Set Attributes 5 = <CRC32,SHA2,SHA1,AES+VMULL,SEVL>
>>>> AArch32 Media and VFP Features 0 = <FPRound,FPSqrt,FPDivide,DP VFPv3+v4,SP VFPv3+v4,AdvSIMD>
>>>> AArch32 Media and VFP Features 1 = <SIMDFMAC,FPHP DP Conv,SIMDHP SP Conv,SIMDSP,SIMDInt,SIMDLS,FPDNaN,FPFtZ>
>>>> C
>>>> 
>>>> 
>>>> For the OPi+2E examples:
>>>> 
>>>> CPU: ARM Cortex-A7 r0p5 (ECO: 0x00000000)
>>>> CPU Features: 
>>>> Multiprocessing, Thumb2, Security, Virtualization, Generic Timer, VMSAv7,
>>>> PXN, LPAE, Coherent Walk
>>>> Optional instructions: 
>>>> SDIV/UDIV, UMULL, SMULL, SIMD(ext)
>>>> LoUU:2 LoC:3 LoUIS:2 
>>>> Cache level 1:
>>>> 32KB/64B 4-way data cache WB Read-Alloc Write-Alloc
>>>> 32KB/32B 2-way instruction cache Read-Alloc
>>>> Cache level 2:
>>>> 512KB/64B 8-way unified cache WB Read-Alloc Write-Alloc
>>> 
>>> Note: as the issue applies to stable/13 and main [so: 14]
>>> (for example), I continue to use the freebsd-arm list
>>> instead of a list that reports commits to stable/* but
>>> not to main.
>>> 
>>> Relative to:
>>> 
>>> #define HWCAP_FP                0x00000001
>>> #define HWCAP_ASIMD             0x00000002
>>> #define HWCAP_EVTSTRM           0x00000004
>>> #define HWCAP_AES               0x00000008
>>> #define HWCAP_PMULL             0x00000010
>>> #define HWCAP_SHA1              0x00000020
>>> #define HWCAP_SHA2              0x00000040
>>> #define HWCAP_CRC32             0x00000080
>>> 
>>> The single-bit enabled OPENSSL_armcap that gets the slow
>>> result is:
>>> 
>>> # env OPENSSL_armcap=1 openssl speed -evp aes-256-gcm
>>> . . .
>>> type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
>>> aes-256-gcm      28427.04k    30712.32k    31446.00k    31683.40k    31829.10k    31839.55k
>>> 
>>> The illegal instruction ones for aes-256-gcm were:
>>> 
>>> # env OPENSSL_armcap=4 openssl speed -evp aes-256-gcm
>>> Doing aes-256-gcm for 3s on 16 size blocks: Illegal instruction (core dumped)
>>> 
>>> env OPENSSL_armcap=32 openssl speed -evp aes-256-gcm
>>> Doing aes-256-gcm for 3s on 16 size blocks: Illegal instruction (core dumped)
>>> 
>>> (sha256 does not match for what is illegal.)
>>> 
>>> Ignoring the illegal-instruction producing bits, HWCAP_FP mixed
>>> with any one of the other bits was also similarly slow.
>>> 
>>> As for all the non-illegal-instruction producing bits: also similarly
>>> slow:
>>> 
>>> # env OPENSSL_armcap=219 openssl speed -evp aes-256-gcm
>>> . . .
>>> type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
>>> aes-256-gcm      28922.63k    30711.51k    31522.15k    31722.15k    31788.97k    31845.03k
>>> 
>>> Disabling just HWCAP_FP from that got the fast category of
>>> result:
>>> 
>>> # env OPENSSL_armcap=218 openssl speed -evp aes-256-gcm
>>> . . .
>>> type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
>>> aes-256-gcm      49543.14k    58068.22k    60236.56k    60724.37k    61216.09k    61212.99k
>>> 
>>> 
>>> As for sha256 . . .
>>> 
>>> # env OPENSSL_armcap=0 openssl speed -evp sha256
>>> . . .
>>> type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
>>> sha256           22434.19k    59895.91k   117258.16k   156264.31k   172624.81k   173848.52k
>>> 
>>> (I'll not list all the similar performing ones but
>>> will list all illegal-instruction producing ones.)
>>> 
>>> # env OPENSSL_armcap=4 openssl speed -evp sha256
>>> Doing sha256 for 3s on 16 size blocks: 4082055 sha256's in 2.99s
>>> Doing sha256 for 3s on 64 size blocks: 2752520 sha256's in 3.02s
>>> Doing sha256 for 3s on 256 size blocks: 1372584 sha256's in 3.03s
>>> Doing sha256 for 3s on 1024 size blocks: 470215 sha256's in 3.11s
>>> Doing sha256 for 3s on 8192 size blocks: 64700 sha256's in 3.07s
>>> Doing sha256 for 3s on 16384 size blocks: 31847 sha256's in 3.00s
>>> Illegal instruction (core dumped)
>>> 
>>> # env OPENSSL_armcap=16 openssl speed -evp sha256
>>> Doing sha256 for 3s on 16 size blocks: Illegal instruction (core dumped)
>>> 
>>> (16 worked for aes-256-gcm but 32 did not.)
>>> 
>>> So: no significantly slower examples of single enabled
>>> bit cases.
>>> 
>>> No (non-illegal-instruction) 2-enabled-bits examples were
>>> dissimilar for the speed.
>> 
>> Incorrect description of what I tested: I testd only
>> 2-bit combinations involving HWCAP_FP being enabled.
>> (Same as for aes-256-gcm .)
>> 
>>> For reference (avoiding illegal-instructions):
>>> 
>>> # env OPENSSL_armcap=235 openssl speed -evp sha256
>>> . . .
>>> type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
>>> sha256           23185.66k    62689.73k   125814.72k   167981.88k   187833.65k   188968.95k
>>> 
>>> So: also similar speed.
>>> 
>>> Need any other specific bit combinations?
>> 
> 
> 
> chroot'd into a armv7 context on the RPi4B gets different results
> for aes-256-gcm: having the HWCAP_FP enabled speed things up.
> 
> # env OPENSSL_armcap=0 openssl speed -evp aes-256-gcm
> . . .
> type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
> aes-256-gcm      35983.70k    41987.64k    44077.00k    44693.54k    44685.68k    44717.40k
> 
> # env OPENSSL_armcap=1 openssl speed -evp aes-256-gcm
> . . .
> type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
> aes-256-gcm      55339.93k    64644.18k    68001.37k    72708.53k    74237.56k    74247.87k
> 
> # env OPENSSL_armcap=4 openssl speed -evp aes-256-gcm
> Doing aes-256-gcm for 3s on 16 size blocks: Illegal instruction (core dumped)
> 
> # env OPENSSL_armcap=32 openssl speed -evp aes-256-gcm
> Doing aes-256-gcm for 3s on 16 size blocks: Illegal instruction (core dumped)
> 
> In general OPENSSL_armcap=2**N was slower and OPENSSL_armcap=(2**N)+1
> was faster in a similar manor. Similarly for 218 vs. 219.
> 
> sha256 did not show such a distinction.
> 
> The armv7 illegal-instruction generation cases for
> sha256 were:
> 
> # env OPENSSL_armcap=4 openssl speed -evp sha256
> Doing sha256 for 3s on 16 size blocks: 3313106 sha256's in 3.02s
> Doing sha256 for 3s on 64 size blocks: 2403376 sha256's in 3.02s
> Doing sha256 for 3s on 256 size blocks: 1289917 sha256's in 3.02s
> Doing sha256 for 3s on 1024 size blocks: 446543 sha256's in 3.00s
> Doing sha256 for 3s on 8192 size blocks: 64123 sha256's in 3.03s
> Doing sha256 for 3s on 16384 size blocks: 32756 sha256's in 3.08s
> Illegal instruction (core dumped)
> 
> # env OPENSSL_armcap=16 openssl speed -evp sha256
> Doing sha256 for 3s on 16 size blocks: Illegal instruction (core dumped)
> 
> 
> 
> Note: I focused on large scale differences in general. I was not trying
> to find the optimal combination. For that I'd also have to test out
> repeatability/variability for each OPENSSL_armcap value that was in
> the faster range.



FYI: on the OPi+2E (Cortex-A7) a more generates illegal instructions
than a chroot to armv7 does on the RPi4B:

# env OPENSSL_armcap=2 openssl speed -evp aes-256-gcm
Doing aes-256-gcm for 3s on 16 size blocks: Illegal instruction (core dumped)

# env OPENSSL_armcap=4 openssl speed -evp aes-256-gcm
Doing aes-256-gcm for 3s on 16 size blocks: Illegal instruction (core dumped)

# env OPENSSL_armcap=32 openssl speed -evp aes-256-gcm
Doing aes-256-gcm for 3s on 16 size blocks: Illegal instruction (core dumped)

 env OPENSSL_armcap=2 openssl speed -evp sha256
Doing sha256 for 3s on 16 size blocks: 579668 sha256's in 3.01s
Doing sha256 for 3s on 64 size blocks: 436508 sha256's in 3.00s
Doing sha256 for 3s on 256 size blocks: 240826 sha256's in 3.03s
Doing sha256 for 3s on 1024 size blocks: 85768 sha256's in 3.04s
Doing sha256 for 3s on 8192 size blocks: 12248 sha256's in 3.04s
Doing sha256 for 3s on 16384 size blocks: 6096 sha256's in 3.00s
Illegal instruction (core dumped)

# env OPENSSL_armcap=4 openssl speed -evp sha256
Doing sha256 for 3s on 16 size blocks: 582757 sha256's in 3.00s
Doing sha256 for 3s on 64 size blocks: 443027 sha256's in 3.04s
Doing sha256 for 3s on 256 size blocks: 241189 sha256's in 3.04s
Doing sha256 for 3s on 1024 size blocks: 85722 sha256's in 3.04s
Doing sha256 for 3s on 8192 size blocks: 12074 sha256's in 3.00s
Doing sha256 for 3s on 16384 size blocks: 6097 sha256's in 3.00s
Illegal instruction (core dumped)

# env OPENSSL_armcap=16 openssl speed -evp sha256
Doing sha256 for 3s on 16 size blocks: Illegal instruction (core dumped)


I should note that my buildworld's and buildkernel's are set
up to involve -mcpu=cortex-a72 or -mcpu=cortex-a53 or -mcpu-cortex-a7
as appropriate to matching the target hardware. The armv7 chroot's
builds used -mcpu-cortex-a7 as well. (The installs are of the
same system build as for the OPi+2E .)

===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)