Re: git: 32a2fed6e71f - stable/13 - openssl: Fix detection of ARMv7 and ARM64 CPU features
Date: Thu, 25 Nov 2021 00:13:11 UTC
On 2021-Nov-24, at 15:25, Mark Millard <marklmi@yahoo.com> wrote: > On 2021-Nov-24, at 13:23, Mark Millard <marklmi@yahoo.com> wrote: > >> On 2021-Nov-24, at 13:19, Mark Millard <marklmi@yahoo.com> wrote: >> >>> On 2021-Nov-24, at 01:51, Mark Millard <marklmi@yahoo.com> wrote: >>> >>>> [Actually, the main [so: 14] equivalent.] >>>> >>>> All Cortex-A72 based . . . >>>> >>>> First, older system versions (before that update) >>>> then after the update: >>>> >>>> >>>> RPi4B 8 GiByte (older FreeBSD first, otherwise new), >>>> Cortex-A72's: >>>> >>>> # openssl speed -evp aes-256-gcm >>>> . . . >>>> type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes >>>> aes-256-gcm 51925.92k 58449.46k 60430.32k 61050.13k 61180.98k 61482.75k >>>> >>>> type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes >>>> aes-256-gcm 28880.07k 30837.33k 31630.29k 31855.62k 31921.54k 32034.53k >>>> >>>> So: slowed down, unlike the other examples below. >>>> >>>> # env OPENSSL_armcap=0 openssl speed -evp aes-256-gcm >>>> . . . >>>> type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes >>>> aes-256-gcm 51894.33k 58540.45k 60815.22k 61534.47k 61906.84k 62042.10k >>>> >>>> So: back to the prior speed. >>>> >>>> But all these are based on config.txt containing: >>>> >>>> over_voltage=6 >>>> arm_freq=2000 >>>> sdram_freq_min=3200 >>>> force_turbo=1 >>>> >>>> (The RPi4B has a heat-sink and a fan.) >>>> >>>> Note: See later about the RPi4B CPU features. >>>> >>>> >>>> MACCHIATObin Double Shot (older first), Cortex-A72's: >>>> >>>> # openssl speed -evp aes-256-gcm >>>> . . . >>>> type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes >>>> aes-256-gcm 50808.49k 58466.08k 60769.11k 61444.92k 61767.94k 61707.61k >>>> >>>> type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes >>>> aes-256-gcm 163579.14k 456319.27k 786544.01k 940234.41k 1003230.55k 1005671.31k >>>> >>>> >>>> HoneyComb (older first), Cortex-A782's: >>>> >>>> # openssl speed -evp aes-256-gcm >>>> . . . >>>> type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes >>>> aes-256-gcm 57659.60k 64599.05k 67719.81k 68373.74k 68724.24k 68793.80k >>>> >>>> type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes >>>> aes-256-gcm 177925.57k 502311.65k 866287.95k 1036500.35k 1106598.06k 1106721.91k >>>> >>>> Rock64 (older first), Cortex-A53's: >>>> >>>> # openssl speed -evp aes-256-gcm >>>> . . . >>>> type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes >>>> aes-256-gcm 18378.23k 23401.45k 24834.99k 25206.10k 25337.86k 25258.19k >>>> >>>> type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes >>>> aes-256-gcm 52711.29k 163586.49k 318738.69k 420277.93k 461373.44k 463192.06k >>>> >>>> >>>> OPi+2E (older first), Cortex-A7's (so armv7): >>>> >>>> # openssl speed -evp aes-256-gcm >>>> . . . >>>> type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes >>>> aes-256-gcm 9343.10k 11156.39k 11827.64k 11995.30k 12025.86k 12031.32k >>>> >>>> type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes >>>> aes-256-gcm 11013.41k 13598.44k 14034.26k 15045.97k 15262.90k 15302.66k >>>> >>>> >>>> >>>> For reference: >>>> >>>> For the RPi4B examples (2 notes added): >>>> >>>> CPU 0: ARM Cortex-A72 r0p3 affinity: 0 >>>> Cache Type = <64 byte D-cacheline,64 byte I-cacheline,PIPT ICache,64 byte ERG,64 byte CWG> >>>> Instruction Set Attributes 0 = <CRC32> >>>> *** NOTE the lack of ",SHA2,SHA1,AES+PMULL" above *** >>>> Instruction Set Attributes 1 = <> >>>> Processor Features 0 = <AdvSIMD,FP,EL3 32,EL2 32,EL1 32,EL0 32> >>>> Processor Features 1 = <> >>>> Memory Model Features 0 = <TGran4,TGran64,SNSMem,BigEnd,16bit ASID,16TB PA> >>>> Memory Model Features 1 = <8bit VMID> >>>> Memory Model Features 2 = <32bit CCIDX,48bit VA> >>>> Debug Features 0 = <DoubleLock,2 CTX BKPTs,4 Watchpoints,6 Breakpoints,PMUv3,Debugv8> >>>> Debug Features 1 = <> >>>> Auxiliary Features 0 = <> >>>> Auxiliary Features 1 = <> >>>> AArch32 Instruction Set Attributes 5 = <CRC32,SEVL> >>>> *** NOTE the lack of ",SHA2,SHA1,AES+VMULL" above *** >>>> AArch32 Media and VFP Features 0 = <FPRound,FPSqrt,FPDivide,DP VFPv3+v4,SP VFPv3+v4,AdvSIMD> >>>> AArch32 Media and VFP Features 1 = <SIMDFMAC,FPHP DP Conv,SIMDHP SP Conv,SIMDSP,SIMDInt,SIMDLS,FPDNaN,FPFtZ> >>>> >>>> For the MACCHIATObin Double Shot examples: >>>> >>>> CPU 0: ARM Cortex-A72 r0p1 affinity: 0 0 >>>> Cache Type = <64 byte D-cacheline,64 byte I-cacheline,PIPT ICache,64 byte ERG,64 byte CWG> >>>> Instruction Set Attributes 0 = <CRC32,SHA2,SHA1,AES+PMULL> >>>> Instruction Set Attributes 1 = <> >>>> Processor Features 0 = <AdvSIMD,FP,EL3 32,EL2 32,EL1 32,EL0 32> >>>> Processor Features 1 = <> >>>> Memory Model Features 0 = <TGran4,TGran64,SNSMem,BigEnd,16bit ASID,16TB PA> >>>> Memory Model Features 1 = <8bit VMID> >>>> Memory Model Features 2 = <32bit CCIDX,48bit VA> >>>> Debug Features 0 = <DoubleLock,2 CTX BKPTs,4 Watchpoints,6 Breakpoints,PMUv3,Debugv8> >>>> Debug Features 1 = <> >>>> Auxiliary Features 0 = <> >>>> Auxiliary Features 1 = <> >>>> AArch32 Instruction Set Attributes 5 = <CRC32,SHA2,SHA1,AES+VMULL,SEVL> >>>> AArch32 Media and VFP Features 0 = <FPRound,FPSqrt,FPDivide,DP VFPv3+v4,SP VFPv3+v4,AdvSIMD> >>>> AArch32 Media and VFP Features 1 = <SIMDFMAC,FPHP DP Conv,SIMDHP SP Conv,SIMDSP,SIMDInt,SIMDLS,FPDNaN,FPFtZ> >>>> >>>> >>>> For the HoneyComb examples: >>>> >>>> CPU 0: ARM Cortex-A72 r0p3 affinity: 0 0 >>>> Cache Type = <64 byte D-cacheline,64 byte I-cacheline,PIPT ICache,64 byte ERG,64 byte CWG> >>>> Instruction Set Attributes 0 = <CRC32,SHA2,SHA1,AES+PMULL> >>>> Instruction Set Attributes 1 = <> >>>> Processor Features 0 = <GIC,AdvSIMD,FP,EL3 32,EL2 32,EL1 32,EL0 32> >>>> Processor Features 1 = <> >>>> Memory Model Features 0 = <TGran4,TGran64,SNSMem,BigEnd,16bit ASID,16TB PA> >>>> Memory Model Features 1 = <8bit VMID> >>>> Memory Model Features 2 = <32bit CCIDX,48bit VA> >>>> Debug Features 0 = <DoubleLock,2 CTX BKPTs,4 Watchpoints,6 Breakpoints,PMUv3,Debugv8> >>>> Debug Features 1 = <> >>>> Auxiliary Features 0 = <> >>>> Auxiliary Features 1 = <> >>>> AArch32 Instruction Set Attributes 5 = <CRC32,SHA2,SHA1,AES+VMULL,SEVL> >>>> AArch32 Media and VFP Features 0 = <FPRound,FPSqrt,FPDivide,DP VFPv3+v4,SP VFPv3+v4,AdvSIMD> >>>> AArch32 Media and VFP Features 1 = <SIMDFMAC,FPHP DP Conv,SIMDHP SP Conv,SIMDSP,SIMDInt,SIMDLS,FPDNaN,FPFtZ> >>>> >>>> >>>> >>>> >>>> For the Rock64 examples: >>>> >>>> CPU 0: ARM Cortex-A53 r0p4 affinity: 0 >>>> Cache Type = <64 byte D-cacheline,64 byte I-cacheline,VIPT ICache,64 byte ERG,64 byte CWG> >>>> Instruction Set Attributes 0 = <CRC32,SHA2,SHA1,AES+PMULL> >>>> Instruction Set Attributes 1 = <> >>>> Processor Features 0 = <AdvSIMD,FP,EL3 32,EL2 32,EL1 32,EL0 32> >>>> Processor Features 1 = <> >>>> Memory Model Features 0 = <TGran4,TGran64,SNSMem,BigEnd,16bit ASID,1TB PA> >>>> Memory Model Features 1 = <8bit VMID> >>>> Memory Model Features 2 = <32bit CCIDX,48bit VA> >>>> Debug Features 0 = <DoubleLock,2 CTX BKPTs,4 Watchpoints,6 Breakpoints,PMUv3,Debugv8> >>>> Debug Features 1 = <> >>>> Auxiliary Features 0 = <> >>>> Auxiliary Features 1 = <> >>>> AArch32 Instruction Set Attributes 5 = <CRC32,SHA2,SHA1,AES+VMULL,SEVL> >>>> AArch32 Media and VFP Features 0 = <FPRound,FPSqrt,FPDivide,DP VFPv3+v4,SP VFPv3+v4,AdvSIMD> >>>> AArch32 Media and VFP Features 1 = <SIMDFMAC,FPHP DP Conv,SIMDHP SP Conv,SIMDSP,SIMDInt,SIMDLS,FPDNaN,FPFtZ> >>>> C >>>> >>>> >>>> For the OPi+2E examples: >>>> >>>> CPU: ARM Cortex-A7 r0p5 (ECO: 0x00000000) >>>> CPU Features: >>>> Multiprocessing, Thumb2, Security, Virtualization, Generic Timer, VMSAv7, >>>> PXN, LPAE, Coherent Walk >>>> Optional instructions: >>>> SDIV/UDIV, UMULL, SMULL, SIMD(ext) >>>> LoUU:2 LoC:3 LoUIS:2 >>>> Cache level 1: >>>> 32KB/64B 4-way data cache WB Read-Alloc Write-Alloc >>>> 32KB/32B 2-way instruction cache Read-Alloc >>>> Cache level 2: >>>> 512KB/64B 8-way unified cache WB Read-Alloc Write-Alloc >>> >>> Note: as the issue applies to stable/13 and main [so: 14] >>> (for example), I continue to use the freebsd-arm list >>> instead of a list that reports commits to stable/* but >>> not to main. >>> >>> Relative to: >>> >>> #define HWCAP_FP 0x00000001 >>> #define HWCAP_ASIMD 0x00000002 >>> #define HWCAP_EVTSTRM 0x00000004 >>> #define HWCAP_AES 0x00000008 >>> #define HWCAP_PMULL 0x00000010 >>> #define HWCAP_SHA1 0x00000020 >>> #define HWCAP_SHA2 0x00000040 >>> #define HWCAP_CRC32 0x00000080 >>> >>> The single-bit enabled OPENSSL_armcap that gets the slow >>> result is: >>> >>> # env OPENSSL_armcap=1 openssl speed -evp aes-256-gcm >>> . . . >>> type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes >>> aes-256-gcm 28427.04k 30712.32k 31446.00k 31683.40k 31829.10k 31839.55k >>> >>> The illegal instruction ones for aes-256-gcm were: >>> >>> # env OPENSSL_armcap=4 openssl speed -evp aes-256-gcm >>> Doing aes-256-gcm for 3s on 16 size blocks: Illegal instruction (core dumped) >>> >>> env OPENSSL_armcap=32 openssl speed -evp aes-256-gcm >>> Doing aes-256-gcm for 3s on 16 size blocks: Illegal instruction (core dumped) >>> >>> (sha256 does not match for what is illegal.) >>> >>> Ignoring the illegal-instruction producing bits, HWCAP_FP mixed >>> with any one of the other bits was also similarly slow. >>> >>> As for all the non-illegal-instruction producing bits: also similarly >>> slow: >>> >>> # env OPENSSL_armcap=219 openssl speed -evp aes-256-gcm >>> . . . >>> type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes >>> aes-256-gcm 28922.63k 30711.51k 31522.15k 31722.15k 31788.97k 31845.03k >>> >>> Disabling just HWCAP_FP from that got the fast category of >>> result: >>> >>> # env OPENSSL_armcap=218 openssl speed -evp aes-256-gcm >>> . . . >>> type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes >>> aes-256-gcm 49543.14k 58068.22k 60236.56k 60724.37k 61216.09k 61212.99k >>> >>> >>> As for sha256 . . . >>> >>> # env OPENSSL_armcap=0 openssl speed -evp sha256 >>> . . . >>> type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes >>> sha256 22434.19k 59895.91k 117258.16k 156264.31k 172624.81k 173848.52k >>> >>> (I'll not list all the similar performing ones but >>> will list all illegal-instruction producing ones.) >>> >>> # env OPENSSL_armcap=4 openssl speed -evp sha256 >>> Doing sha256 for 3s on 16 size blocks: 4082055 sha256's in 2.99s >>> Doing sha256 for 3s on 64 size blocks: 2752520 sha256's in 3.02s >>> Doing sha256 for 3s on 256 size blocks: 1372584 sha256's in 3.03s >>> Doing sha256 for 3s on 1024 size blocks: 470215 sha256's in 3.11s >>> Doing sha256 for 3s on 8192 size blocks: 64700 sha256's in 3.07s >>> Doing sha256 for 3s on 16384 size blocks: 31847 sha256's in 3.00s >>> Illegal instruction (core dumped) >>> >>> # env OPENSSL_armcap=16 openssl speed -evp sha256 >>> Doing sha256 for 3s on 16 size blocks: Illegal instruction (core dumped) >>> >>> (16 worked for aes-256-gcm but 32 did not.) >>> >>> So: no significantly slower examples of single enabled >>> bit cases. >>> >>> No (non-illegal-instruction) 2-enabled-bits examples were >>> dissimilar for the speed. >> >> Incorrect description of what I tested: I testd only >> 2-bit combinations involving HWCAP_FP being enabled. >> (Same as for aes-256-gcm .) >> >>> For reference (avoiding illegal-instructions): >>> >>> # env OPENSSL_armcap=235 openssl speed -evp sha256 >>> . . . >>> type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes >>> sha256 23185.66k 62689.73k 125814.72k 167981.88k 187833.65k 188968.95k >>> >>> So: also similar speed. >>> >>> Need any other specific bit combinations? >> > > > chroot'd into a armv7 context on the RPi4B gets different results > for aes-256-gcm: having the HWCAP_FP enabled speed things up. > > # env OPENSSL_armcap=0 openssl speed -evp aes-256-gcm > . . . > type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes > aes-256-gcm 35983.70k 41987.64k 44077.00k 44693.54k 44685.68k 44717.40k > > # env OPENSSL_armcap=1 openssl speed -evp aes-256-gcm > . . . > type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes > aes-256-gcm 55339.93k 64644.18k 68001.37k 72708.53k 74237.56k 74247.87k > > # env OPENSSL_armcap=4 openssl speed -evp aes-256-gcm > Doing aes-256-gcm for 3s on 16 size blocks: Illegal instruction (core dumped) > > # env OPENSSL_armcap=32 openssl speed -evp aes-256-gcm > Doing aes-256-gcm for 3s on 16 size blocks: Illegal instruction (core dumped) > > In general OPENSSL_armcap=2**N was slower and OPENSSL_armcap=(2**N)+1 > was faster in a similar manor. Similarly for 218 vs. 219. > > sha256 did not show such a distinction. > > The armv7 illegal-instruction generation cases for > sha256 were: > > # env OPENSSL_armcap=4 openssl speed -evp sha256 > Doing sha256 for 3s on 16 size blocks: 3313106 sha256's in 3.02s > Doing sha256 for 3s on 64 size blocks: 2403376 sha256's in 3.02s > Doing sha256 for 3s on 256 size blocks: 1289917 sha256's in 3.02s > Doing sha256 for 3s on 1024 size blocks: 446543 sha256's in 3.00s > Doing sha256 for 3s on 8192 size blocks: 64123 sha256's in 3.03s > Doing sha256 for 3s on 16384 size blocks: 32756 sha256's in 3.08s > Illegal instruction (core dumped) > > # env OPENSSL_armcap=16 openssl speed -evp sha256 > Doing sha256 for 3s on 16 size blocks: Illegal instruction (core dumped) > > > > Note: I focused on large scale differences in general. I was not trying > to find the optimal combination. For that I'd also have to test out > repeatability/variability for each OPENSSL_armcap value that was in > the faster range. FYI: on the OPi+2E (Cortex-A7) a more generates illegal instructions than a chroot to armv7 does on the RPi4B: # env OPENSSL_armcap=2 openssl speed -evp aes-256-gcm Doing aes-256-gcm for 3s on 16 size blocks: Illegal instruction (core dumped) # env OPENSSL_armcap=4 openssl speed -evp aes-256-gcm Doing aes-256-gcm for 3s on 16 size blocks: Illegal instruction (core dumped) # env OPENSSL_armcap=32 openssl speed -evp aes-256-gcm Doing aes-256-gcm for 3s on 16 size blocks: Illegal instruction (core dumped) env OPENSSL_armcap=2 openssl speed -evp sha256 Doing sha256 for 3s on 16 size blocks: 579668 sha256's in 3.01s Doing sha256 for 3s on 64 size blocks: 436508 sha256's in 3.00s Doing sha256 for 3s on 256 size blocks: 240826 sha256's in 3.03s Doing sha256 for 3s on 1024 size blocks: 85768 sha256's in 3.04s Doing sha256 for 3s on 8192 size blocks: 12248 sha256's in 3.04s Doing sha256 for 3s on 16384 size blocks: 6096 sha256's in 3.00s Illegal instruction (core dumped) # env OPENSSL_armcap=4 openssl speed -evp sha256 Doing sha256 for 3s on 16 size blocks: 582757 sha256's in 3.00s Doing sha256 for 3s on 64 size blocks: 443027 sha256's in 3.04s Doing sha256 for 3s on 256 size blocks: 241189 sha256's in 3.04s Doing sha256 for 3s on 1024 size blocks: 85722 sha256's in 3.04s Doing sha256 for 3s on 8192 size blocks: 12074 sha256's in 3.00s Doing sha256 for 3s on 16384 size blocks: 6097 sha256's in 3.00s Illegal instruction (core dumped) # env OPENSSL_armcap=16 openssl speed -evp sha256 Doing sha256 for 3s on 16 size blocks: Illegal instruction (core dumped) I should note that my buildworld's and buildkernel's are set up to involve -mcpu=cortex-a72 or -mcpu=cortex-a53 or -mcpu-cortex-a7 as appropriate to matching the target hardware. The armv7 chroot's builds used -mcpu-cortex-a7 as well. (The installs are of the same system build as for the OPi+2E .) === Mark Millard marklmi at yahoo.com ( dsl-only.net went away in early 2018-Mar)