Re: [Intel AlderLake] Read&Write files to FAT32 or UFS partition cause data corrupt due to P-Core&E-Core

From: Konstantin Belousov <kostikbel_at_gmail.com>
Date: Tue, 08 Aug 2023 14:02:32 UTC
On Tue, Aug 08, 2023 at 10:46:12PM +0900, Tomoaki AOKI wrote:
> On Tue, 8 Aug 2023 15:38:46 +0300
> Konstantin Belousov <kostikbel@gmail.com> wrote:
> 
> > On Tue, Aug 08, 2023 at 06:37:35AM +0900, Tomoaki AOKI wrote:
> > > On Sun, 6 Aug 2023 12:55:07 +0300
> > > Konstantin Belousov <kostikbel@gmail.com> wrote:
> > > 
> > > > On Sun, Aug 06, 2023 at 06:12:38PM +0900, Tomoaki AOKI wrote:
> > > > > On Wed, 23 Feb 2022 01:30:28 +0200
> > > > > Konstantin Belousov <kostikbel@gmail.com> wrote:
> > > > > 
> > > > > > On Tue, Feb 22, 2022 at 06:23:17PM -0500, Alexander Motin wrote:
> > > > > > > On 22.02.2022 17:46, Konstantin Belousov wrote:
> > > > > > > > Ok, the next step is to get the CPU feature reports from P- vs. E- cores.
> > > > > > > > Patch below should work, with verbose boot.
> > > > > > > 
> > > > > > > Not much difference on that level:
> > > > > > > 
> > > > > > > --- zzzp        2022-02-22 18:18:24.531704000 -0500
> > > > > > > +++ zzze        2022-02-22 18:18:18.631236000 -0500
> > > > > > > @@ -1,22 +1,21 @@
> > > > > > > -CPU 2: 12th Gen Intel(R) Core(TM) i7-12700K (3609.60-MHz K8-class CPU)
> > > > > > > +CPU 16: 12th Gen Intel(R) Core(TM) i7-12700K (3609.60-MHz K8-class CPU)
> > > > > > >    Origin="GenuineIntel"  Id=0x90672  Family=0x6  Model=0x97  Stepping=2
> > > > > > > Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
> > > > > > > Features2=0x7ffafbff<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,SDBG,FMA,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND>
> > > > > > >    AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM>
> > > > > > >    AMD Features2=0x121<LAHF,ABM,Prefetch>
> > > > > > >    Structured Extended Features=0x239ca7eb<FSGSBASE,TSCADJ,BMI1,AVX2,FDPEXC,SMEP,BMI2,ERMS,INVPCID,NFPUSG,PQE,RDSEED,ADX,SMAP,CLFLUSHOPT,CLWB,PROCTRACE,SHA>
> > > > > > >    Structured Extended Features2=0x98c027ac<UMIP,PKU,WAITPKG,GFNI,VAES,VPCLMULQDQ,TME,RDPID,MOVDIRI,MOVDIR64B>
> > > > > > >    Structured Extended Features3=0xfc1cc410<FSRM,MD_CLEAR,PCONFIG,IBT,IBPB,STIBP,L1DFL,ARCH_CAP,CORE_CAP,SSBD>
> > > > > > >    XSAVE Features=0xf<XSAVEOPT,XSAVEC,XINUSE,XSAVES>
> > > > > > >    IA32_ARCH_CAPS=0xd6b<RDCL_NO,IBRS_ALL,SKIP_L1DFL_VME,MDS_NO,TAA_NO>
> > > > > > >    VT-x: Basic Features=0x3da0500<SMM,INS/OUTS,TRUE>
> > > > > > >          Pin-Based Controls=0xff<ExtINT,NMI,VNMI,PreTmr,PostIntr>
> > > > > > >          Primary Processor Controls=0xfffbfffe<INTWIN,TSCOff,HLT,INVLPG,MWAIT,RDPMC,RDTSC,CR3-LD,CR3-ST,CR8-LD,CR8-ST,TPR,NMIWIN,MOV-DR,IO,IOmap,MTF,MSRmap,MONITOR,PAUSE>
> > > > > > >          Secondary Processor Controls=0xf5d7fff<APIC,EPT,DT,RDTSCP,x2APIC,VPID,WBINVD,UG,APIC-reg,VID,PAUSE-loop,RDRAND,INVPCID,VMFUNC,VMCS,XSAVES>
> > > > > > >          Exit Controls=0x3da0500<PAT-LD,EFER-SV,PTMR-SV>
> > > > > > >          Entry Controls=0x3da0500
> > > > > > >          EPT Features=0x6f34141<XO,PW4,UC,WB,2M,1G,INVEPT,AD,single,all>
> > > > > > >          VPID Features=0x10f01<INVVPID,individual,single,all,single-globals>
> > > > > > >    TSC: P-state invariant, performance statistics
> > > > > > > -64-Byte prefetching
> > > > > > > -L2 cache: 1280 kbytes, 8-way associative, 64 bytes/line
> > > > > > > +L2 cache: 2048 kbytes, 16-way associative, 64 bytes/line
> > > > > > > 
> > > > > > 
> > > > > > Show me the full verbose dmesg of the boot then.
> > > > > > 
> > > > > > As another blind guess, try to disable pcid, vm.pmap.pcid_enabled=0.
> > > > > > 
> > > > > 
> > > > > Hi.
> > > > > 
> > > > > Intel N100 is reported to crash without this tunable on 13.2 at
> > > > > freebsd-users-jp ML (as this is a ML in Japanese, reported in
> > > > > Japanese). [1]
> > > > > Crashes with UFS, but ZFS is claimed to be OK.
> > > > > 
> > > > > N100 is an Alder Lake-N processor WITHOUT P-CORE. [2] [3]
> > > > > So check logics on workarouund codes (IIRC, all are MFC'ed before 13.2)
> > > > > wouldn't be working?
> > > > 
> > > > Show me the output from x86info -r on the machine, I do not care which
> > > > specific core it is, they should be all the same.  x86info is available
> > > > as sysutils/x86info.
> > > 
> > > Requested to original reporter and got the result below.
> > > HTH.
> > > 
> > > -----------------------
> > > root@eq12:~ # x86info -r
> > > x86info v1.31pre
> > > /dev/cpuctl0: No such file or directory
> > > Found 4 identical CPUs
> > > Extended Family: 0 Extended Model: 11 Family: 6 Model: 190 Stepping: 0
> > > Type: 0 (Original OEM)
> > > CPU Model (x86info's best guess): Unknown model.
> > ...
> > > eax in: 0x0000001a, eax = 20000001 ebx = 00000000 ecx = 00000000 edx = 00000000
> > 
> > The CPU is reported as small core/atom, so the workaround is turned on.
> > I do not think that the issue reported is related to the TLB/PG_G errata.
> > 
> > Why do you think that this is hw issue at all, and not some software bug
> > in the build etc ?
> 
> Because the issue looks similar (crashes on UFS but not ZFS, and as far
> as the original reporter tested, vm.pmap.pcid_enabled=0
> in /boot/loader.conf helped).
> 
> Moreover, N100 CPU is Alder Lake-N. So potentially includes the same
> design issue (common circuits, firmwares, ...).
> 
> So I suspected the same problem persists even without P-core and
> adviced the original reporter to add the workaround
> in /boot/loader.conf.
> It seems to help until now.
The workaround is switched on automatically, when kernel detects 'small cores'
reported by CPUID.