From nobody Wed Feb 02 04:10:35 2022 X-Original-To: freebsd-arm@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id C9BD619B26CE for ; Wed, 2 Feb 2022 04:10:51 +0000 (UTC) (envelope-from marklmi@yahoo.com) Received: from sonic303-23.consmr.mail.gq1.yahoo.com (sonic303-23.consmr.mail.gq1.yahoo.com [98.137.64.204]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4JpSyf75TMz3jnh for ; Wed, 2 Feb 2022 04:10:50 +0000 (UTC) (envelope-from marklmi@yahoo.com) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s2048; t=1643775043; bh=sMgAG9FHVqZPQqWmlzNZEfKu3wpEtJ0I7hs+O6/F17g=; h=Subject:From:In-Reply-To:Date:Cc:References:To:From:Subject:Reply-To; b=WWychSpR9uYwWEnq6qXQLIK0VqO6qTaKtlLLbETspCo7OkAkLj9LR6VGs2JUL2JkcimEpd2BS+H1b3UsQB6ZQVu87ZtqBGjnUZhfzyckwFUrqJGiLftHSG3DgNxRyMR5CU+l2PfNWhYMfi40Lv4jjTpIWIyRuBWmuaJyeGXt7JxWOIZxS9c30wWXxp6803YZ8jZep8PyT8S3ulkSC/k1rUkC+mZ1YQkjkRhUhY45/aB4q9gcPgCgYgBBqsEZdnl1WfDxc8mmOe4HgXY8tAAAt/DbveOtfuF+lfYhHK+Nf7SHuanxVhYIuhE7WjDGvipAczpW3x1U4CFxybpUber7Wg== X-SONIC-DKIM-SIGN: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s2048; t=1643775043; bh=Vr1uZiRWq0dAgm1R/jIbiR5UG4BgXVvXroJcY1q2IoV=; h=X-Sonic-MF:Subject:From:Date:To:From:Subject; b=lujaedzziWlvlvZXkz8UuVgVKMO+wMEBUTCasESTqyhA5BRxL9/y9UUzO0Kr1+g+WL+8BdpyN28SI28K/oX5+hVRk0kvqh1qmoPo2LLJPvYY7uv5Q3JT2F5rmKnROIH1HCYs3GQ6ha8ib56xA1Dy7rftdYbtbB++vIrxjXe+cs7r58vLIZLbG7xU87VyPSV+bhUOtm6lDXfAyuch4sJnTaGKWZJ6+BgJLnjPw7g1ABykxs5WeE6keo1KNiZhDt6cHGNiT2WqAkvzfgfyOp0ZtqZ3asVV+utzjzY3wDdhHFKyb+XtAvP6t6u+kRKD4ci+OnCCkHpjJ+PnilpZ80C/+A== X-YMail-OSG: hLWhpM0VM1m5XZpwnZb3MFvx5ryPc672E6AQTD8zQlWNs1ywrt2sac4xJeVxEbU N0uhwq4krXyYLcf93.m7hcy8MhFRnyj4OyP4fMU4iVCdaqawnrIYjl78IwOsVgu_wb9Jb0TeSs4y iPXdQ5dr24w7wiVWSeSMvKbMMF2mkne4axP5Wo1peEp.cXhaY7ALaoPlxmTg1S7z0Nf8j.015C8U bHA6Z8Tc3M.2uxP56MYJ1Oyj6KblCwRvaQx_isx70nQCRGg5OHmlk6nscIP5WjridFJBIHo._AZp WFPdGEyZd0dVADXGwoC7KztVkrVuPhmBHs3vShoeLWTZ1DSMx9CPnogErX2HjmUwFjovYtFRYSmM UtFSo4Vs_sSf.LKYc_vJTtw1Ris9gvZjz5aE38ebM7hIW3HvgytAv8L7j6Kyn5ufNlSVt_V6e6Ey sNUwOZ3nls10XqXYrm4ixij4rvS8YPYzkWWbAbMIYXmZie4y5JsY5LFb6ZaKHz8HC.eg4lk10Ux1 25Ax4j8kG8WEyO1JUhBrWLO3.z34q2IeLe3uItK53clR1osMRwJskOn81.jky4zfblK2zPYzv_L5 Zp3KvAncxi6pjmEE38ub4qX1BrwAgESDzw2rRzel0CqYzAKdgSk3t5ZG312Hij.na0xh7ysixgpZ 3E4NyNw2Yt3PY4j.rzMlX_KSTrD4NA2VwI4dgSQBcc53sHBiSZLIirTOIyJ038ty4NSq02yL9zex GpPfvb._X.KByCjx.zmoqss.F7Zutv67b_bc1wUcB7hVQ9Nw6Ad1T38yBXo9h78PT_vMBKyMBxp5 9qWlAhvLMCHaRutDnfK7G7KxgJrusAKRhf9LwgXOwE7RQmjc2lOkeOJVkQDcHa35xY0EvD82eiRA VVjP1sMP6KJFp8ofz3kyO0KuvsZ4Os.Pt8ouMZViVFntSiN7oAlS_Qdn4a14gRq0x9AHj_pM4uYA LDDevAVk0AfAv0aAt5OtUQeiYsoJgUTQmrue7UNqvtu_f8r9NW6Zj4zKyMDlDyMNCdfjhOTg5cCv nujiGRjTzl8QcBECm8uPPko6cQ2fnCYPHmDr.Ml1pb9ajEFJq7iRo1cj76KGxxE6UF3zpjaNll4T suLCqrS7.NCLhO7jJQ7rczkXvzfB.Fqr2nKws.XMJWePC_yuAfNI89bFnLBE1MAWohhaBybdU_xP gnCtelTDtgQf1qvkzU8GOwsOiTV_vI6ym68egPn7BpwhsvpjEla.V2_Q402PEJLYBA7zVrHU.FpG QSCX6Qgim960pTsX3uLgojEnQ_yAPRfZEpb_7iP8hmvvBhUrk.niBu3zYRDCXMc2bn5XkvjmQPA3 g3IdxqxPiZg3J0AbgGVMXIi1gNQg_wAxrrMeeoa3ZizVUKkhw.Z0G9hYCpVABvE4rvBF0752Y8aB yBmQrM.yM9lmUbMaFaKLMIBZSU5wIXOcVt1QosU9v4aPVSBGfYSZ6eCaJsWKWAYesGMWd9945_2H zOZmclDA7KbTmgbTeyTOIWAmaTh9zTPkDA4xNzpQE00KXWU_mKuexc8ohwjjH1ccJQs3egmWkYlt u7ZsdgUHE.Y8Hfv6sxbd8xa1.9VPif4Kr6c5wwxhpwMTK74Zr8Io7TwxsveP0Er9QtQAzQUgiPZ0 oNwzFDkpz8_cVDslQ.cKLV05eKwuwfmP4cKFhFtkuFZ8mdHWfuyY_.aCtXDJX_yXQhE0uzfVDacc eXV9chbr7QFqJ5Emr6EuX7p9O8aA5eF7TqGhEFNZ2z7jbvcmwTekjoaqB4lunqP9l9FlStGwwH14 Q_4WLC9g5zw_5bFW1RqA0lhRtDqXQy3Np2Dt8d7wwJREmpLeyVupvKU45_k9XHKd9Bg98xtFa.Sy QsbW84oKKbRfo3n7y3U0itFr5ACNBr72pd2EOzc6ct80kbEIaZbZz6z3NjHVEb1oUZqIuoAFAg6A q_vUhBltMbXJfjhUiFVIIMRln_T3cm1Ni6IcLZ4KbKthcFNxbXM1K05LWSt_Ktq8p4wyWdzMbQXQ .TLx1oZPC40fJAApLbQRPHDZ2waHCUm4ksVd92tvwpcr0g0N85v.aYAjt0mK7_eoAKiJhHEM8kG. Yc1uLjh_6nPU7NKhf3isS5Vo9hVIlW69dKnNgcbQ1C3ysFO1Bs7NGLiew1cHGZonQ5QACHgR0o2f 8uCmJ0EBNMlucLqAhptNylmk7VlpPLN7qz1_g.rPQgnH5vtN7_QsXNI_B0cbtT.ySRHf0v4.TxC8 K1VCTeha9cSe3N7wUhBBD3whTJXD.B7bJkvb0zDu2ksPTj36ZduxcLxYCWBir26qa9CnCjc4mtSB kMv9bez54eA-- X-Sonic-MF: Received: from sonic.gate.mail.ne1.yahoo.com by sonic303.consmr.mail.gq1.yahoo.com with HTTP; Wed, 2 Feb 2022 04:10:43 +0000 Received: by kubenode531.mail-prod1.omega.bf1.yahoo.com (VZM Hermes SMTP Server) with ESMTPA ID cdc74e4583088d1172f905d65300b943; Wed, 02 Feb 2022 04:10:37 +0000 (UTC) Content-Type: text/plain; charset=us-ascii List-Id: Porting FreeBSD to ARM processors List-Archive: https://lists.freebsd.org/archives/freebsd-arm List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-arm@freebsd.org Mime-Version: 1.0 (Mac OS X Mail 14.0 \(3654.120.0.1.13\)) Subject: Re: Error detection for microSD-based swap, buildworld failures on pi3 From: Mark Millard In-Reply-To: <9b604f9e-45bf-b197-562f-1f6381ee5515@gmail.com> Date: Tue, 1 Feb 2022 20:10:35 -0800 Cc: "freebsd-arm@freebsd.org" Content-Transfer-Encoding: quoted-printable Message-Id: References: <20220129022255.GA59340@www.zefox.net> <6B822440-6F01-4578-803C-20A51DADF10C@yahoo.com> <20220130020546.GA63792@www.zefox.net> <1964F2B7-EC41-42C8-9C18-5E2B79EE0271@yahoo.com> <5B3DF910-23B1-4246-999E-0196E90269F2@yahoo.com> <20220131165333.GA69543@www.zefox.net> <9E0510D2-9FAC-4F01-89A3-E6D8C7C21FDA@yahoo.com> <20220131221405.GA70251@www.zefox.net> <14716537-6E22-44F5-B6AA-841E3EB2AD04@yahoo.com> <20220201161808.GA73977@www.zefox.net> <0e61e2d8-c65f-eb23-473f-69403e33da9e@gmail.com> <9b604f9e-45bf-b197-562f-1f6381ee5515@gmail.com> To: MJ X-Mailer: Apple Mail (2.3654.120.0.1.13) X-Rspamd-Queue-Id: 4JpSyf75TMz3jnh X-Spamd-Bar: - Authentication-Results: mx1.freebsd.org; dkim=pass header.d=yahoo.com header.s=s2048 header.b=WWychSpR; dmarc=pass (policy=reject) header.from=yahoo.com; spf=pass (mx1.freebsd.org: domain of marklmi@yahoo.com designates 98.137.64.204 as permitted sender) smtp.mailfrom=marklmi@yahoo.com X-Spamd-Result: default: False [-1.50 / 15.00]; TO_DN_EQ_ADDR_SOME(0.00)[]; TO_DN_SOME(0.00)[]; MV_CASE(0.50)[]; FREEMAIL_FROM(0.00)[yahoo.com]; R_SPF_ALLOW(-0.20)[+ptr:yahoo.com]; DKIM_TRACE(0.00)[yahoo.com:+]; RCPT_COUNT_TWO(0.00)[2]; DMARC_POLICY_ALLOW(-0.50)[yahoo.com,reject]; NEURAL_HAM_SHORT(-1.00)[-1.000]; FREEMAIL_TO(0.00)[gmail.com]; FROM_EQ_ENVFROM(0.00)[]; RCVD_TLS_LAST(0.00)[]; MIME_TRACE(0.00)[0:+]; FREEMAIL_ENVFROM(0.00)[yahoo.com]; ASN(0.00)[asn:36647, ipnet:98.137.64.0/20, country:US]; MID_RHS_MATCH_FROM(0.00)[]; DWL_DNSWL_NONE(0.00)[yahoo.com:dkim]; ARC_NA(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; R_DKIM_ALLOW(-0.20)[yahoo.com:s=s2048]; FROM_HAS_DN(0.00)[]; NEURAL_HAM_LONG(-1.00)[-1.000]; MIME_GOOD(-0.10)[text/plain]; INTRODUCTION(2.00)[]; TO_MATCH_ENVRCPT_SOME(0.00)[]; RCVD_IN_DNSWL_NONE(0.00)[98.137.64.204:from]; MLMMJ_DEST(0.00)[freebsd-arm]; RWL_MAILSPIKE_POSSIBLE(0.00)[98.137.64.204:from]; RCVD_COUNT_TWO(0.00)[2] X-ThisMailContainsUnwantedMimeParts: N On 2022-Feb-1, at 18:52, MJ wrote: > On 2/02/2022 12:25 pm, Mark Millard wrote: >> On 2022-Feb-1, at 16:47, MJ wrote: >>> On 2/02/2022 3:18 am, bob prohaska wrote: >>>> [new subject, different emphasis, old problem] >>>> On Mon, Jan 31, 2022 at 03:06:01PM -0800, Mark Millard wrote: >>>>>=20 >>>>> One thing that could fit the behavior is if small part(s) >>>>> of the system c++ compiler (or libraires it uses) were >>>>> corrupted on that specific media. In that case, nothing >>>>> elsewhere would replicate the failures but a lot might >>>>> work without using the corrupted part(s), making the >>>>> failures not random. >>>> [spaced for emphasis] >>>>> Checking on that is part of why >>>>> I'd hoped to get a lldb report for a .sh/.cpp pair >>>>> leading to failure on your RPi3* in question. >>>>>=20 >>>> If/when the stable/13 Pi3 finishes its -j1 single-user >>>> build/install cycle I'll make a point of trying the >>>> .sh/.cpp test under lldb. >>>> For most of their operational history both troublesome Pi3 >>>> systems have had some of their swap on microSD. If there >>>> is no error detection at all for microSD-based storage >>>=20 >>> Is this true? I would have thought it used some form of error = detection in the firmware or in >>> the controller. >> The type of error and stage at which the error occurs matters. >> The firmware can not cover all issues that lead to corrupted >> content on media. >=20 > I did not state it covers all corruption. However, I would be totally = surprised if the controller in > ALL SD cards does not do error checking, whether ECC or even BCH. That = remains my point. I've a lot more context on Bob's problem and was mostly giving more context than he supplied. (My name is on the most nested text that he sent. I've been working with him for a while on this.) The corrupted-data hypothesis is a potential explanation for the compiler "139" (SEGV) failures on the RPi3* configuration in question for a few specific files being compiled during buildworld. But we have no specific evidence of a corruption at any specific place at this point. We have no specific evidence of a media-error type of corruption either. The symptoms make it unlikely that the swap partition pages are involved, block(s) in the compiler's file(s) would be more likely. But, again, no specific evidence identifying any specific block on the media. >>>> then undetected corruption of data from swap is a real >>>> possibility. I expected that storage errors would be >>>> reported but maybe not, especially outside file systems. >>>=20 >>> If indeed your suppositions are correct, would a file for swap be = more prudent as it has to >>> go through the file system (UFS/VFS) to read/write to swap? >> No. See https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D206048 = and >> its comments #7 and #8. >=20 > This seems to address potential memory over-use because of a swapfile, = not the safety of it over a > swap partition. The problem is system deadlocks and such. A deadlock and having to cut power and reapply power is a file system safety issue. > I still contend the UFS file system has better protection against = corruption than > a raw partition labelled swap. Not when the likely result is system deadlocks. I speak from experience with trying the file system (v-node) based page files. > If Bob's requirement is a "safer" swap, then a file would be the = answer. My experience with the deadlocks and their consequences indicate otherwise. That is why I added comments #7 and #8 to that bugzilla submittal. > Whether there are other issues to contend with are likely out of = context in this particular discussion. The deadlock consequences I suffered would most definitely matter: starting over from scratch from something that could not readily be recovered after the deadlock. The deadlocks violate what guarantees file system update coherency. (UFS for me back then and for Bob now.) buildworld and the like tend to have a lot of pending I/O around and a deadlock during this has not gone well in my experience. >>>> Mechanical disks have some internal error detection and >>>> report explictly when data can't be retrieved. As I think >>>> back on it at least one flash device (a USB thumb drive) >>>> failed silently, no reported errors but also no-write. >>>> That was on a filesystem, so the OS noticed and so did I. >>>=20 >>> But this could "simply" be because one of the NAND blocks has = failed, not that it could not >>> detect an error. Is there a lack of error detection in the driver = handling USB thumb drives and reported back to the kernel? I do not = know. >> Bob's context is reproducible at the same places in >=20 > No, he was talking about a "failed silently" event and this is what I = was replying to. I've been working with Bob for some time on the issue and have a lot of context on how to interpret things that are not clear from what he extracted and sent out of our off-list exchanges. At this point it is hard for me to write notes without using that context. This likely makes for an odd read and unintended implications relative to only having what Bob sent to the list. We have no specific evidence of a media error at any specific block/page. We have had no evidence of random variations in the behavior beyond the expected sorts of things from the likes of a -j4 buildworld : which file was processed first being the one to get the report of the compiler problem. > I am not up-to-date with the previous discussion on the failure of = llvm/clang. I am. Some of it has been off-list. The corrupted data hypothesis is a potential explanation for the compiler "139" (SEGV) failures on the RPi3* configuration in question. >> Such is unlikely for hitting the same problem page(s) >> in the swap space each way things are run. >=20 > I couldn't agree more. The chances would seem remote, unless that = partition is on a part of the SD card/USB drive that is failing and the = USB driver is not detecting these as reported by the controller. =3D=3D=3D Mark Millard marklmi at yahoo.com