Re: FYI for aarch64 main [14] running a mid March version: I ended up with [usb{usbus2}] stuck at (near) 100% cpu
Date: Fri, 14 May 2021 21:51:27 UTC
On 2021-May-14, at 13:48, Rodney W. Grimes <freebsd-rwg at gndrsh.dnsmgr.net> wrote: >> >> On 2021-May-14, at 05:52, Rodney W. Grimes <freebsd-rwg at gndrsh.dnsmgr.net> wrote: >> >>>> Note: The context was using a non-debug main build >>>> from mid-2021-Mar. (More details identified >>>> later.) >>>> >>>> The issue happend while attempting a: >>>> >>>> # zfs send -R zpold@for-copy | zfs recv -Fdv zpnew >>>> >>>> where the drives involved in the command were: >>>> >>>> zpold: a USB3 SSD, using /dev/da0p3 >>>> zpnew: an 480 GiByte Optane in the PCIe slot, using /dev/nda0p3 >>>> >>>> with: >>>> >>>> # gpart show -pl >>>> => 40 468862048 da0 GPT (224G) >>>> 40 532480 da0p1 4C8GCA72EFI (260M) >>>> 532520 2008 - free - (1.0M) >>>> 534528 29360128 da0p2 4C8GCA72swp14 (14G) >>>> 29894656 4194304 - free - (2.0G) >>>> 34088960 33554432 da0p4 4C8GCA72swp16 (16G) >>>> 67643392 401217536 da0p3 4C8GCA72zfs (191G) >>>> 468860928 1160 - free - (580K) >>>> >>>> => 40 2000409184 ada0 GPT (954G) >>>> 40 409600 ada0p1 (null) (200M) >>>> 409640 1740636160 ada0p2 FBSDmacchroot (830G) >>>> 1741045800 58720256 ada0p3 FBSDmacchswp0 (28G) >>>> 1799766056 176160768 ada0p4 FBSDmacchswp1 (84G) >>>> 1975926824 24482400 - free - (12G) >>>> >>>> => 40 937703008 nda0 GPT (447G) >>>> 40 532480 nda0p1 CA72opt0EFI (260M) >>>> 532520 2008 - free - (1.0M) >>>> 534528 117440512 nda0p2 CA72opt0swp56 (56G) >>>> 117975040 16777216 - free - (8.0G) >>>> 134752256 134217728 nda0p4 CA72opt0swp64 (64G) >>>> 268969984 668731392 nda0p3 CA72opt0zfs (319G) >>>> 937701376 1672 - free - (836K) >>>> >>>> The system running was that on /dev/ada0p2 (FBSDmacchroot, >>>> which is UFS instead of ZFS). >>>> >>>> The [usb{usbus2}] process eventually got stuck-busy, no >>>> more I/O: >>>> >>>> CPU 0: 0.0% user, 0.0% nice, 100% system, 0.0% interrupt, 0.0% idle >>>> CPU 1: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle >>>> CPU 2: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle >>>> CPU 3: 0.4% user, 0.0% nice, 0.0% system, 0.0% interrupt, 99.6% idle >>>> >>>> PID USERNAME PRI NICE SIZE RES STATE C TIME CPU COMMAND >>>> 15 root -72 - 0B 262144B CPU0 0 8:51 99.95% [usb{usbus2}] >>>> >>>> 1295 root -8 0 20108Ki 8092Ki q->bq_ 2 0:04 0.00% zfs recv -Fdv zpnew{receive_writer_thre} >>>> 1295 root 48 0 20108Ki 8092Ki piperd 2 0:22 0.00% zfs recv -Fdv zpnew{zfs} >>>> 1294 root -8 0 17544Ki 7740Ki q->bq_ 2 0:01 0.00% zfs send -R zpold@for-copy{send_reader_thread} >>>> 1294 root -8 0 17544Ki 7740Ki q->bq_ 0 0:00 0.00% zfs send -R zpold@for-copy{send_merge_thread} >>>> 1294 root -8 0 17544Ki 7740Ki hdr->b 2 0:00 0.00% zfs send -R zpold@for-copy{send_traverse_threa} >>>> 1294 root 52 0 17544Ki 7740Ki range- 3 0:20 0.00% zfs send -R zpold@for-copy{zfs} >>>> >>>> 1036 root -8 - 0B 1488Ki t->zth 0 0:00 0.00% [zfskern{z_checkpoint_discar}] >>>> 1036 root -8 - 0B 1488Ki t->zth 1 0:00 0.00% [zfskern{z_livelist_condense}] >>>> 1036 root -8 - 0B 1488Ki t->zth 2 0:00 0.00% [zfskern{z_livelist_destroy}] >>>> 1036 root -8 - 0B 1488Ki t->zth 1 0:00 0.00% [zfskern{z_indirect_condense}] >>>> 1036 root -8 - 0B 1488Ki mmp->m 3 0:00 0.00% [zfskern{mmp_thread_enter}] >>>> 1036 root -8 - 0B 1488Ki tx->tx 1 0:00 0.00% [zfskern{txg_thread_enter}] >>>> 1036 root -8 - 0B 1488Ki tx->tx 2 0:00 0.00% [zfskern{txg_thread_enter}] >>>> >>>> I was unable to ^c or ^z the process where I >>>> typed the command. I eventually stopped the >>>> system with "shutdown -p now" from a ssh >>>> session (that had already been in place). >>> >>> Should this occur again before doing the shutdown run a >>> zpool status & >>> I have gotten in this state when the recv pool was a usb device >> >> The USB device had the send pool in my example. >> >>> and for some reason it had a timeout and gone offline. >> >> No messages about timeouts or other such were made. >> >>> The clue >>> this occured are in dmesg, and zpool status. >> >> No console, dmesg -a, or /var/log/messages output were >> generated. (And the system was running from a SATA SSD >> that was operating well.) >> >> For reference, the USB EtherNet device that was in a >> USB2 port continued to operate just fine, allowing the >> use of existing ssh sessions that were displaying >> gstat -spod and top -Samio -ototal until I started >> looking at the problem. (I did not try making a new >> ssh session). >> >> I did not do the "zpool status" so I can not report >> about it. >> >>> Unplug/plug the USB device, check dmesg that it came online, >>> and do a zpool clear. >> >> Okay. I normally avoid unplugging USB storage media >> if the system overall does not hang up: hopes of >> a clean shutdown leaving things better. > > Do the zpool status, and only if that indicates a device offline > or other problem would you proceed to do the unplug/plug, as at > that point zfs has stopped doing anything to the device and > your shutdown wont do anything as far as zfs for that pool anyway. > >> >> The system did appear to shutdown to completion. > > Yes, you can shutdown a system with a zpool in failed state. I still have the console output available and I was wrong: the shutdown hung up and apparently I cut power before "All buffers synced". All it got to was: Stopping cron. Waiting for PIDS: 858. Stopping sshd. Waiting for PIDS: 852. fstab: /etc/fstab:6: Inappropriate file type or format Stopping ntpd. Waiting for PIDS: 807. Stopping nfsd. Waiting for PIDS: 781 782. Stopping mountd. Waiting for PIDS: 779. Stopping rpcbind. Waiting for PIDS: 741. Stopping devd. Waiting for PIDS: 431. Writing entropy file: . Writing early boot entropy file: . . Terminated May 13 20:16:26 FBSDmacch syslogd: exiting on signal 15 It did not get to any of the usual sort of (from a different shutdown): Waiting (max 60 seconds) for system process `vnlru' to stop... done Waiting (max 60 seconds) for system process `syncer' to stop... Syncing disks, vnodes remaining... 0 0 0 0 0 0 0 0 0 0 done Waiting (max 60 seconds) for system thread `bufdaemon' to stop... done Waiting (max 60 seconds) for system thread `bufspacedaemon-1' to stop... done Waiting (max 60 seconds) for system thread `bufspacedaemon-3' to stop... done Waiting (max 60 seconds) for system thread `bufspacedaemon-4' to stop... done Waiting (max 60 seconds) for system thread `bufspacedaemon-5' to stop... done Waiting (max 60 seconds) for system thread `bufspacedaemon-0' to stop... done Waiting (max 60 seconds) for system thread `bufspacedaemon-2' to stop... done Waiting (max 60 seconds) for system thread `bufspacedaemon-6' to stop... done All buffers synced. Uptime: 8m12s >> >>>> >>>> When I retried after rebooting and scrubbing (no >>>> problems found), the problem did not repeat. >>>> >>>> I do not have more information nor a way to repeat >>>> the problem on demand, unfortunately. >>>> >>>> Details of the vintage of the system software and >>>> such: >>>> >>>> # ~/fbsd-based-on-what-freebsd-main.sh >>>> FreeBSD FBSDmacch 14.0-CURRENT FreeBSD 14.0-CURRENT mm-src-n245445-def0058cc690 GENERIC-NODBG arm64 aarch64 1400005 1400005 >>>> def0058cc690 (HEAD -> mm-src) mm-src snapshot for mm's patched build in git context. >>>> merge-base: 7381bbee29df959e88ec59866cf2878263e7f3b2 >>>> merge-base: CommitDate: 2021-03-12 20:29:42 +0000 >>>> 7381bbee29df (freebsd/main, freebsd/HEAD, pure-src, main) cam: Run all XPT_ASYNC ccbs in a dedicated thread >>>> n245444 (--first-parent --count for merge-base) >>>> >>>> The system was a MACCHIATObin Double Shot. === Mark Millard marklmi at yahoo.com ( dsl-only.net went away in early 2018-Mar)