[Bug 263908] Something spawning many "sh" process, system no longer boots, in single user /var/log empty
- Reply: bugzilla-noreply_a_freebsd.org: "[Bug 263908] Something spawning many "sh" process, system no longer boots, in single user /var/log empty"
- Reply: bugzilla-noreply_a_freebsd.org: "[Bug 263908] Something spawning many "sh" process (possibly zfsd), stalled system (No more processes), would not boot normally afterward"
- Reply: bugzilla-noreply_a_freebsd.org: "[Bug 263908] Something spawning many "sh" process (possibly zfsd), stalled system (No more processes), would not boot normally afterward"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Wed, 11 May 2022 00:11:02 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=263908 Bug ID: 263908 Summary: Something spawning many "sh" process, system no longer boots, in single user /var/log empty Product: Base System Version: 13.1-STABLE Hardware: amd64 OS: Any Status: New Severity: Affects Only Me Priority: --- Component: misc Assignee: bugs@FreeBSD.org Reporter: greg@teamworkweb.com Not sure how, or even if, I should report this. However figured I should say something, since process I am using to install and run 13.1-RC6 is basically same as what I had going with 13.0. But... with a serious issue! All things being equal, issues point to a flaw or difference in 13.1-RC6 compared to 13.0. Did a fresh install of 13.1RC-6 on Sunday (05/08) evening. Ran into an issue with MFI driver (reported as bug 263906) but was able to work around with MRSAS driver (which I intended to use anyway). Installed common packages for benchmarks. Built a zpool using dRAID out of HDDs and special vdev using 3x mirror of SSDs. Applied mix of system tunables that had been working reliably under 13.0 (can provide if requested). Started a test set of back to back fio and iozone benchmarks. Next morning went to check results. Found I could not run anything, was getting "No more processes" on my shell. Left it running, later Monday evening found I was able to run processes. But there were over 37,000+ instances of "sh" running! Mostly in sleep. I was able to pull /var/log/messages, and found: May 9 20:11:00 freebsd kernel: maxproc limit exceeded by uid 2 (pid 21916); see tuning(7) and login.conf(5) Results from top at the time: last pid: 22684; load averages: 0.26, 0.18, 0.11 up 0+22:20:59 20:15:46 37976 processes:1 running, 37975 sleeping CPU: 0.1% user, 0.0% nice, 6.0% system, 0.0% interrupt, 93.8% idle Mem: 1112K Active, 19G Inact, 8491M Laundry, 2648M Wired, 40K Buf, 817M Free ARC: 236M Total, 50M MFU, 108M MRU, 2067K Header, 75M Other 90M Compressed, 222M Uncompressed, 2.46:1 Ratio Swap: 8192M Total, 2784M Used, 5408M Free, 33% Inuse PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND 22684 root 1 20 0 72M 46M CPU1 1 0:16 85.79% top 25011 ntpd 1 20 0 21M 1724K select 3 0:02 0.00% ntpd 8242 root 1 52 0 13M 2004K wait 1 0:01 0.00% sh Did a reboot, and has been all down hill from there. System will no longer boot, at least not to login prompt. It stalls during several points at loading up, after usb driver load, and after starting network. Can coax it along some what by crtl-c/x/z, the last thing it will do is "Starting devd". Kernel seems to be running, as it will reboot if you hit ctrl-alt-del, or power down if you tap power button. I can get into single user mode, but find /var/log is empty. I let it sit for a while at one point, and it displayed a few lines over time that it was killing of "sh" processes. Because I had rebooted several times on the first night, right now I suspect some stock ("out of the box") cron job is running and looping, creating all the "sh" processes. But I don't have enough detail yet. Honestly still figuring out how I get root file system out of read-only mode when booted single user? I want to comment out everything in /etc/crontab and try booting. See if one of these is the cause. (again all "stock", I didn't create any custom cron jobs yet) Because of the issues with the MFI driver, I did pull the LSI 9361 HBA out of the server. I even destroyed the dRAID pool. Doesn't seem related, issue persists. So why am I reporting this as a "bug", when I lack enough detail to confirm the actual issue? Because every single step I did was the same as performed under 13.0. On the same hardware, that had been 100% stable for 3+ months. All things being equal, there is something "wrong" or "different" in 13.1-RC6 which is now broken / breaking my setup. In the interest of helping rule this out as a flaw in RC6, willing to do what I can to trouble shoot further. But honestly would need more input as to proper diagnostic steps. I do have a little more time to "play" with this hardware, before I have to select a version and put it into production. I was holding out so I could run 13.1 when it goes to release. But if I cannot figure this out I will roll back to 13.0 for production, since that was fully stable. Please let me know what other details to provide, suggestions for trouble shooting, further diagnostics. Just looking to contribute to RC6 testing, determine if this is a bug or a "just me" problem. Thanks! -Greg- -- You are receiving this mail because: You are the assignee for the bug.