[Bug 255445] lang/python 3.8/3.9 SIGSEV core dumps in libthr TrueNAS

Tue Apr 27 18:41:24 UTC 2021

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=255445

            Bug ID: 255445
           Summary: lang/python 3.8/3.9 SIGSEV core dumps in libthr
                    TrueNAS
           Product: Ports & Packages
           Version: Latest
          Hardware: amd64
                OS: Any
            Status: New
          Keywords: crash
          Severity: Affects Many People
          Priority: ---
         Component: Individual Port(s)
          Assignee: python at FreeBSD.org
          Reporter: yocalebo at gmail.com
             Flags: maintainer-feedback?(python at FreeBSD.org)
          Assignee: python at FreeBSD.org

Seeing many TrueNAS (previously FreeNAS) users dump core on the main
middlewared process (python) starting with our version 12.0 release.

Relevant OS information:
12.2-RELEASE-p6 FreeBSD 12.2-RELEASE-p6 f2858df162b(HEAD) TRUENAS  amd64

Python versions that experience the core dump:
Python 3.8.7
Python 3.9.4

When initially researching this, I did find a regression with threading and
python 3.8 on freeBSD and was able to resolve that particular problem by
backporting the commits:
https://github.com/python/cpython/commit/4d96b4635aeff1b8ad41d41422ce808ce0b971c8
and
https://github.com/python/cpython/commit/9ad58acbe8b90b4d0f2d2e139e38bb5aa32b7fb6.

The reason why I backported those commits is because all of the core dumps that
I've analyzed are panic'ing in the same spot (or very close to it). For
example, here are 2 backtraces showing null-ptr dereference.

Core was generated by `python3.8: middlewared'.
 Program terminated with signal SIGSEGV, Segmentation fault.
 #0 cond_signal_common (cond=<optimized out>) at
/truenas-releng/freenas/_BE/os/lib/libthr/thread/thr_cond.c:457
warning: Source file is more recent than executable.
 457 mp = td->mutex_obj;
 [Current thread is 1 (LWP 100733)]
 (gdb) list
 452                _sleepq_unlock(cvp);
 453                    return (0);
 454                }
 455
 456                td = _sleepq_first(sq);
 457                mp = td->mutex_obj;
 458                cvp->__has_user_waiters = _sleepq_remove(sq, td);
 459                if (PMUTEX_OWNER_ID(mp) == TID(curthread)) {
 460                    if (curthread->nwaiter_defer >= MAX_DEFER_WAITERS) {
 461                        _thr_wake_all(curthread->defer_waiters, 

(gdb) p *td
Cannot access memory at address 0x0

and another one
Core was generated by `python3.8: middlewared'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  cond_signal_common (cond=<optimized out>) at
/truenas-releng/freenas/_BE/os/lib/libthr/thread/thr_cond.c:459warning: Source
file is more recent than executable.
459             if (PMUTEX_OWNER_ID(mp) == TID(curthread)) {
[Current thread is 1 (LWP 101105)]
(gdb) list
454             }
455
456             td = _sleepq_first(sq);
457             mp = td->mutex_obj;
458             cvp->__has_user_waiters = _sleepq_remove(sq, td);
459             if (PMUTEX_OWNER_ID(mp) == TID(curthread)) {
460                     if (curthread->nwaiter_defer >= MAX_DEFER_WAITERS) {
461                             _thr_wake_all(curthread->defer_waiters,
462                                 curthread->nwaiter_defer);
463                             curthread->nwaiter_defer = 0;
(gdb) p *mp
Cannot access memory at address 0x0

I'm trying to instrument a program to "stress" test threading (tearing down and
recreating etc etc) but I've been unsuccessful at tickling this particular
problem. The end-users that have seen this core dump sometimes go 1month +
without a problem. Hoping someone more knowledgeable can at least give me a
pointer or help me figure this one out. I have access to my VM that has all the
relevant core dumps available so if someone needs remote access to it to "poke"
around, please let me know. You can reach me at caleb [at] ixsystems.com

-- 
You are receiving this mail because:
You are the assignee for the bug.