Helping understand cause of SIGSEGV
Pete Wright
pete at nomadlogic.org
Sun Nov 8 00:12:22 UTC 2020
On 11/7/20 11:57 AM, Patrick Mahan wrote:
> On Sat, Nov 7, 2020 at 9:59 AM Pete Wright <pete at nomadlogic.org> wrote:
>
>>
>> On 11/5/20 9:44 PM, Patrick Mahan wrote:
>>
>> On Thu, Nov 5, 2020 at 5:01 PM Pete Wright <pete at nomadlogic.org> wrote:
>>
>>>
>>> On 11/5/20 4:01 PM, Patrick Mahan wrote:
>>>
>>>
>>>
>>>> | thread #1, name = 'fluent-bit', stop reason = signal SIGABRT
>>>> * frame #0: 0x000000004087100a libc.so.7`__sys_thr_kill at
>>>> thr_kill.S:4
>>>> frame #1: 0x00000000407e6c84 libc.so.7`__raise(s=6) at raise.c:52:10
>>>> frame #2: 0x000000004089a5d9 libc.so.7`abort at abort.c:67:8
>>>> frame #3: 0x000000000034a7a8
>>>> fluent-bit`flb_signal_handler(signal=11) at fluent-bit.c:418:9
>>>> frame #4: 0x00000000406d1c20
>>>> libthr.so.3`handle_signal(actp=0x00007fffdfffc600, sig=11,
>>>> info=0x00007fffdfffc9f0, ucp=0x00007fffdfffc680) at thr_sig.c:303:3
>>>> frame #5: 0x00000000406d11ef libthr.so.3`thr_sighandler(sig=11,
>>>> info=0x00007fffdfffc9f0, _ucp=0x00007fffdfffc680) at thr_sig.c:246:2
>>>> frame #6: 0x00007fffffffe193
>>>> frame #7: 0x000000000036fe0c fluent-bit`tasks_start [inlined]
>>>> output_params_set(th=0x00000000416091c0, data=0x000000004165d980,
>>>> bytes=128, tag="random.0", tag_len=8, i_ins=0x0000000040e58000,
>>>> out_plugin=0x0000000040e2dfc0, out_context=0x00000000416051e0,
>>>> config=0x0000000040e19180) at flb_output.h:429:5
>>>>
>>> I would look at what is happening here in output_params_set(). Something
>>> is accessing out of bounds memory.
>>>
>>>
>>>
>>> thanks for your response Patrick i really appreciate it.
>>>
>>> So here is where output_params_set() is defined - with an interesting
>>> comment that i haven't chased down yet:
>>>
>>> 521 /* Workaround for makecontext() */
>>> 522 output_params_set(th,
>>> 523 buf,
>>> 524 size,
>>> 525 tag,
>>> 526 tag_len,
>>> 527 i_ins,
>>> 528 o_ins->p,
>>> 529 o_ins->context,
>>> 530 config);
>>> 531 return th;
>>> 532 }
>>> 533
>>>
>>> and the frame from the backtrace is this for reference:
>>> frame #8: 0x000000000036fd14 fluent-bit`tasks_start [inlined]
>>> flb_output_thread(task=0x00000000416410a0, i_ins=0x0000000040e58000,
>>> o_ins=0x0000000040e5b000, config=0x0000000040e19180,
>>> buf=0x000000004165d980, size=128, tag="random.0", tag_len=8) at
>>> flb_output.h:522
>>>
>>> and then later on line 429 of flb_output.h it does this:
>>> 428 FLB_TLS_SET(flb_libco_params, params);
>>> 429 co_switch(th->callee);
>>>
>>> like i said i'm not really sure how to grok this, but it sounds like one
>>> of the params in output_params_set isn't being set correctly. hopefully
>>> the code snippet makes the error more obvious :)
>>>
>>>
>> Okay, I don't know lldb very well. But according to the GDB to LLDB
>> command map <http://lldb.llvm.org/use/map.html> it uses the same commands
>> to move between frames. So at startup you want to ensure you are in thread
>> 1 (thread select 1). That should place you in the last frame on the stack
>> (frame #0). You just move up the stack using the command 'up' until you
>> are in frame #7.
>>
>> Once there you need to dump the contents of 'th' using the command 'p *th'
>> or 'frame variable -T *th'. I suspect the value of th->callee is
>> incorrect. The next frame on the stack is -
>>
>> frame #6: 0x00007fffffffe193
>>
>> This is different from the rest of the stack addresses. So I suspect it
>> is out of bounds.
>>
>> Patrick
>>
>>
>>
>> that's totally it - thanks Patrick!
>>
>> frame #7: 0x000000000036fe0c fluent-bit`tasks_start [inlined]
>> output_params_set(th=0x00000000416091c0, data=0x000000004165d980,
>> bytes=128, tag="random.0", tag_len=8, i_ins=0x0000000040e58000,
>> out_plugin=0x0000000040e2dfc0, out_context=0x00000000416051e0,
>> config=0x0000000040e19180) at flb_output.h:429:5
>> 426 params->th = th;
>> 427
>> 428 FLB_TLS_SET(flb_libco_params, params);
>> -> 429 co_switch(th->callee);
>> 430 }
>> 431
>> 432 static FLB_INLINE void output_pre_cb_flush(void)
>> (lldb) p *th
>> (flb_thread) $0 = {
>> caller = 0x00000000406b2950
>> callee = 0x000000004169f640
>> data = 0xa5a5a5a5a5a5a5a5
>> cb_destroy = 0x0000000000000000
>> }
>> (lldb)
>>
>> i guess the next question to answer is why is this out of bounds. i'm
>> gonna poke around and see what i can learn today.
>>
>>
> The value of th->callee should be a function, I think. That is just from a
> cursory glance at libco.
>
> Good luck.
interesting - so it looks like fluent-bit includes their own version of
libco under lib/flb_libco. i didn't observe any major differences from
it's upstream via a cursory glance. the included doc has this to say
about co_switch():
void co_switch(cothread_t cothread)
Switch to specified cothread.
Null (0) or invalid cothread handle is not allowed.
Passing handle of active cothread to this function is not allowed.
looking through their flb_thread_libco.h file the implementation looks
like this:
#define flb_thread_return(th) co_switch(th->caller)
static FLB_INLINE void flb_thread_resume(struct flb_thread *th)
{
pthread_setspecific(flb_thread_key, (void *) th);
/*
* In the past we used to have a flag to mark when a coroutine
* has finished (th->ended == MK_TRUE), now we let the coroutine
* to submit an event to the event loop indicating what's going on
* through the call FLB_OUTPUT_RETURN(...).
*
* So we just swap context and let the event loop to handle all
* the cleanup required.
*/
th->caller = co_active();
co_switch(th->callee);
}
the above code is old (from 2016) so i don't think that's the issue.
thanks for your help on the Patrick - i suspect to make much more
progress i'll need someone from the fluent-bit team to take a closer
look as to what's happening.
cheers,
-pete
--
Pete Wright
pete at nomadlogic.org
@nomadlogicLA
More information about the freebsd-questions
mailing list