Seg Fault while upgrading freebsd TCP stack to v13.1 in mbuf chain

From: Muhammad Waseem <Muhammad.Waseem_at_Sophos.com>
Date: Thu, 08 Feb 2024 15:18:02 UTC
We are running a software that is used to operate middleboxes, and we have the FreeBSD Network Stack for the TCP protocol implementation. It has served us well, but we have not upgraded since 2013. In the past we have addressed issues as and when we find them by applying patches from upstream, but that won't work anymore and we decided to upgrade to 13.1. The upgrade was going fine until, we were running smoke tests and we ran into a core.

The particular smoke test in question uses Linux Traffic Control to add delay, and other impairments to the packet transmission. The exact options being: delay 10ms reorder 10% and we are using CURL to send a 512 MB randomly generated file with a 45 second timeout window from the client. Even if we expect the test to fail due to changes in the upgrade, it should not result in a segfault.

We primarily see the segmentation fault in this low memory machine and not in other environments. We don't have a liberty to run memory analysis tool on the environment where the core is reproduced.

As for the details of the segmentation fault itself, its occurring in the mbuf chain. In the different tests we have run, the crash point is a different function, but usually occurs in these functions:
1. sbdrop_internal
2. tcp_m_copym
3. m_split (very rarely, it has also occured in)

However, what's to note it always on trying to access a member of the current m buffer, e.g. m->m_len causes the crash or m->m_flag causes the crash. I have tracked the faulty address that I get from these functions, to a socket which is assigned from tcp_input_with_port() function from the inpcb struct. The address is of course inaccessible in gdb. The faulty address belongs to the mbuf chains in the so_snd socket buffer. It is usually the mbuf in sb_sndptr. Either the first member itself or down the line. Although in one or two cores, the same applies for the sb_mb mbuf chain (which I assume is the main chain itself). From the addresses we can clearly see its a heap overflow, as I was able to go through sb_sndptr chain in one the cores until i found the faulty address.

The last address before the faulty one is: 0x7f402a4d0700 after which comes 0x9fff22eb779f. I also see this faulty address for the first time in the frame of the function tcp_input_with_port(), in inpcb struct(inp). The very obvious difference between the two addresses and it show that somewhere while accessing or assigning the mbuf, an overflow has occurred. These are most common back trace:
1. sbdrop_internal()
2. sbdrop_locked()
3. tcp_do_segment()
4. tcp_input_with_port()
5. in_input()
5. netisr_dispatch_src()
6. ether_demux()
7. ether_input_internal()
8. ether_nh_input
9. netisr_dispatch_src()
10. netisr_dispatch()
11. ns_net_tcp_push_frame().

I have tried to track down the source of the faulty address further than tcp_input_with_port() but with no avail. I only have cores available, and even gdb blocks the seg fault from happening in the test. I have gone through the code, and according to my meagre understanding, nothing indicates towards a heap buffer overflow in any of the above functions. Any help, in pointing to the right direction or anything else would be greatly appreciated. If you need any more information or a more appropriate mailing list, please let me know.

Thanks,
Waseem