[Bug 283903] rtw88: possible skb leak

From: <bugzilla-noreply_at_freebsd.org>
Date: Fri, 31 Jan 2025 22:54:32 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=283903

--- Comment #21 from Guillaume Outters <guillaume-freebsd@outters.eu> ---
(In reply to Bjoern A. Zeeb from comment #16, comment #17, comment #19)

> This may or may not be the same problem (and I do have four different Realtek cards in this system) but another possible data point:
> [...]
> It started like (but no SKB alloc failures in the log):
> rtw881: failed to dequeue 3967 skb TX queue 5, BD=0xffffffff, rp 127 -> 4095^M

"skb alloc failed" is not the primary symptom, rather it's a consequence once
the situation has degraded too much.

For me the first symptom is `vmstat -m | grep skb` growing at the pace of
network activity; sometimes as soon as 10 mn after the reboot (but sometimes
far later than that; I still don't understand what triggers the change).

At boot I always have a disturbingly flat value of 16781312 (16 MB + 1 buffer
of 4096 b); some activity will make it eat 1 or 2 or more buffers, but it
always ends up returning to this value.

Then as long as I see it going over 18 MB, I know that something has gone
rogue, and that _any new activity will make unreleased skbuffs grow
proportionally to network traffic_.

In my 5 lasts attempts, for a 10 MiB nc transfer (either from or to my laptop),
it had grown 19.43 MiB, then 18.60, 18.52, 14.52, 15.04,
after having temporarily topped at 23.20 MiB, 45.36, 42.29, 21.64, 26.52
(from a baseline which was the vmstat -m before the transfer, at around 2.2
GiB;
that is, before first transfer vmstat was at 2.2 GiB, during the transfer it
peaked à 2.22320i GiB, and after the transfer it went back to "only" 2.21943
GiB)

During that phase, I noticed some sluggishness or small Wayland freezes (3 to
dozens of seconds) from time to time; as if (but that's just a guess) it did a
lookup over all the already allocated mem (to look for the oldest buffer to
free? For a free mem block where to allocate a new buffer? To relocate buffers?
To reach the end of a linked-list pool of buffers, looking for one to reuse?).

The "skb alloc failed" only occurs ONCE IT CANNOT ALLOCATE ANYMORE IN THE FIRST
4 GB OF RAM (due to compat.linuxkpi.skb.mem_limit=1, as I understand), after
having grown MB by MB (and being in concurrence with userspace processes: this
afternoon after some "skb alloc failed", I quit a long-running Firefox, and as
a result got 1 or 2 hours without "skb alloc failed").

> Did you also instrument the RX path?
> Are your SCPs pushing or pulling data?  as in do you copy a file off from the rtw88 device or do you copy a file to the rtw88 device?
> But also 170 bytes is really not much each time.

I just instrumented those functions, because that file was the one I could grep
'skb_free|free_skb' in, and I didn't look further.

And my original tests were on pushes (via scp).
But today with my 10 MiB of a .gz file pushed and pulled via nc,
I could measure the same non-freed allocations: ALL of the ~3400 allocations
traced for one transfer were between 126 and 196 bytes long (but it's not an
absolute limit: another test later saw 4 out of 4000 at 243, 249, 368 and 434
bytes).

However, the `vmstat -m | grep skb` increase was way more more than the sum of
all those traced packets; for a 10 MiB transfer that resulted in a 15 - 20 MiB
increase of skb allocated space, only 650 KB were traced by my artisanal probe.
On the other hand, I don't know how the allocator works: IF FOR EACH 170 b
ALLOCATION REQUESTED, THE ALLOCATOR RETURNS A FULL 4096 B PAGE, THEN THIS
EXPLAINS OUR 15 - 20 MB INCREASE (I saw from 3500 to 7000 calls through my
trace, which multiplied by 4 K give 14.3 to 28.7 MiB: it would perfectly match
with vmstat report!).

> BTW. you do not have to patch the kernel for this.  Dtrace provides adequate tracing functionality in this case.
> Here's a sample I shared earlier on which you can probably use as a start:
> [...]

Nice! It's been a long time I say to myself that I should look into Dtrace, but
I never did; you're adding to the good reasons to do that in 2025.

> Coming back after a while I see on the 1 minute update differences for vmstat -m for lkpiskb (but not mbuf-tags):
> # It's exactly one page a time!
> % expr 74 \* 4096
> 303104

From my experience (see my first block of this reply), during the "non
problematic phase", a continuous use of network (a transfer) made skb allocated
by multiples of 4 KB, but as soon as network's pressure lowered, they were
released and I reached back to 16 MB + 4 KB.

So this may be normal to have a small, permanent increase... as long as the
dequeue has opportunities to run, and, more important, as long as it does not
just resigns. 

Now I'll have to reboot to post this long comment: any HTTP request now gets at
least an "skb alloc failed", network isn't usable anymore.

-- 
You are receiving this mail because:
You are on the CC list for the bug.