pf state disappearing [ adaptive timeout bug ]

Fri Jan 22 22:02:19 UTC 2016

On 1/22/2016 3:35 PM, Nick Rogers wrote:
> On Thu, Jan 21, 2016 at 11:44 AM, Matthew Grooms <mgrooms at shrew.net> wrote:
>
>> # pfctl -si
>> Status: Enabled for 0 days 02:25:41           Debug: Urgent
>>
>> State Table                          Total             Rate
>>    current entries                    77759
>>    searches                       483831701        55352.0/s
>>    inserts                           825821           94.5/s
>>    removals                          748060           85.6/s
>> Counters
>>    match                           27118754         3102.5/s
>>    bad-offset                             0            0.0/s
>>    fragment                               0            0.0/s
>>    short                                  0            0.0/s
>>    normalize                              0            0.0/s
>>    memory                                 0            0.0/s
>>    bad-timestamp                          0            0.0/s
>>    congestion                             0            0.0/s
>>    ip-option                           6655            0.8/s
>>    proto-cksum                            0            0.0/s
>>    state-mismatch                         0            0.0/s
>>    state-insert                           0            0.0/s
>>    state-limit                            0            0.0/s
>>    src-limit                              0            0.0/s
>>    synproxy                               0            0.0/s
>>
>> # pfctl -st
>> tcp.first                   120s
>> tcp.opening                  30s
>> tcp.established           86400s
>> tcp.closing                 900s
>> tcp.finwait                  45s
>> tcp.closed                   90s
>> tcp.tsdiff                   30s
>> udp.first                   600s
>> udp.single                  600s
>> udp.multiple                900s
>> icmp.first                   20s
>> icmp.error                   10s
>> other.first                  60s
>> other.single                 30s
>> other.multiple               60s
>> frag                         30s
>> interval                     10s
>> adaptive.start            90000 states
>> adaptive.end             120000 states
>> src.track                     0s
>>
>> I think there may be a problem with the code that calculates adaptive
>> timeout values that is making it way too aggressive. If by default it's
>> supposed to decrease linearly between %60 and %120 of the state table max,
>> I shouldn't be loosing TCP connections that are only idle for a few minutes
>> when the sate table is < %70 full. Unfortunately that appears to be the
>> case. At most this should have decreased the 86400s timeout by %17 to
>> 72000s for established TCP connections.
> That doesn't make sense to me either. Even if the math is off by a factor
> of 10 the state should live for about 24 minutes.
>
>> I've tested this for a few hours now and all my idle SSH sessions have
>> been rock solid. If anyone else is scratching their head over a problem
>> like this, I would suggest disabling the adaptive timeout feature or
>> increasing it to a much higher value. Maybe one of the pf maintainers can
>> chime in and shed some light on why this is happening. If not, I'm going to
>> file a bug report as this certainly feels like one.
>>
> Did you go with making adaptive timeout less aggressive or disable it
> entirely? I would think that if adaptive timeout is really that broken more
> people would notice this problem, especially myself since I have many
> servers running a very short tcp.established timeout, but the fact that you
> are noticing this kind of weirdness has me concerned about how the adaptive
> setting is affecting my environment.

I increased the value to 90K for the 10K limit. Yes, it's concerning. 
Today I setup a test environment at about 1/10th the connections to see 
if I could reproduce the issue on a smaller scale, but had no luck. I'm 
trying to find a cmd line test program that will generate enough tcp 
connections so I can reproduce it on a similar scale to my production 
environment. So far I haven't found anything that will do the trick. I 
may end up rolling my own. I'll reply back to the list if I can find a 
way to reproduce this.

Thanks again,

-Matthew