Can an app crash from a single TCP packet lost in transmission?

Fri Jul 17 20:09:08 UTC 2009

The first thing I noticed was that my nameserver had gone. 
I searched for the reason and found:

>Jul 15 04:04:52 <kern.crit> edge kernel: swap_pager_getswapspace(3): failed
< ... hundreds more of these ... >
>Jul 15 04:05:07 <kern.err> edge kernel: pid 47113 (named), uid 53, was 
                killed: out of swap space

That didn't make sense - the machine has enough swapspace.
But since this did repeat every other night, I started logging
ps output minutely.
And so I found a postgres database backup going weird:

03:23   70 78433 78432   0  96  0  8220  4196 -      R     ??    0:22.84 pg_dump -b 
< ... >
03:49   70 78433 78432   0  96  0  8220  4024 -      R     ??   17:06.61 pg_dump -b 
03:50   70 78433 78432   0  96  0  8220  4024 -      R     ??   17:46.15 pg_dump -b 
03:51   70 78433 78432   0  96  0  8220  4024 -      R     ??   18:26.69 pg_dump -b 
03:52   70 78433 78432   0  47  0 139292 57888 select S     ??   18:37.65 pg_dump -b 
03:53   70 78433 78432   0  48  0 139292 57828 select S     ??   18:40.36 pg_dump -b 
03:54   70 78433 78432   0 -20  0 401436 69092 swread DL    ??   18:42.49 pg_dump -b 
03:55   70 78433 78432   0 -20  0 401436 63232 swread DL    ??   18:43.99 pg_dump -b 

That process starts with 8MB memory, and runs so for half an hour,
then suddenly between 03:51 and 03:52 memory usage explodes.
And in that night it did not run out of swap space - instead it gave an
error message:

>pg_dump: Error message from server: lost synchronization with server: 
>         got message type "0", length 154143043
>pg_dump: The command was: COPY public.file (fileid, fileindex, jobid, 
>         pathid, filenameid, markid, lstat, md5) TO stdout;

But that database backup is at that time quite in the middle of 
dumping a db table containing lots of small records - there is no 
reason why a 154 MB "message" should be transferred between server 
and client while copying records of ~60 Bytes each.

One other thing did happen between 03:51 and 03:52 - the DSL 
internet connection did disconnect/reconnect and obtained a new 
IP adress. Afterwards, a script does flush and reload an ipfw table()
with the new local adresses - and during this process one(!) packet
of the database session was dropped.

I could verify that relation: every night when there were memory
problems, few packets from the database backup were lost during the
firewall reconfigure - in nights when no packets were lost, there were
no memory problems.

I will now change the firewall handling to get rid of that packet loss, 
but also, I need some refresh on how TCP works:

I thought TCP would not be disturbed by a lost packet, but would 
automatically resend that packet until ACK received; and I thought
this would happen below the application, so practically the application
CANNOT go weird from a lost packet...

Is there any reason why this would not be true on a localhost connection?

rgds, 
PMc