Can an app crash from a single TCP packet lost in transmission?
Peter Much
pmc at citylink.dinoex.sub.org
Fri Jul 17 20:09:08 UTC 2009
The first thing I noticed was that my nameserver had gone.
I searched for the reason and found:
>Jul 15 04:04:52 <kern.crit> edge kernel: swap_pager_getswapspace(3): failed
< ... hundreds more of these ... >
>Jul 15 04:05:07 <kern.err> edge kernel: pid 47113 (named), uid 53, was
killed: out of swap space
That didn't make sense - the machine has enough swapspace.
But since this did repeat every other night, I started logging
ps output minutely.
And so I found a postgres database backup going weird:
03:23 70 78433 78432 0 96 0 8220 4196 - R ?? 0:22.84 pg_dump -b
< ... >
03:49 70 78433 78432 0 96 0 8220 4024 - R ?? 17:06.61 pg_dump -b
03:50 70 78433 78432 0 96 0 8220 4024 - R ?? 17:46.15 pg_dump -b
03:51 70 78433 78432 0 96 0 8220 4024 - R ?? 18:26.69 pg_dump -b
03:52 70 78433 78432 0 47 0 139292 57888 select S ?? 18:37.65 pg_dump -b
03:53 70 78433 78432 0 48 0 139292 57828 select S ?? 18:40.36 pg_dump -b
03:54 70 78433 78432 0 -20 0 401436 69092 swread DL ?? 18:42.49 pg_dump -b
03:55 70 78433 78432 0 -20 0 401436 63232 swread DL ?? 18:43.99 pg_dump -b
That process starts with 8MB memory, and runs so for half an hour,
then suddenly between 03:51 and 03:52 memory usage explodes.
And in that night it did not run out of swap space - instead it gave an
error message:
>pg_dump: Error message from server: lost synchronization with server:
> got message type "0", length 154143043
>pg_dump: The command was: COPY public.file (fileid, fileindex, jobid,
> pathid, filenameid, markid, lstat, md5) TO stdout;
But that database backup is at that time quite in the middle of
dumping a db table containing lots of small records - there is no
reason why a 154 MB "message" should be transferred between server
and client while copying records of ~60 Bytes each.
One other thing did happen between 03:51 and 03:52 - the DSL
internet connection did disconnect/reconnect and obtained a new
IP adress. Afterwards, a script does flush and reload an ipfw table()
with the new local adresses - and during this process one(!) packet
of the database session was dropped.
I could verify that relation: every night when there were memory
problems, few packets from the database backup were lost during the
firewall reconfigure - in nights when no packets were lost, there were
no memory problems.
I will now change the firewall handling to get rid of that packet loss,
but also, I need some refresh on how TCP works:
I thought TCP would not be disturbed by a lost packet, but would
automatically resend that packet until ACK received; and I thought
this would happen below the application, so practically the application
CANNOT go weird from a lost packet...
Is there any reason why this would not be true on a localhost connection?
rgds,
PMc
More information about the freebsd-stable
mailing list