Bind DoS?

Sat Sep 3 16:12:25 PDT 2005

Hello,

I am currently trying to set up two caching nameservers and noticed an 
interesting behaviour.

The configuration is the following:
two FreeBSD/amd64 6-CURRENT machines, with single Opteron processors.

Bind was compiled from ports, without threading, with gcc34 (from 
ports), with -O2 -static. It runs in a jail, with nothing more than the 
config and a nearly empty devfs mount.

Machine A has a simple config of the following:
options {
         directory "/etc/bind";
         tcp-clients 256;
         recursive-clients 8192;
         max-cache-size 600M;
         minimal-responses yes;
         pid-file "/tmp/named.pid";
         forwarders { MACHINE_B_IP; };
};

Machine B has the same bind, but runs as an authoritative NS with a 
joker record of:
*	IN	TXT	"256xA"
in the . zone (so it answers 256 "A"s for everything).

The test:
from machine B I start a queryperf, this way:
queryperf -d list -s MACHINE_A_IP

where list has the following:
www.RANDOMNUMBER.hu TXT
[...] this is 9000000 times.

During the test, machine A starts to fill its cache up until about 860 
MBs. Until that I see this in top:
CPU states: 27.7% user,  0.0% nice, 58.1% system, 14.2% interrupt,  0.0% 
idle

On machine B queryperf receives answer within the default timeout (5 
seconds).

After bind reaches about 860 MBs, it starts to eat CPU, so there is 100% 
user and nearly 0% system and interrupt usage.

queryperf starts to time out with the following:
[Timeout] Query timed out: msg id 64837
Warning: Received a response with an unexpected (maybe timed out) id: 64837

The server effectively dies, it can answer only a very little number of 
queries and with very low performance. If I stop queryperf, bind remains 
in the CPU eating state:
76423 bind        1 129    0   861M   862M RUN      8:30 97.71% named

Because the machine has much more RAM, I first tried with 1200M in the 
config. The server has reached its "zombie" state at around 1600 MB of 
usage and it was much unresponsive.

On another (real) server, I noticed similar behaviour this week. Bind 
started to eat all CPU resources, there were only "recursive quota 
reached" messages in the logs, but rndc status said only very low usage 
(for example 60/1024 on that server).

I can repeat this with and without patch-lib_dns_resolver.c.

If I stop the queries, the server starts to answer the queries in a few 
minutes, after it has finished its strange "CPU eating" loop.

ktrace says, it's doing this many-many times between two successful queries:
  76423 named    CALL  gettimeofday(0x7fffffffe450,0)
  76423 named    RET   gettimeofday 0

Any ideas?

Thanks,
-- 
Attila Nagy                                   e-mail: Attila.Nagy at fsn.hu
Free Software Network (FSN.HU)           phone @work: +361 371 3536
ISOs: http://www.fsn.hu/?f=download            cell.: +3630 306 6758