Little research how rm -rf and tar kill server

Sat Mar 28 20:23:09 UTC 2015

I had a freebsd 9 server updated via buildkernel/buildworld  last time 
in june 2014 and
everything was fine.
The server had 100+ web sites apache only, mysql database (40+ 
databases), imap server.
2x2TB seagate disks in geom mirror config, SU+J UFS.
Every night incremental or full backup was done into tar.gz (gtar -czf)
It never generated any problems. Sometimes i had to delete full sites 
with tone of files
using rm -rf and it never caused any problem too.

In february 2015 i migrated to a new server.
Hardware. CPU is the same (some xeon 4 core + HT), ram is the same 32GB,
disks now 2x3TB TOSHIBA, same geom mirror, same SU+J USF
Freebsd is now 10-STABLE (updated via buildkernel/buildworld)
I changed fronend to NGINX and it now serves all static files, back end 
is still Apache,
switch fro mysql to  MariaDB (it seems a lot faster)

And in the first week i discovered two things:
1) untaring (tar -xf) a backup from tar.gz overloads the server
2) rm -rf overloads the server

How overload looks like:

Here is typical picture or normal operation:

TOP:

last pid: 60506;  load averages:  1.10,  0.93, 0.86 up 30+16:00:32  18:25:22
563 processes: 2 running, 560 sleeping, 1 stopped
CPU: 13.0% user,  0.0% nice,  2.4% system,  0.2% interrupt, 84.3% idle
Mem: 2034M Active, 25G Inact, 3412M Wired, 20M Cache, 1656M Buf, 943M Free
Swap: 4096M Total, 51M Used, 4045M Free, 1% Inuse

systat -io
           /0%  /10  /20  /30  /40  /50  /60  /70  /80  /90  /100
ada0  MB/s
       tps|XXXXXX
ada1  MB/s
       tps|XXXXX

mysql:
show processlist
26 connection
all in "sleep" (many request are executed, but hard to catch, too fast)

Now, when i started

tar -xf backup.tgz

after about 5 minutes number of processed rise to 800, many in ufs state,
  mysql show processlist show about 200 requests in opening tables state,
sysstat -io show tps over 1000
sites which use mysql stop responding, static sites  work but slow,
ssh loging takes very login time.

I used pv to limit tgz reading bandwidth to 5MB/sec - did not help, the 
end of
the world just started a little later.

Then i accidentally did sync from shell and situation became better.
So, i run the same untar command and added a script in parallel which 
did sync
every 120 seconds - all problems went away.

This is VERY strange, because man 2 sync says sync is done every 30 seconds
anyway. Apparently not.

Now, next i did cp -Rp a huge tree with tons of files - no such problems 
like with untar

now i
rm -rf test1
test1 has 4 levels of subdirs with tons reltivelly small files.

Here is what i get after 3 minutes

systat -io
           /0%  /10  /20  /30  /40  /50  /60  /70  /80  /90  /100
ada0  MB/sXXXX
       tps|XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX298.54
ada1  MB/sXXXX
       tps|XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX298.94

top

last pid: 69540;  load averages:  2.48,  1.48, 1.16 up 30+16:55:02  19:19:52
767 processes: 2 running, 764 sleeping, 1 stopped
CPU:  0.9% user,  0.0% nice,  0.3% system,  0.2% interrupt, 98.6% idle
Mem: 8129M Active, 14G Inact, 3548M Wired, 333M Cache, 1655M Buf, 5722M Free
Swap: 4096M Total, 51M Used, 4045M Free, 1% Inuse

mysql
show processlist
205 rows
in
opening table
creating table

Web servers are dead.

I CTRL-C rm command
sync
took 230 seconds

did not help, now 1000 processes, over 300 stuck sql  requests

killall -TERM httpd
wait, not help

sync
200 seconds

mysql query queue is empty
sysstat jumpt to 1000

then return to idle
restart httpd
everything is ok

TEST 2 for rm

rm -rf test1
sync every 60 seconds in parallel
All the same problem, sync  every 60 s did not help.

TEST 3 for rm
/usr/bin/nice -n20 rm -rf test1
All the same

TEST 4 for rm
/usr/bin/nice -n20 rm -rf test1
fsync every 60 seconds in parallel
All the same

So, questions and thoughts:
1) Why i had no problem such this in fbsd 9? I think the reason for the 
problem is in kernel, not in hardware or mariadb+nginx because server 
load did not
increase at all, even decreases a little.
2)  I consider it a sever bug, because even normal used (and i have 
plenty of them using ssh) can eventually do rm -rf and kill all sites. 
Which means there are
must be some way to limit io usage per user

Artem