high load system do not take all CPU time

Mon Dec 26 19:44:31 UTC 2011

Здравствуйте, Коньков.

Вы писали 26 декабря 2011 г., 20:52:11:

КЕ> Здравствуйте, Коньков.

КЕ> Вы писали 25 декабря 2011 г., 18:10:17:

КЕ>> Здравствуйте, wishmaster.

КЕ>> Вы писали 19 декабря 2011 г., 6:54:08:

w>>>   --- Original message ---
w>>>  From: "Коньков Евгений" <kes-kes at yandex.ru>
w>>>  To: "Daniel Staal" <DStaal at usa.net>
w>>>   Date: 18 December 2011, 19:47:40
w>>>  Subject: Re[2]: high load system do not take all CPU time
w>>>  
w>>>  

>>>> Здравствуйте, Daniel.
>>>> 
>>>> Вы писали 18 декабря 2011 г., 17:52:00:
>>>> 
>>>> DS> --As of December 17, 2011 10:29:42 AM +0200, Коньков Евгений 
>>>> DS> is alleged to have said:
>>>> 
>>>> >> How to debug why system do not use free CPU resouces?
>>>> >>
>>>> >> On this pictures you can see that CPU can not exceed 400tics
>>>> >> http://piccy.info/view3/2368839/c9022754d5fcd64aff04482dd360b5b2/
>>>> >> http://piccy.info/view3/2368837/a12aeed98681ed10f1a22f5b5edc5abc/
>>>> >> http://piccy.info/view3/2368836/da6a67703af80eb0ab8088ab8421385c/
>>>> >>
>>>> >>
>>>> >> On these pictures you can see that problems begin with trafic on re0
>>>> >> when CPU load rise to "maximum"
>>>> >> http://piccy.info/view3/2368834/512139edc56eea736881affcda490eca/
>>>> >> http://piccy.info/view3/2368827/d27aead22eff69fd1ec2b6aa15e2cea3/
>>>> >>
>>>> >> But there is 25% CPU idle yet at that moment.
>>>> 
>>>> DS> <snip>
>>>> 
>>>> >># top -SIHP
>>>> >> last pid: 93050;  load averages:  1.45,  1.41,  1.29
>>>> >> up 9+16:32:06  10:28:43 237 processes: 5 running, 210 sleeping, 2
>>>> >> stopped, 20 waiting
>>>> >> CPU 0:  0.8% user,  0.0% nice,  8.7% system, 17.7% interrupt, 72.8% idle
>>>> >> CPU 1:  0.0% user,  0.0% nice,  9.1% system, 20.1% interrupt, 70.9% idle
>>>> >> CPU 2:  0.4% user,  0.0% nice,  9.4% system, 19.7% interrupt, 70.5% idle
>>>> >> CPU 3:  1.2% user,  0.0% nice,  6.3% system, 22.4% interrupt, 70.1% idle
>>>> >> Mem: 843M Active, 2476M Inact, 347M Wired, 150M Cache, 112M Buf, 80M Free
>>>> >> Swap: 4096M Total, 15M Used, 4080M Free
>>>> 
>>>> DS> --As for the rest, it is mine.
>>>> 
>>>> DS> You are I/O bound; most of your time is spent in interrupts.  The CPU is
>>>> DS> dealing with things as fast as it can get them, but it has to wait for the
>>>> DS> disk and/or network card to get them to it.  The CPU is not your problem;
>>>> DS> if you need more performance, you need to tune the I/O.  (And possibly get
>>>> DS> better I/O cards, if available.)
>>>> 
>>>> DS> Daniel T. Staal
>>>> 
>>>> can I get interrupt limit or calculate it before that limit is
>>>> reached?
>>>> 
>>>> interrupt source is internal card:
>>>> # vmstat -i
>>>> interrupt                          total       rate
>>>> irq14: ata0                       349756         78
>>>> irq16: ehci0                        7427          1
>>>> irq23: ehci1                       12150          2
>>>> cpu0:timer                      18268704       4122
>>>> irq256: re0                     85001260      19178
>>>> cpu1:timer                      18262192       4120
>>>> cpu2:timer                      18217064       4110
>>>> cpu3:timer                      18210509       4108
>>>> Total                          158329062      35724
>>>> 
>>>> Have you any good I/O tuning links to read?
>>>> 
>>>> -- 
>>>> С уважением,
>>>> Коньков                          mailto:kes-kes at yandex.ru
w>>>   
w>>>  Your problem is in the poor performance LAN Card. Guy from
w>>> Calomel Org told you about it. He advised you to change to Intel Network Card.

КЕ>> see at time 17:20
КЕ>> http://piccy.info/view3/2404329/dd9f28f8ac74d3d2f698ff14c305fe31/

КЕ>> at this point freeradius start to work slow because of no CPU time is
КЕ>> allocated to it or is allocated to little and mpd5 start to drop users because of no response
КЕ>> from radius. I do not know what idle were on 'top', sadly.

КЕ>> does SNMP return right values for CPU usage?

КЕ> last pid: 14445;  load averages:  6.88,  5.69,  5.33              up 0+12:11:35  20:37:57
КЕ> 244 processes: 12 running, 211 sleeping, 3 stopped, 15 waiting, 3 lock
КЕ> CPU 0:  4.7% user,  0.0% nice, 13.3% system, 46.7% interrupt, 35.3% idle
КЕ> CPU 1:  2.0% user,  0.0% nice,  9.8% system, 69.4% interrupt, 18.8% idle
КЕ> CPU 2:  2.7% user,  0.0% nice,  8.2% system, 74.5% interrupt, 14.5% idle
КЕ> CPU 3:  1.2% user,  0.0% nice,  9.4% system, 78.0% interrupt, 11.4% idle
КЕ> Mem: 800M Active, 2708M Inact, 237M Wired, 60M Cache, 112M Buf, 93M Free
КЕ> Swap: 4096M Total, 25M Used, 4071M Free

КЕ>   PID USERNAME   PRI NICE   SIZE    RES STATE   C   TIME   WCPU COMMAND
КЕ>    12 root       -72    -     0K   160K CPU1    1 159:49 100.00% {swi1: netisr 3}
КЕ>    12 root       -72    -     0K   160K *per-i  2 101:25 84.57% {swi1: netisr 1}
КЕ>    12 root       -72    -     0K   160K *per-i  3  60:10 40.72% {swi1: netisr 2}
КЕ>    12 root       -72    -     0K   160K *per-i  2  41:54 39.26% {swi1: netisr 0}
КЕ>    11 root       155 ki31     0K    32K RUN     0 533:06 24.46% {idle: cpu0}
КЕ>  3639 root        36    0 10460K  3824K CPU3    3   7:43 22.17% zebra
КЕ>    12 root       -92    -     0K   160K CPU0    0  93:56 14.94% {irq256: re0}
КЕ>    11 root       155 ki31     0K    32K RUN     1 563:29 14.16% {idle: cpu1}
КЕ>    11 root       155 ki31     0K    32K RUN     2 551:46 12.79% {idle: cpu2}
КЕ>    11 root       155 ki31     0K    32K RUN     3 558:54 11.52% {idle: cpu3}
КЕ>    13 root       -16    -     0K    32K sleep   3  16:56  4.93% {ng_queue2}
КЕ>    13 root       -16    -     0K    32K RUN     2  16:56  4.69% {ng_queue0}
КЕ>    13 root       -16    -     0K    32K RUN     0  16:56  4.54% {ng_queue1}
КЕ>    13 root       -16    -     0K    32K RUN     1  16:59  4.44% {ng_queue3}
КЕ>  6818 root        22    0 15392K  4836K select  2  25:16  4.10% snmpd
КЕ> 49448 freeradius  29    0 27748K 16984K select  3   2:37  2.59% {initial thread}
КЕ> 16118 firebird    20  -10   233M   145M usem    2   0:06  0.83% {fb_smp_server}
КЕ> 14282 cacti       21    0 12000K  3084K select  3   0:00  0.68% snmpwalk
КЕ> 16118 firebird    20  -10   233M   145M usem    0   0:03  0.54% {fb_smp_server}
КЕ>  5572 root        21    0   136M 78284K wait    1   5:23  0.49% {mpd5}
КЕ> 14507 root        20    0  9536K  1148K nanslp  0   0:51  0.15% monitord
КЕ> 14441 root        25    0 11596K  4048K CPU0    0   0:00  0.00% perl5.14.1
КЕ> 14443 cacti       21    0 11476K  2920K piperd  0   0:00  0.00% perl5.14.1
КЕ> 14444 root        22    0  9728K  1744K select  0   0:00  0.00% sudo
КЕ> 14445 root        21    0  9672K  1240K kqread  0   0:00  0.00% ping

КЕ>    # vmstat -i
КЕ> interrupt                          total       rate
КЕ> irq14: ata0                      1577446         35
КЕ> irq16: ehci0                       66968          1
КЕ> irq23: ehci1                       94012          2
КЕ> cpu0:timer                     180767557       4122
КЕ> irq256: re0                    683483519      15587
КЕ> cpu1:timer                     180031511       4105
КЕ> cpu3:timer                     175311179       3998
КЕ> cpu2:timer                     179460055       4092
КЕ> Total                         1400792247      31947

КЕ>     1 users    Load  6.02  5.59  5.31                  Dec 26 20:38

КЕ> Mem:KB    REAL            VIRTUAL                       VN PAGER  SWAP PAGER
КЕ>         Tot   Share      Tot    Share    Free           in   out  in   out
КЕ> Act 1022276   12900  3562636    39576  208992  count           4
КЕ> All 1143548   20380  5806292   100876          pages          48
КЕ> Proc:                                                            Interrupts
КЕ>   r   p   d   s   w   Csw  Trp  Sys  Int  Sof  Flt   1135 cow   37428 total
КЕ>             186      129k  10k  17k  21k  14k 5857   2348 zfod    15 ata0 14
КЕ>                                                       184 ozfod   1 ehci0 16
КЕ>  8.1%Sys  68.4%Intr  5.9%User  0.0%Nice 17.6%Idle       7%ozfod   2 ehci1 23
КЕ> |    |    |    |    |    |    |    |    |    |    |       daefr  4120 cpu0:timer
КЕ> ====++++++++++++++++++++++++++++++++++>>>            2423 prcfr 21013 re0 256
КЕ>                                        208 dtbuf     4425 totfr  4100 cpu1:timer
КЕ> Namei     Name-cache   Dir-cache    142271 desvn          react  4083 cpu3:timer
КЕ>    Calls    hits   %    hits   %      3750 numvn          pdwak  4094 cpu2:timer
КЕ>    36571   36546 100                  1998 frevn          pdpgs
КЕ>                                                           intrn
КЕ> Disks   ad0   da0 pass0                            241412 wire
КЕ> KB/t  26.81  0.00  0.00                            826884 act
КЕ> tps      15     0     0                           2714240 inact
КЕ> MB/s   0.39  0.00  0.00                             97284 cache
КЕ> %busy     1     0     0                            111708 free
КЕ>                                                    114976 buf

КЕ> # netstat -w 1 -I re0
КЕ>             input          (re0)           output
КЕ>    packets  errs idrops      bytes    packets  errs      bytes colls
КЕ>      52329     0     0   40219676      58513     0   40189497     0
КЕ>      50207     0     0   37985881      57340     0   38438634     0

КЕ> http://piccy.info/view3/2409691/69d31186d8943a53c31ec193c8dfe79d/
КЕ> http://piccy.info/view3/2409746/efb444ffe892592fbd6f025fd14535c4/
КЕ> before overload happen, as you can see, server passthrought more traffic.

КЕ> programs at this moment works very sloooow!
КЕ> at the day on re0 there are can be more interrupts than now and server works fine

КЕ> some problems with scheduler I think.

and three is *radix state.

last pid: 51533;  load averages:  4.67,  5.24,  5.29                                       up 0+12:59:43  21:26:05
284 processes: 6 running, 255 sleeping, 3 stopped, 17 waiting, 3 lock
CPU 0:  0.5% user,  0.0% nice, 15.2% system, 27.2% interrupt, 57.1% idle
CPU 1:  0.0% user,  0.0% nice, 20.1% system, 22.3% interrupt, 57.6% idle
CPU 2:  1.6% user,  0.0% nice, 29.3% system, 20.7% interrupt, 48.4% idle
CPU 3:  2.7% user,  0.0% nice, 21.7% system, 16.3% interrupt, 59.2% idle
Mem: 788M Active, 2660M Inact, 239M Wired, 81M Cache, 112M Buf, 129M Free
Swap: 4096M Total, 51M Used, 4045M Free, 1% Inuse

  PID USERNAME   PRI NICE   SIZE    RES STATE   C   TIME   WCPU COMMAND
51239 root       -72    0 10460K  3416K CPU0    0   0:15 66.80% zebra
   11 root       155 ki31     0K    32K CPU3    3 565:03 46.53% {idle: cpu3}
   11 root       155 ki31     0K    32K RUN     1 571:46 45.70% {idle: cpu1}
   11 root       155 ki31     0K    32K RUN     2 558:13 44.73% {idle: cpu2}
   11 root       155 ki31     0K    32K CPU0    0 546:21 43.85% {idle: cpu0}
   12 root       -72    -     0K   160K *radix  1 204:13 42.14% {swi1: netisr 3}
   12 root       -72    -     0K   160K *radix  2 141:57 37.55% {swi1: netisr 1}
   12 root       -72    -     0K   160K *radix  3  61:10 25.15% {swi1: netisr 0}
   12 root       -72    -     0K   160K WAIT    3  78:28 19.92% {swi1: netisr 2}
   12 root       -92    -     0K   160K WAIT    0 100:28  9.13% {irq256: re0}
 6818 root        22    0 15392K  4836K select  1  26:59  2.10% snmpd
   13 root       -16    -     0K    32K sleep   3  19:24  1.56% {ng_queue1}
51531 cacti       36    0 17092K  5944K select  0   0:00  1.51% {initial thread}
   13 root       -16    -     0K    32K sleep   3  19:27  1.46% {ng_queue3}
   13 root       -16    -     0K    32K sleep   3  19:24  1.46% {ng_queue2}
   13 root       -16    -     0K    32K sleep   1  19:25  1.42% {ng_queue0}
51531 cacti       52    0 17092K  5944K usem    0   0:00  1.42% {perl5.14.1}
51510 cacti       46    0 32256K 16304K piperd  3   0:00  1.22% php
51514 cacti       46    0 11476K  2940K piperd  2   0:00  1.22% perl5.14.1
51515 root        46    0  9728K  1748K select  3   0:00  1.22% sudo
51516 root        45    0  9672K  1220K kqread  1   0:00  1.22% ping
51508 cacti       52    0 32256K 16312K piperd  2   0:00  1.03% php
51248 root         4    0 10564K  4980K select  0   0:00  0.44% bgpd
 5572 root        20  -15   136M 64812K select  1   6:10  0.34% {mpd5}
51502 cacti       25    0 32256K 16568K nanslp  0   0:00  0.34% php
51513 cacti       23    0 17772K  4436K piperd  1   0:00  0.34% rrdtool
 5572 root        20  -15   136M 64812K select  2   0:00  0.34% {mpd5}
 5572 root        20  -15   136M 64812K select  1   0:00  0.34% {mpd5}
 5572 root        20  -15   136M 64812K select  1   0:00  0.34% {mpd5}

I am trying to google about *radix and *per-i but I did not find
anything (

-- 
С уважением,
 Коньков                          mailto:kes-kes at yandex.ru