[ATA] and re(4) stability issues

Victor Balada Diaz victor at bsdes.net
Wed Dec 10 06:08:26 PST 2008


On Wed, Dec 10, 2008 at 09:07:19PM +0900, Pyun YongHyeon wrote:
> On Wed, Dec 10, 2008 at 12:32:25PM +0100, Victor Balada Diaz wrote:
>  > On Wed, Dec 10, 2008 at 07:28:00PM +0900, Pyun YongHyeon wrote:
>  > > On Wed, Dec 10, 2008 at 09:59:35AM +0100, Victor Balada Diaz wrote:
>  > >  > On Wed, Dec 10, 2008 at 03:12:26PM +0900, Pyun YongHyeon wrote:
>  > >  > > On Tue, Dec 09, 2008 at 07:52:37PM +0100, Victor Balada Diaz wrote:
>  > >  > >  > Hello,
>  > >  > >  > 
>  > >  > >  > I got various machines[1] at hetzner.de and I've been having problems
>  > >  > >  > with interrupts on FreeBSD 7.0 and now FreeBSD 7.1 -BETA2 in amd64. I've
>  > >  > >  > been trying to narrow the problem so someone more knowledgeable than me
>  > >  > >  > is able to fix it. This mail is an other attempt to ask a question
>  > >  > >  > with regards ATA code to see if this time i got something.
>  > >  > >  > 
>  > >  > >  > For the ones that don't actually know what happened:
>  > >  > >  > 
>  > >  > >  > With FreeBSD 7.0 -RELEASE for amd64 and default kernel
>  > >  > >  > the system shared re0 interrupt with OHCI and this caused
>  > >  > >  > re(4) to corrupt packets and create interrupt storms. Tried
>  > >  > > 
>  > >  > > re(4) in 7.0-RELEASE had bus_dma(9) bug which could be easily
>  > >  > > triggered on systems with > 4GB memory. But I dont' know whether
>  > >  > > this is related with interrupt storms.
>  > >  > > 
>  > >  > >  > updating to 7.1 -BETA2 and still had some problems with it.
>  > >  > >  > 
>  > >  > >  > I've opened the PR kern/128287[2] and Remko quickly answered
>  > >  > >  > with a workaround: that workaround was removing USB support from
>  > >  > >  > my kernel. I did it and re(4) wasn't sharing interrupts anylonger,
>  > >  > >  > and the interrupt storms were gone. Now sometime later the interface
>  > >  > >  > goes up and down from time to time, but less often. Also sometimes
>  > >  > >  > the machine losts the network interface but continues to work.
>  > >  > >  > 
>  > >  > > 
>  > >  > > It seems that your controller supports MSI so you can set a tunable
>  > >  > > hw.re.msi_disable to 0 to enable MSI. With MSI you can remove
>  > >  > > interrupt sharing(e.g. add hw.re.msi_disable="0" to
>  > >  > > /boot/loader.conf file.) However there were several issues on re(4)
>  > >  > > w.r.t MSI so it was off by default.
>  > >  > 
>  > >  > This is undocumented and with sysctl -a i can't find the tunable. Is this
>  > >  > a HEAD feature or it's also in 7.1 -BETA2? Should i add
>  > > 
>  > > Yeah it's an undocmented feature. But most drivers written by me
>  > > have similar kobs. Both HEAD and stable/7 including 7.1 BETA2 have
>  > > the tunable.
>  > 
>  > I think it could be great if you could document it or at least
>  > show it by default when you do sysctl -ad with a small description.
>  > 
> 
> If MSI worked as expected I would have documented it as I did
> in msk(4)/nfe(4)/ale(4)/age(4)/jme(4) etc.
> Using MSI on RealTek does not seem to stable. I tried hard to fix
> that but some users still reported watchdog timeouts. Working
> without documentation and hardware also made it hard to complete
> the work. This was the main reason why MSI was disabled on re(4).

What do you think about adding a note in the man page telling that
it's experimental and in some cases it could improve the situation
but in others it will give errors? 

> 
>  > > 
>  > >  > hw.re_msi_disable="0" to /boot/loader.conf?
>  > >    ^^^^^^^^^^^^^^^^^^^^^
>  > >    Shoule be hw.re.msi_disable="0"
>  > >  > 
>  > > 
>  > > Yes, just add it to /boot/loader.conf. Note, you should not disable
>  > > system-wide MSI control(e.g. hw.pci.enable_msi == 1).
>  > > 
>  > >  > This was sharing interrupt with USB, does USB need any special MSI handling
>  > >  > or with re using MSI is enough to not share the interrupt?
>  > > 
>  > > If re(4) can use MSI, you don't need to worry about interrupt
>  > > sharing with USB. Check the output of "vmstat -i". You normally get
>  > > an irq256 or higher for MSI enabled driver.
>  > > 
>  > >  > 
>  > >  > 
>  > >  > > 
>  > >  > >  > I know it continues to work because some days later i can see that
>  > >  > >  > it tried to deliver the status reports but was unable to resolve the
>  > >  > >  > aliases hostnames. I can't ping the machine and i know the network
>  > >  > >  > is OK. If i reboot the machine everything is working again.
>  > >  > >  > 
>  > >  > > 
>  > >  > > Recently I've made small changes to re(4) which may help to detect
>  > >  > > link state change event. Would you try re(4) in HEAD?
>  > >  > 
>  > >  > Can i just drop HEAD's /stable/7/sys/dev/re/ in -STABLE and test that
>  > > 
>  > > Yes, you can. It should build without problems. Just replace re(4) on
>  > > stable/7 with HEAD version.
>  > > 
>  > >  > or do i need to test the whole HEAD kernel?
>  > >  > 
>  > > 
>  > > No you don't have to that.
>  > 
>  > Backporting the changes i've found that it didn't compile so in
>  > the end i got from HEAD the following files:
>  > 
>  > base/head/sys/dev/re/if_re.c
>  > base/head/sys/pci/if_rl.c
>  > base/head/sys/pci/if_rlreg.h
>  > 
> 
> Ah,, sorry about that. Recently there was some changes. I forgot
> that.
> 
>  > After that i've recompiled 7.1 -BETA2 GENERIC kernel and enabled
>  > the knob you suggested in /boot/loader.conf.
>  > 
>  > With the new kernel and MSI the interrupts are like this:
>  > 
>  > # vmstat -i
>  > interrupt                          total       rate
>  > irq9: acpi0                            1          0
>  > irq16: ohci0                           1          0
>  > irq17: ohci1 ohci3                     1          0
>  > irq18: ohci2 ohci4                     1          0
>  > irq22: atapci0                     19215         15
>  > cpu0: timer                      2502718       1998
>  > irq256: re0                      4967726       3967
>  > cpu1: timer                      2502525       1998
>  > Total                            9992188       7980
>  > 
>  > The high interrupt numbers are because i've been running iperf to
>  > check everything it's fine, not because of interrupt storms. So far
>  > i didn't find any interrupt storms related to USB or re(4) driver
>  > but while doing the tests i've found this error:
>  > 
>  > re0: watchdog timeout (missed Tx interrupts) -- recovering
>  > 
>  > This didn't create any error on the interfaces (netstat -i).
>  > 
> 
> This was triggered by new code in HEAD. It indicates re(4) missed
> Tx completion interrupt. It could be a bug in driver or hardware
> bug. If you can live with that message you can safely ignore that
> as now re(4) does not reinitialize the hardware if it detect
> missing Tx completion interrupt.

Yeah, just happened once, and i'm used to receiving a lot of interface
UP/DOWN messages that now are gone, so this is an improvement.

> 
>  > Also i didn't see any problem with interfaces going up and down,
>  > but that usually happen after some hours of uptime, so i'll let
>  > you know if the error happens again.
>  > 
> 
> Ok.
> 
>  > As these seems to improve the current situation, is there any
>  > chance of merging -current driver in 7.1 before release?
>  > 
> 
> I think re(4) in HEAD needs more testing. As you might know RealTek
> produced too many chipsets. :-(

Ok, i'll use the backported driver as it works better for me :-)

If i can help you testing any patches i'm more than welcome to do it.

Thanks a lot for your help Pyun YongHyeon.

Regards.
-- 
La prueba más fehaciente de que existe vida inteligente en otros
planetas, es que no han intentado contactar con nosotros. 


More information about the freebsd-amd64 mailing list