[ATA] and re(4) stability issues
Victor Balada Diaz
victor at bsdes.net
Wed Dec 10 06:08:26 PST 2008
On Wed, Dec 10, 2008 at 09:07:19PM +0900, Pyun YongHyeon wrote:
> On Wed, Dec 10, 2008 at 12:32:25PM +0100, Victor Balada Diaz wrote:
> > On Wed, Dec 10, 2008 at 07:28:00PM +0900, Pyun YongHyeon wrote:
> > > On Wed, Dec 10, 2008 at 09:59:35AM +0100, Victor Balada Diaz wrote:
> > > > On Wed, Dec 10, 2008 at 03:12:26PM +0900, Pyun YongHyeon wrote:
> > > > > On Tue, Dec 09, 2008 at 07:52:37PM +0100, Victor Balada Diaz wrote:
> > > > > > Hello,
> > > > > >
> > > > > > I got various machines[1] at hetzner.de and I've been having problems
> > > > > > with interrupts on FreeBSD 7.0 and now FreeBSD 7.1 -BETA2 in amd64. I've
> > > > > > been trying to narrow the problem so someone more knowledgeable than me
> > > > > > is able to fix it. This mail is an other attempt to ask a question
> > > > > > with regards ATA code to see if this time i got something.
> > > > > >
> > > > > > For the ones that don't actually know what happened:
> > > > > >
> > > > > > With FreeBSD 7.0 -RELEASE for amd64 and default kernel
> > > > > > the system shared re0 interrupt with OHCI and this caused
> > > > > > re(4) to corrupt packets and create interrupt storms. Tried
> > > > >
> > > > > re(4) in 7.0-RELEASE had bus_dma(9) bug which could be easily
> > > > > triggered on systems with > 4GB memory. But I dont' know whether
> > > > > this is related with interrupt storms.
> > > > >
> > > > > > updating to 7.1 -BETA2 and still had some problems with it.
> > > > > >
> > > > > > I've opened the PR kern/128287[2] and Remko quickly answered
> > > > > > with a workaround: that workaround was removing USB support from
> > > > > > my kernel. I did it and re(4) wasn't sharing interrupts anylonger,
> > > > > > and the interrupt storms were gone. Now sometime later the interface
> > > > > > goes up and down from time to time, but less often. Also sometimes
> > > > > > the machine losts the network interface but continues to work.
> > > > > >
> > > > >
> > > > > It seems that your controller supports MSI so you can set a tunable
> > > > > hw.re.msi_disable to 0 to enable MSI. With MSI you can remove
> > > > > interrupt sharing(e.g. add hw.re.msi_disable="0" to
> > > > > /boot/loader.conf file.) However there were several issues on re(4)
> > > > > w.r.t MSI so it was off by default.
> > > >
> > > > This is undocumented and with sysctl -a i can't find the tunable. Is this
> > > > a HEAD feature or it's also in 7.1 -BETA2? Should i add
> > >
> > > Yeah it's an undocmented feature. But most drivers written by me
> > > have similar kobs. Both HEAD and stable/7 including 7.1 BETA2 have
> > > the tunable.
> >
> > I think it could be great if you could document it or at least
> > show it by default when you do sysctl -ad with a small description.
> >
>
> If MSI worked as expected I would have documented it as I did
> in msk(4)/nfe(4)/ale(4)/age(4)/jme(4) etc.
> Using MSI on RealTek does not seem to stable. I tried hard to fix
> that but some users still reported watchdog timeouts. Working
> without documentation and hardware also made it hard to complete
> the work. This was the main reason why MSI was disabled on re(4).
What do you think about adding a note in the man page telling that
it's experimental and in some cases it could improve the situation
but in others it will give errors?
>
> > >
> > > > hw.re_msi_disable="0" to /boot/loader.conf?
> > > ^^^^^^^^^^^^^^^^^^^^^
> > > Shoule be hw.re.msi_disable="0"
> > > >
> > >
> > > Yes, just add it to /boot/loader.conf. Note, you should not disable
> > > system-wide MSI control(e.g. hw.pci.enable_msi == 1).
> > >
> > > > This was sharing interrupt with USB, does USB need any special MSI handling
> > > > or with re using MSI is enough to not share the interrupt?
> > >
> > > If re(4) can use MSI, you don't need to worry about interrupt
> > > sharing with USB. Check the output of "vmstat -i". You normally get
> > > an irq256 or higher for MSI enabled driver.
> > >
> > > >
> > > >
> > > > >
> > > > > > I know it continues to work because some days later i can see that
> > > > > > it tried to deliver the status reports but was unable to resolve the
> > > > > > aliases hostnames. I can't ping the machine and i know the network
> > > > > > is OK. If i reboot the machine everything is working again.
> > > > > >
> > > > >
> > > > > Recently I've made small changes to re(4) which may help to detect
> > > > > link state change event. Would you try re(4) in HEAD?
> > > >
> > > > Can i just drop HEAD's /stable/7/sys/dev/re/ in -STABLE and test that
> > >
> > > Yes, you can. It should build without problems. Just replace re(4) on
> > > stable/7 with HEAD version.
> > >
> > > > or do i need to test the whole HEAD kernel?
> > > >
> > >
> > > No you don't have to that.
> >
> > Backporting the changes i've found that it didn't compile so in
> > the end i got from HEAD the following files:
> >
> > base/head/sys/dev/re/if_re.c
> > base/head/sys/pci/if_rl.c
> > base/head/sys/pci/if_rlreg.h
> >
>
> Ah,, sorry about that. Recently there was some changes. I forgot
> that.
>
> > After that i've recompiled 7.1 -BETA2 GENERIC kernel and enabled
> > the knob you suggested in /boot/loader.conf.
> >
> > With the new kernel and MSI the interrupts are like this:
> >
> > # vmstat -i
> > interrupt total rate
> > irq9: acpi0 1 0
> > irq16: ohci0 1 0
> > irq17: ohci1 ohci3 1 0
> > irq18: ohci2 ohci4 1 0
> > irq22: atapci0 19215 15
> > cpu0: timer 2502718 1998
> > irq256: re0 4967726 3967
> > cpu1: timer 2502525 1998
> > Total 9992188 7980
> >
> > The high interrupt numbers are because i've been running iperf to
> > check everything it's fine, not because of interrupt storms. So far
> > i didn't find any interrupt storms related to USB or re(4) driver
> > but while doing the tests i've found this error:
> >
> > re0: watchdog timeout (missed Tx interrupts) -- recovering
> >
> > This didn't create any error on the interfaces (netstat -i).
> >
>
> This was triggered by new code in HEAD. It indicates re(4) missed
> Tx completion interrupt. It could be a bug in driver or hardware
> bug. If you can live with that message you can safely ignore that
> as now re(4) does not reinitialize the hardware if it detect
> missing Tx completion interrupt.
Yeah, just happened once, and i'm used to receiving a lot of interface
UP/DOWN messages that now are gone, so this is an improvement.
>
> > Also i didn't see any problem with interfaces going up and down,
> > but that usually happen after some hours of uptime, so i'll let
> > you know if the error happens again.
> >
>
> Ok.
>
> > As these seems to improve the current situation, is there any
> > chance of merging -current driver in 7.1 before release?
> >
>
> I think re(4) in HEAD needs more testing. As you might know RealTek
> produced too many chipsets. :-(
Ok, i'll use the backported driver as it works better for me :-)
If i can help you testing any patches i'm more than welcome to do it.
Thanks a lot for your help Pyun YongHyeon.
Regards.
--
La prueba más fehaciente de que existe vida inteligente en otros
planetas, es que no han intentado contactar con nosotros.
More information about the freebsd-amd64
mailing list