Boot hangs on ips0: resetting adapter, this may take up to 5
minutes
Oleg Sharoiko
os at rsu.ru
Thu Mar 23 09:14:29 UTC 2006
Hi!
On Mon, 13 Mar 2006, John Baldwin wrote:
JB>> To make GENERIC usable it's enough to comment
JB>> options PREEMPTION
JB>> Not sure if this helps much.
JB>It could point to a bug in a driver.
All this time I was doing experiments, but the more I did the less I
understood. Now I'd say that I suppose the problem is not with a
particular device, but rather with a number of devices installed in the
system. The things are different depending on hardware setup and kernel
configuration. Just a few examples:
The only configuration which I've never seen failing was with no pci cards
installed and several devices disabled in BIOS (mouse, floppy, ata, serial
ata). This way the system boots fine with GENERIC kernel. As soon as I
install additional scsi card (adaptec 29160) SCB timeouts start happening
on internal scsi adapter during "Waiting 5 seconds for SCSI devices to
settle". The system would still boot after "ahd0: Recovery Initiated -
Card was not paused". If I remove bge driver from kernel (keeping
additional scsi in system) this timeouts go away.
The GENERIC kernel on the system with no pci cards and all devices
enabled in BIOS sometimes boots and sometimes hangs with last line "lo0:
bpf attached". The same happens with kernel without bge with the exception
that for this one chances that it would boot are higher.
When ips pci card is installed the GENERIC kernel would definitely hang
at boot. Kernel without bge would boot almost for sure. On SMP kernel I
was even able to kldload bge when boot have been completed. The same
action on UP system produces rather strange results. If I boot to
singleuser mode and load if_bge than the system returns to command prompt
and I can edit command line and everything looks normal. But as soon as I
try to execute something (I suppose disk io is a point here, but I'm not
sure) the system becomes extremely slow. It takes about 30 seconds to
print a single character on console. The same happens if I load if_bge in
multiuser mode.
One thing is common to all cases: when system hangs (or becomes slow)
Ctrl+Alt+Esc wouldn't work, but sending break on com port still would and
it's possible to get into kernel debugger. Unfortunately this doesn't help
me. To be true I don't think I can cope with this on my own. I setup
remote gdb for this box but it gives nothing to me, due to lack of
knowledge on how interrupt delivery works and how interrupt handling is
done in FreeBSD. Would it be possible for you, John, or maybe for someone
else to look at this box. I can provide full remote access to it with
remote gdb, serial console and ip kvm.
And another thing, just to remember, is that disabling preemption makes
things normal.
All tests were done with sources checked out with -r HEAD -D '2006-03-10
15:34:00 UTC'. I have also tested GENERIC built from fresh src - it has
same problems.
This issue is not specific to scsi problems. I think it would be nice to
change mailing list to the more appropriate one. This happens on amd64,
and not on i386. Should this conversation be moved to freebsd-amd64? Or
maybe another list?
--
Oleg Sharoiko.
Software and Network Engineer
Computer Center of Rostov State University.
More information about the freebsd-scsi
mailing list