Boot hangs on ips0: resetting adapter, this may take up to 5 minutes

Oleg Sharoiko os at rsu.ru
Thu Mar 23 09:14:29 UTC 2006


Hi!

On Mon, 13 Mar 2006, John Baldwin wrote:

JB>> To make GENERIC usable it's enough to comment
JB>> options      PREEMPTION
JB>> Not sure if this helps much.
JB>It could point to a bug in a driver.

 All this time I was doing experiments, but the more I did the less I 
understood. Now I'd say that I suppose the problem is not with a 
particular device, but rather with a number of devices installed in the 
system. The things are different depending on hardware setup and kernel 
configuration. Just a few examples:

 The only configuration which I've never seen failing was with no pci cards 
installed and several devices disabled in BIOS (mouse, floppy, ata, serial 
ata). This way the system boots fine with GENERIC kernel. As soon as I 
install additional scsi card (adaptec 29160) SCB timeouts start happening 
on internal scsi adapter during "Waiting 5 seconds for SCSI devices to 
settle". The system would still boot after "ahd0: Recovery Initiated - 
Card was not paused". If I remove bge driver from kernel (keeping 
additional scsi in system) this timeouts go away.

 The GENERIC kernel on the system with no pci cards and all devices 
enabled in BIOS sometimes boots and sometimes hangs with last line "lo0: 
bpf attached". The same happens with kernel without bge with the exception 
that for this one chances that it would boot are higher.

 When ips pci card is installed the GENERIC kernel would definitely hang 
at boot. Kernel without bge would boot almost for sure. On SMP kernel I 
was even able to kldload bge when boot have been completed. The same 
action on UP system produces rather strange results. If I boot to 
singleuser mode and load if_bge than the system returns to command prompt 
and I can edit command line and everything looks normal. But as soon as I 
try to execute something (I suppose disk io is a point here, but I'm not 
sure) the system becomes extremely slow. It takes about 30 seconds to 
print a single character on console. The same happens if I load if_bge in 
multiuser mode.

 One thing is common to all cases: when system hangs (or becomes slow) 
Ctrl+Alt+Esc wouldn't work, but sending break on com port still would and 
it's possible to get into kernel debugger. Unfortunately this doesn't help 
me. To be true I don't think I can cope with this on my own. I setup 
remote gdb for this box but it gives nothing to me, due to lack of 
knowledge on how interrupt delivery works and how interrupt handling is 
done in FreeBSD. Would it be possible for you, John, or maybe for someone 
else to look at this box. I can provide full remote access to it with 
remote gdb, serial console and ip kvm.

 And another thing, just to remember, is that disabling preemption makes 
things normal.

 All tests were done with sources checked out with -r HEAD -D '2006-03-10 
15:34:00 UTC'. I have also tested GENERIC built from fresh src - it has 
same problems.

 This issue is not specific to scsi problems. I think it would be nice to 
change mailing list to the more appropriate one. This happens on amd64, 
and not on i386. Should this conversation be moved to freebsd-amd64? Or 
maybe another list?

-- 
Oleg Sharoiko.
Software and Network Engineer
Computer Center of Rostov State University.


More information about the freebsd-scsi mailing list