Long Day's Journey into <Bleep>

Fri Jun 10 23:53:17 UTC 2011

On 09/06/2011 02:56, Gary Kline wrote:
> Well, people,
>
> It's been a long, long century.  I've been down for 5 days.
> Couldn't understand _why_ I couldn't ping anywhere [expect the
> Server itself].  Finally, tho, it became more and more likely that
> my FreeBSD was fine ... even tho I kept stripping the most likely
> problem points.  My large 16-port LinkSys router was either *it* or
> it was some kind of bug unknown to geekdom.  After a friend bought
> me a new (and tiny) 8-port switch, yes!  I could ping everywhere.
>
> I'm still bringing back the dozens of things I removed from ethic.
> And testing new ideas.  But I have a general question: have any of
> you wizards who run your own domains or otherwise use a switch [or
> hub] *ever* had it just-quit?!  It is solid-state.  Yes, the box is
> within my feet/foot reach.  I have accidently kicked it i suppose,
> but still.
>
> After wandering in the wilderness for 5 days,<<mmph>>, dunno.
>
> gary
>
> PS: yes, this is a serious question.  1) I like things-Cisco, and
> LinkSys.  I just bought this switch about 2.5 years ago, so I really
> am looking for feedback.
>
> PPS:  Another question to ask about upgrading is next.
>
>
I had a lot of faulty switch, either going all out by themselves or 
doing stranger things.
The most common thing is of course the defective port - One port will 
start spurting errors and eventually die, with little to no impact on 
the rest of the ports. (easy to detect : ping on one port vs ping on an 
other port)
Another common error is the "I want full duplex" error. The switch will 
announce itself as full duplex before falling back to half duplex 
immediately. Most of the time the port will act fine, but under heavy 
load you will have a nice panel of network error happening one after the 
other. (Also easy to detect : force connected elements to half duplex 
for test, if everything starts working again you got your problem)
Of course there is also the problem with "not so anti-loopback" switches 
- that cause packets to go round and round and round and round. (ping 
will be very inconsistent in its timing, going from a few ms to entire 
seconds)

On pure level 2 switches I had few other problems - though two took me 
days to figure out :
1 - Faulty power source : The switch could simply not bear full load 
anymore. Various errors, packet corruption, DHCP errors, misrouting and 
so on. When tested port by port, functions by functions the switch would 
work wonders. I spent an entire week testing every boxes for 
virus/trojan/rootkits/DHCP rogue servers. The problem was only solved 
after I changed every element of the network one by one. Final 
diagnostic made by Netgear
2 - Memory corruption (suspected, not validated) : Everything would work 
fine from 9 A.M to 3 to 4 P.M for an entire branch, then the network 
would slow to a crawl. Rebooting the switches would solve the problem 
for a while and then it would be nightmare again after less than an 
hour. Some boxes would complain about duplicate IP addresses. We managed 
to find that most of the defective IP addresses converged to just one 
switch - from there we theorized that there was a problem with the ARP 
cache of the switch that would make it explode after a sufficient number 
of updates (since there was a lot of VPN connection made after 3PM, we 
imagined that it was the triggering factor). We took of the switch and 
replaced it, but no light came from the manufacturer to either confirm 
or infirm our theories.

Jerome