Image of Cortney & Jeremy

ProCurve 5406zl Issue Rears Its Head Again

by Jeremy L. Gaddis on August 30, 2010 · 6 comments

in Networking

Monday morning. 7am. Reading e-mail. I get the “new mail” notification from Outlook and glance down in the bottom right corner of the screen to look at the preview.

The e-mail is from RANCID, and I can see from the preview that a change has been made to a core switch at one particular location.

“That’s weird,” I think to myself as I click on the preview to open the e-mail.

I’ve not included the whole message, but basically the output shows that every line of the running configuration was effectively removed. For those of you who don’t run RANCID — you should be (see “Installing RANCID on Ubuntu 10.04 LTS”) — allow me to give a simplified explanation.

RANCID has logged into this switch, an HP ProCurve 5406zl, and issued the “show run” command. The switch happily returns the running configuration to RANCID. RANCID compares it with the last running configuration and alerts us (via e-mail) to any differences between the two in UNIX diff format.

In a nutshell, the switch is screwed up. Again.

I’ve seen this once before, on 29/Jun/2010 actually. Same exact switch. At the time, I didn’t think much of it. I honestly thought it was just a fluke and RANCID had messed up. I was very wrong, though.

Eight hours later, in the early afternoon, I got an e-mail asking about a particular issue and logged into the device to check the access-lists. I issued the “show access-list config” command and the switch immediately rebooted. That’s sort of a problem.

I opened a case w/ ProCurve support, providing the following information:

I just had a production ProCurve 5406zl running K.13.71 spontaneously reboot.  I
had logged in via SSH and executed the "show access-list config" command
and my terminal locked up.  A quick visual check showed that the device was
reloading.  The following was present in the logs: 

M 06/29/10 18:24:41 sys: 'PPC Program exception 0x700: esf=0x0820f410'
I 06/29/10 18:26:41 00061 system: ----------------------------------------------
I 06/29/10 18:26:41 00063 system: System went down:  06/29/10 18:24:41
I 06/29/10 18:26:41 00064 system: PPC Program exception 0x700: esf=0x0820f410 

Upon reload, the following was displayed on the console: 

System went down:  06/29/10 18:24:41
Saved crash information:
PPC Program exception 0x700: esf=0x0820f410
addr=0x0badbad0 ip=0x00000000 Task='tSvcWorkQ' tid=0xab9bc60
fp=0x00008150 sp=0x0820f4d0 lr=0x009f20d4 

After reload, the device appears to be functioning normally.

HP responded:

Research indicates that the replacement of the Management Module is advisable for this particular crash.

They sent me a new management module and CompactFlash card, which I swapped out a few days later. No issues had been observed on this device until now.

As a precaution, I’ve let everyone know not to even log into this device. Fortunately, we have spare chassis and I can grab a new management module and CF card out of one of them and get it swapped out — hopefully tonight, before the switch craps itself again.

I have a feeling this is going to be a great week. =)

{ 6 comments… read them below or add one }

Leave a Comment

Previous post:

Next post: