When it’s an HP ProCurve switch, of course!
A long time ago I was taught that loopback interfaces are always up.
HP’s Management and Configuration Guide for their 3500/5400/6200/6600/8200 switches seems to agree. Matter of fact, it says so right here in Chapter 8:
User-defined loopback addresses provide the following benefits:
- A loopback interface is a virtual interface that is always up and reachable so long as at least one of the IP interfaces on the switch is operational. As a result, a loopback interface is useful for debugging tasks since its IP address can always be pinged if any other switch interface is up.
- You can use a loopback interface to establish a Telnet session, ping the switch, and access the switch through SNMP, SSH, and HTTP (web interface).
- A loopback IP address can be used by routing protocols. For example, you can configure the loopback IP address as the router ID used to identify the switch in an OSPF area. Because the loopback interface is always up, you ensure that the switch’s router ID remains constant and that the OSPF network is protected from changes caused by downed interfaces.
Yeah, that’s pretty much why we use ‘em, right?
I have one particular 5400 that has given me problems since I got it (a 5406zl, if you’re wondering). This thing has more issues than a teenage emo girl whose step-dad beats her. I’ve lost count of how many cases I’ve opened over this switch and, at one point, even had a “Solutions Architect” from HP fly in to spend the day and witness one particular issue (maybe I was lying or something, right?). The management module and CF card have been replaced twice that I can recall — maybe three times (without checking, I’m not really sure anymore).
This 5406zl has been running the K.13.71 software for quite a while. I found it to be (mostly) stable and I don’t go around upgrading to new versions of software until there’s a good reason to.
One particular issue that we’ve encountered is that the switch will — almost exactly at 60 day intervals — start crapping out. We use rancid to audit/monitor changes and eventually rancid will generate an e-mail that clearly shows that something isn’t right. Logging into the switch and issuing “show config” will get you a wonderful error message: “Translator failed”. If left alone, the switch seems to continue functioning (for the most part), but eventually either 1) you’ll get locked out of SSH access or 2) you’ll enter a harmless command that will cause the switch to reload.
The first time this occurred was on June 29th when I logged into the switch to verify the configuration of an access list. I issued the “show access-list config” command and the switch immediately crashed and reloaded (at about 14:20, conveniently).
To close out the case that I opened because of that, I was sent a new management module and CF card that I swapped out on July 6th. On August 30th, I again received the infamous “Translator failed” e-mail and, late that night, swapped out the management module yet again.
Knowing that HP Support will eventually want me to upgrade the firmware, I attempted to do just that on October 6th. I grabbed K.14.49m, a “maintenance release”, and loaded it onto the switch. To quote the first line of the follow-up e-mail I sent to my co-workers:
“Well, that was a miserable failure.”
The switch still worked, depending on how you define “work”. The big issue was that the loopback interface would not respond to any traffic. Wait, what did the manual say again?
“You can use a loopback interface to establish a Telnet session, ping the switch, and access the switch through SNMP, SSH, and HTTP (web interface).”
Um, yeah. None of that worked, actually.
I ended up rolling back to K.13.71 and life went on. About sixty days later, of course, we had to reboot the damn thing again. I think I had to do it again a month and a half or so ago but at this point I’ve quit keeping track.
Last week, I decided to be pre-emptive and attempt the upgrade to K.14.49m again. I’m not sure why I expected anything different, but I got the same result. This time, however, I wondered if it were just a fluke with this particular switch (since I’ve already had numerous issues with it) and left it at K.14.49m.
Because everything uses the loopback IP address, however, a number of other things had to be changed: our monitoring system, DNS entries, the RADIUS configuration (on both the switch and the server), our SNMP software that graphs interface utilization, RANCID, etc. Obviously, this wasn’t too damn convenient and I had to update all of this stuff.
Now I’m thinking that it’s just this particular switch. It can ping its own loopback (usually), but nothing else can. It was even sending out RADIUS Access-Requests with the IP of the loopback (which is what it was configured to do) and I saw the return traffic with tcpdump, but it was apparently just silently discarding it.
Before opening a new case with HP, I decide I’ll upgrade another switch to the same software version and see what happens. I fully expected this upgrade to go smoothly. How wrong I was!
Tonight, I VPN in and upgrade this switch, a 5412zl with a similar configuration, to K.14.49m as well. It comes back up but, once again, the loopback IP address is unreachable. I go hit the documentation again and see that “lo1″ through “lo7″ are identified as being user-definable. Maybe that’s my issue, since I used “lo0″. I remove the IP from loopback0, bring up loopback1, assign it the IP address, and advertise that into OSPF. I make sure it shows up in OSPF and try to ping it: nothing.
Just for good measure, I’ll show some verification:
SWITCH4# show ip | in lo1 lo1 | Manual 10.144.5.4 255.255.255.255
10.144.5.4 is the loopback IP address of the switch and we can ping that IP from itself:
SWITCH4# ping 10.144.5.4 10.144.5.4 is alive, time = 1 ms
Let’s make sure we can see it in OSPF (from two other devices, just to be doubly sure):
SWITCH3# show ip route ospf | in 5.4 10.144.5.4/32 10.144.4.22 4001 ospf IntraArea 2 110
SWITCH1# show ip route ospf | in 5.4 10.144.5.4/32 10.144.4.10 4012 ospf IntraArea 2 110
Yep, everything looks good. We should be able to ping that without any issues, right?
SWITCH3# ping 10.144.5.4 repetitions 3 | in packet 3 packets transmitted, 0 packets received, 100% packet loss
SWITCH1# ping 10.144.5.4 repetitions 3 | in packet 3 packets transmitted, 0 packets received, 100% packet loss
Maybe it’s just not responding to ping, but we can still get to it via SSH, right?
$ ssh -v 10.144.5.4 OpenSSH_5.4p1 FreeBSD-20100308, OpenSSL 0.9.8n 24 Mar 2010 debug1: Reading configuration data /etc/ssh/ssh_config debug1: Connecting to 10.144.5.4 [10.144.5.4] port 22. debug1: connect to address 10.144.5.4 port 22: Operation timed out ssh: connect to host 10.144.5.4 port 22: Operation timed out
In addition, tcpdump confirms that the RADIUS server receives Access-Request messages from the switch and responds accordingly, but the switch still seems to just thrown ‘em out.
No such luck. At least now I know that it’s not the 5406zl that’s screwed up, it’s the software. Maybe I shouldn’t be using the “maintenance release”, though. Let’s see what the Release Notes say:
Maintenance branch software is recommended for users who want to:
- Standardize on a major release after is (sic) has gained some maturity,
- Remain on a major release for an extended period of time to reduce time and expense of qualifying a new major release,
- Do no (sic) need to use new hardware or features that are not supported in Maintenance branch.
Yeah, that pretty much describes me.
Anyway, that’s my rant for tonight (it’s just past 3am now). If you’ll excuse me, I’ve got a switch here that I have to rollback to K.13.71. Oh, and another case to open with HP (surely there’s got to be an achievement for this?).
UPDATE: I discovered the issue and how to work around it. HP changed the functionality in a software update, then later reverted back to the original behavior after I complained.