Image of Cortney & Jeremy

When isn’t a loopback always reachable?

by Jeremy L. Gaddis on January 31, 2011 · 12 comments

in Networking

When it’s an HP ProCurve switch, of course!

A long time ago I was taught that loopback interfaces are always up.

HP’s Management and Configuration Guide for their 3500/5400/6200/6600/8200 switches seems to agree. Matter of fact, it says so right here in Chapter 8:

User-defined loopback addresses provide the following benefits:

  • A loopback interface is a virtual interface that is always up and reachable so long as at least one of the IP interfaces on the switch is operational. As a result, a loopback interface is useful for debugging tasks since its IP address can always be pinged if any other switch interface is up.
  • You can use a loopback interface to establish a Telnet session, ping the switch, and access the switch through SNMP, SSH, and HTTP (web interface).
  • A loopback IP address can be used by routing protocols. For example, you can configure the loopback IP address as the router ID used to identify the switch in an OSPF area. Because the loopback interface is always up, you ensure that the switch’s router ID remains constant and that the OSPF network is protected from changes caused by downed interfaces.

Yeah, that’s pretty much why we use ‘em, right?

I have one particular 5400 that has given me problems since I got it (a 5406zl, if you’re wondering). This thing has more issues than a teenage emo girl whose step-dad beats her. I’ve lost count of how many cases I’ve opened over this switch and, at one point, even had a “Solutions Architect” from HP fly in to spend the day and witness one particular issue (maybe I was lying or something, right?). The management module and CF card have been replaced twice that I can recall — maybe three times (without checking, I’m not really sure anymore).

This 5406zl has been running the K.13.71 software for quite a while. I found it to be (mostly) stable and I don’t go around upgrading to new versions of software until there’s a good reason to.

One particular issue that we’ve encountered is that the switch will — almost exactly at 60 day intervals — start crapping out. We use rancid to audit/monitor changes and eventually rancid will generate an e-mail that clearly shows that something isn’t right. Logging into the switch and issuing “show config” will get you a wonderful error message: “Translator failed”. If left alone, the switch seems to continue functioning (for the most part), but eventually either 1) you’ll get locked out of SSH access or 2) you’ll enter a harmless command that will cause the switch to reload.

The first time this occurred was on June 29th when I logged into the switch to verify the configuration of an access list. I issued the “show access-list config” command and the switch immediately crashed and reloaded (at about 14:20, conveniently).

To close out the case that I opened because of that, I was sent a new management module and CF card that I swapped out on July 6th. On August 30th, I again received the infamous “Translator failed” e-mail and, late that night, swapped out the management module yet again.

Knowing that HP Support will eventually want me to upgrade the firmware, I attempted to do just that on October 6th. I grabbed K.14.49m, a “maintenance release”, and loaded it onto the switch. To quote the first line of the follow-up e-mail I sent to my co-workers:

“Well, that was a miserable failure.”

The switch still worked, depending on how you define “work”. The big issue was that the loopback interface would not respond to any traffic. Wait, what did the manual say again?

“You can use a loopback interface to establish a Telnet session, ping the switch, and access the switch through SNMP, SSH, and HTTP (web interface).”

Um, yeah. None of that worked, actually.

I ended up rolling back to K.13.71 and life went on. About sixty days later, of course, we had to reboot the damn thing again. I think I had to do it again a month and a half or so ago but at this point I’ve quit keeping track.

Last week, I decided to be pre-emptive and attempt the upgrade to K.14.49m again. I’m not sure why I expected anything different, but I got the same result. This time, however, I wondered if it were just a fluke with this particular switch (since I’ve already had numerous issues with it) and left it at K.14.49m.

Because everything uses the loopback IP address, however, a number of other things had to be changed: our monitoring system, DNS entries, the RADIUS configuration (on both the switch and the server), our SNMP software that graphs interface utilization, RANCID, etc. Obviously, this wasn’t too damn convenient and I had to update all of this stuff.

Now I’m thinking that it’s just this particular switch. It can ping its own loopback (usually), but nothing else can. It was even sending out RADIUS Access-Requests with the IP of the loopback (which is what it was configured to do) and I saw the return traffic with tcpdump, but it was apparently just silently discarding it.

Before opening a new case with HP, I decide I’ll upgrade another switch to the same software version and see what happens. I fully expected this upgrade to go smoothly. How wrong I was!

Tonight, I VPN in and upgrade this switch, a 5412zl with a similar configuration, to K.14.49m as well. It comes back up but, once again, the loopback IP address is unreachable. I go hit the documentation again and see that “lo1″ through “lo7″ are identified as being user-definable. Maybe that’s my issue, since I used “lo0″. I remove the IP from loopback0, bring up loopback1, assign it the IP address, and advertise that into OSPF. I make sure it shows up in OSPF and try to ping it: nothing.

Just for good measure, I’ll show some verification:

SWITCH4# show ip | in lo1
  lo1          | Manual       10.144.5.4      255.255.255.255

10.144.5.4 is the loopback IP address of the switch and we can ping that IP from itself:

SWITCH4# ping 10.144.5.4
10.144.5.4 is alive, time = 1 ms

Let’s make sure we can see it in OSPF (from two other devices, just to be doubly sure):

SWITCH3# show ip route ospf | in 5.4
  10.144.5.4/32      10.144.4.22     4001 ospf      IntraArea  2          110
SWITCH1# show ip route ospf | in 5.4
  10.144.5.4/32      10.144.4.10     4012 ospf      IntraArea  2          110

Yep, everything looks good. We should be able to ping that without any issues, right?

SWITCH3# ping 10.144.5.4 repetitions 3 | in packet
3 packets transmitted, 0 packets received, 100% packet loss
SWITCH1# ping 10.144.5.4 repetitions 3 | in packet
3 packets transmitted, 0 packets received, 100% packet loss

Maybe it’s just not responding to ping, but we can still get to it via SSH, right?

$ ssh -v 10.144.5.4
OpenSSH_5.4p1 FreeBSD-20100308, OpenSSL 0.9.8n 24 Mar 2010
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Connecting to 10.144.5.4 [10.144.5.4] port 22.
debug1: connect to address 10.144.5.4 port 22: Operation timed out
ssh: connect to host 10.144.5.4 port 22: Operation timed out

In addition, tcpdump confirms that the RADIUS server receives Access-Request messages from the switch and responds accordingly, but the switch still seems to just thrown ‘em out.

No such luck. At least now I know that it’s not the 5406zl that’s screwed up, it’s the software. Maybe I shouldn’t be using the “maintenance release”, though. Let’s see what the Release Notes say:

Maintenance branch software is recommended for users who want to:

  • Standardize on a major release after is (sic) has gained some maturity,
  • Remain on a major release for an extended period of time to reduce time and expense of qualifying a new major release,
  • Do no (sic) need to use new hardware or features that are not supported in Maintenance branch.

Yeah, that pretty much describes me.

Anyway, that’s my rant for tonight (it’s just past 3am now). If you’ll excuse me, I’ve got a switch here that I have to rollback to K.13.71. Oh, and another case to open with HP (surely there’s got to be an achievement for this?).

UPDATE: I discovered the issue and how to work around it. HP changed the functionality in a software update, then later reverted back to the original behavior after I complained.

{ 11 comments… read them below or add one }

Ralph January 31, 2011 at 12:08 pm

Sounds like about as much fun as issue I had with the 5400 series a few years ago with the multicast support. I have also run into the console port unresponsive issue a few times which is always fun.

Reply

Wyatt January 31, 2011 at 5:00 pm

I tried to sit on that major release as well and HP had me upgrade to a “unreleased version” to fix my random reboot from memory leaks issue. I don’t think there is any reason HP should be calling that a maintenance release as it seems far from polished.

Reply

Rob February 3, 2011 at 5:35 pm

Place I used to work for runs 14.49m software on 5406zl’s. Rather than assign the Loopback a /32 address, we use an address out of one of the VLANs on the switch. No problems reported so far.

Reply

JJ February 4, 2011 at 1:37 pm

Hey I saw this post.

So… The previous incarnations of ProCurve firmware supported use of IPs on a VLAN even if no interfaces were up (meaning a physical port). Somehwere between 13 and 14 that changed on the K code for the ProVision ASIC series.

We bitched, we moaned, we complained too. In fact, I’m going to take this up the chain AGAIN through the development lab to see if we can get a resolution another way, since I’ve learned that we’re not the only ones with this issue.

Support may tell you that’s the expected operation, and it is now, but it WASNT before; they changed it.

I’ll keep you posted if we can get a feature change request going!
jj

Reply

JL February 7, 2011 at 5:24 am

I feel your pain, these are toy switchs, the CLI is terrible and the hardware is no better, just last week I had two new linecards fail just 72 hours apart.

Reply

JJ February 7, 2011 at 3:50 pm

Aww they’re not toy switches. You’d be surprised at some of the places that run their entire organizations (routing and all) on HP switches.

I like them because they’re easier to manage. In one line of CLI I can do things that would take 2-5 lines on a Cisco switch, plus the HP doesn’t force you to be in a specific context to do actions. I guess if you’re just used to Cisco then that’s what you know and you’re familiar with it, but as someone who has to do a lot of cross-platform, there are things I prefer on the HP that make my life simple.

j

Reply

workape February 21, 2011 at 5:56 pm

You are right, they aren’t toy switches they are pretty much crap. We were one of those orgs that was running Procurve everywhere and after 2.5 years of it we are starting to gut our sites and replace with Cisco kit. I’ve had 0 issues with the sites that we’ve put in with Cisco and nothing but heartache and pain with the Procurve kit.

We’ve got everything from the 5412′s down to the little 1708 pieces of crap which aren’t fit to be used at a persons desk as a hub. You get Half Price you also get Half Effort in support and less than Half Effort in product and software testing to find all sorts of wonderful bugs.

Reply

Aaron February 23, 2011 at 5:15 am

Hi! I work at a university and we have around 400 user-access switches and about 25 5406/54012zl, one in each building and a couple more per campus as Core.

I upgraded some time ago one of our Core from K.13.71 to K.14.49m and the loopback (lo1) is reacheble. We use it for monitorization (snmp,ping,syslog..) and management and we reach it trhough OSPF. Besides a some really weird CPU peaks we had no complaints. The only difference is that it is a 5412zl, not a 5406zl…

I guess you don’t even think about upgrading to K.15.xxxx, we are using it as “lab” in a building and the CPU went down a quite and it is pretty stable. We don’t use loopback there, so I can’t assure you it will work but… hey, what can you lose by trying ;)

Reply

Aaron February 23, 2011 at 5:17 am

By the way, great blog

Reply

Joachim Tingvold July 4, 2011 at 8:59 am

Ehr, this is stupid.

I’ve encountered this issue on some 3500yl’s we have, and it’s not even consistent. I recently upgraded them to K.15.05.0002, and on some of them, the lo0-interface behaves the same way as you describe, while on others, it works just fine (even with the same configuration). So, same switch, same software, same config, but different behavior between the switches. I’m baffled.

Reply

Jeremy L. Gaddis July 14, 2011 at 3:05 am

@Joachim,

Same here. Welcome to HP.

-Jeremy

Reply

Leave a Comment

{ 1 trackback }

Previous post:

Next post: