The Phone Call
One morning, around 2am, I received a call from my manager – “I know you’re not on call and so you know I wouldn’t be calling you if it wasn’t urgent but…….”.
I was a network engineer for a long time and those words for me are like the Bat-signal – I didn’t need an explanation, I was in my car and driving within 10 minutes.
Of the many times that has happened to me that one doesn’t even crack the top 3 of wacky outages I’ve been in. This one started with “…..errr the building has been struck by lighting and taken everything out” (believe me that it’s nothing like Back to the Future)
An hour and half later I walked in to my manager giving me a list of issues. The number one was a Cisco Catalyst 4009 running Catos (Google it kids) that we thought had only had a blown layer3 daughter card. We had a very good maintenance contract and had the part on site within about an hour or so and swapped. We soon found out that this wasn’t the only issue with the Cat.
Reading this 3 years later in the cold light of day you could be tempted to comment that at this point you would have noticed that all the servers connected to the Cat’ were still offline and you wouldn’t have been saying to yourself “well that was easy, celebrate with a bru?…….stupid servers!”. But in these high pressure, laugh it off instead of puking, down time experiences where literally every minute that passes has a financial penalty associated with it, everything they teach you about troubleshooting in books or on a course isn’t actually that good or practical. You concentrate on core routing, your core building blocks – fix the hardest thing first and everything else will fall into place. i.e. your servers are just expensive table tops until the network has healthy routing that knows how to connect A to B (why is B always at the other side of the world?!).
Long diagnosis story short, the power outage had caused the Cat’ to have only the odd numbered slots in the chases backplane work. We figured out that because the servers were evenly distributed, there would be enough room to move things around and put the servers on the remaining working slots. Easier said than done as due to years of use, every cable had to be labelled, removed, manoeuvred to one side, move the switch modules around until they all came up, then and after writing a new config for every port (in CatOS no less), then plugged the cables into their new home.
This was actually only one of many things that needed fixing in what turned into a 16 hour (i think) day and thankfully, by this time, something called al-cooo-hol?? had been discovered in the world and I proceed to see what all the fuss was about.
What I would do different now
These days I use these experiences to design solutions around ‘what can i do to prevent this’ instead of ‘if only i’d done this to prevent that’.
With that mentality in mind I recently purchased a load of these for a project – SergeantClip www.sergeantclip.com
It’s a straightforward plastic clip which can take a bundle of up to 6 or 12 cables (copper or fibre) and insert the clip around them and clamp it down. Now a 12 cable bundle is able to be disconnected and reconnected in quick fashion. You just have to keep an eye on which bundle goes into which port group and label up accordingly.
Instead of labelling 48 cables per switch or line card, you just label 4 clips.
Although the clip has many uses i.e. speed up a undocumented data centre migration etc, but for me it’s more about time saving in the future. It might seem odd to want to buy a plastic clip to save yourself an hour or so during an outage that may or may not happen but believe me, if you’ve been where I have you’d wish you’d had them.
Full disclosure: I don’t have any personal financial interest in the product.