Information Technology « Random Brain Drool

We’ll yesterday was the first time in many years that I had to spend an entire day in the office on a weekend due to a major hardware outage. This is never a good thing and it really *feels* bad when your team fails to keep core operations running smoothly. As a IT Manager, I strive for 0% unscheduled downtime and 95% up-time overall (Yep, that 5% might sound high but due to the volume of updates, backups, patches, hardware migrations, performance improvement tasks, etc… Thing need to go down for schedule work).

** I think Benjamin Franklin’s quote about a Apple a day, can be adapted to “A update a day, keeps another hacker away!”.

As for my data unavailable (DU) outage, the core symptom was our SQL Server 2008 host connect via fiber HBA to the SAN, loosing all of it’s LUNS. When my DBA saw this, she was on the phone with me in minutes yelling that all of her table had disappeared from every database in the system. She said that the system went down at 1:45 AM and all of our backups failed and all of our databases were empty except for the system tables. I immediately hopped online and realized this symptom was caused by SQL Server showing unmounted databases (why not say offline, unavailable), I’ll leave that question to Microsoft to explain…. but it definitely left her in a panic seeing all the data missing from every database on the server. Since I’m familiar with the issue (I think I’ve caused or fixed a few thousand issues, over my past 10 years of managing SQL Server), I packed the car and headed into the office.

Thinking we had DR built into our very expensive EMC SAN, I was going crazy trying to figure out why we were not online. I put in a severity 1 request with EMC telling them that we had DU and to get somebody on the phone ASAP. I grabbed a SP collect and sent it off to EMC and tech support person 1/7 found a bad fiber port on storage processor A (SP A – SFP FE 0) and said this was my issue. Since I know we have 2 storage processors, I asked him about moving the resource to B, but he said he only knew the Celerra part of our system and declined helping saying I would need to work with the Clarion tech who would be dispatched to replace SFP FE0 in 4 hours, and be on-site with parts within 24 hours. I asked multiple times if he could directly transfer me to somebody on the Clarion team to get my system online and data available, but he kept saying NO. I finally gave up on the call, so I could log some NASTY messages in the call ticket to get somebody moving ASAP (This works very well with EMC, so if you can’t get the online person to solve your problem then post all your issues, comments, and concerns in the ticketing system).

Into the server room I went, one of the few times I wish the room was a bit warmer than 67 degrees… I manually moved the host fiber cable connected to the HBA over to storage processor B and saw my LUNs show up on the HOST in seconds, but they were all showing as unrecognized and prompting for initialization. Knowing there was something wrong, I spent about a hour looking in the NaviSpehere, UniSphere, and CLI (both GUIs were new to me because 38 days prior EMC did a major update to our system). After getting half-way there, I decided to be brave to re-seat storage processor A. This ended up restoring data, but it only last for ~45 minutes before going offline again. At this point, the Clarion tech called and said he be in site in 1 hour, at 2PM with parts. Upon his arrival with a new SFP FE0 port, I asked him to first help me get online with storage processor B so we could get data available while working on storage processor A. He also didn’t know the new interface and we proceeded to go through 5 Clarion tech support people, 2 shift changes, and 4 more hours before ending up with somebody who knew what to do. As I had joked to the on-site tech a few times about it probably being 1 button or switch to set, the online tech showed us a single magic button to fix the problem! ARGH, 15 hours of downtime and 7 tech support people from EMC kept my data unavailable. The issue with the fail-over to storage processor B, was permission with the virtual disk being assigned only to storage processor A, so we moved it from A -> B and I was finally online in seconds. Since this guy seemed to KNOW the system pretty well, I asked why automatic DR fail-over did not work and why manually assigning a virtual disk to a node was needed. He realized the question meant there was else wrong, since he too was confused why this whole permission change was needed in the first place. Another hour of research from the online tech, and he determined I was missing EMC PowerPath software on all my hosts connected to the SAN via iSCSI and Fiber.

At this point, totally confused I had to get a 30 minutes lesson on PowerPath and how it works and where it was supposed to be installed. Finally, after hashing out the whole situation to determine the root cause with the; original designer, past employee who lead the project for me, my current team, Celerra support and the on-site Clarion support it was determined that the automatic fail-over features of the SAN were never setup correctly and we have never had an outage to trigger this scenario. This was a *shock* to everybody involved, but it explains a few mysterious events that have happened over the past 6 months (cluster resources becoming unavailable, complete SAN failure during upgrade, high IO on individual storage processors, etc…) that EMC support could not explain.

So, the end result and lesson learned is to always install PowerPath on all non-clustered hosts that connect to your EMC SAN if you want automatic fail-over. If you do have this setup and have a storage processor fail, follow the following steps:

1. Connect to Navisphere
2. Click “Virtual Disks”
3. Choose your virtual to edit and click the name.
4. Click “Virtual Disk Access Details”

5. Click “Change Ports for Access”
** On this screen, it will show where the virtual disk is currently assigned (e.g. SP A or SP B).

That’s it, hopefully this saves somebody 15 hours of downtime confused tech support people and/or a mis-configured DR setup!

HTTrack website copier is awesome and FREE! I’ve been using it for a few months and I’ve already saved myself and my team members hundreds of hours by downloading full working copies of various websites locally.

Here are a few examples of websites I download.

A client was having problems with their hosting company who refused to give them access to their source code (developed by a 3rd party) to their CF website. After weeks of trying to solve the problem, they decided to switch hosting companies. During their migration to a hosting company and a new CMS solution, they wanted to keep a copy of the old site online. Regardless of the engine running the website, HTTrack was able to download a working copy of the site to HTML, minus the dynamic features (Search, CMS Admin, etc…) in a few minutes.
We inherited a old web server at working running Plone 2.0.4. The machine was running on old hardware that was failing (2/3 dead HDD, broken RAID 1) and Zope was randomly crashing. After spending 3 days trying to get the machine stable, I decide to convert the site to HTML and put on one of our newer “stable” web servers. The process was a little mess, since our plone instance pointed to folders paths vs. actual files (/index_html vs /index.hmtl) the final static site had a lot of weird and randomly directed links! The final static HTML site probably took ~8 solid hours of HTML clean-up, but there was also a lot of bad/obsolete content in the existing portal that made this process painful!
I have a few websites I like to use as samples/demo that show better as interactive! Using HTTrack and controlling the depth of link following/downloading (like first page only), I can download an offline working copy of a sites main page! This is a great way to demo a feature/capability/competitor website…

A few other good things to know, a couple sites I’ve downloaded require you to login first. This isn’t an issue with HTTrack, since you can build your own cookies to apply to your capture. It was a bit tricky getting the cookie format correct (Used Netscape Cookie Format), but after a few minutes I was able to take the cookie data from IE/FireFox and have a perfect working cookie for HTTrack to use!

Once again, this is FREE utility that is handy to have around!

HTTrack WEBSITE COPIER

Random Brain Drool

Zach Hunter's Busy Life

Archive for category Information Technology

Storage Processor Failover with a EMC Celerra NX4 / Clarion AX4

Download Your Site to HTML

Archives

Categories