Thursday, July 29, 2010

The Overheating EVA

I had a "interesting" experience recently, namely had a EVA 4400 overheating due to environmental issues (fancy-talk for Aircon failure).  The client phoned me, complaining that half of their Hyper-V VM's are not running.  Further investigation revealed that the CSV's were offline.  Hmmm, this was getting serious.  I logged into Command View and saw that most of my VDisks were faulted, this was due in no small matter to the fact that all the drives in one of my shelves were faulted.

Event Logs

I had a look at the relevant EVA logs and discovered the following relevant entries:
  •  Temperature within an HSV300 controller becoming too hot.
    View corrective actions.  Corrective action code: 2e
  •  A drive enclosure temperature sensor out of range condition has been reported by one of the drive enclosure link modules.
  • A physical disk drive has disappeared.
    View corrective actions.  Corrective action code: 42
  • A Volume has transitioned to the MISSING state.
    View corrective actions.  Corrective action code: bf
What Happened

In retrospect it was a fairly simple sequence of events, as evidenced by the entries above.  The Air Conditioner failed, which caused the temperature within the Drive Shelf to rise (this is the HSV300 controller referred to in the event log).  To prevent damage to itself, the drive then switched itself off, which prompted the log entry about the physical drive disappearing.

We then started seeing volumes transitioning to the missing state, i.e. our VDisks went missing.  Hardly surprising considering that the drives containing them switched themselves off.

Resolution

  1. Restored Air Conditioning (goes without saying I guess)
  2. Powered off the EVA and all attached disk shelves
  3. Powered on disk shelves and waited for the Numeric ID LED's at the back to display the proper IDs.
  4. Powered up the Controller
  5. Lo and behold!  All the previously failed physical disks came on-line, meaning that my missing VDisks also made a most welcome return
  6. Unfortunately my Hyper-V Hosts still couldn't access the Vdisks, so I had to unpresent and re-present them via Command View.  I assume the EVA assigned new WWN's to the LUNs.
  7. I re-scanned for storage from the Disk Management MMC on the Hyper-V Hosts
  8. Brought the Disks and CSV's online via cluster manager
  9. Started up the VM's
Conclusion

This was quite a harrowing experience, obviously.  What struck me as ridiculous is that HP does not have *ANY* thermal shutdown logic / capabilities on the EVA controller itself.  It keeps on trucking till the drives themselves fail, causing a very ungraceful failure of the VDisks.  There is also no guarantee that your drives and VDisks will come back online.  In essence - if your EVA overheats there is a distinct possibility that you lose your Data.  Caveat Emptor...