IT (Mis)Adventures in Namibia: VMware

Showing posts with label VMware. Show all posts

Sunday, February 24, 2013

Issues and Workarounds – vSphere Site Recovery Manager with the NetApp Storage Replication Adapter 2.01

I have just completed a project where I had to Install and configure VMware vSphere Site Recovery Manager. Storage was provided by NetApp FAS and V-Series filers, thus I had to use the NetApp provided Storage Replication Adapter. As of the time of this writing the latest version was 2.01. True to form I ran into a couple of bugs, which took a bit of figuring out.

Unable to add a controller: “Error: SRA command ‘discoverArrays’ failed”

Execute the following commands on your filers

options httpd.admin.enable on
options httpd.enable on
options httpd.admin.ssl.enable off

Error when adding an Array Pair: “Internal error: std::exception 'class Dr::Xml::XmlValidateException' "Element 'SourceDevices' is not valid for content model: '(SourceDevice,)”

There are two solutions to this issue

Downgrade back to NetApp SRA version 2.0.0
Manually include the lists of volumes you want discovered by the SRA. You’ll need to do this on both controllers in the pair.

This is a documented bug

Reprotect Job fails after recovering to Disaster Recovery Site

The SRM / SRA timeouts seems a bit aggressive to me. This is highlighted when you do a reprotect on a failed over Protection Group. Part of the task sequences is to reverse the direction of replication, but this fails consistently due to the SRM not waiting long enough for this reversal to take place.

You can kludge it by:

Re-running the reprotect until it works
Manually refresh the Array Manager while the Re-Protect job is running

Recovered DataStores have snap-xxx prefixes

More of a cosmetic irritant than a true bug, I wanted this fixed nonetheless.

Within SRM, right-click your site and select Advanced Settings
Click StorageProvider
Select the storageProvider.fixRecoveredDatastoresNames check box

Tip

I would suggest increasing your SRM SAN provider timeout settings to something a bit more sane, like double. Instructions can be found here.

Also make sure that the ALUA settings on your iGroups in both the protected and recovery sites are the same.

Wednesday, October 10, 2012

Directly Connecting a Brocade 815 HBA to a EMC VNX5300

I’m busy with a project which involves getting two ESXi hosts hooked up to a VNX5300 configured in block mode. The order we placed with Dell specified Emulex 12000 HBA’s, but Dell got creative and shipped Brocade 815’s instead. Only problem was that they didn’t work when directly connected to the front-end ports on the VNX. I’m documenting the symptoms here as well, so that the next person does not have to battle for two days.

The Symptoms

When directly connecting the HBA’s to the VNX fiber ports the following events pop up in the SP event logs

EV_VirtualArrayFeature::_mergeInternalObjects() - No parent for HBA,
EV_TargetMapEntry::GetHostInitiatorPort() - NULL HBAPort pointer

Running NaviSECCli.exe -Address 172.20.10.27 port -list –sfpstate outputs the following:

SP Name:             SP A
SP Port ID:          1
SP UID:              50:06:01:60:BE:A0:72:F9:50:06:01:61:3E:A0:72:F9
Link Status:         Up
Port Status:         Online
Switch Present:      NO
SFP State:           Online

This tells us that things are fine on a physical layer, but not much else is happening higher up the stack.

The Fix

First we need to upgrade the HBA firmware to version 3.1. There are various OS specific ways to do it, easiest is probably to download the livecd from Brocade. Since this HBA is not on the ESXi 5.1 HCL we need to install the driver. You need to install at least the v3.1 I include the steps for the sake of completeness

Enable SSH on your ESXi host
Use scp for Windows or the following command from a linux / max host: scp brocade_driver_esx50_v3-1-0-0.tar root@<ip address>:/tmp
SSH into your ESXi host and navigate to the /tmp folder with cd /tmp
Execute tar xf brocadedriveresx50_v3-1-0-0.tar
Execute ./brocade_install_esxi.sh
Wait for the installation to finish (takes about 1 – 2 mins) and reboot host once done

Now we need to configure the HBA for direct connection, or more technically, FC-AL mode

SSH into your ESXi host and navigate to /opt/brocade/bin/ by entering cd /opt/brocade/bin/
./bcu port --topology 1/0 loop
./bcu port —disable 1/0
./bcu port —enable 1/0
./bcu port --topology 2/0 loop
./bcu port —disable 1/0
./bcu port —enable 1/0

Your ESXi host should now show up as a host on the VNX where you can add it to a storage group and assign LUNs.

Wednesday, April 11, 2012

Issues when upgrading to Veeam Backup & Replication v6

Let me start off by saying that Veeam Backup & Replication (VBR) is one of the most awesome pieces of software I know of! As with any piece of software there might be the occasional bug or issue that needs to be worked around. I’ve upgraded a couple of clients to VBR already, and I’ve consistently run into two little niggles. These are:

Backup of vCenter SQL DB fails when using VBR v6

This is actually detailed on the Veeam forums here.

What it boils down to is that when using VBR for an application aware backup of the VM which hosts the vCenter SQL DB, the backup will fail with a VSSControl: Failed to freeze guest, wait timeout error. This occurs because VBR has to communicate with vCenter to create a snapshot, but at the same time the vCenter SQL DB is frozen because of the VSS snapshot

To solve this problem, add the IP-address of the ESX host which runs the vCenter Server SQL database to the Veeam console (the servers list in the left panel). Then adjust the job and select the virtual machine with the SQL database from the ESX host instead of via the vCenter Server. Veeam B&R will then use the host for communication with the VM to do the VSS snapshot.

This will, of course, cause your replication job to fail should the vCenter VM move to another host. We can work around this by using DRS Host Affinity rules to tie your vCenter VM to a physical box.

Replication job fails – cannot connect to port 2500

When VBR does a replication job, it might fail with “Creating snapshot Cannot connect to server [x.x.x.x:2500]”

In my case it was always happening when the destination server was running ESX (not ESXi) v4.x. This is easily resolved by logging onto the target server via SSH and issuing the following commands:

esxcfg-firewall -openport 2500:2510,tcp,in,VeeamSCP
service mgmt-vmware restart

Easy enough!