Thursday 8 March 2012

Virtualised Exchange 2010 SP2 – Configure intersite DAG for DR without CAS array

I was asked to build a DR site for MS Exchange 2010 SP2 and replicate all DBs there. The client did not have requirements for load-balancing or fully automated fail-over solution so I was curious if it was feasible to build a functional DR system, to which users could switch over easily and quickly enough, without using 3rd party NLBs and CAS array.

Each site runs on VMware ESXi v5 clusters.

There is a 10 Mbps WAN link between the sites and Riverbed WAN accelerators at each side.


Build and configure the servers at the DR site:

At the main site, Exchange was running on 2 servers, one mailbox and one CAS/HT. So I built 2 VMs (Windows 2008 R2 Datacenter) at the DR site and configured CPU, RAM and drives. Since there was a single path between the 2 clusters, I assigned a single NIC to each server, for both MAPI and replication traffic.

As VMware recommends in Microsoft Exchange 2010 on VMware Best Practices Guide switch the NIC to VMXNET3 adapter for better performances. You may need to add a new adapter and remove the old one. You will get a prompt saying that “The IP address XXX.XXX.XXX.XXX you have entered for this network adapter is already assigned to another adapter“, but it will ask you if you want to delete the ghost adapter. Click Yes.

Use the VMXNET3 network adapter – this is a paravirtualized device that works only if VMware Tools is installed on the guest operating system. The VMXNET3 adapter is optimized for virtual environments and designed to provide high performance.

Both Microsoft and VMware now support VMware HA, DRS and vMotion for DAG servers. But take into consideration that a server will go offline for a moment during e.g. vMotion and how it is going to affect your configuration and any other applications (e.g. Blackberry) plugged into your Exchange system (with or without DAG).

Using VMware HA, DRS and vMotion with Exchange 2010 DAGs

Announcing Enhanced Hardware Virtualization Support for Exchange 2010

Dynamic memory allocation should be disabled if possible.

MS Exchange 2010 database cache and VMware ESX Server memory resource management

I installed Exchange servers, Mailbox role on one server and CAS and HT on another, and configured Server Configuration (entered license keys, imported and configured certificates, setup server limits etc).

During the installation of the first Exchange server (CAS/HT) I got the error below:


After some troubleshooting, I found that this could happen if the new server did not have right to ‘Manage auditing and security log’ on DCs, which is configured by granting the right to the group Exchange Servers in Domain Controllers Policy > Computer Configuration > Policies > Windows Settings > Security Settings > Local Policies/User Rights Assignment > Manage auditing and security log. I found that the GPO was there and the group added, but for some reasons someone set ‘deny read’ permission for several DCs, including the one at the DR site. Once the restriction was removed, the installation completed successfully.

I found that Riverbed accelerators were showing an error next to the DAG traffic entry, so I contacted their support guys in order to confirm that the software on these devices supported the DAG traffic. The software version in use was v6.1.2 and they confirmed that only v6.5.x and later supported DAG replication traffic. So I had to turn it off for the 2 mailbox servers until we update the devices.

To turn the optimisation off, I had to create 2 rules:

1. In-Path Rule

Configure > Optimization > In-Path Rules > New rule

Type: Pass Through
Source Subnet: DR_site_Riverbed_IP (add /32 to the IP)
Destination Subnet: Main_site_Riverbed_IP (add /32 to the IP)

2. Peering Rule

Configure > Optimization > Peering Rules

Type: Pass Through
Source Subnet: Main_site_Riverbed_IP (add /32 to the IP)
Destination Subnet: DR_site_Riverbed_IP (add /32 to the IP)


Designate 2 IPs that will be assigned to the DAG cluster and make exclusions for these in DHCP servers. Each address needs to be from the network range where mailbox servers reside, one IP for the main site and one for the DR site.


Create and configure DAG

I created a new DAG network in EMC, used the main CAS server as a witness server, and assigned the mailbox servers and cluster IPs to it. This can be done through EMS as well:

New-DatabaseAvailabilityGroup –Name DAGname –WitnessServer WitnessName –WitnessDirectory “WitnessDirectory” -DatabaseAvailabilityGroupIPAddresses cluster_IP_1, cluster_IP_2 –verbose

A virtual computer object with the name of the group (and the description: ‘Failover cluster virtual network name account’) is created in Active Directory and should be moved to the 'Exchange Servers’ OU. An A RR is created in DNS for the new DAG network with the IP of the active server.


Then I configured several features:

1. Turned off automatic fail-over:
In order to stay on the safe side, I did not automate fail/switch-over of databases. By default, the automatic fail-over is enabled.

Set-MailboxServer –Identity DR_mailbox_server -DatabaseCopyAutoActivationPolicy Blocked

2. Disable encryption and compression:
Exchange encryption and compression of DAG cluster replication traffic is in conflict with Riverbed’s compression and should be turned off. By default, both compression and encryption are enabled for traffic between different sites.

Set-DatabaseAvailabilityGroup -Identity DAGname -NetworkEncryption Disabled
Set-DatabaseAvailabilityGroup -Identity DAGname -NetworkCompression Disabled

3. Enable cross-site direct connect:
In order to allow the primary CAS server to connect to the DR mailbox server and the DR CAS server to connect to the primary mailbox server, cross-site direct connect needs to be enabled. It is disabled by default.

Set-DatabaseAvailabilityGroup –Identity DAGname -AllowCrossSiteRpcClientAccess: $true

A database can be added to a DAG by right clicking it and selecting ‘Add Mailbox Database Copy’. While testing it, I got several errors and was not able to set it up successfully. The database itself was replicated, but logs were not and the status of the passive copy remained in the ‘Resynchronising’ state indefinitely. The errors below describe the issue.



I found an article that suggests removing the database GUID from the registry of the database host in order to fix the issue, but I have not tried it.

A source-side operation failed. Error An error occurred while performing the seed operation. Error: Failed to open a log truncation context to source server….

Instead, I found that the safest way to set it up successfully is to create a new (empty) database and configure DAG replication for it. Once both databases are okay, showing as Mounted and Healthy, move all the mailboxes from the database that was supposed to be set up.

Note: While mailboxes are being moved between databases, Outlook will lose connection and then re-establish it once the move is completed. Duration of the outage depends on the size of the mailbox. So, moving mailboxes should probably be scheduled for after hours.


Switching/failing-over between sites:

I wanted to test how Outlook behaves in case a database, server or the whole site goes down, so I created a test DB and replicated it to the DR site, then moved 2 mailboxes to it and let the passive copy reach the Healthy state with 0 logs in its queue. Then logged to each account and open Outlook under their profiles, one at the main site and the other at the DR site. Then I moved the test DB from the main to the DR site and back, and manually switched between CAS servers.

Note: Automatic switch over of DAG databases is disabled. If an active database or the whole primary server goes offline, log on to any available Exchange server and mount a copy of the database in question at the DR server.

1. If the mailbox server at the primary site goes offline but the CAS server is available, once the databases are switched over to DR mailbox server, Outlook clients will auto-reconnect and no user action is required. It took only 2-3 seconds for Outlook to reconnect. Clients from all sites, including the DR site, will still be connecting through the primary CAS.

2. If the CAS at the main location goes offline, but the mailbox server is still online, a manual switch-over to the DR CAS is required. Logon to any Exchange server and in its EMS execute the following command for each affected database:

Set-MailboxDatabase -Identity “databasename” –RPCClientAccessServer DR_CAS_name

Clients will need to restart their Outlook, but no profile repair is required. In this case Outlook clients will be connecting to the DR CAS server and from there to the main mailbox server.

3. If both the main CAS and mailbox servers go offline, a manual switch-over for affected databases and the CAS server is required. Log on to any DR Exchange server and mount the affected databases on the DR mailbox server, then in EMS execute the following command for each affected database:

Set-MailboxDatabase -Identity “databasename” –RPCClientAccessServer
DR_CAS_name

Clients at all sites will need to restart their Outlook, but no profile repair is required. In this case Outlook clients will be connecting to the DR CAS server and from there to the DR mailbox server.

Note: After I switched CAS for a test DB, an Outlook client with caching turned off connected to the new CAS straight away, but another client which had caching turned on failed to connect even after several restarts of Outlook and profile repairs. Then I turned caching off and restarted Outlook, and it connected successfully. Then I re-enabled caching and restarted Outlook and it connected again.


OWA and ActiveSync connectivity:

OWA is part of the Exchange organisation and is auto-configured on the new CAS servers as they are joined to the organisation. The default OWA URL for DR CAS is

https://drcas.domain.com/owa

If you use OWA, you should probably use an alias for it, so set the TTL for the alias to 5 minutes and simply recreate it to point to the DR CAS in case of a switch over.

The same goes for the ActiveSync service. Use an alias and set its TTL to 5 minutes, then, in case the main CAS goes down, just recreate the record in DNS and point it to the DR CAS.


Setting up email flow for the DR site:

In order to restore email flow, in case the main CAS server or the whole main site goes down, it is necessary to reconfigure send and receive connectors, so the configuration used by the primary CAS server should be well documented.

No comments:

Post a Comment