Wednesday, 28 January 2015

High-Availability in Exchange 2013

HA requirements

• Multiple SMTP links to the internet and between internal sites.
• Multiple DNS servers for multiple copies of MX records.
• At least 2 Mailbox servers in at least 2 sites with configured DAGs.
• If CAS server is installed separately, 2 CAS servers in 2 sites with a load balancer.

Shadow redundancy keeps a redundant copy of the message while the message is in transit. Safety Net keeps a redundant copy of a message after the message is successfully processed. So, Safety Net begins where shadow redundancy ends.

Shadow Redundancy

Shadow redundancy minimizes message loss due to server outages. It requires multiple Exchange Mailbox servers:

• If the second Mailbox server is not in a DAG, it must be in the same AD site.
• If it is in a DAG, it can be in either local or remote site (preferred).
• The primary and the shadow server communicate through a heartbeat.

The major improvement to shadow redundancy in Microsoft Exchange Server 2013 is that the transport server now makes a redundant copy of any messages it receives before it acknowledges successfully receiving the message back to the sending server. The sending server's support or lack of support for shadow redundancy doesn't matter. This helps to ensure that all messages in the Exchange 2013 transport pipeline are made redundant while they're in transit. If Exchange 2013 determines the original message was lost in transit, the redundant copy of the message is redelivered.

When the primary server successfully transmits the message to the next hop, and the next hop acknowledges receipt of the message, the primary server updates the discard status of the message as delivery complete. The discard status is basically a message that contains of list of messages that are being monitored. A successfully delivered message doesn't need to be kept in a shadow queue, so once the shadow server knows the primary server has successfully transmitted the message to the next hop, the shadow server moves the shadow message from the shadow queue into Safety Net.

Shadow Redundancy Manager is the core component of an Exchange 2013 transport server that's responsible for managing shadow redundancy. Shadow Redundancy Manager is responsible for maintaining the following information for all the primary messages that a server is currently processing:

• The shadow server for each primary message being processed.
• The discard status to be sent to shadow servers.

Shadow Redundancy Manager is responsible for the following for all the shadow messages that a shadow server has in its shadow queues:

• Maintaining the list of primary servers for each shadow message.

• Comparing the original database ID and the current database ID of the queue database where the primary copy of the message is stored.

• Checking the availability of each primary server for which a shadow message is queued.

• Processing discard notifications from primary servers.

• Removing the shadow messages from the shadow queues after all expected discard notifications are received.

• Deciding when the shadow server should take ownership of shadow messages, becoming a primary server.

• Tracking message bifurcations and other side-effect messages like delivery status notifications (DSNs) and journal reports to verify the redundant copy of the message isn't released until all forks of the message are fully processed.

Disable shadow redundancy (it’s on by default):
Set-TransportConfig –ShadowRedundancyEnabled $false

Reject messages that have not been shadowed (these are accepted by default):
Set-TransportConfig –RejectMessageOnShadowFailure $true

Set the message status check wait time for a shadow server (default is 2 mins):
Set-TransportConfig –ShadowHeartbeatFrequency

Set how long a shadow server waits for an unreachable primary server to respond before assuming it’s failed (default is 3 hours):
Set-TransportConfig - ShadowResubmitTimeSpan

Set how long a server retains discard events for successfully delivered messages (default is 2 days):
Set-TransportConfig –ShadowMessageAutoDiscardInterval

Set how long to keep successfully processed primary messages in Primary Safety Net, and acknowledged shadow messages in Shadow Safety Net (default is 2 days):
Set-TransportConfig –SafetyNetHoldTime

How long a message can remain in a queue before it expires (default is 2 days):
Set-TransportService –MessageExpirationTimeout

Database Availability Groups

In Microsoft Exchange Server 2013, the primary mechanism of mailbox high availability is the database availability group (DAG).

A database availability group (DAG) is a set of up to 16 Microsoft Exchange Server 2013 Mailbox servers that provides automatic, database-level recovery from a database, server, or network failure. DAGs use continuous replication and a subset of Windows failover clustering technologies to provide high availability and site resilience. Mailbox servers in a DAG monitor each other for failures. When a Mailbox server is added to a DAG, it works with the other servers in the DAG to provide automatic, database-level recovery from database failures.

When you create a DAG, it's initially empty. When you add the first server to a DAG, a failover cluster is automatically created for the DAG. In addition, the infrastructure that monitors the servers for network or server failures is initiated. The failover cluster heartbeat mechanism and cluster database are then used to track and manage information about the DAG that can change quickly, such as database mount status, replication status, and last mounted location.

• Create a DAG (specify a 15 character name that's unique within the Active Directory forest).
• Add Mailbox servers to the DAG (up to 16).
• Determine passive copies of the active databases.

If you are creating a DAG that will contain Mailbox servers that are running Windows Server 2012 R2, you also have the option of creating a DAG without a cluster administrative access point. In that case, the cluster will not have a cluster name object (CNO) in Active Directory, and the cluster core resource group will not contain a network name resource or an IP address resource.

When you create a DAG, an empty object representing the DAG with the name you specified and an object class of msExchMDBAvailabilityGroup is created in Active Directory.

DAGs use a subset of Windows failover clustering technologies, such as the cluster heartbeat, cluster networks, and cluster database (for storing data that changes or can change quickly, such as database state changes from active to passive or the reverse, or from mounted to dismounted or the reverse). Because DAGs rely on Windows failover clustering, they can only be created on Exchange 2013 Mailbox servers running the Windows Server 2008 R2 Enterprise or Datacenter operating system, Windows Server 2012 Standard or Datacenter operating system, or Windows Server 2012 R2 Standard or Datacenter operating system.

The witness server (ensures only 1 instance of a DB is active at any time) cannot be a member of the DAG – this prevents split-brain syndrome.

The witness server and its directory are used only when there's an even number of members in the DAG and then only for quorum purposes. You don't need to create the witness directory in advance. Exchange automatically creates and secures the directory for you on the witness server. The directory shouldn't be used for any purpose other than for the DAG witness server.

The requirements for the witness server are as follows:

• The witness server can't be a member of the DAG.
• The witness server must be in the same Active Directory forest as the DAG.
• The witness server must be running Windows Server 2012, Windows Server 2008 R2, Windows Server 2008, Windows Server 2003 R2, or Windows Server 2003.
• A single server can serve as a witness for multiple DAGs. However, each DAG requires its own witness directory.

Regardless of what server is used as the witness server, if the Windows Firewall is enabled on the intended witness server, you must enable the Windows Firewall exception for File and Printer Sharing.

If the witness server you specify isn't an Exchange 2013 or Exchange 2010 server, you must add the Exchange Trusted Subsystem universal security group (USG) to the local Administrators group on the witness server prior to creating the DAG. These security permissions are necessary to ensure that Exchange can create a directory and share on the witness server as needed.

When a DAG has been deployed across two datacenters, a new configuration option in Exchange 2013 is to use a third location for hosting the witness server. If your organization has a third location with a network infrastructure that is isolated from network failures that affect the two datacenters in which your DAG is deployed, then you can deploy the DAG’s witness server in that third location, thereby configuring your DAG with the ability automatically failover databases to the other datacenter in response to a datacenter-level failure event. If your organization only has two physical locations, you can use a Microsoft Azure virtual network as a third location to place your witness server.

If your DAG members are running Windows Server 2012, you must pre-stage the CNO (Cluster Name Object) prior to adding the first server to the DAG. If your DAG members are running Windows Server 2012 R2, and you create a DAG without a cluster administrative access point, then a CNO will not be created, and you do not need to create a CNO for the DAG.

Safety Net aka Transport Dumpster

• Provides message redundancy after the message has been processed in a queue on the Mailbox server within a transport high-availability boundary.
• Processed by Transport Service.
• By default, it keeps the second copy of each message for 2 days (can be configured) – including delivered messages in order to ensure delivery in case of a database failover to its passive copy.

The transport dumpster was first introduced in Exchange 2007, and was further improved in Exchange 2010 to provide redundant copies of messages after they're successfully delivered to mailboxes in DAGs. In Exchange 2010, the transport dumpster helped protect against data loss by maintaining a queue of successfully delivered messages that hadn't replicated to the passive mailbox database copies in the DAG. When a mailbox database or server failure required the promotion of an out-of-date copy of the mailbox database, the messages in the transport dumpster were automatically resubmitted to the new active copy of the mailbox database. The transport dumpster has been improved in Exchange 2013 and is now called Safety Net.

• Safety Net is a queue that's associated with the Transport service on a Mailbox server. This queue stores copies of messages that were successfully processed by the server.

• You can specify how long Safety Net stores copies of the successfully processed messages before they expire and are automatically deleted. The default is 2 days.

• Safety Net doesn't require DAGs. For Mailbox servers that don't belong to a DAGs, Safety Net stores copies of the delivered messages on other Mailbox servers in the local Active Directory site.

• Safety Net itself is now redundant, and is no longer a single point of failure. This introduces the concept of the Primary Safety Net and the Shadow Safety Net. If the Primary Safety Net is unavailable for more than 12 hours, resubmit requests become shadow resubmit requests, and messages are re-delivered from the Shadow Safety Net.

• Safety Net takes over some responsibility from shadow redundancy in DAG environments. Shadow redundancy doesn't need to keep another copy of the delivered message in a shadow queue while it waits for the delivered message to replicate to the passive copies of mailbox database on the other Mailbox servers in the DAG. The copy of the delivered message is already stored in Safety Net, so the message can be resubmitted from Safety Net if necessary.

• In Exchange 2013, transport high availability is more than just a best effort for message redundancy. Exchange 2013 attempts to guarantee message redundancy. Because of this, you can't specify a maximum size limit for Safety Net. You can only specify how long Safety Net stores messages before they're automatically deleted.

The Primary Safety Net exists on the Mailbox server that held the primary message before the message was successfully processed by the Transport service. This could mean the message was delivered to the Mailbox Transport service on the destination Mailbox server. Or, the message could have been relayed through the Mailbox server in an Active Directory site that's designated as a hub site on the way to the destination DAG or Active Directory site. After the primary server processes the primary message, the message is moved from the active queue into the Primary Safety Net on the same server.

The Shadow Safety Net exists on the Mailbox server that held the shadow message. After the shadow server determines the primary server has successfully processed the primary message, the shadow server moves the shadow message from the shadow queue into the Shadow Safety Net on the same server. Although it may seem obvious, the existence of the Shadow Safety Net requires shadow redundancy to be enabled, and shadow redundancy is enabled by default in Exchange 2013.

The message is retained in Primary Safety Net and Shadow Safety Net until the message expires based on a configurable timeout value. If a mailbox database failover occurs before the message expires, the Primary Safety Net on Mailbox01 resubmits the message. If the Mailbox01 isn't available, the Shadow Safety Net on Mailbox03 takes over and resubmits the message.

There are two basic Safety Net message resubmission scenarios:

• After the automatic or manual failover of a mailbox database in a DAG.
• After you activate a lagged copy of a mailbox database.

Like message resubmission from Primary Safety Net, message resubmissions from Shadow Safety Net are fully automated, and require no manual intervention. Without optimization, resubmitting messages from Safety Net would result in potentially large numbers of duplicate deliveries. Duplicate deliveries within the Exchange organization aren't a problem, because duplicate message detection prevents mailbox users from seeing duplicate copies of a message. But duplicate message delivery to recipients outside the Exchange organization will result in duplicate copies of messages. Fortunately, the resubmission of messages from Safety Net has been optimized in Exchange 2013 to reduce duplicate message delivery.

Managing database availability groups
Planning for high availability and site resilience
Safety Net
Shadow redundancy
Transport high availability

No comments:

Post a Comment