### Introduction to HA/DR for Azure SQL This cheatsheet summarizes High Availability (HA) and Disaster Recovery (DR) solutions for Azure SQL Database, Azure SQL Managed Instance, and SQL Server on Azure Virtual Machines, drawing from Chapter 16 of "Administering Microsoft Azure SQL Solutions." HA/DR solutions ensure business continuity by minimizing downtime and data loss in the event of failures. Azure offers various built-in and configurable options to achieve different levels of resilience. ### Active Geo-Replication - **Purpose:** Creates database replicas for DR and read scale-out. - **Definition:** Little "r" replication for continuously synchronizing data sets, not to be confused with big "R" Transactional Replication. - **Replica Placement:** - DR replicas: Must be in different Azure regions from the primary. - Read scale-out: Can be in the same or different regions. - **Configuration Prerequisites:** - Two configured Azure SQL Database Servers (one for primary, one for secondary). - Avoid namespace contention (secondary server cannot have a database with the same name as the primary). - **Deployment Process:** 1. Navigate to Data Management > Replicas blade in the Azure portal. 2. Select "Create replica" to open the deployment blade. 3. **Sections of the deployment blade:** - Project details (subscription, resource group - pre-populated). - Primary database details (name, region - pre-populated). - Geo-secondary database details (subscription, resource group, target region, elastic pool membership, compute/storage provisioning). - **Critical:** Only the replica target server name must be chosen; most other settings can inherit defaults. - Transparent data encryption key management (only if using Bring Your Own Key - BYOK). - Backup storage redundancy (defaults to primary's level, but configurable). - A summary page with monthly cost estimation is shown before deployment. - **T-SQL Deployment:** ```sql ALTER DATABASE ADD SECONDARY ON SERVER ``` - Inherits defaults for most fields except `SecondaryReplicaServerName`. - Option changes are not supported via T-SQL during replica creation but can be modified later via portal or T-SQL. - **Failover:** - **Failover:** Requires active connection to primary for full synchronization. - **Forced Failover:** Can result in lost transactions, used only during true DR when primary access is lost. - Option to stop data replication and leave secondary as a standalone database. ### Auto-failover Groups - **Enhancement over Active Geo-Replication:** Adds group failover control and endpoint abstraction. - **Failover Scope:** Databases fail over as sets; scope is one or more databases replicated between a single source and target server. - **Endpoints:** Each failover group has a read-only and a read-write endpoint, eliminating the need to update application connection strings during failover. - **Coexistence:** Both Active Geo-Replication and Auto-failover Groups exist due to historical deployment and feature overlap. - **Read Replicas:** - **Local read-only replicas:** Accessed via `ApplicationIntent=READONLY` connection string option. - **Global read-only replicas:** Selected by choice of endpoint. - Can use local Logical Server name (endpoint) for non-failing-over read-only access. - **Configuration:** - You do NOT need to configure Active Geo-Replication first; Failover Group configuration creates underlying replicas. - Configuration begins at the **Server level** (Active Geo-Replication starts at Database level). - Access via Data management entry > Failover Groups. - **Failover Policy:** - Microsoft recommends defining a failover policy matching specific availability requirements. - Microsoft-initiated failover depends on underlying region failure, with a minimum one-hour "Grace Period" delay before forcing. ### Configure an Always-On Availability Group (AOAG) - **Context:** Focuses on Azure-specific implementation details for SQL Server on Azure Virtual Machines. - **Endpoint Redirection (Listener):** - **Listener:** Abstracts SQL endpoint connection via a Windows Server Failover Clustering (WSFC) Network Name object. - **On-premises:** Listener IP "floats" between nodes, advertised by ARP. - **Azure Challenge:** Azure does not support "floating" IP addresses or ARP. - **Azure Solutions:** 1. **Basic Load Balancer:** Hosts the "floating" IP address and redirects connections to the correct replica. 2. **Dynamic Network Name (DNN):** Maps a named endpoint to an IP address and port on a host server. - PowerShell only, cannot use port 1433, limited SQL Server service support. 3. **Multiple Fixed IP Addresses:** Allows multiple fixed IPs on the same virtual NIC, enabling cluster nodes in different subnets without a load balancer. This is the recommended approach. - **Azure Template for SQL High Availability:** - Marketplace template simplifies AOAG cluster creation. - **Prerequisites:** - Resource Group - Virtual Network - Domain Controller(s) in the same Virtual Network - Domain User with specific permissions (Create Computer Object, Domain Service Account for SQL Server). - **Process:** Search "SQL Server with High Availability" in Azure Marketplace, then use the template wizard. - The wizard creates a WSFC cluster and an empty Availability Group with a Listener. - **Single-Subnet Azure AOAG Cluster IP Addresses (Example with Load Balancer):** | Name | Purpose | IP (10.90.10.0/24) | |------------|---------------------|--------------------| | SQL01 | SQL Host 1 | 10.90.10.10 | | SQL02 | SQL Host 2 | 10.90.10.11 | | Clus01 | WSCF Cluster Name | 10.90.10.12 | | AOAG01 | AOAG Listener Name | 10.90.10.13 | *A load balancer hosts the actual IP address and redirects external connections.* - **Multi-Subnet Azure AOAG Cluster IP Addresses (Example without Load Balancer):** | Name | Purpose | IP (10.90.10.0/24) | IP (10.90.20.0/24) | |------------|---------------------|--------------------|--------------------| | SQL01 | SQL Host 1 | 10.90.10.10 | NA | | SQL02 | SQL Host 2 | NA | 10.90.20.11 | | Clus01 | WSCF Cluster Name | 10.90.10.12 | 10.90.20.12 | | AOAG01 | AOAG Listener Name | 10.90.10.13 | 10.90.20.13 | *The Cluster Name Artifact shows both IP addresses in an OR configuration.* - **SQL Client Libraries:** Must be SQL 2012 or later compliant for multi-subnet clusters. ### Configure Quorum Options for a Windows Server Failover Cluster - **Quorum:** A voting system to determine cluster actions during failures. A simple majority rules. - **Witness:** A tie-breaker for clusters with an even number of nodes (e.g., two-node cluster) to prevent "split-brain" scenarios. - **Witness Types for Azure SQL VMs:** - **Disk Witness:** Typically used for Failover Clustered Instances where disk arbitration is already configured. - **File Share Witness:** Almost always used for AOAGs (no shared disks). - **Cloud Witness:** A File Share Witness hosted in an Azure Blob Container. Useful for multi-site clusters as it provides an external viewpoint. - WSFC creates a correctly configured storage container for Cloud Witness files. ### Configure Failover Clustered Instances (FCI) on Azure Virtual Machines - **No Marketplace Template:** Unlike AOAGs, there's no direct Marketplace template for FCIs. - **IaaS Extension Limitations:** The IaaS extension does not fully support FCIs; features like automatic backup, storage provisioning, and inventory are unavailable. Manual deployment is required. - **Deployment Approach:** 1. Start with a SQL Marketplace image (even though SQL will be reinstalled). - **Benefits:** Pre-installs IaaS extension (for licensing support), SQL Install media is pre-copied. 2. Provision two (or more) machines. 3. Delete any AOAG artifacts and uninstall all SQL instances. 4. Install SQL from local media. 5. Configure an FCI (similar to on-premises). 6. **Azure-specific step:** Provision shared disks. - Options: Storage Spaces Direct (emulates shared disk), shared Premium, Premium V2, or Ultra disks. - **Configuration Choices:** Load balancer vs. multi-subnet configurations still apply, with multi-subnet being recommended. ### Configure Log Shipping - **Overview:** A fundamental HA/DR technology based on continuous transaction log backups, copies, and restores. - **Mechanism:** Exports transaction logs via backup, copies them, and restores them to a different system. - **SSMS Log Shipping Wizard:** - Found under Tasks by right-clicking the database name in SQL Server Management Studio (SSMS). - **Steps:** 1. Configures log backups (ensures FULL recovery model, sets up log backups to a shared folder). 2. Configures each target server and database (creates job to copy files from primary landing folder to a local folder on target). - Azure storage can be used for landing and copy folders. - Requires proper security access configuration. 3. Configures RESTORE jobs on each target system. - Options: `NORECOVERY` or `STANDBY` mode. - Optional delay between primary and secondary. - The wizard generates a script that can be run on source and target servers. ### Monitor and Troubleshoot HA/DR Solutions - **Importance:** Essential for effective HA/DR; "unmonitored systems are out of control." - **Monitoring Scope:** - Baseline functionality of each participating server. - Data synchronization process (backup, copy, restore for log shipping; transaction log forward/replay for AOAGs). - Utilize HA/DR Dynamic Management Views (DMVs) to measure velocity and latency. - **Troubleshooting Approach:** - HA/DR issues are complex; each element is critical. - **Common Root Causes:** Often due to network, storage, or permissions issues, rather than SQL Server itself. - Maintain a "Growth Mindset" and consider all possibilities.