The lease is used between the SQL Server resource DLL and the SQL Server instance to prevent split-brain from occurring for the availability group (AG).
The lease is a standard signaling mechanism between the SQL Server resource DLL and the SQL Server availability group. The figure below depicts the general flow of the lease.
The lease is only present on the primary replica, making sure the SQL Server and Windows cluster state for the availability group remain synchronized.
The Windows cluster components poll to determine if the resource IsAlive or LooksAlive on regular intervals. The resource dll must report the state of the resource to the Windows clustering components. For those familiar with the older, SQL Server failover cluster instances (FCIs) this was the accomplished with generic query execution every ## of seconds to see of the server 'looks alive.'
The new lease design removes all the connectivity components and problems associated with that additional overhead and provides a streamlined design to determine if the SQL Server 'looks alive.' The resource dll and the SQL Server instance use the named memory objects, in shared memory, to communicate. The objects are signaled and checked at regular intervals.
The default signaling interval is 1/3 of the configured 'Health Check Timeout' of the availability group.
If the HealhCheckTimeout is exceeded without the signal exchange the lease is declared 'expired' and the SQL Server resource dll reports that the SQL Server availability group no longer 'looks alive' to the Windows cluster manager. The cluster manager undertakes the configured corrective actions. SQL Server prevents further data modifications (avoiding split-brain issues) on the current primary. The cluster manager activity helps select the proper primary location and attempts to online the availability group.
The following is a sample message from the SQL Server error log when the lease has expired.
Error: 19407, Severity: 16, State: 1. The lease between availability group 'MyAG' and the Windows Server Failover Cluster has expired. A connectivity issue occurred between the instance of SQL Server and the Windows Server Failover Cluster. To determine whether the availability group is failing over correctly, check the corresponding availability group resource in the Windows Server Failover Cluster.
AlwaysOn: The local replica of availability group 'MyAG' is going offline because either the lease expired or lease renewal failed. This is an informational message |
Looking at the …\MSSQL\LOG\*DIAG*.XEL file for this issue you can see failures reported. Notice the 'Resource Alive result: 0' - The SQL Server resource dll is going to report to the cluster manager that the availability group DOES NOT LOOK ALIVE.
Note: SQL Server Management Studio may adjust time values based on your clients time zone settings.
The matching cluster log can output similar information as well. Notice you have to adjust for UTC time.
000015ec.00002a64::2012/09/06-05:34:56.019 INFO [RES] SQL Server Availability Group: [hadrag] SQL Server component 'query_processing' health state has been changed from 'warning' to 'clean' at 2012-09-06 06:34:56.017 000015ec.00002a64::2012/09/06-05:35:36.050 WARN [RES] SQL Server Availability Group: [hadrag] Failed to retrieve data column. Return code -1 000015ec.00001a04::2012/09/06-05:35:36.050 ERR [RES] SQL Server Availability Group: [hadrag] Failure detected, diagnostics heartbeat is lost |
The million dollar question is still why? The answer is that 'it depends.' In this instance the SQL Server has encountered a system level problem and is stuck attempting to allocate memory and generating dump files. Looking at the stacks for the dump 100s of threads are stalled, attempting to allocate memory, indicating a memory stall or problem on the overall system at some level and preventing SQL Server from processing work. |
Bob Dorr - Principal SQL Server Escalation Engineer