Are you the publisher? Claim or contact us about this channel

Embed this content in your HTML


Report adult content:

click to rate:

Account: (login)

More Channels

Channel Catalog

Channel Description:

This is the official team Web Log for Microsoft Customer Service and Support (CSS) SQL Support. Posts are provided by the CSS SQL Escalation Services team.

older | 1 | .... | 9 | 10 | (Page 11) | 12 | 13 | .... | 17 | newer

    0 0

    It has been well over a year when I wrote a series of blog posts about a product called System Center Advisor. You can read these posts at this link

    When Advisor was first released, this cloud service was free for a 60 day trial period but required a Software Assurance contract to use past that.

    Well we have decided that this should be free for everyone. Read more about this announcement at this link:

    For SQL Server, we now have over 100+ rules baked into this service representing the collective knowledge of CSS SQL Engineers worldwide on common customer issues. Have you ever wanted to know what the CSS teams knows based on common issues reported by customers? That is what SCA is all about. Providing you that knowledge in the form of a cloud-based service.

    Give this a try on your SQL Server and look at the advice that is presented. We specifically baked in rules (called alerts) with the intention of helping you prevent problems before they happen.

    Take a look through my previous blog posts above on this topic for some examples. What is incredibly powerful about this service is:

    • Once you install, it just “runs”
    • You view your alerts through a web portal so can do this anywhere
    • As part of the service we capture configuration change history (i.e. a problem started but what changed?)
    • We keep the rules “fresh” by updating the service each month but you don’t have to do anything. The service automatically pulls in these new rules for you.

    I look forward to any comments you post to this blog regarding your experiences with it.


    Bob Ward

    0 0

    The data source provider list in PowerPivot can often be a source of confusion for users since they equate the fact that a provider appears in the list as the provider being installed and available. Unfortunately, the list of providers is actually a static list of supported data sources for PowerPivot, so the user is still required to install the desired provider to successfully import data into PowerPivot. Thus, the most common fix for a "provider is not installed" error in the import wizard is to ensure you have the proper data provider installed and that the installed provider matches the platform architecture (32-bit or 64-bit) of PowerPivot and Excel.

    If you are certain that the selected provider is installed on your client machine and are able to import data directly into Excel using the desired provider via the Data tab, then you may be encountering another issue which was recently discovered.

    In this new scenario data import in PowerPivot will fail for any provider selected. The exact error seen varies depending on the provider selected but examples include:

    Text File:  "Details: Failed to connect to the server. Reason: Provider information is missing from the connection string"

    Excel:  "Cannot connect to the data source because the Excel provider is not installed."

    SQL Server: "Cannot connect to the data source because the SQLServer provider is not installed."


    The problem is actually due to a problem with the .NET machine configuration. PowerPivot attempts to instantiate providers by using the .NET DbProviderFactory class. If an error is encountered while instantiating the DbProviderFactory class, the error for the DbProviderFactory is not returned, instead the message returned is that the selected provider is not installed. If you are encountering this scenario it is very likely that there is a problem instantiating the .NET DBProviderFactory class.

    The DbProviderFactory class configuration is read from the Machine.Config.xml file, which depending on whether you are running the 32-bit or 64-bit version of Excel and PowerPivot is located at:




    Checking the Machine.Config.xml file you will find the <DBProviderFactories> element under <>.  The <DBProviderFactories> element should only appear once, but problematic machines may have more than one XML tag for DbProviderFactories.

    Example of bad element list:

            <add name="Microsoft SQL Server Compact Data Provider"invariant="System.Data.SqlServerCe.3.5"description=".NET Framework Data Provider for Microsoft SQL Server Compact" type="System.Data.SqlServerCe.SqlCeProviderFactory, System.Data.SqlServerCe, Version=, Culture=neutral, PublicKeyToken=89845dcd8080cc91"/>


    NOTE: The begin and end tag around the add for the SQLServerCE provider, followed by the empty element tag.

    Correct Example:


            <add name="Microsoft SQL Server Compact Data Provider"invariant="System.Data.SqlServerCe.3.5"description=".NET Framework Data Provider for Microsoft SQL Server Compact" type="System.Data.SqlServerCe.SqlCeProviderFactory, System.Data.SqlServerCe, Version=, Culture=neutral, PublicKeyToken=89845dcd8080cc91"/>


    NOTE: The add element(s) between the open <DbProviderFactories> and close </DbProviderFactories> tags will vary depending on what providers are installed on your machine.

    If you find that you have something similar to the bad example above, please use the following steps to resolve the issue:

    1. Make a backup copy of existing machine.config.xml file in the event you need to restore it for any reason.
    2. Open the machine.config.xml file in notepad or another editor of your choice.
    3. Delete the empty element tag <DbProviderFactories/> from the file.
    4. Save the updated file.
    5. Retry the import from PowerPivot 


    Wayne Robertson - Sr. Escalation Engineer

    0 0

    "Denzil and I were working on this issue for a customer and Denzil has been gracious enough to write-up a blog for all of us." – Bob Dorr

    From Denzil:

    I recently worked with a customer on a Database restore issue where the database being restored had 2TB of File stream data. The restore in this case would just not complete successfully and would fail with the error below.

    10 percent processed.

    20 percent processed.

    30 percent processed.

    40 percent processed.

    Msg 3634, Level 16, State 1, Line 1

    The operating system returned the error '32(The process cannot access the file because it is being used by another process.)' while attempting 'OpenFile' on 'F:\SQLData11\DataFiles\535cc368-de43-4f03-9a64-f5506a3f532e\547fc3ed-da9f-44e0-9044-12babdb7cde8\00013562-0006edbb-0037'.

    Msg 3013, Level 16, State 1, Line 1

    RESTORE DATABASE is terminating abnormally.

    Subsequent restore attempts would fail with the same error though on "different" files and at a different point in the restore cycle.

    Given that this was "not" the same file or the same point of the restore on various attempts my thoughts immediately went to some filter driver under the covers wreaking some havoc. I ran an a command to see what filter drivers were loaded (trimmed output below.)

    C:\>fltmc instances

    Filter Volume Name Altitude Instance Name

    -------------------- ----------------------- ------------ ---------------------- -----

    BHDrvx64             F:\SQLData11             365100           BHDrvx64                     0  

    eeCtrl               F:\SQLData11              329010           eeCtrl                       0   

    SRTSP                F:\SQLData11              329000           SRTSP                        0 

    SymEFA                 F:\SQLData11              260600           SymEFA                        0 

    RsFx0105               \Device\Mup              41001.05        RsFx0105 MiniFilter Instance  0   


    SymEFA         = Symantec extended file attributes driver
    SRTSP        = Symantec Endpoint protection                
    RsFx0105     = SQL Server File Stream filter driver.

    In discussing this with the customer, Anti-virus exclusions were controlled by GPO so he had put in a request to exclude the respective folders, yet the issue still continued.

    In order to do my due diligence, the other question was whether we "released" the file handle after we created it, and whether someone else grabbed it? So we (Venu, Bob and I) did take a look at the code and this can be the case. On SQL Server 2008 R2 when we call the CreateFile API and we hardcode the shareAccess parameter to 0 which is exclusive access while we have it open to prevent secondary access.

    If this parameter is zero and CreateFile succeeds, the file or device cannot be shared and cannot be opened again until the handle to the file or device is closed. For more information, see the Remarks section.

    Once the file is created, we release the EX latch and can close the file handle, on the file, but sqlservr.exe continues to hold the lock on the file itself during the restore process. Once the restore operation is completed, we no longer hold an exclusive lock to the file.

    We can reopen file handles during the Recovery process so the other thought was perhaps it was a transaction affected by recovery and GC and potentially some race condition but in this case we know that the restore was failing prior to that as it didn't reach 100% so that could be ruled out as well.

    Getting a dump at the failure time showed me the same Restore Stack but different dumps showed multiple different files in question so it wasn't a particular Log record sequence per say causing this.



    Given now that it was unlikely it was SQL Server, I concentrated more on the Filter driver theory. I tried to capture Process monitor, but given the time it took and amount of files touched, Process monitor was not all that useful. I couldn't filter on a specific folder as it failed on different folders and there were 10 + mount points involved.

    However from Process monitor while the restore was going on, I looked at the stack for some I/O operations (not ones that failed by any means) and I still saw fltmgr.sys sitting there for an OpenFile Call on a file in the filestream directory

    fltmgr.sys + 0x2765



    fltmgr.sys + 0x424c



    fltmgr.sys + 0x1f256



    ntoskrnl.exe + 0x2c8949



    ntoskrnl.exe + 0x2c0e42



    ntoskrnl.exe + 0x2c19d5



    ntoskrnl.exe + 0x2c6fb7



    ntoskrnl.exe + 0x2b61a8



    ntoskrnl.exe + 0x57573



    ntdll.dll + 0x471aa


    C:\Windows\System32\ntdll.dll     ZwOpenFile

    kernel32.dll + 0x10d48



    kernel32.dll + 0x10a7c



    _____SQL______Process______Available + 0x695c7e



    _____SQL______Process______Available + 0x6d6898


    • ParseContainerPath

    _____SQL______Process______Available + 0x6d714a




    Also looking at some other Symantec related issues, I found an article not necessarily to do with any SQL restores but the fact that this was a possibility – again this has to do with a specific issue on a specific build, but am illustrating that Filter drivers can cause some unexpected behaviors.

    As far as Anti-virus exclusions go, we actually have guidance in the article below:

    And also in our File stream best practices article:

    When you set up FILESTREAM storage volumes, consider the following guidelines:

    •Turn off short file names on FILESTREAM computer systems. Short file names take significantly longer to create. To disable short file names, use the Windows fsutil utility.

    •Regularly defragment FILESTREAM computer systems.

    •Use 64-KB NTFS clusters. Compressed volumes must be set to 4-KB NTFS clusters.

    •Disable indexing on FILESTREAM volumes and set disablelastaccess to set disablelastaccess, use the Windows fsutil utility.

    Disable antivirus scanning of FILESTREAM volumes when it is not unnecessary. If antivirus scanning is necessary, avoid setting policies that will automatically delete offending files.

    •Set up and tune the RAID level for fault tolerance and the performance that is required by an application.

    Looking at another run of "fltmc instances" command output and still saw the Anti-virus components on the list for those mount points. Given we "thought" we had put an exclusion in for the whole drive, and it was showing up, it was time to look at this closer

    1. Excluded the drives where the data was being stored – Restore still failed
    2. Stopped the AV Services - Restore still failed
    3. Uninstalled Anti-virus – Restore now succeeded

    Voila once we uninstalled AV on this machine, the restore succeeded. The customer is broaching this this with the AV vendor to figure out more of the root cause.


    Denzil Ribeiro – Senior PFE

    0 0

    I had a very specific question asked of me related to the SQLIOSIM.exe, checksum validation logic.  It is pretty simple logic (on purpose) but effective so here are the basics.

    The key is that there are multiple memory locations used to hold the data and do the comparison.



    1.     Allocate a buffer in memory of 8K. 
    Stamp the page with random data using Crypto Random function(s)
    Save Page Id, File Id, Random seed and calculated checksum values in the header of the page

    2.     Send the page to stable media  (async I/O)
    Check for proper write completion success

    3.     Sometime after successful write(s).   Allocate another buffer and read the data from stable media. 
    (Note:  This is a separate buffer from that of the write call)

    4.     Validate the bytes read.

    Do header checks for file id, page id, seed and checksum values

    Expected CheckSum: 0xEFC6D39C         
    ---------- Checksum stored in the WRITE buffer

    Received CheckSum: 0xEFC6D39C          ---------- Checksum stored in the READ buffer  (what stable media is returning)

    Calculated CheckSum: 0xFBD2A468        --------- Checksum as calculated on the READ buffer

    The detailed (.TXT) output file(s) show the WRITE image, the READ image and the DIFFERENCES found between them  (think memcmp).   When only a single buffer image is added to the detailed TXT file this indicates that the received header data was damaged or the WRITE buffer is no longer present in memory so only the on disk checksum vs calculated checksum are being compared.

    If there appears to be damage SQLIOSim will attempt to read the same data 15 more times and validate before triggering the error condition.   Studies from SQL Server and Exchange showed success of read-retries in some situations.  SQL Server and Exchange will perform up to 4 read-retries in the same situation.
    The window for damage possibilities is from the time the checksum is calculated to the time the read is validated.   While this could be SQLIOSim the historical evidence shows this is a low probability.   The majority of the time is in kernel and I/O path components and the majority of bugs over the last 8 years have been non-SQL related.         

    For vendor debugging the detailed TXT file contains the various page images as well as the sequence of Win32 API calls, thread ids and other information.   Using techniques such as a bus analyzer or detailed I/O tracing the vendor can assist at pin-pointing the component causing the damage.

    The top of the 8K page header is currently the following (Note: This may change with future versions of SQLIOSim.exe)

    DWORD       Page Number     (Take * 8192 for file offset)

    DWORD       File Id

    DWORD       Seed value

    DWORD       Checksum CRC value

    BYTE             Data[8192 – size(HEADER)]   <---------  Checksum protects this data

    Bob Dorr - Principal SQL Server Escalation Engineer


    0 0

    I keep running into the question: “When will my secondary allow automatic failover?”   Based on the question I did some extended research and I will try to summarize in is blog post.  I don’t want to turn this post into a novel so I am going to take some liberties and assume you have read SQL Server Books Online topics related to Always On failover.

    The easy answer:  Only when the secondary is marked SYNCHRONIZED.  - End of blog right? – not quite!

    At a 10,000 foot level that statement is easy enough to understand but the issue is really understanding what constitutes SYNCHRONIZED.   There are several state machines that determine the NOT vs SYNCHRONIZED state.  These states are maintained using multiple worker threads and at different locations to keep the system functional and fast.

    • Secondary Connection State
    • End Of Log Reached State

    To understand these states I need to discuss a few concepts to make sure we are all on the same page.

    Not Replicating Commits – Log Blocks Are The Replication Unit

    The first concept is to remember SQL Server does not ship transactions. It ships log blocks. 

    The design is not really different than a stand alone server.  On a stand alone server a commit transaction issues (FlushToLSN/StartLogFlush) to make sure all LSN’s up to and including the commit LSN flushed.    This causes the commit to block the session, waiting for the log manager to indicate that all blocks of the log have been properly flushed to stable media.  Once the LSN has been reached any pending transaction(s) can be signaled to continue.

    image Let’s use the diagram on the left for discussion.   The ODD LSNs are from Session 1 and the EVEN LSNs are from Session 2.  

    The Log Block is a contiguous, chunk of memory (often 64K and disk sector size aligned), maintained by the Log Manager.  Each database has multiple log blocks maintained in LSN order.  As multiple workers are processing they can use various portions of the log block, as shown here.

    To make this efficient a worker requests space in the block to store its record.  This request returns the current location in the log block, increments the next write position in the log block (to be used by the next caller) and acquires a reference count.   This makes the allocation of space for a log record only a few CPU instructions.  The storage position movement is thread safe and the reference count is used to determine when the log block can be closed out.

    In general, closing out a log block means all the space has been reserved and new space is being handed out for another log block.  When all references are released the block can be compressed, encrypted, … and flushed to disk.   

    Note:  A commit transaction (FlushToLSN/StartLogFlush) can trigger similar behavior, even when the block is not full, so a commit transaction does not have to wait for the block to become full.   Reference: 

    In this example both commits would be waiting on the log block to be written to stable media.

    Session 1 – FlushToLSN (05)
    Session 2 – FlushToLSN (06)

    The log writer’s completion routine is invoked when the I/O completes for the block.   The completion routine checks for errors and when successful, signals any sessions waiting on a LSN <= 6.   In this case both session 1 and 2 are signaled to continue processing.
    Write Log Waits accumulate during this wait for the flush activities.   You can read more about write log waits at:


    I had a discussion on an e-mail where the individual was thinking we only shipped committed transactions.  Not true (for Always On or Database Mirroring).  If I only shipped committed transactions it would require a different set of log blocks for EACH transaction.  This would be terrible performance impacting overhead.   It would also be extremely difficult to handle changes on the same page.   If SQL Server doesn’t have the ACID series of log records how would SQL Server ever be able to run recovery, both redo and undo.   Throw in row versioning and shipping just the committed log records becomes very cumbersome.

    We don’t ship log records, we ship log blocks and optimize the recovery needs.


    Parallel Flushing / Hardening

    Always On is a bit different than database mirroring (DBM) with respect to sending the log blocks to the secondary replica(s).   DBM flushes the log block to disk and once completed locally, sends the block to the secondary.

    Always On changed this to flush the block(s) in parallel.  In fact, a secondary could have hardened log block(s) before the primary I/O completes.    This design increases performance and narrows the NOT IN SYNC window(s).

    SQL Server uses an internal, callback mechanism with the log manager.   When a log block is ready to be flushed (fully formatted and ready to write to disk) the notification callbacks are fired.   A callback you might expect is Always On.   These notifications start processing in parallel with the actual flushing of the log to the local (LDF) stable media.


    As the diagram shows, the race is on.  One worker (log writer) is flushing to the local media and the secondary consumer is reading new blocks and flushing on the secondary.   A stall in the I/O on the primary can allow the secondary to flush before the primary just as a delay on the secondary could cause the primary to flush the I/O before the secondary.

    My first reaction to this was, oh no, not in sync this is bad.   However, the SQL Server developers didn’t stop at this juncture, Always On is built to handle this situation from the ground up.

    Not shown in the diagram are the progress messages.   The secondary sends messages to the primary indicating the hardened LSN level.   The primary uses that information to help determine synchronization state.   Again, these messages execute in parallel to the actual log block shipping activities. 

    Cluster Registry Key for the Availability Group

    The cluster, AG resource is the central location used to maintain the synchronization states.   Each secondary has information stored in the AG resource key (binary blob) indicating information about the current LSN levels, synchronization state and other details.   This registry key is already replicated, atomically across the cluster so as long as we use the registry at the front of our WAL protocol design the AG state is maintained.

    Note:  We don’t update the registry for every transaction.  In fact, it is seldom updated, only at required state changes.  What I mean by WAL protocol here is that the registry is atomically updated before further action is taken on the database so the actions taken in the database are in sync with the registry across the cluster.

    Secondary Connection State (Key to Synchronized State)

    The design of Always On is a pull, not a push model.  The primary does NOT connect to the secondary, the secondary must connect to the primary and ask for log blocks.

    Whenever the secondary is NOT connected the cluster registry is immediately updated to NOT SYNCHRONIZED.  Think if it this way.  If we can’t communicate with the secondary we are unable to guarantee the state remains synchronized and we protect the system by marking it NOT SYNCHRONIZED.

    Primary Database Startup

    Whenever a database is taken offline/shutdown the secondary connections are closed.   When the database is started we immediately set the state of the secondary to NOT SYNCHRONIZED and then recover the database on the primary.  Once recovery has completed the secondary(s) are allowed to connect and start the log scanning activity.

    Note: There is an XEvent session, definition included at the end of this blog, that you can be use to track several of the state changes.


    Once the secondary is connected it asks (pull) for a log scan to begin.   As the XEvents show, you can see the states change for the secondary scanner on the primary.

    Uninitialized The secondary has connected SQL Server but it has not sent LSN information yet.
    WaitForWatermark Waiting for the secondary to reconcile the hardened log LSN position on the secondary with the cluster key and recovery information.   The secondary will send its end-of-log (EOL) LSN to the primary.
    SendingLog The primary has received the end-of-log position from the secondary so it can send log from the specified LSN on the primary to the secondary.

    Note:  None of these states alone dictate that the secondary is IN SYNC.   The secondary is still marked as NOT SYNCHRONIZED in the cluster registry.


    Hardened Log On Secondary

    You will notice the 3rd column is indicating the commit, harden policy.  The harden policy indicates how a commit transaction should act on the primary database.

    DoNothing There is no active ‘SendingLog’ so the commits on the primary don’t wait for acknowledgement from the secondary.  There is no secondary connected so it can’t wait for an acknowledgement even if it wanted to.

    The state of the secondary must remain NOT SYNCHRONIZED as the primary is allowed to continue.

    I tell people this is why it is called HADR and not DRHA.  High Availability (HA) is the primary goal so if a secondary is not connected the primary is allowed to continue processing.   While this does put the installation in danger of data loss it allows production uptime and alternate backup strategies to compensate.
    Delay When a secondary is not caught up to the primary end–of-log (EOL) the transaction commits are held for a short delay period (sleep) helping the secondary catch up.  This is directly seen while the secondary is connected and catching up (SYNCHRONIZING.)
    WaitForHarden As mentioned earlier the secondary sends progress messages to the primary.   When the primary detects that the secondary has caught up to the end of the log the harden policy is changed to WaitForHarden.

    SYNCHRONIZING – DMVs will show synchronizing state until the end-of-log (EOL) is reached.  Think of as a catch up phase.   You can’t be synchronizing unless you are connected.

    SYNCHRONIZED – This is the point at which the secondary is marked as SYNCHRONIZED.  (Secondary is connected and known to have achieved log block hardening with the primary EOL point.)


    From this point forward all transactions have to wait for the primary (log writer) and secondary to advance the LSN flushes to the desired harden location.

    Going back to the first example, Session 2 waits for all LSNs up to and including 06 to be hardened.   When involving the synchronous replica this is a wait for LSNs up to 06 to be hardened on the primary and the secondary.   Until the progress of both the primary and secondary achieve LSN 06 the committing session is held (Wait For Log Flush.)

    Clean vs Hard Unexpected Database Shutdowns

    When you think about database shutdown there are 2 main scenarios, clean and unexpected (hard).   When a clean shutdown occurs the primary does not change the synchronized state in the cluster registry.  Whatever the current synchronization state is at the time the shutdown was issued remains sticky.   This allows clean failovers, AG moves and other maintenance operations to occur cleanly.

    Unexpected, can’t change the state if the unexpected action occurs at the service level (SQL Server process terminated, power outage, etc..).   However, if the database is taken offline for some reason (log writes start failing) the connection to the secondary(s) are terminated and terminating the connection immediately updates the cluster registry to NOT SYNCHRONIZED.  Something like failure to write to the log (LDF) could be as simple as an administrator incorrectly removing a mount point.  Adding the mount point back to the system and restarting the database restores the system quickly.


    Now I started running scenarios on my white board.   I think a few of these are applicable to this post to help solidify understanding.

    In Synchronized State
    Primary flushed LSN but not flushed on Secondary
    • Primary experiences power outage.
    • Automatic failover is allowed.  
    • The primary has flushed log records that the secondary doesn’t have and the commits are being held.
    • Secondary will become the new primary and recovers.
    • When old primary is restarted the StartScan/WaitForHarden logic will rollback to the same location that the new primary, effectively ignoring the commits flushed but never acknowledged. 
    In Synchronized State
    Secondary flushed LSN but not flushed on Primary
    • Primary experiences power outage.
    • Automatic failover is allowed.
    • Secondary has committed log records that primary doesn’t have. 
    • Secondary becomes new primary and recovers.
    • When old primary is restarted the StartScan/WaitForHarden logic will detect a catch up is required.

      Note:  This scenario is no different than a synchronized secondary that has not hardened as much as the primary.  If the primary is restarted on the same node, upon connect, the secondary will start the catch up activity (SYNCHRONIZING), get back to end of log parity and return to SYNCHRNONED state.

    The first reaction when I draw this out for my peers is, we are loosing transactions.  Really we are not.  We never acknowledge the transaction until the primary and secondary indicate the log has been hardened to LSN at both locations.  

    If you take the very same scenarios to a stand alone environment you have the same timing situations.   The power outage could happen right after the log is hardened but before the client is sent the acknowledgement.   It looks like a connection drop to the client and upon restart of the database the committed transaction is redone/present.   In contrast, the flush may not have completed when the power outage occurred so the transaction would be rolled back.   In neither case did the client receive an acknowledgement of success or failure for the commit.


    Going back to the intent of this blog, only when the cluster registry has the automatic, targeted secondary, marked SYNCHRONIZED is automatic failover allowed.   You can throw all kinds of other scenarios at this but as soon as you drop the connection (restart the log scan request, …) the registry is marked NOT SYNCHRONIZED and it won’t be marked SYNCHRONIZED again until the end-of-log (EOL) sync point is reached.

    Many customers have experienced failure to allow fail over because they stopped the secondary and then tried a move.  They assumed that because they no longer had primary, transaction activity it was safe.   Not true as ghost, checkpoint and other processes can still be adding log records.   As soon as you stop the secondary, by definition you no longer have HA so the primary marks the secondary NOT SYNCHRONIZED.

    As long as the AG failover detection can use proper, cluster resource offline behaviors, SQL Server is shutdown cleanly or SQL Server is terminated harshly, while the secondary is in the SYNCHRONIZED state, automatic failover is possible.  If the SQL Server is not shutdown but a database is taken offline the state is updated to NOT SYNCHRONIZED.

    Single Failover Target

    Remember that you can only have a single, automatic failover target.  To help your HA capabilities you may want to setup a second, synchronous replica.  While it can’t be the target of automatic failover it could help High Availability (HA).  

    For example, the automatic failover, secondary target machine has a power outage.   Connection on primary is no longer valid so the secondary is marked NOT SYNCHRONIZED.   The alternate synchronous, replica can still be SYNCHRONIZED and a target for a manual move WITHOUT DATA LOSS.   The automatic failover target, in this example, is only a move WITH ALLOW DATA LOSS target.

    Don’t forget that to enable true HA for this example the replica(s) should have redundant hardware.  Second network cards, cabling and such.  If you use the same network and a networking problem arises the connections on the primary are dropped and that immediately marks the replica(s) NOT SYNCHRONIZED.

    Resolving State

    Most of the time the question addressed in this post comes up because the secondary is NOT becoming the primary and is in Resolving state.   Looking at the state changes leading up to the issue the secondary was in SYNCHRONIZING.  When the primary goes down the secondary knows it was not SYNCHRONIZED.  The secondary is attempting to connect to the a primary and the primary is down so the state is RESOLVING. 


    Customizing Failover – All Kinds of Options

    A secondary question that always follows this main question is:  “If a disk fails on my database, within an AG why does automatic failover not occur?”

    The short answer is that the secondary connections are dropped during database shutdown – NOT SYNCHRONIZED.  (SQL Server development is looking into keeping the SYNCHRONIZED state in this situation instead of forcing NOT SYNCHRONIZED in vNext, opening up the window for automatic failover possibilities.)

    The other part of the answer is that the built-in, failover logic is not designed to detect a single database failure.   If you look at the failure conditions in SQL Server Books Online none of these are database level detections.

    I was part of the work we did to enhance the failover diagnostics and decision conditions/levels.  We specifically considered the custom solution needs.  We evaluated dozens of scenarios, ranked and targeted those conditions safe for the broad customer base using Always On.   This design specifically involved allowing any customer to extend the logic for your specific business needs.  We made sure the mechanisms, the SQL Server and resource DLL use, were using publicly consumable interfaces and documented in SQL Server Books Online. 

    Note:  All of the following can be done with PowerShell.


    For XEvents you can use the XEvent Linq Reader and monitor a live feed from the SQL Server.   The easiest way to accomplish this would be to setup a SQL Agent job (continuous running so if the processes exits it restarts itself) which launches a C# executable or Powershell script.

    • The job can make sure it is only starting the executable on the primary server.
    • The executable can make sure the proper XEvent sessions are running (these sessions can even be defined to startup during SQL Server, service startup).
    • The executable can monitor the steam of events for the custom trigger points you consider critical to your business needs and when the parameters fall out of the desired boundary(s) issue the Cluster command to MOVE the AG to another node.
    • The XEvent session can also write to a file (.XEL) so the system has history of the event stream as well.

    Note: The executable should be drop connection resilient.   The design of the XEvent live stream is to terminate the connection for the stream if the server detects the event stream is stalled (client not processing events fast enough.)   This means the client needs to detect the connection failure and reset.   This usually means actions are posted to a worker thread in the application and the main reader only accepts the events and hands them to background tasks.


    sp_server_diagnostics (

    This was specifically designed to flow across a T-SQL connection (TDS) so anyone using a SQL Server client (.NET, ODBC, OLDEB, …) can execute the procedure and process the results.   You don’t want dozens of these running on the SQL Server but you could easily monitor this stream as well and take any custom actions necessary.

    Note:  The I/O result row is NOT used by the SQL Server resource dll to make failover decisions.  It is used for logging purposes only.   It is not safe assumption that an I/O stall would be resolved by a failover of the system or even restart of the service.  We have many examples of virus scanners and such components that can cause this issue and it would lead to a Ping-Pong among nodes if we trigger automated failover to occur.

    DMVs and Policy Based Management (PBM)

    In most cases it will be more efficient to setup an XEvent to monitor various aspects of the system.  (Specific errors, database status changes, AG status changes, ….).   However, the DMVs are also useful and a great safety net.  We use many of the DMVs and the PBM rules to drive the Always On dashboard.    You can create your own policies and execute them as well as using the XEvent predicates to limit the events produced.

    Between some DMV queries and the policies you can easily detect things like corruption errors occurring, loss of a drive, etc…      

    External Factor Detections

    Using PowerShell and WMI you can query information about the machine.  For example you can check each drive for reported failure conditions, such as too many sector remaps or temperature problems.   When detected you can take preemptive action to move the AG and pause the node, marking it for proper maintenance.

    $a = get-wmiobject win32_DiskDrive
    $a[0] | get-member

    Loss of LDF  (No Automatic Failover)

    A specific tenant of Always On is – protect the data- don’t automate things that can lead to data loss. 

    The scenario is a mount point, used to hold the LDF, is mistakenly removed from the primary node.   This causes the SQL Server database to become suspect, missing log file but does not trigger automatic failover.  

    If the mount point can simply be added back to the node the database can be brought back online and business continues as usual, no data loss.   If we had forced failover (ALLOW DATA LOSS) it could have led to data loss for a situation that the administrators could have cleanly resolved.

    When the secondary drops a connection (loss of network, database LDF is damaged, …) the state is updated to ‘not synchronized’, preventing automatic failover.   We are careful because allowing anything else may lead to split brain and other such scenarios that cause data loss.  Furthermore, if you change a primary to a secondary it goes into recovery state and at that point if we had serious damage and needed to recover the data it is much more difficult to access the database.

    A situation like this requires a business decision.  Can the issue be quickly resolved or does it require a failover with allow data loss?  

    To help in preventing data loss the replicas are marked suspended.  As described in the following link you can use a snapshot database, before resuming, to capture the changes that will be lost.   Then using T-SQL queries and facilities such as TableDiff one can determine the best reconciliation.

    Also reference:

    Note: You want to make sure the snapshot has a short life span to avoid the additional overhead for a long period of time and the fact that is can hold up other operations, such as File Stream garbage collection actions.

    One could build additional monitoring to:

    • Make sure primary was marked suspended
    • Force the failover with allow data loss
    • Create snapshot on OLD primary
    • Resume OLD primary as a new secondary

    Then take appropriate business steps to use the data in the snapshot to determine what the data loss would/could be.    This is likely to involve a custom, data resolver design (much like the custom conflict resolution options of database replication) to determine how the data should be resolved.

    Don’t Kill SQL

    Killing SQL Server is a dangerous practice.    It is highly unlikely but I can never rule out that it may be possible to introduce unwanted behavior, such as when SQL Server is attempting to update the cluster registry key, leaving the key corrupted.   A corrupted registry key, blob for the Availability Group (AG) would then render every replica of the AG damaged because the AG configuration is damaged, not the data!   You would then have to carefully drop and recreate the AG in a way that did not require you to rebuild the actual databases but instead allows the cluster configuration to be corrected.  It is only few minute operation, once discovered, to fix it but immediate downtime and is usually a panic stricken situation.

    SQL Server is design to handle power outages and tested well to accommodate this.  Kill is a bit like simulating a power outage and not something Microsoft would recommend as a business practice.  Instead you should be using something like PowerShell and issuing a ‘move’ of the availability group in a clean and designed way.

    Example: (Move-ClusterResource)  

    XEvent Session


    ADDEVENTsqlserver.hadr_db_commit_mgr_set_policy(    ACTION(package0.callstack,sqlserver.database_name)),

    ADDEVENTsqlserver.hadr_db_commit_mgr_update_harden(    ACTION(package0.callstack,sqlserver.database_name)),

    ADDEVENTsqlserver.hadr_db_partner_set_sync_state(    ACTION(package0.callstack,sqlserver.database_name)),

    ADDEVENTsqlserver.hadr_db_manager_state (ACTION(package0.callstack,sqlserver.database_name)),





    Bob Dorr - Principal SQL Server Escalation Engineer

    0 0

    The distinction between these two wait types is subtle but very helpful in tuning your Always On environment.

    The committing of a transaction means the log block must be written locally as well as remotely for synchronous replicas.   When in synchronized state this involves specific waits for both the local and remote, log block, harden operations.

    HADR_SYNC_COMMIT = Waiting on response from remote replica that the log block has been hardened.  This does not mean the remote, redo has occurred but instead that the log block as been successfully stored on stable media at the remote, replica.  You can watch the remote, response behavior using the XEvent: hadr_db_commit_mgr_update_harden.

    WRITELOG = Waiting on local I/O to complete for the specified log block.

    The design puts the local and remote log block writes in motion at the same time (async) and then waits for their completion.   The wait order is 1) remote replica(s) and 2) the local log.

    The HADR_SYNC_COMMIT is usually the longer of the waits because it involves shipping the log block to the replica, writing to stable media on the replica and getting a response back.   By waiting on the longer operation first the wait for the local write is often avoided. 

    Once the response is received any wait on the local (primary), log (WRITELOG) occurs as necessary.

    Accumulation of HADR_SYNC_COMMIT wait time is the remote activity and you should look at the network and log flushing activities on the remote replica.

    Accumulation of WRITELOG wait time is the local log flushing and you should look at the local I/O path constraints.



    Bob Dorr - Principal SQL Server Escalation Engineer

    0 0

    The topic I received most in my inbox this week was redo blocked on a secondary while attempting to acquire SCH-M (schema modify) lock.

    First of all, this is expected behavior and you can monitor for this with your standard blocking activities (sys.dm_exec_requests, blocked process TRC event, blocked process threshold configuration setting(s) and the log_redo_blocked XEvent.)

    The SQL Server 2012 implementation of Always On extended the database mirroring (DBM) capabilities by allowing read only queries and backups against a secondary replica.   With this new activity comes additional overhead.

    1. When a replica is marked for read only capabilities the updated/inserted rows on primary add additional overhead for the row versioning to help support snapshot isolation activities of the read only connections.

    2. When queries are run against the secondary the SCH-S (schema stability) lock is held during the query to make sure the schema of the object can’t be changed during the processing of results.

    In the case of the blocked, redo the read only clients typically have long running queries and the object is changed (ALTER, create index, …) on the primary.   When the DDL activity arrives on the secondary the SCH-M is required to complete the requested, redo change.    This causes the redo worker to become blocked on the long running, read only query(s).

    You can monitor the redo queue size and other performance counters to determine the relative impact of redo being blocked and make any necessary business decisions to KILL the head blocker(s).  It will look no different than a production server with a head blocker that you resolve today.

    Microsoft is evaluating, for future builds, the ability to configure a replica to automatically kill a redo blocker, allowing redo to progress.

    Bob Dorr - Principal SQL Server Escalation Engineer

    0 0

    I was recently going through an exercise of documenting how to discover certain aspects of Reporting Services for some Kerberos work.  RS 2012 in SharePoint is a totally different game though.  The easiest way I could discover certain items about Reporting Services within SharePoint was with PowerShell.  For previous versions, and RS 2012 in Native Mode, we can use other avenues such as WMI to discover configuration of Reporting Services.

    Service Enumeration

    One of the big things we want to do when we are discovering what is out there is to be able to tell if we even have the service installed.  This may also lead to multiple services within SharePoint.  Within SQL 2012, Reporting Services is a Shared Service within SharePoint and is deployed as such. We could run the following to see the Service Applications that are configured.

    Get-SPServiceApplication |where {$_.TypeName -like "SQL Server Reporting*"}


    This aligns with what we see in Central Admin as well.


    From the Kerberos/Claims configuration perspective, we are also interested in the Service Account for the Reporting Services Application. Also keeping in mind that there may be more than one.

    $apps = Get-SPServiceApplication |where {$_.TypeName -like "SQL Server Reporting*"}
    foreach ($app in $apps){"{0,-20} {1,-20}" -f $, $app.ApplicationPool.ProcessAccountName}


    We can see that the name is “Reporting Services” and the AppPool Process Account is “BATTLESTAR\rsservice”. The next piece of information we want to know is which SharePoint boxes is this service running on?  We could have any number of App Servers that we need to go check.

    $services = Get-SPServiceInstance |where {$_.TypeName –like "*Reporting*"}
    foreach ($service in $services){"{0,-20} {1,-20}" -f $service.parent.address, $service.status}


    In this example, I only have one server that has the Reporting Services Service started within SharePoint and that is on the CAPTHELO server. We can then compare with the Claims to Windows Token Service (C2WTS).

    Claims to Windows Token Service (C2WTS)

    As part of SQL 2012 with SharePoint Integration, the Claims to Windows Token Service plays a big part in our functionality when it comes to Kerberos. As such, if we are going to validate RS 2012 in SharePoint configuration, we need to look at C2WTS as well. We can use the following PowerShell command to get the C2WTS instances:

    Get-SPServiceInstance |where {$_.TypeName -like "*Claims*"}

    One thing to remember is that C2WTS is not a service app.  You may see multiple items here if they have more than one SharePoint Server.  For example, in my environment I have two SharePoint Servers:


    The service only needs to be started on the SharePoint Box where the Reporting Services Shared Service is started as well. In the Reporting Services example above, we know that Reporting Services is only active on CAPTHELO, so that is where C2WTS needs to be started.

    $services = Get-SPServiceInstance |where {$_.TypeName –like "*Claims*"}
    foreach ($service in $services){"{0,-20} {1,-20} {2}" -f $service.parent.address, $service.status, $service.Service.ProcessIdentity.username}


    We can see that we have two SharePoint servers here.  C2WTS is started on LTBOOMER but it is not running on CAPTHELO.  This will result in an error such as the following:


    Throwing Microsoft.ReportingServices.Diagnostics.Utilities.ClaimsToWindowsTokenLoginTypeException: , Microsoft.ReportingServices.Diagnostics.Utilities.ClaimsToWindowsTokenLoginTypeException: Can not convert claims identity to windows token. This may be due to user not logging in using windows credentials.; 

    From the last command above, we can also see the Process Identity for the C2WTS.  So, this will provide the service accounts for both Reporting Services and C2WTS as well as let us know which servers Reporting Services is enabled on and which servers C2WTS needs to be enabled on.  This could be very helpful when it comes to automation.


    Adam W. Saxton | Microsoft Escalation Services

    0 0

    This blog outlines a new twist to my previous blog outlining issues with 4K sector sizes.

    SQL Server - New Drives Use 4K Sector Size:

    In the previous post I discussed that it was unsafe for the I/O subsystem to present a sector size that was smaller than the actual, physical sector size.   This leads to unsupported, Read-Modify-Write (RMW) behavior.

    I was doing testing on a Windows 2012 Server - Storage Space setup and found that both Storage Spaces and the VHDx format can report a 4K sector size to the SQL Server.   This allows the various drives setup in the pool for Storage Spaces to be of disparate sector sizes (Drive 1 = 512 bytes, 1K, 2K, and Drive 4 = 4K.) 

    Is this safe for SQL Server?

    The answer is yes.  An I/O subsystem can return a larger sector size than actual, physical sector size as long as all reported values can be evenly divided by 512 bytes.

    As the diagram below shows, SQL Server maintains parity on 512 byte boundaries, for the log, regardless of the reported sector size.   This allows SQL Server to detect a partial write (torn behavior.)   For example, if the system reported a sector size of 4K but the physical sector size was 512 bytes, the I/O subsystem is only guaranteed to flush to a 512 byte mark.   If the first 4, physical sectors are flushed (2K of the 4K aligned block) and a power outage occurs, SQL Server will be able to detect the entire 4K was not properly flushed.



    Without the logical parity every 512 bytes SQL Server would be unable to detect the torn situation, leading to unexpected recovery and logging behavior(s).

    WARNING:  While SQL Server protects your data against such a failure the reporting of sector size, larger than physical sector size, can lead to unwanted/unexpected space usage.   SQL Server will align the log writes to the reported sector size (4K in this example.) 

    SQL Server packs records within the log blocks and then aligns/pads the writes on the reported sector boundary.  Lots of small transactions, leading to many log flushes, can result in wasted log space for a system reporting larger sector sizes.   Moving the scenario to an I/O subsystem reporting smaller sector sizes can reduce space usage.

    The easiest way to see this in action is a single worker doing tiny transactions.



       insert into tblTest values (1)   // Each insert is a transaction and a log flush


    Each insert is a separate commit transaction, causing the log to be flushed for each iteration.   In this example each insert will require at least 4K of log space to properly align during the flush.    Wrapping a transaction around the while loop or only committing at reasonable boundaries (say 10,000 inserts) reduces the log flushing behavior and uses the log space more effectively.

    Bob Dorr - Principal SQL Server Escalation Engineer

    0 0

    I’m always amazed that issues usually come in batches.  I was looped into a few cases that had the following symptoms.   They were running SharePoint 2010 and Reporting Services 2012 SP1.  When they went to use a data source with Windows Authentication, they were seeing the following error:


    System.Data.SqlClient.SqlException: A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: Named Pipes Provider, error: 40 - Could not open a connection to SQL Server)   

    This caused me to raise an eyebrow (visions of Spock as the new Star Trek movie is opening today <g>).  A lot of thoughts were floating in my head that all told me that this error didn’t make sense, for a bunch of reasons.

    1. The default protocol order for connecting to SQL from a client is TCP and then Named Pipes.  So, because we failed with a Named Pipes error, that meant something was either wrong with TCP or someone changed the Protocol order (which I have never seen in a customer case – so very unlikely)
    2. This is RS 2012, which means we are a Shared Service and rely on the Claims to Windows Token Service (C2WTS).  This forces Constrained Delegation.  Pretty sure most people would not have created the delegation requirements for the Named Pipes SQL SPN as most people go down the TCP route.  You can read more about SQL’s SPNs being Protocol based here.  Also more on this related aspect in a later post as I found some interesting things about this as well.
    3. This error tells me that we couldn’t establish a connection to SQL via Named Pipes.  Think of this as a “Server Not Found” type error.  I immediately tossed out any Kerberos/Claims related issue due to that thinking – again more on the kerb piece of this in a later post.
    4. This is really the first time I’ve had someone hit me up with a Named Pipes connection failure from an RS/SharePoint Integration perspective ever.  And I just got hit with 3 of them within the same week.  Something is up.

    Being this told me we had an actual connection issue via Named Pipes, I started down the normal connectivity troubleshooting path.  With any connectivity issue, I started with a UDL (Universal Data Link) file.  Basically just a text file renamed with an extension of UDL.  It’s important to run this from the same machine that is hitting the SqlException.  In my case it was my SharePoint App server, not the WFE server.


    You’ll notice the “np:” in front of the server name.  This forces the Named Pipes Protocol and ignores the default protocol order.  And this worked.  I also tried “tcp:” to force TCP in the UDL and this worked to.  I went back to my data source and tried forcing TCP there.


    System.Data.SqlClient.SqlException: A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: TCP Provider, error: 0 - The requested name is valid, but no data of the requested type was found.)

    This made no sense.  I even made sure I was logged in as the RS Service Account as that is the context in which we would have been connecting to SQL.  Same result.  Also, within a network trace, I saw nothing on either the TCP or Named Pipes side of the house in the trace that related to this connection attempt.  Which meant we never hit the wire. 

    As I was going to collect some additional diagnostic logging (Kerberos ETW tracing and LSASS Logging) I ended up doing an IISRESET and a recycle of the C2WTS service.  We went to reproduce the issue, but got a different error this time.


    System.IO.FileLoadException: Could not load file or assembly 'System.EnterpriseServices, Version=, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a' or one of its dependencies. Either a required impersonation level was not provided, or the provided impersonation level is invalid. (Exception from HRESULT: 0x80070542)  File name: 'System.EnterpriseServices, Version=, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a' ---> System.Runtime.InteropServices.COMException (0x80070542): Either a required impersonation level was not provided, or the provided impersonation level is invalid. (Exception from HRESULT: 0x80070542)   

    This error I did know and can work with.  I had blogged about this error last July here.  Checking the “Act as part of the operating system” showed that the C2WTS service account in fact was not given that right.  Adding that account to that policy right and restarting the C2WTS Windows Service and performing an IISRESET then yielded the following:


    The connectivity errors were clearly related to the lack of the Policy Setting.  It was unexpected and didn’t line up with normal connectivity related issues and also wasn’t very helpful with regards of where to go look for more information as all of the normal paths didn’t show anything useful.

    Of note, I tried reproducing this on SharePoint 2013, but only got the FileLoadException.  I think this is partly a timing issue with how IIS AppPools are started and the C2WTS service is started.  Doesn’t mean you won’t see this on SharePoint 2013 necessarily.  Even on SharePoint 2010, the first time I hit the FileLoadException.


    Adam W. Saxton | Microsoft Escalation Services

    0 0

    I ran into a new Kerberos Scenario that I hadn’t hit before when I was working on the cases related to this blog post. It’s rare that I actually see a case related to the Named Pipes protocol.  When I do, it is usually a customer trying to get it setup with a Cluster deployment.  I have never had a Named Pipes case related to Kerberos.  On top of that, I’ve never had a SQL related Kerberos issue that looked like an actual network related issue.  I usually see a traditional “Login failed for user” type error from the SQL Server itself.

    As part of my troubleshooting for the other blog post with the Claims configuration, I stumbled upon some information and theories about how Named Pipes responds when Kerberos is in the picture that I hadn’t ever seen or dealt with before.  I love when I see new things! It is very humbling and always reminds me there are a lot of things that I don’t know.  And, if you have read my other blog posts, or have seen me present at conferences like PASS, you know I have a passion for Kerberos!

    Here is what I saw from an error perspective using SharePoint 2013 and Reporting Services 2012 SP1.


    System.Data.SqlClient.SqlException: A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: Named Pipes Provider, error: 40 - Could not open a connection to SQL Server) ---> System.ComponentModel.Win32Exception: Access is denied

    This is a typical error if we can’t connect to SQL.  Think of this like a “Server doesn’t exist” type error.  We didn’t get the normal “Login failed for user” error that would possibly point towards Kerberos.  In this error, we didn’t even make it to SQL.  The interesting piece here though is the “Access is denied” inner exception.  That does possibly point to a permission issue. 

    I had talked in the last Blog Post about protocol order with connecting to SQL and that the default was TCP.  In this case, I was forcing Named Pipes, so the fact that the error is a Named Pipes error is expected.

    I dropped down to a network trace to see how far we actually got and to see if that revealed any other information.  One thing to keep in mind here is that we are in a Claims to Windows Token Service (C2WTS) scenario with the SharePoint/RS 2012 integration.  So, Kerberos/Constrained Delegation will be in the picture here.  A lot of people aren’t necessarily familiar with how Named Pipes actually works.  Named Pipes actually uses the SMB (simple message block) protocol from a network perspective.  This is the same protocol used for file shares and you’ll see the traffic on port 445.  It can be a little confusing because SMB sits on top of TCP, but we aren’t actually using the TCP 1433 port.  It is just a different way to connect to SQL Server. The IP was the SharePoint Server hosting the Reporting Services Service.

    300    9:04:40 AM 5/17/2013    captthrace.battlestar.local    SMB    SMB:C; Negotiate, Dialect = PC NETWORK PROGRAM 1.0, LANMAN1.0, Windows for Workgroups 3.1a, LM1.2X002, LANMAN2.1, NT LM 0.12, SMB 2.002, SMB 2.???    {SMBOverTCP:42, TCP:41, IPv4:1}

    302    9:04:40 AM 5/17/2013    captthrace.battlestar.local    SMB2    SMB2:R   NEGOTIATE (0x0), Revision: (0x2ff) - SMB2 wildcard revision number., ServerGUID={97B805C2-296C-477B-82B4-DEB6170A2A01} Authentication Method: GSSAPI,     {SMBOverTCP:42, TCP:41, IPv4:1}

    303    9:04:40 AM 5/17/2013    captthrace.battlestar.local    SMB2    SMB2:C   NEGOTIATE (0x0), ClientGUID= {9CB563F9-BEF4-11E2-9403-00155D4CB97B},     {SMBOverTCP:42, TCP:41, IPv4:1}

    304    9:04:40 AM 5/17/2013    captthrace.battlestar.local    SMB2    SMB2:R   NEGOTIATE (0x0), Revision: (0x300) - SMB 3.0 dialect revision number., ServerGUID={97B805C2-296C-477B-82B4-DEB6170A2A01} Authentication Method: GSSAPI,     {SMBOverTCP:42, TCP:41, IPv4:1}

    323    9:04:40 AM 5/17/2013    captthrace.battlestar.local    SMB2    SMB2:C   SESSION SETUP (0x1) Authentication Method: GSSAPI,     {SMBOverTCP:42, TCP:41, IPv4:1}

    326    9:04:40 AM 5/17/2013    captthrace.battlestar.local    SMB2    SMB2:R  - NT Status: System - Error, Code = (22) STATUS_MORE_PROCESSING_REQUIRED  SESSION SETUP (0x1), SessionFlags=0x0 Authentication Method: GSSAPI,     {SMBOverTCP:42, TCP:41, IPv4:1}

    327    9:04:40 AM 5/17/2013    captthrace.battlestar.local    SMB2    SMB2:C   SESSION SETUP (0x1) Authentication Method: GSSAPI,     {SMBOverTCP:42, TCP:41, IPv4:1}
         - ResponseToken: NTLM AUTHENTICATE MESSAGE Version:NTLM v2, Workstation: CAPTHELO
              Signature: NTLMSSP

    328    9:04:40 AM 5/17/2013    captthrace.battlestar.local    SMB2    SMB2:R  - NT Status: System - Error, Code = (34) STATUS_ACCESS_DENIED  SESSION SETUP (0x1) ,     {SMBOverTCP:42, TCP:41, IPv4:1}

    329    9:04:40 AM 5/17/2013    captthrace.battlestar.local    TCP    TCP:Flags=...A.R.., SrcPort=49665, DstPort=Microsoft-DS(445), PayloadLen=0, Seq=2945236632, Ack=2852397926, Win=0 (scale factor 0x8) = 0    {TCP:41, IPv4:1}

    In the Network Trace we can see that we were trying to connect via NTLM.  I already know that that will be a problem as we have to go Kerberos.  We started supporting Kerberos with Named Pipes starting in SQL 2008, so it should work. At this point, I’m thinking we actually have a Kerberos issue even though it looked like a network issue from the original error message.  So, lets go see if we can validate that.  I already had Kerberos Event Logging enabled.  These entries will be located in the System Event Log.  You can ignore errors that show “KDC_ERR_PREAUTH_REQUIRED”.  That is just noise and expected.  Also realize that errors may be cached and if they are, you will not see them in the Event Log or a Network Trace. It may require an IISRESET, a reset of the C2WTS Windows Service, or even a reboot of the box to get the items to show in the Event log or Network Trace. See this Blog Post.

    Log Name:      System
    Source:        Microsoft-Windows-Security-Kerberos
    Date:          5/17/2013 9:04:40 AM
    Event ID:      3
    Task Category: None
    Level:         Error
    Keywords:      Classic
    User:          N/A
    Computer:      CaptHelo.battlestar.local
    A Kerberos error message was received:
    on logon session
    Client Time:
    Server Time: 14:4:40.0000 5/17/2013 Z
    Error Code: 0xd KDC_ERR_BADOPTION
    Extended Error: 0xc0000225 KLIN(0)
    Client Realm:
    Client Name:
    Server Realm: BATTLESTAR.LOCAL
    Server Name: cifs/captthrace.battlestar.local
    Target Name: cifs/captthrace.battlestar.local@BATTLESTAR.LOCAL
    Error Text:
    File: 9
    Line: 12be
    Error Data is in record data.

    This entry was the only non-PREAUTH_REQUIRED error.  Two things that were interesting about this.  First was KDC_ERR_BADOPTION.  When I see this, especially in a Claims type configuration, it tells me we have a Constrained Delegation issue.  The other item that was interesting was the CIFS SPN.  CIFS is used for File Sharing.  It stands for “Common Internet File System”.  This was our SMB traffic.  We can also see this in the Network Trace.

    319    9:04:40 AM 5/17/2013    KerberosV5    KerberosV5:TGS Request Realm: BATTLESTAR.LOCAL Sname: cifs/captthrace.battlestar.local     {TCP:44, IPv4:14}

    321    9:04:40 AM 5/17/2013    KerberosV5    KerberosV5:KRB_ERROR  - KDC_ERR_BADOPTION (13)    {TCP:44, IPv4:14}

    This was interesting, because I never gave Constrained Delegation rights to CIFS for the C2WTS or the Computer Account.  When we talk about SPN’s and Delegation and placement, we talk about that the SPN should be on the account that is running the servers.  For CIFS, it will be the system itself and therefore on the machine account of the SQL Server that we are trying to connect to. 

    CIFS is one of those special Service Classes, similar to HTTP.  It is covered by the HOST SPN on the Machine Account and we won’t see an actual CIFS SPN defined, but when we go to the delegation side of things you will see it.



    I added this to both the Claims Service account and the Computer Account.  I say computer account, because the actual SMB request will come from the machine and not directly from the RS Process.  Under the hoods, it is affectively making a call to the CreateFile Windows API. 

    After resetting IIS and cycling the C2WTS Service, I still saw the same exact error.  This was one of those reboot moments.  After rebooting the server, I then got the following:


    I didn’t necessarily expect this as I expected to fail on the Kerb side to SQL.  So, I ran a report and stuck a WAITFOR DELAY in there so I could see the connection.  had a look at dm_exec_connections on the SQL Server and saw that we had connected with NTLM:


    For our purposes this will work as I’m not going further than SQL.  This is technically a single hop between the SharePoint Server System context and the SQL Server.  You can configure it for Kerberos if you really want that auth_scheme by creating the appropriate Named Pipes SPN and configuring the appropriate Delegation for the C2WTS Service Account and the Machine Account for where the SMB request is originating from.  Also realize that if you have a misplaced Named Pipes SQL SPN, you will encounter a “Cannot Generate SSPI Context” similar to the following:




    Adam W. Saxton | Microsoft Escalation Services

    0 0

    Recently we've seen an issue where some Excel 2010 workbooks containing PowerPivot models encounter errors when attempting to upgrade to Excel 2013.

    When opening a PowerPivot model that was created in Excel 2010 in Excel 2013 you will be prompted to upgrade with the following message:

    This workbook has a PowerPivot data model created using a previous version of
    the PowerPivot add-in. You will need to upgrade this data model with PowerPivot
    for Excel 2013.


    After clicking OK to upgrade to model the following error message is displayed:

    Error Message:

    The handle is invalid
    The '' local cube file cannot be opened.
    A connection cannot be made. Ensure that the server is running.
    Sorry, PowerPivot couldn't connect to server A connection cannot be made. Ensure that the server is running..

    Call Stack:

       at Microsoft.AnalysisServices.LocalCubeStream..ctor(String cubeFile, OpenFlags settings, Int32 timeout, String password, String serverName)
       at Microsoft.AnalysisServices.LocalCubeStream..ctor(String cubeFile, OpenFlags settings, Int32 timeout, String password, String serverName)
       at Microsoft.AnalysisServices.XmlaClient.OpenLocalCubeConnection(ConnectionInfo connectionInfo)
       at Microsoft.AnalysisServices.XmlaClient.OpenConnection(ConnectionInfo connectionInfo, Boolean& isSessionTokenNeeded)
       at Microsoft.AnalysisServices.XmlaClient.Connect(ConnectionInfo connectionInfo, Boolean beginSession)
       at Microsoft.AnalysisServices.XmlaClient.Connect(ConnectionInfo connectionInfo, Boolean beginSession)
       at Microsoft.AnalysisServices.Server.Connect(String connectionString, String sessionId, ObjectExpansion expansionType)
       at Microsoft.AnalysisServices.BackEnd.DataModelingSandboxConnection.OpenAMOConnection()
       at Microsoft.AnalysisServices.BackEnd.DataModelingSandboxConnection.OpenAMOConnection()
       at Microsoft.AnalysisServices.BackEnd.DataModelingSandboxConnection.Open()
       at Microsoft.Office.PowerPivot.ExcelAddIn.InProcServer.LoadSandboxAfterConnection(String errorCache)
       at Microsoft.Office.PowerPivot.ExcelAddIn.InProcServer.LoadSafeSandboxAfterConnection(String errorCache)
       at Microsoft.Office.PowerPivot.ExcelAddIn.InProcServer.LoadOLEDBConnection(Boolean raiseCompleteEvent, String errorCache)



    This issue occurs when a Pivot Table in the workbook has an invalid set or calculated field definition.

    When the attempt to upgrade begins, Excel opens the PowerPivot model and executes the commands in the pivotcache for the pivot tables in the workbook, including creating session sets or calculated items. If a definition contains an error then the embedded PowerPivot engine returns an error and Excel disconnects from the PowerPivot embedded engine, returning the message saying the server connection could not be opened.


    We are working with the Excel team to address this issue, but for now one option to work around this problem is to use the following steps:

    1. Open the workbook in Excel 2010
    2. Click on a PivotTable
    3. In the PivotTable Tools, Options, menu click on the "Fields, Items, and Sets" button and choose Manage Sets.
    4. Delete any invalid set definitions.
    5. Save the file then open it in Excel 2013.  It should upgrade successfully. 


    -Wayne Robertson


    0 0

    Note: This blog is based on behavior as of June 2013.  At Microsoft we continue to evolve and enhance our products so the behavior may change over time.

    The I/O path for SQL Server, running on Windows Azure Virtual Machines, uses Windows Azure storage (often referred to as XStore.)  The following is a link to a whitepaper, written by the SQL Server development team, explaining various I/O aspects and tuning needs for this environment: 

    I have been testing and driving workloads using SQL Server in our IaaS environment(s). Exercising test patterns from SQLIOSim, T-SQL, Tempdb stress, BCP, and the RML utility test suites.  This work has afforded me the opportunity to work closely with both the SQL Server and Windows Azure development teams to better understand how to tune SQL Server for a Windows IaaS deployment.

    The previously mentioned white paper points out many of the interactions but I would like to dive into some specific areas to help you understand your Windows Azure IaaS, SQL Server deployment capabilities and options.


    Caching happens at multiple levels (SQL Server, VHD/VHDx, XStore, …).   SQL Server requires stable media (data retained across power outages) in order to uphold the ACID properties of the database.  It should not surprise you that I spent a fair amount of time understanding the various caches and how SQL Server I/O patterns interact with them from both data safety/stability and performance angles.

    SQL Server Data and Log Files: FILE_FLAG_WRITETHROUGH/Forced Unit Access (FUA) Writes

    SQL Server does all database and log file writes (including TEMPDB) forcing write though.  This means the system is not allowed to return the success of a WriteFile* operation until the system guarantees the data has been stored in stable media.  This is regardless of the cache settings at any level of the system.  The data must be stored in a cache that is battery backed or written to physical media that will survive a power outage.

    SQL Server Backups

    When performing a backup SQL Server does NOT set the FILE_FLAG_WRITETHOUGH, instead the backup allows caching of the data to take place.   When backup has written all of the data it issues the FlushFileBuffers command on the backup file(s).  This acts like FUA in that the FlushFileBuffers must guarantee all data stored in cache has been written to stable media before returning success from the API.

    Other Files

    There are other files such as the SQL Server error log, BCP output, … that a SQL Server deployment will use.  These usually do NOT use FILE_FLAG_WRITETHROUGH nor FlushFileBuffers.   This behavior is not unique to Windows IaaS it has been the design and implementation for many releases of the SQL Server product.   I am only mentioning this because it can make a difference to you as to where you store the data or leverage various caching mechanisms.

    Drive: VHD/VHDx Cache (Drive Caching)

    The first level of caching appears at the drive level.   Accessing the drive’s properties exposes the drive level caching policy.   When enabled the drive is allowed to cache data within the physical drive cache (for physical disks) and virtual implementations.  


    Drive: XStore Caching

    The virtual disks attached to your Windows IaaS VM are stored in Windows Azure storage (the XStore.)   The XStore provides host level caching and performance optimizations for your VM.   XStore caching is controlled outside of the individual drive cache settings, previously shown.   (Drive level cache settings do not impact the XStore cache settings.)

    There are three cache settings (None, ReadOnly, ReadWrite) for the XStore.  By default a data disk is set to NONE and an OS Disk is set to Read Write.   The data disk can be configured to any of the 3 settings while an OS disk is limited to the Read Only or Read Write options.

    To alter the XStore cache settings you must use a command such as the following:  (The cmdlet’s are part of the Windows Azure Management cmdlets package.)

    get-azurevm 'TestSub' 'TestMach' | set-azureosdisk -HostCaching "ReadOnly" | update-azurevm

    Note: The configuration change does not take place until the VM is restarted.

    You will find that most documentation points to the use of one or more data drives for SQL Server database and log files and I agree with this assessment but I would like to explain a bit more as to why I agree.

    OS Drive Default: Read Write

    To understand the recommendation it helps to understand a bit of how the XStore cache is implemented.    The XStore cache can use a combination of the host’s RAM as well as disk space, on the local host to support the cache.   When a read or write takes place the local cache is consulted.  As you can imagine, the local cache can be faster than making a request to the Windows Azure storage cluster.  To see this in action it helps if I show you some high level scenarios.

    Note:  These scenarios are designed to provide you a 10,000 foot view of the architecture.

    Action FUA XStore Read Cache XStore Write Cache Outcome
    WriteFile* Yes Yes Yes
    • Is covering block in XStore write cache  (No)
    • Obtain cache block (could require write of an older block to XStore before reuse is allowed)
    • Issue XStore read for block ~512K and store in local cache (write)
    • Do original write to local cache
    • Do original write to XStore

    Data is written all the way to the XStore (stable media.) 

      No Yes Yes
    • Is covering block in XStore write cache  (No)
    • Obtain cache block (could require write of an older block to XStore before reuse is allowed)
    • Issue XStore read for block ~512K and store in local cache (write)
    • Do original write to local cache

    Data is NOT in stable media, will be written by XStore caching, LRU algorithms.   This is not used by the SQL Server database or log files because FUA=YES but BCP out is a usage example.   This may be helpful because the BCP can leverage the cache and allow the cache to optimally send data to the XStore.

    Note:  For temporary files, such as BCP not TEMPDB, you may consider the (D:) scratch drive, provided.  It is local and no XStore caching is involved.

      Yes Yes No The write is propagated directly to the XStore.
      Yes No No The write is propagated directly to the XStore.
    ReadFile* N/A Yes Yes
    • Is covering block in XStore read cache  (No)
    • Obtain cache block (could require reclaim of an older block before reuse is allowed)
    • Issue XStore read for block ~512K and store in local cache (write)
    • Provide portion of block requested to reader

    If the block is present in the read cache the portion of the block requested to the reader is directly serviced.


    The SQL Server use cases become more clear once you understand the XStore caching behavior and combine it with SQL Server caching behavior.

    SQL Server Log File

    The log file is typically a serially moving, write only entity.   SQL Server maintains a cache of log block to avoid physical I/O for actions such as a transaction rollback or provide data to Always On, Database Mirroring, Replication, …   If you place the log on a drive that allows XStore write caching you are seldom taking advantage of the write cache.   First the log is written with FUA enabled so a write has to go though the cache, all the way to the XStore.  Secondly, since the log has a much higher ratio of writes to reads you are likely not using the XStore cache effectively but forcing local XStore cache writes for every backend, FUA XStore write.

    SQL Server Data File

    The data file is opened with FUA enabled and the SQL Server uses a large buffer pool to maintain database pages in cache.  The larger your VM size the more buffer pool cache SQL Server can take advantage of, avoiding physical I/O activities.  

    The same write behavior is true for database pages as described in the log section above.   When you then apply the XStore read cache capabilities, they may not be performance improving over what SQL Server is already caching in buffer pool.   As shown in the scenario table a read, via XStore read cache enablement, can result in a fetch of a larger block into the local XStore read cache.   This could be helpful for subsequent SQL Server read requests but you also incur a possible write to the local XStore cache to maintain the data in the XStore cache.  Your application read pattern may also be sporadic and defeat the intent of the XStore read cache.

    BCP Out

    You may get an advantage of writing to a XStore, write enabled drive by allowing the XStore to cache and optimially flush information to the backend store.

    BCP In

    This is one of the tests that the XStore read cache improved performance for.   The read ahead action of the larger blocks used by the XStore allowed the streaming of the read bytes to be faster than from a XStore drive with read caching disabled.  

    TEMPDB Scratch Drive – No Thanks

    It can be a bit confusing that you have a (D: scratch) drive so why not use it for TEMPDB.  The reason is that scratch drive is a shared resource between all VMs on the host.   You are given a sandbox drive in the VM so others can’t see your data and you can’t see theirs but the physical media is shared.   While the system attempts to avoid it, it does mean a noisy neighbor could impact consistent I/O throughput on the scratch drive and change your performance predictability.  

    Replicas of Data

    The Windows Storage for the VHDs atomically replicates your data to 3 separate, physical media destinations.  This is done during the write request in a way that the write is assured quorum to the devices before the write is considered complete, proving your VHDs with a high degree of data storage safety.

    Remote Replicas (Geo-Replication)

    The default for the VHD storage is to provide 3 local replicas of the data.  A remote replica can also be established.   The remote replica is currently NOT safe for SQL Server use.  The remote replica is maintained asynchronously and the system, currently, does not provide the ability to group VHDs into a consistency group.  Without consistency groups it is unsafe to assume SQL Server, with files on multiple VHDs maintain the write ordering across all the VHDs and as such the database won’t be recoverable.

    At the current time you should not leverage the remote, Windows Azure storage replication capabilities for SQL Server as it is not supported.   You should leverage SQL Server technologies that provide the capability (Always On, Database Mirroring, Log Shipping, Backup to Blob Storage, …)

    Recap / Recommendation

    What am I trying to say? – Test!  As pointed out in the SQL Server whitepaper, the XStore is perfectly safe for SQL Server ACID requirements.   However, the types of I/O pattern(s) your application(s) drive dictate how you can leverage the XStore caching capabilities.  

    Most implementations start with the SQL Server database and log files on a data drive, XStore caching disabled and only after testing enable various levels of XStore, caching behavior.

    Bob Dorr - Principal SQL Server Escalation Engineer

    0 0

    A case came up where the user was trying to use Report Builder in a Reporting Services instance that was not integrated with SharePoint.  It was in Native Mode configuration.  They indicated that they were getting a 401 error.  My initial thought was that we were hitting a Kerberos issue.  Of note, they were trying to hit a List that was in SharePoint 2013. 

    SharePoint 2013 is defaulted to use Claims Authentication Sites.  So, most would probably ignore the Kerberos aspects of the SharePoint site.  I was able to reproduce the issue locally because I had done the same thing.

    I created the Data Source within Report Builder to hit my SharePoint 2013 site:  http://capthelo/, and when I click on “Test Connection” within the Data Source Dialog Window, I get the following error.


    dataextension!ReportServer_0-1!9cc!06/11/2013-14:25:58:: e ERROR: Throwing Microsoft.ReportingServices.DataExtensions.SharePointList.SPDPException: , Microsoft.ReportingServices.DataExtensions.SharePointList.SPDPException: An error occurred when accessing the specified SharePoint list. The connection string might not be valid. Verify that the connection string is correct.  ---> System.Net.WebException: The request failed with HTTP status 401: Unauthorized.

    This happens because when you click “Test Connection” the connection test is actually performed on the Report Server itself not directly from Report Builder.  I had blogged a while back regarding Report Builder and Firewalls where I talk about how some of the items in Report Builder will try to connect direction, but “Test Connection” is not one of them.

    At this point, we could ignore the error and hit OK on the Data Source Dialog and try and create a DataSet. When I go to the Query Designer, it appears to have worked.  This because the DataSets and Query Designer are coming from Report Builder itself.  It is a direct Web Request from the Report Builder Process and not the Report Server, so I don’t get an error.


    However, this is misleading.  This may make you believe that it is working properly, but when you deploy and try to run the report, you will be back to the 401 error because we are now coming from the Report Server which will be down the same path that the original error with the “Test Connection” had.  From the DataSet/Query Designer perspective, this is a straight shot from Report Builder to SharePoint, so we can get away with an NTLM connection for the Web Request and the Windows Credential is valid. 

    From the Report Server, however, this is called a Double Hop and to forward Windows Credentials you need Kerberos to do that.   Even when your SharePoint 2013 site is configured for Claims.  This actually has nothing to do with SharePoint, it has everything to do with Reporting Services.  The Report Server is the one trying to delegate the Windows Credential to whoever the receiving party is for the Web Request (or SQL Connection if that is your Data Source).  In this case, it is SharePoint 2013.  Because Kerberos isn’t configured properly, IIS (which is hosting SharePoint), received an anonymous credential for the Web Request and rejects it accordingly with a 401 error.

    In my case, I was using a Domain User Account for the RS Service Account (BATTLESTAR\RSService – http://chieftyrol).  It had the proper HTTP SPN on it.  Also my SharePoint site was using a Domain User account for the AppPool identity within IIS (BATTLESTAR\spservice – http://capthelo) and this had the proper HTTP SPN on it.


    So, now I just need to verify the Delegation properties for the RSService Account. Because I’m using the RSService account for other things that includes Claims within SharePoint 2013, I’m forced to Constrained Delegation on this account and need to continue using that.  If you are not bound to Constrained Delegation, you could choose the option “Trust this user for delegation to any service (Kerberos Only)” which is considered Full Trust and should correct the issue.  If you are using Constrained Delegation, you have to add the proper service that you want to delegate to.  In my case that is for my SharePoint site and is http/capthelo.battlestar.local.  After I added it, it looked like the following.


    Then I restarted the Reporting Services Service and created the Data Source again.  At that point, the “Test Connection” returned Success!


    Adam W. Saxton | Microsoft Escalation Services

    0 0

    We have been troubleshooting a customer’s case and uncovered a GC behavior with SQL Server CPU affinity that is worth sharing here in a blog.


    Customer reported that they had two instances of SQL Server 2008 running on a two-node cluster.  Let’s call them Instance1 and Instance2.  When they run both instances on the same node, kicking off a job on Instance1 will cause query timeouts on Instance2. The queries timed out were CLR queries via spatial usage.

    The node had plenty of memory for both instances.  Each instance had max server memory set.  Additionally, Instance1 and Instance2 had CPU affinity set so that they don’t overlap CPU usage.



    The puzzle here was that Instance2 was negatively impacted by a job run by Instance1 though on the surface it shouldn’t have been.  The system had plenty of memory and CPU’s were divided by CPU affinity settings. When Instance1’s job was not running or running on a different node, Instance2 would work perfectly fine.

    We captured data such as DMV data, perfmon, userdump etc via pssdiag during problem period (when both instances ran on the same node and Instance1 kicked off the offending job).

    In sys.dm_exec_requests, we saw high waits on  CLR_MANUAL_EVENT and CLR_CRST for spatial queries which used CLR on Instance1. 

    From perfmon and userdump, we were able to see the threads were waiting for garbage collection to finish.   When the issue occured, time spent by GC was greatly increased.

    For a while, we were puzzled why Instance1 would impact Instance2 by increasing it’s GC time spent.   The system had plenty of free memory and CPU affinity was set so that two instances didn’t overlap CPU usage.


    We did notice that a few CPUs affinitized to Instance1 were pegged to 100% for a long time.  Note that these CPUs should not have been used by Instance2 because CPU affinity was set. 

    One break-through came in when I noticed  that number of GC threads seemed to be odd.  SQL CLR uses server GC.   For each CPU, there will be a CLR heap and GC thread created.   What I noticed was that, even Instance1 had 16 CPUs affinitized to it, there were 64 GC threads (customer had 64 CPU on each node).  The same was true for Instance2 ( 64 GC threads with affinity set to 30 CPUs).

    Here was a little discovery:

    When you set CPU affinity on SQL Server,  it doesn’t actually set process affinity.  Instead the affinity is set at thread creation time to a specific CPU/scheduler.  Doing this way  will help enable dynamic CPU affinity.

    But CLR initializes heaps and  GC threads based on process affinity.  It will create number of heaps and GC threads depending on the number of CPUs set by process affinity.  If no process affinity is set, it will create number of heaps and GC threads based machine CPU.

    In other words, even though Instance2 was affinitized (at SQL level) to use CPUs 16-45, some GC threads still run on CPUs 0-15 (where instance1 ran).

    This is where ‘interaction’ comes in.  Basically the CLR heaps and GC threads are not isolated at all.  For example, even CPU 1 was configured for Instance1, there will be one CLR heap and GC thread for Instance2 running on that CPU as well.

    Since there were quite a few of CPUs run by Instance1 were pegged, this caused GC threads from Instance2 to fight CPU quantum, leading to increased GC time.


    Conclusion & Solution

    To summarize, if you have multiple instances of SQL Server on the same machine,  garbage collection from one instance may be impacted by another instance  even you have CPU affinity set to isolate the instances.

    There are a couple of things you can do:

    When you examine your system health, look for individual CPU utilization.  Any individual CPU pegged to 100% for a sustained period of time, you need to work to bring CPU down.  If you have a balanced resource usage, this behavior (GC thread starvation) shouldn’t happen.  SQL CLR has been in use since 2005, this is the first time I have seen this behavior.

    If you want true isolation, consider use VM (one instance per VM).


    A couple of  additional notes:

    By the way, the same customer also helped uncover a situation Bob Dorr documented it in the blog titled  “AppDomain unloading messages flooding the SQL Server error log

    Note that CLR itself has a bug that can cause GC to misbehave.  This is fixed in;EN-US;2504603.   This customer was on that build.  If you have GC misbehavior, apply this hotfix first before doing additional troubleshooting.  That’s maybe all you need.



    Jack Li | Senior Escalation Engineer | Microsoft SQL Server Support

    0 0

    High CPU Troubleshooting with DMV Queries



    Recently, a customer called Microsoft Support with a critical performance issue that I worked on. Their SQL Server instance had 100% CPU utilization for a period, but the problem stopped without any action taken
    by the customer. By the time I was engaged, the issue was no longer occurring. Very often we use PSSDiag to gather data related to performance issues, but we cannot gather trace data with PSSDiag
    after the fact. XEvents will also not reveal anything.

    PSSDiag will gather other details like top CPU queries that will be useful after the problem has ceased. In this case, we reviewed the execution plans that consumed the most CPU by using DMVs like sys.dm_exec_query_stats.


    In my discussion with the customer, he was made aware of the problem and started to investigate it, but the problem seemed to resolve itself. When I was engaged on the call, the issue had been over for 2 hours. I asked if the server had been restarted and found that it had not. This raised the possibility that the execution plan of the queries and procedures that had driven CPU were still in the cache. In that case, we can run queries against sys.dm_exec_query_stats to review those queries and procedures as well as their execution plans.

    Sys.dm_exec_query_stats Query:

    Here’s an example (and the attached file has the script as well):

    --Run the following query to get the TOP 50 cached plans that consumed the most cumulative CPU All times are in microseconds

    SELECT TOP 50 qs.creation_time, qs.execution_count, qs.total_worker_time as total_cpu_time, qs.max_worker_time as max_cpu_time, qs.total_elapsed_time, qs.max_elapsed_time, qs.total_logical_reads, qs.max_logical_reads, qs.total_physical_reads, qs.max_physical_reads,t.[text], qp.query_plan, t.dbid, t.objectid, t.encrypted, qs.plan_handle, qs.plan_generation_num FROM sys.dm_exec_query_stats qs CROSS APPLY sys.dm_exec_sql_text(plan_handle) AS t CROSS APPLY sys.dm_exec_query_plan(plan_handle) AS qp ORDER BY qs.total_worker_time DESC

     (Please visit the site to view this video)

    Many variants of this query can be found which retrieve the stats (reads, CPU, executions) for execution plans from sys.dm_exec_query_stats() and the text of the query from sys.dm_exec_sql_text. To this template, I added sys.dm_exec_query_plan() to also provide the graphical execution plan. This query puts the results in order by total CPU usage.

    This is not necessarily an exhaustive list. If a query is recompiled the details for that plan are removed; if there are further executions later, its cumulative stats in sys.dm_exec_query_stats start off at zero. If the procedure cache is flushed or SQL Server is restarted, all plans will be similarly affected.

    After running this query with my customer, we saw that there were about five queries that were noticeably higher in total CPU than the rest.

    Reviewing Execution Plans:

    Once we identified the highest CPU consumers, we started reviewing their execution plan by clicking on the link in our results.** There are a number of items to look for in any execution plan that could indicate where performance can be improved:

    • High cost operations
    • Index scans
    • Multiple executions of scans/seeks
    • Bookmark Lookups
    • Operations with very high rowcount
    • Sort operations
    • Implicit conversions
    • Estimated rows/executions that do not match actual rows/executions (which could indicated out of date statistics)

    In this case, we found the execution plans for each didn’t have obvious red flags, like expensive scans. So, we reviewed the index seeks with higher costs. We reviewed the queries to see what columns were used in the WHERE and JOIN clauses. Using sp_help against the tables, we looked at the existing indexes and found that the indexes seemed to support these queries relatively well.


    Since the indexes appeared to support the queries, I decided to take another step and check statistics on these tables using DBCC SHOWSTATISTICS.

    For the first several queries from our high CPU output, we checked the statistics on any indexes that seemed relevant. The most important
    parts of the statistics output are the “All Density” value, the “Rows” and
    “Rows Sampled” values, and the “Updated” value.

    The “Updated” value shows when the statistics were sampled. “Rows” and “Rows Sampled” allows you to determine the sampling rate; a higher sampling rate tends to lead to the statistics being more accurate and providing better performance.

    The “All Density” is less direct; inverting this number (dividing 1 by the “All Density”) gives the number of unique values in the column. This is useful in determining how unique an index is, and a more unique index is more likely to be used by the engine to complete a request faster.

    What we found in the customer's case was a wide variety of dates for statistics; some were 4 hours old, some were 4 days old, and some were 4 months old. After seeing this repeated on indexes related to several queries, the statistics were updated which resolved the issue.

    To Summarize:

    Here were the steps we took to analyze and resolve this issue.

    • Used query against sys.dm_exec_query_stats and sys.dm_exec_sql_text to identify highest CPU consumers
    • Reviewed the execution plans, identified what drives the cost of the plan and operators that are typically problematic
    • Reviewed the columns used in JOIN or WHERE clauses in the query, ensured there are indexes built on these columns
    • Checked the statistics and updated them when we realized they were out of date.

    Additional steps to consider in similar cases:

    • Check the SELECT list and consider if a covering index would further improve performance

    • Use the cost of plans to compare old and new plans

    Tools to help identify performance issues causes and improve performance:

    • The DatabaseTuning Advisor can use a workload or a specific query to determine if adding or modifying an index can improve the SQL Server’s performance.

    • A variant of the high CPU query and a number of other queries I use to gather data quickly are available in the Performance Dashboard. This provides a graphical interface to access these queries as reports.
    Jared Poché, Sr. Support Escalation Engineer. @jpocheMS
    with thanks to Jack Li and Bob Ward

    ** If clicking on the link in the query opens a window with raw XML, save this data to an .XML file. Change the file extension to .SQLPLAN, then open the file in SQL Server Management Studio to view the graphical plan.

    0 0

    I want to make you aware of a latest SQL Server 2008 hotfix documented in   Using large number of constants in IN clause can result in SQL Server termination unexpectedly.   When this happens, you won’t see anything in errorlog or any dumps generated by SQL Dumper.

    The condition to trigger this is not that common.  Therefore, you may never experience this type of issue.     In order to hit this condition, you must have mismatched numeric data type in the IN clause. 

    Let’s assume that you have a table defined as “create table t (c1 numeric(3, 0))”.   But in the IN  clause, you have something like t.c1 in ( 6887 , 18663 , 9213 , 526 , 30178 , 17358 , 0.268170 , 25638000000000.000000 ).  Note that precision and scale of  the constants exceed the column c1’s precision and scale.

    If your have queries like these, then you may experience this unexpected behavior depending on the final query plan.  This usually happens when you allow your user to do ad hoc queries and add random number of constant values which may exceed the column’s precision and scale.



    The solution is to apply  Note that the issue doesn’t happen on SQL 2012 and we are working on a fix on SQL Server 2008 R2 as well.

    0 0
  • 02/13/13--12:13: Breaking Down 18065
  • We have had two blog posts on this blog regarding the 18056 error.  Two from Bob Dorr (and part 2) and another from Tejas Shah.  However, we still see a lot of questions about this error message. This error message can show up for different reasons.  After those two blog posts were made, we released the following:

    FIX: Errors when a client application sends an attention signal to SQL Server 2008 or SQL Server 2008 R2

    This fix was specific to the following message and having to do with Attentions:

    Error: 18056, Severity: 20, State: 29.
    The client was unable to reuse a session with <SPID>, which had been reset for connection pooling. The failure ID is 29. This error may have been caused by an earlier operation failing. Check the error logs for failed operations immediately before this error message.

    Since this was released, there has still continued to be confusion over this error.  The intent of the fix above was to limit the amount of noise in the ERRORLOG.  And, this was specific to receiving the State 29 with 18056 when an Attention was received.  The Attention is the important part here.  If an Attention occurred during a reset of a connection, we would normally log that to the ERRORLOG under the State 29.  However, with this fix applied, if the Attention occurs during the reset of a connection, you should no longer see the error within the ERRORLOG.  This does NOT mean that you will no longer see a State 29

    I will use this post to explain further how we handle these errors to give you a better understanding.  To do that, I will expand on Bob Dorr's blog post that I linked above which lists out the states.  


    Default = 1,
    GetLogin1, 2
    UnprotectMem1, 3
    UnprotectMem2, 4
    GetLogin2, 5
    LoginType, 6
    LoginDisabled, 7
    PasswordNotMatch, 8
    BadPassword, 9
    BadResult, 10
    FCheckSrvAccess1, 11
    FCheckSrvAccess2, 12
    LoginSrvPaused, 13
    LoginType, 14
    LoginSwitchDb, 15
    LoginSessDb, 16
    LoginSessLang, 17
    LoginChangePwd, 18
    LoginUnprotectMem, 19
    RedoLoginTrace, 20
    RedoLoginPause, 21
    RedoLoginInitSec, 22
    RedoLoginAccessCheck, 23
    RedoLoginSwitchDb, 24
    RedoLoginUserInst, 25
    RedoLoginAttachDb, 26
    RedoLoginSessDb, 27
    RedoLoginSessLang, 28
    RedoLoginException, 29    (Kind of generic but you can use dm_os_ring_buffers to help track down the source and perhaps –y. Think E_FAIL or General Network Error)
    ReauthLoginTrace, 30
    ReauthLoginPause, 31
    ReauthLoginInitSec, 32
    ReauthLoginAccessCheck, 33
    ReauthLoginSwitchDb, 34
    ReauthLoginException, 35

    **** Login assignments from master ****

    LoginSessDb_GetDbNameAndSetItemDomain, 36
    LoginSessDb_IsNonShareLoginAllowed, 37
    LoginSessDb_UseDbExplicit, 38
    LoginSessDb_GetDbNameFromPath, 39
    LoginSessDb_UseDbImplicit, 40    (We can cause this by changing the default database for the login at the server)
    LoginSessDb_StoreDbColl, 41
    LoginSessDb_SameDbColl, 42
    LoginSessDb_SendLogShippingEnvChange, 43

    **** Connection String Values ****

    RedoLoginSessDb_GetDbNameAndSetItemDomain, 44
    RedoLoginSessDb_IsNonShareLoginAllowed, 45
    RedoLoginSessDb_UseDbExplicit, 46    (Data specified in the connection string Database=XYX no longer exists)
    RedoLoginSessDb_GetDbNameFromPath, 47
    RedoLoginSessDb_UseDbImplicit, 48
    RedoLoginSessDb_StoreDbColl, 49
    RedoLoginSessDb_SameDbColl, 50
    RedoLoginSessDb_SendLogShippingEnvChange, 51

    **** Common Windows API Calls ****

    ImpersonateClient, 52
    RevertToSelf, 53
    GetTokenInfo, 54
    DuplicateToken, 55
    RetryProcessToken, 56
    LoginChangePwdErr, 57
    WinAuthOnlyErr, 58

    **** New with SQL 2012 ****

    DbAuthGetLogin1, 59
    DbAuthUnprotectMem1, 60
    DbAuthUnprotectMem2, 61
    DbAuthGetLogin2, 62
    DbAuthLoginType, 63
    DbAuthLoginDisabled, 64
    DbAuthPasswordNotMatch, 65
    DbAuthBadPassword, 66
    DbAuthBadResult, 67
    DbAuthFCheckSrvAccess1, 68
    DbAuthFCheckSrvAccess2, 69
    OldHash, 70
    LoginSessDb_ObtainRoutingEnvChange, 71
    DbAcceptsGatewayConnOnly, 72

    Pooled Connections

    An 18056 error can only occur when we are trying to reset a pooled connection. Most applications I see these days are setup to use pooled connections. For example, a .NET application will use connection pooling by default. The reason for using pooled connections are to avoid some of the overhead of creating a physical hard connection.

    With a pooled connection, when you close the connection in your application, the physical hard connection will stick around. When the application then goes to open a connection, using the same connection string as before, it will grab an existing connection from the pool and then reset the connection.

    When a connection is reset, you will not see sp_reset_connection over the wire. You will only see the "reset connection" bit set in the TDS Packet Header.

    Frame: Number = 175, Captured Frame Length = 116, MediaType = ETHERNET
    + Ethernet: Etype = Internet IP (IPv4),DestinationAddress:[00-15-5D-4C-B9-60],SourceAddress:[00-15-5D-4C-B9-52]
    + Ipv4: Src =, Dest =, Next Protocol = TCP, Packet ID = 18133, Total IP Length = 102
    + Tcp: [Bad CheckSum]Flags=...AP..., SrcPort=59854, DstPort=1433, PayloadLen=62, Seq=4058275796 - 4058275858, Ack=1214473613, Win=509 (scale factor 0x8) = 130304
    - Tds: SQLBatch, Version = 7.3 (0x730b0003), SPID = 0, PacketID = 1, Flags=...AP..., SrcPort=59854, DstPort=1433, PayloadLen=62, Seq=4058275796 - 4058275858, Ack=1214473613, Win=130304
    - PacketHeader: SPID = 0, Size = 62, PacketID = 1, Window = 0
    PacketType: SQLBatch, 1(0x01)
    Status: End of message true, ignore event false, reset connection true, reset connection skip tran false
    Length: 62 (0x3E)
    SPID: 0 (0x0)
    PacketID: 1 (0x1)
    Window: 0 (0x0)
    - TDSSqlBatchData:
    + AllHeadersData: Head Type = MARS Header
    SQLText: select @@version

    In the above example, we are issuing a SQL Batch on a pooled connection. Because it was a pooled connection, we have to signal that we need to reset the connection before the Batch is executed. This is done via the "reset connection" bit.

    After the above SQLBatch is issued, the app could then turn around and issue an Attention to cancel the request. This is what resulted in the 18056 with State 29 in the past under the condition of an attention.

    Frame: Number = 176, Captured Frame Length = 62, MediaType = ETHERNET
    + Ethernet: Etype = Internet IP (IPv4),DestinationAddress:[00-15-5D-4C-B9-60],SourceAddress:[00-15-5D-4C-B9-52]
    + Ipv4: Src =, Dest =, Next Protocol = TCP, Packet ID = 18143, Total IP Length = 48
    + Tcp: [Bad CheckSum]Flags=...AP..., SrcPort=59854, DstPort=1433, PayloadLen=8, Seq=4058275858 - 4058275866, Ack=1214473613, Win=509 (scale factor 0x8) = 130304
    - Tds: Attention, Version = 7.3 (0x730b0003), SPID = 0, PacketID = 1, Flags=...AP..., SrcPort=59854, DstPort=1433, PayloadLen=8, Seq=4058275858 - 4058275866, Ack=1214473613, Win=130304
    - PacketHeader: SPID = 0, Size = 8, PacketID = 1, Window = 0
    PacketType: Attention, 6(0x06)
    Status: End of message true, ignore event false, reset connection false, reset connection skip tran false
    Length: 8 (0x8)
    SPID: 0 (0x0)
    PacketID: 1 (0x1)
    Window: 0 (0x0)

    In this case, we would still be in the process of doing the connection reset which would be a problem. Bob Dorr's Part 2 blog that is linked above goes into good detail for how this actually occurs.

    So, no more State 29?

    The thing to realize about State 29 is that it is a generic state just indicating that an exception has occurred while trying to redo a login (Pooled Connection). This exception was not accounted for in any other logic to produce a different state that is listed above. Something similar to like an E_FAIL or General Network Error.

    Going forward, assuming you the above fix applied, or are running on SQL 2012 which has it as well, if you get a State 29, it will not be because of an Attention because we are not logging the 18056 any longer for the Attention, however, if you look at dm_os_ring_buffers, you will still see the actual Attention (Error 3617). We just don't log the 18056 any longer to avoid noise.

    <Record id= "3707218" type="RING_BUFFER_EXCEPTION" time="267850787"><Exception><Task address="0x52BDDC8"></Task><Error>3617</Error><Severity>25</Severity><State>23</State><UserDefined>0</UserDefined></Exception><Stack

    There are things that occur in the course of resetting a login that could trigger a State 29. One example that we have seen is a Lock Timeout (1222).

    In the Lock Timeout scenario, the only thing logged to the ERRORLOG was the 18056. We had to review the dm_os_ring_buffersDMV to see the Lock Timeout.

    <Record id= "3707217" type="RING_BUFFER_EXCEPTION" time="267850784"><Exception><Task address="0x4676A42C8"></Task><Error>1222</Error><Severity>16</Severity><State>55</State><UserDefined>0</UserDefined></Exception><Stack

    The Lock Timeout was a result of statements issuing "SET LOCK_TIMEOUT 0" which affects the connection itself. When the connection is "reset", the SET statements are carried forward. Then based on timing, and whether an exclusive lock is taken based on what the Login logic is looking for, it could end up affecting Logins off of a Pooled Connection when that connection is reused. The default lock timeout for a connection is -1.

    Now what?

    If you receive a State 29, you should follow that up by looking in the dm_os_ring_buffers. You will want to look at the RING_BUFFER_EXCEPTION buffer type.

    selectcast(recordasXML) asrecordXML
    wherering_buffer_type =

    The error that you find should help explain the condition, and/or allow you to troubleshoot the problem further. If you see 3617, then you will want to look at applying the hotfix above to prevent those messages from being logged. If you see a different error, then you may want to collect additional data (Profiler Trace, Network Trace, etc…) to assist with determining what could have led to that error.


    Adam W. Saxton | Microsoft Escalation Services




    0 0

    I've ran across the following error a few times and thought I would post this out there for people to understand what is happening.

    ERROR: Throwing Microsoft.ReportingServices.Diagnostics.Utilities.OperationNotSupportedException: , Microsoft.ReportingServices.Diagnostics.Utilities.OperationNotSupportedException: The feature: "The Database Engine instance you selected is not valid for this edition of Reporting Services. The Database Engine does not meet edition requirements for report data sources or the report server database. " is not supported in this edition of Reporting Services.;

    You may also see a similar message in your Event Logs.

    This error is a result of mismatched SKU's between the Reporting Server and the Database Engine as the message mentions. You can also look at two entries in the RS Server logs to see what it thought it hit when it performed the SKU check.

    resourceutilities!WindowsService_0!e44!02/19/2013-08:58:28:: i INFO: Reporting Services starting SKU: Enterprise
    library!WindowsService_0!e54!02/19/2013-08:58:34:: i INFO: Catalog SQL Server Edition = Enterprise

    Where this has caused confusion is when the Catalog SQL Server Edition SKU shows the following:

    resourceutilities!WindowsService_0!e44!02/19/2013-08:58:28:: i INFO: Reporting Services starting SKU: Enterprise
    library!WindowsService_0!e54!02/19/2013-08:58:34:: i INFO: Catalog SQL Server Edition = Developer

    This will cause the above error. From a usability perspective, Developer edition is essentially the Enterprise product that you can use for testing. However, Reporting Services has a specific SKU check to prevent running an Enterprise version of a Report Server against a Developer or Eval version of the Storage Engine.

    It looks something like this:

    case Standard:
    case Enterprise:
    case EnterpriseCore:
    case DataCenter:
    case BusinessIntelligence:


    Of note, Developer edition cannot use Eval and Eval edition cannot use Developer.

    So, how do we correct this if we run into this situation? You will need to uninstall and reinstall Reporting Services with the correct Product Key that matches the SKU you are looking for. If you are wanting Developer Edition, you will need to run setup with the Product Key for Developer. The Edition SKU is part of the Setup process and is stamped for that instance.

    For Reporting Services 2012 in SharePoint Integrated mode, it is considered a Shared Feature, but still is reliant on the SKU that you used to run SQL Setup with. The catch here is that it is not an Instance per se. You still need to make sure that you are installing it with the proper Product Key for the edition you are using.

    The product will then make a WMI call to get the "EditionName" property to determine what is appropriate.


    Adam W. Saxton | Microsoft Escalation Services

    0 0

    SQL Server has supported CLR usage since version 2005.  But support of .NET framework assemblies within SQL Server is limited per our support policy in KB


    Some users chose to use .NET framework assemblies outside the list in KB  This can cause various issues.   Lately we have had a few reports of the following error following upgrade to SQL Server 2012.

    Msg 6544, Level 16, State 1, Line 2

    CREATE ASSEMBLY for assembly '<assembly name>' failed because assembly ‘<assembly name>’ is malformed or not a pure .NET assembly.          Unverifiable PE Header/native stub.


    A little background.  When you develop your user assembly, you can reference .NET framework assemblies.  If the referenced .NET framework assemblies are all from the supported list, you only need to register your own user assembly by using CREATE ASSEMBLY statement.     When you use a .NET framework assembly that is  not in the supported list, the following happens:

    1. You are required to mark your assembly to be unsafe.
    2. You are required to use CREATE ASSEMBLY statement to register  .NET framework assembly and referenced assemblies (not in the supported list) within SQL Server database.   In other words, the .NET framework assembly has to physically reside in a SQL Server database just the same as your own assembly.  
    3. When you do this, you are presented with a warning: “Warning: The Microsoft .Net frameworks assembly 'AssemblyName' you are registering is not fully tested in SQL Server hosted environment.”


    There are two types of .NET assemblies.  Pure .NET assemblies only contain MSIL instructions.   Mixed assemblies contain both unmanaged machine instructions and MSIL instructions.  Mixed assemblies in general are compiled by C++ compiler with /clr switch but contain machine instructions resulting from native C++ code.


    Regardless which version of SQL Server, CREATE ASSEMBLY only allows pure .NET assemblies to be registered.  SQL Server has always required that an assembly to be loaded into SQL Server database with CREATE ASSEMBLY contains only MSIL instructions (pure assembly).   CREATE ASSEMBLY will raise the above error if an assembly to be registered is mixed assembly.


    Why are we seeing this issue now more often than before?

    SQL Server 2005, 2008 and 2008 R2 use CLR 2.0.   In SQL Server 2012, we upgraded CLR to use 4.0.  As a result, all the .NET framework assemblies will need to be in version 4.0.  If you have used a .NET framework assembly that is not in the supported list, you must re-register the 4.0 version using CREATE ASSEMBLY statement following upgrade.  Some .NET framework assembly such as WCF started referencing mixed mode assembly in 4.0.   Therefore you started to experience the issue in SQL 2012 instead of early versions.

    A couple of clarifications

    1. The above error can occur in any version of .NET framework if the assembly you are trying to register (with CREATE ASSEMBLY) is not a pure .NET assembly.   A .NET framework assembly is not guaranteed to be a pure .NET assembly in very version.  Additionally, a newer version assembly may reference non-pure .NET assembly.  In such situations, upgrade will fail with the above error.
    2. The  issue occurs only if you use unsupported .NET framework assemblies which result in validation because of CREATE ASSEMBLY is involved.  If your user assembly references the assemblies in the list documented in KB (which will be updated to reflect the issue documented in this blog), we ensure it will work.



    Jack Li | Senior Escalation Engineer | Microsoft SQL Server Support

older | 1 | .... | 9 | 10 | (Page 11) | 12 | 13 | .... | 17 | newer