Temp table caching improvement for table valued parameters in SQL Server 2012

February 26, 2013, 12:12 pm

≫ Next: switchoffset built-in function can cause incorrect cardinality estimate

≪ Previous: Unable to register .NET framework assembly not in the supported list

I wanted to point out a nice performance improvement related to table valued parameters (TVP) in SQL Server 2012. It’s not currently documented in our online documentation. But we have had customers who inquired about this.

When you use TVP, SQL Server internally uses temp table to store the data. Starting SQL Server 2005, temp tables can be cached for re-used. Caching reduces contentions such as page latch contentions on system tables which can occur as temp tables are created and dropped at a high rate.

If you use TVP with a stored procedure, temp table for the TVP will be cached since SQL Server 2008 when TVP was introduced.

But if you use TVP together with parameterized queries, temp tables for TVP won’t be cached in SQL 2008 or 2008 R2. This leads to page latch contentions on system tables mentioned earlier.

Starting SQL Server 2012, table tables for TVP are cached even for parameterized queries.

Below are two perfmon results for a sample application that uses TVP in a parameterized query. Figure 1 shows that SQL 2008 R2 had sustained a high “temp table creation rate” until the test is complete. Figure 2 shows that SQL 2012 had just a very quick spike for “temp table creation rate” but then it went to zero while running rest of the test.

Figure 1: SQL Server 2008 R2’s Temp table Creation Rate

Figure 2: SQL Server 2012’s Temp Table Creation Rate

Just on a side note, a parameterized query uses sp_executesql from SQL Server perspective. From application perspective, the following ADO.NET pseudo-code will generate parameterized query:

SqlCommand cmd = ….;
cmd.CommandText = "SELECT Value FROM @TVP"
cmd.CommandType = System.Data.CommandType.Text;
DataTable tvp = new DataTable();
//adding rows to the datatable
SqlParameter tvpParam = cmd.Parameters.AddWithValue("@MyParameter", tvp);
tvpParam.SqlDbType = SqlDbType.Structured;

Jack Li | Senior Escalation Engineer | Microsoft SQL Server Support

↧

switchoffset built-in function can cause incorrect cardinality estimate

March 2, 2013, 1:20 pm

≫ Next: SQL Server 2012 partitioned table statistics update behavior change when rebuilding index

≪ Previous: Temp table caching improvement for table valued parameters in SQL Server 2012

Recently, we received a call from a customer reported that a query was slow. Upon further investigation, his query has a predicate that look like this:

select * from t o where c1 >switchoffset (Convert(datetimeoffset, GETDATE()), '-04:00')

Upon further investigation, we discovered that it was a cardinality issue. The data this customer had is such that there was no date beyond today. All the dates are in the past (as it for most scenarios).

SQL Server has many built-in/intrinsic functions. During query compilation, optimizer can actually ‘peek’ the value by ‘executing’ the function to provide better estimate. For example, if you use getdate() like (“select * from t where c1 > getdate()”), optimizer will be able actually get the value of getdate() and then use histogram to obtain accurate estimate.

DateAdd is another intrinsic function that optimizer can do the same trick.

But switchoffset is not one of those intrinsic functions and optimizer can’t ‘peek’ the value and utilize histogram.

Just to compare the difference, query “select * from t o where c1 >switchoffset (Convert(datetimeoffset, GETDATE()), '-04:00')” shows incorrect estimate (74397 rows).

But “select * from t o where c1 > convert (datetimeoffset, dateadd (dd, 0, getdate()))” shows correct estimate. Note that the two queries are identical. But I used them to illustrate the difference in terms of cardinality estimate.

Solution

When you use switchoffset together with getdate(), it’s best when you ‘precompute’ the value and then plug it in your query. Here is an example:

declare @dt datetimeoffset = switchoffset (Convert(datetimeoffset, GETDATE()), '-04:00')
select * from t where c1 > @dt option (recompile)

Complete demo script

if object_id ('t') is not null
drop table t
go
create table t (c1 datetimeoffset)
go
declare @dt datetime, @now datetime
set @dt = '1900-01-01'
set @now = SYSDATETIMEOFFSET()
set nocount on
begin tran
while @dt < @now
begin
insert into t values (@dt)
insert into t values (@dt)
insert into t values (@dt)
insert into t values (@dt)
insert into t values (@dt)
insert into t values (@dt)
set @dt = dateadd (dd, 1, @dt)
end
commit tran
go
create index ix on t (c1)
go

set statistics profile on
go
--inaccurate estimate
select * from t where c1 >switchoffset (Convert(datetimeoffset, GETDATE()), '-04:00')
--accurate estimate
select * from t where c1 > convert (datetimeoffset, dateadd (dd, 0, getdate()))
--accurate estimate
declare @dt datetimeoffset = switchoffset (Convert(datetimeoffset, GETDATE()), '-04:00')
select * from t where c1 > @dt option (recompile)

go
set statistics profile off

Jack Li | Senior Escalation Engineer |Microsoft SQL Server Support

↧

SQL Server 2012 partitioned table statistics update behavior change when rebuilding index

March 19, 2013, 7:52 am

≫ Next: System Center Advisor is now free

≪ Previous: switchoffset built-in function can cause incorrect cardinality estimate

In this blog, I will talk about a couple of things related to statistics update when rebuilding index on a partitioned table.

In past versions, when you rebuild an index, you will get statistics update equivalent to FULLSCAN for free. This is true regardless if the table is partitioned table or not.

But SQL Server 2012 changed the behavior for partitioned table. If a table is partitioned, ALTER INDEX REBUILD will only update statistics for that index with default sampling rate. In other words, it is no longer a FULLSCAN. This is documented in http://technet.microsoft.com/en-us/library/ms188388.aspx. But lots of users do not realized that. If you want fullscan, you will need to run UPDATE STATISTCS WITH FULLSCAN. This change was made because we started to support large number of partitions up to 15000 by default. Previous versions did support 15000 partitions. But it’s not on by default. Supporting large number of partitions will cause high memory consumption if we track the stats with old behavior (FULLSCAN). With partitioned table, ALTER INDEX REBUILD actually first rebuilds index and then do a sample scan to update stats in order to reduce memory consumption.

Another behavior change is actually a bug. In SQL 2012, ALTER INDEX REBUILD doesn’t preserve norecompute property for partitioned tables. In other words, if you specify norecompute on an index, it will be gone after you run ALTER INDEX REBUILD for SQL 2012. We have corrected this issue in a newly released CU 3 of SQL Server 2012 SP1. Here is the KB: http://support.microsoft.com/kb/2814780

Jack Li | Senior Escalation Engineer | Microsoft SQL Server Support

↧

System Center Advisor is now free

March 20, 2013, 11:54 am

≫ Next: PowerPivot Table Import Wizard cannot find provider

≪ Previous: SQL Server 2012 partitioned table statistics update behavior change when rebuilding index

It has been well over a year when I wrote a series of blog posts about a product called System Center Advisor. You can read these posts at this link

http://blogs.msdn.com/b/psssql/archive/tags/system+center+advisor/

When Advisor was first released, this cloud service was free for a 60 day trial period but required a Software Assurance contract to use past that.

Well we have decided that this should be free for everyone. Read more about this announcement at this link:

http://blogs.technet.com/b/momteam/archive/2013/03/06/system-center-advisor.aspx

For SQL Server, we now have over 100+ rules baked into this service representing the collective knowledge of CSS SQL Engineers worldwide on common customer issues. Have you ever wanted to know what the CSS teams knows based on common issues reported by customers? That is what SCA is all about. Providing you that knowledge in the form of a cloud-based service.

Give this a try on your SQL Server and look at the advice that is presented. We specifically baked in rules (called alerts) with the intention of helping you prevent problems before they happen.

Take a look through my previous blog posts above on this topic for some examples. What is incredibly powerful about this service is:

Once you install, it just “runs”
You view your alerts through a web portal so can do this anywhere
As part of the service we capture configuration change history (i.e. a problem started but what changed?)
We keep the rules “fresh” by updating the service each month but you don’t have to do anything. The service automatically pulls in these new rules for you.

I look forward to any comments you post to this blog regarding your experiences with it.

Bob Ward
Microsoft

↧

PowerPivot Table Import Wizard cannot find provider

March 25, 2013, 10:23 am

≫ Next: The Case of Anti-Virus filter drive interference with File Stream Restore

≪ Previous: System Center Advisor is now free

The data source provider list in PowerPivot can often be a source of confusion for users since they equate the fact that a provider appears in the list as the provider being installed and available. Unfortunately, the list of providers is actually a static list of supported data sources for PowerPivot, so the user is still required to install the desired provider to successfully import data into PowerPivot. Thus, the most common fix for a "provider is not installed" error in the import wizard is to ensure you have the proper data provider installed and that the installed provider matches the platform architecture (32-bit or 64-bit) of PowerPivot and Excel.

If you are certain that the selected provider is installed on your client machine and are able to import data directly into Excel using the desired provider via the Data tab, then you may be encountering another issue which was recently discovered.

In this new scenario data import in PowerPivot will fail for any provider selected. The exact error seen varies depending on the provider selected but examples include:

Text File: "Details: Failed to connect to the server. Reason: Provider information is missing from the connection string"

Excel: "Cannot connect to the data source because the Excel provider is not installed."

SQL Server: "Cannot connect to the data source because the SQLServer provider is not installed."

The problem is actually due to a problem with the .NET machine configuration. PowerPivot attempts to instantiate providers by using the .NET DbProviderFactory class. If an error is encountered while instantiating the DbProviderFactory class, the error for the DbProviderFactory is not returned, instead the message returned is that the selected provider is not installed. If you are encountering this scenario it is very likely that there is a problem instantiating the .NET DBProviderFactory class.

The DbProviderFactory class configuration is read from the Machine.Config.xml file, which depending on whether you are running the 32-bit or 64-bit version of Excel and PowerPivot is located at:

c:\Windows\Microsoft.NET\Framework\v4.0.30319\Config

c:\Windows\Microsoft.NET\Frameworkx64\v4.0.30319\Config

Checking the Machine.Config.xml file you will find the <DBProviderFactories> element under <system.data>. The <DBProviderFactories> element should only appear once, but problematic machines may have more than one XML tag for DbProviderFactories.

Example of bad element list:
<system.data>
<DbProviderFactories>

</DbProviderFactories>
<DbProviderFactories/>
</system.data>

NOTE: The begin and end tag around the add for the SQLServerCE provider, followed by the empty element tag.

Correct Example:

<system.data>
<DbProviderFactories>

</DbProviderFactories>
</system.data>

NOTE: The add element(s) between the open <DbProviderFactories> and close </DbProviderFactories> tags will vary depending on what providers are installed on your machine.

If you find that you have something similar to the bad example above, please use the following steps to resolve the issue:

Make a backup copy of existing machine.config.xml file in the event you need to restore it for any reason.
Open the machine.config.xml file in notepad or another editor of your choice.
Delete the empty element tag <DbProviderFactories/> from the file.
Save the updated file.
Retry the import from PowerPivot

Wayne Robertson - Sr. Escalation Engineer

↧

The Case of Anti-Virus filter drive interference with File Stream Restore

March 29, 2013, 2:47 pm

≫ Next: SQLIOSim Checksum Validations

≪ Previous: PowerPivot Table Import Wizard cannot find provider

"Denzil and I were working on this issue for a customer and Denzil has been gracious enough to write-up a blog for all of us." – Bob Dorr

From Denzil:

I recently worked with a customer on a Database restore issue where the database being restored had 2TB of File stream data. The restore in this case would just not complete successfully and would fail with the error below.

10 percent processed.

20 percent processed.

30 percent processed.

40 percent processed.

Msg 3634, Level 16, State 1, Line 1

The operating system returned the error '32(The process cannot access the file because it is being used by another process.)' while attempting 'OpenFile' on 'F:\SQLData11\DataFiles\535cc368-de43-4f03-9a64-f5506a3f532e\547fc3ed-da9f-44e0-9044-12babdb7cde8\00013562-0006edbb-0037'.

Msg 3013, Level 16, State 1, Line 1

RESTORE DATABASE is terminating abnormally.

Subsequent restore attempts would fail with the same error though on "different" files and at a different point in the restore cycle.

Given that this was "not" the same file or the same point of the restore on various attempts my thoughts immediately went to some filter driver under the covers wreaking some havoc. I ran an a command to see what filter drivers were loaded (trimmed output below.)

C:\>fltmc instances

Filter Volume Name Altitude Instance Name

-------------------- ----------------------- ------------ ---------------------- -----

BHDrvx64 F:\SQLData11 365100 BHDrvx64 0

eeCtrl F:\SQLData11 329010 eeCtrl 0

SRTSP F:\SQLData11 329000 SRTSP 0

SymEFA F:\SQLData11 260600 SymEFA 0

RsFx0105 \Device\Mup 41001.05 RsFx0105 MiniFilter Instance 0

SymEFA         = Symantec extended file attributes driver
SRTSP        = Symantec Endpoint protection
RsFx0105     = SQL Server File Stream filter driver.

In discussing this with the customer, Anti-virus exclusions were controlled by GPO so he had put in a request to exclude the respective folders, yet the issue still continued.

In order to do my due diligence, the other question was whether we "released" the file handle after we created it, and whether someone else grabbed it? So we (Venu, Bob and I) did take a look at the code and this can be the case. On SQL Server 2008 R2 when we call the CreateFile API and we hardcode the shareAccess parameter to 0 which is exclusive access while we have it open to prevent secondary access.

http://msdn.microsoft.com/en-us/library/windows/desktop/aa363858(v=vs.85).aspx

If this parameter is zero and CreateFile succeeds, the file or device cannot be shared and cannot be opened again until the handle to the file or device is closed. For more information, see the Remarks section.

Once the file is created, we release the EX latch and can close the file handle, on the file, but sqlservr.exe continues to hold the lock on the file itself during the restore process. Once the restore operation is completed, we no longer hold an exclusive lock to the file.

We can reopen file handles during the Recovery process so the other thought was perhaps it was a transaction affected by recovery and GC and potentially some race condition but in this case we know that the restore was failing prior to that as it didn't reach 100% so that could be ruled out as well.

Getting a dump at the failure time showed me the same Restore Stack but different dumps showed multiple different files in question so it wasn't a particular Log record sequence per say causing this.

sqlservr!ex_raise
sqlservr!HandleOSError

sqlservr!FileHandleCache::OpenFile
sqlservr!FileHandleCache::ProbeForFileHandle
sqlservr!FileHandleCache::GetFileHandle
sqlservr!RestoreCopyContext::RestoreFilesystemData
BackupIoRequest::StartDatabaseScatteredWrite

Given now that it was unlikely it was SQL Server, I concentrated more on the Filter driver theory. I tried to capture Process monitor, but given the time it took and amount of files touched, Process monitor was not all that useful. I couldn't filter on a specific folder as it failed on different folders and there were 10 + mount points involved.

However from Process monitor while the restore was going on, I looked at the stack for some I/O operations (not ones that failed by any means) and I still saw fltmgr.sys sitting there for an OpenFile Call on a file in the filestream directory

fltmgr.sys + 0x2765	0xfffffa6001009765	C:\Windows\system32\drivers\fltmgr.sys
fltmgr.sys + 0x424c	0xfffffa600100b24c	C:\Windows\system32\drivers\fltmgr.sys
fltmgr.sys + 0x1f256	0xfffffa6001026256	C:\Windows\system32\drivers\fltmgr.sys
ntoskrnl.exe + 0x2c8949	0xfffff80002918949	C:\Windows\system32\ntoskrnl.exe
ntoskrnl.exe + 0x2c0e42	0xfffff80002910e42	C:\Windows\system32\ntoskrnl.exe
ntoskrnl.exe + 0x2c19d5	0xfffff800029119d5	C:\Windows\system32\ntoskrnl.exe
ntoskrnl.exe + 0x2c6fb7	0xfffff80002916fb7	C:\Windows\system32\ntoskrnl.exe
ntoskrnl.exe + 0x2b61a8	0xfffff800029061a8	C:\Windows\system32\ntoskrnl.exe
ntoskrnl.exe + 0x57573	0xfffff800026a7573	C:\Windows\system32\ntoskrnl.exe
ntdll.dll + 0x471aa	0x77b371aa	C:\Windows\System32\ntdll.dll  ZwOpenFile
kernel32.dll + 0x10d48	0x779d0d48	C:\Windows\system32\kernel32.dll
kernel32.dll + 0x10a7c	0x779d0a7c	GetVolumeNameForRoot
_____SQL______Process______Available + 0x695c7e	0x1a080fe	GetVolumeDeviceNameAndMountPoint
_____SQL______Process______Available + 0x6d6898	0x1a48d18	ParseContainerPath
_____SQL______Process______Available + 0x6d714a	0x1a495ca	sqlservr!CFsaShareFilter::RsFxControlContainerOwnership

Also looking at some other Symantec related issues, I found an article not necessarily to do with any SQL restores but the fact that this was a possibility – again this has to do with a specific issue on a specific build, but am illustrating that Filter drivers can cause some unexpected behaviors.

As far as Anti-virus exclusions go, we actually have guidance in the article below: http://support.microsoft.com/kb/309422

And also in our File stream best practices article: http://msdn.microsoft.com/en-us/library/dd206979(v=SQL.105).aspx

When you set up FILESTREAM storage volumes, consider the following guidelines:

•Turn off short file names on FILESTREAM computer systems. Short file names take significantly longer to create. To disable short file names, use the Windows fsutil utility.

•Regularly defragment FILESTREAM computer systems.

•Use 64-KB NTFS clusters. Compressed volumes must be set to 4-KB NTFS clusters.

•Disable indexing on FILESTREAM volumes and set disablelastaccess to set disablelastaccess, use the Windows fsutil utility.

•Disable antivirus scanning of FILESTREAM volumes when it is not unnecessary. If antivirus scanning is necessary, avoid setting policies that will automatically delete offending files.

•Set up and tune the RAID level for fault tolerance and the performance that is required by an application.

Looking at another run of "fltmc instances" command output and still saw the Anti-virus components on the list for those mount points. Given we "thought" we had put an exclusion in for the whole drive, and it was showing up, it was time to look at this closer

Excluded the drives where the data was being stored – Restore still failed
Stopped the AV Services - Restore still failed
Uninstalled Anti-virus – Restore now succeeded

Voila once we uninstalled AV on this machine, the restore succeeded. The customer is broaching this this with the AV vendor to figure out more of the root cause.

Denzil Ribeiro – Senior PFE

↧

SQLIOSim Checksum Validations

April 5, 2013, 7:57 am

≫ Next: How It Works: Always On–When Is My Secondary Failover Ready?

≪ Previous: The Case of Anti-Virus filter drive interference with File Stream Restore

I had a very specific question asked of me related to the SQLIOSIM.exe, checksum validation logic. It is pretty simple logic (on purpose) but effective so here are the basics.

The key is that there are multiple memory locations used to hold the data and do the comparison.

1. Allocate a buffer in memory of 8K.
Stamp the page with random data using Crypto Random function(s)
Save Page Id, File Id, Random seed and calculated checksum values in the header of the page

2. Send the page to stable media (async I/O)
Check for proper write completion success

3. Sometime after successful write(s). Allocate another buffer and read the data from stable media.
(Note: This is a separate buffer from that of the write call)

4. Validate the bytes read.

Do header checks for file id, page id, seed and checksum values

Expected CheckSum: 0xEFC6D39C ---------- Checksum stored in the WRITE buffer

Received CheckSum: 0xEFC6D39C ---------- Checksum stored in the READ buffer (what stable media is returning)

Calculated CheckSum: 0xFBD2A468 --------- Checksum as calculated on the READ buffer

The detailed (.TXT) output file(s) show the WRITE image, the READ image and the DIFFERENCES found between them (think memcmp). When only a single buffer image is added to the detailed TXT file this indicates that the received header data was damaged or the WRITE buffer is no longer present in memory so only the on disk checksum vs calculated checksum are being compared.

If there appears to be damage SQLIOSim will attempt to read the same data 15 more times and validate before triggering the error condition. Studies from SQL Server and Exchange showed success of read-retries in some situations. SQL Server and Exchange will perform up to 4 read-retries in the same situation.

The window for damage possibilities is from the time the checksum is calculated to the time the read is validated. While this could be SQLIOSim the historical evidence shows this is a low probability. The majority of the time is in kernel and I/O path components and the majority of bugs over the last 8 years have been non-SQL related.

For vendor debugging the detailed TXT file contains the various page images as well as the sequence of Win32 API calls, thread ids and other information. Using techniques such as a bus analyzer or detailed I/O tracing the vendor can assist at pin-pointing the component causing the damage.

The top of the 8K page header is currently the following (Note: This may change with future versions of SQLIOSim.exe)

DWORD       Page Number     (Take * 8192 for file offset)
DWORD       File Id
DWORD       Seed value
DWORD       Checksum CRC value
BYTE             Data[8192 – size(HEADER)]   <--------- Checksum protects this data

Bob Dorr - Principal SQL Server Escalation Engineer

↧

How It Works: Always On–When Is My Secondary Failover Ready?

April 22, 2013, 7:35 am

≫ Next: AlwaysON - HADRON Learning Series: HADR_SYNC_COMMIT vs WRITELOG wait

≪ Previous: SQLIOSim Checksum Validations

I keep running into the question: “When will my secondary allow automatic failover?” Based on the question I did some extended research and I will try to summarize in is blog post. I don’t want to turn this post into a novel so I am going to take some liberties and assume you have read SQL Server Books Online topics related to Always On failover.

The easy answer: Only when the secondary is marked SYNCHRONIZED. - End of blog right? – not quite!

At a 10,000 foot level that statement is easy enough to understand but the issue is really understanding what constitutes SYNCHRONIZED. There are several state machines that determine the NOT vs SYNCHRONIZED state. These states are maintained using multiple worker threads and at different locations to keep the system functional and fast.

Secondary Connection State
End Of Log Reached State

To understand these states I need to discuss a few concepts to make sure we are all on the same page.

Not Replicating Commits – Log Blocks Are The Replication Unit

The first concept is to remember SQL Server does not ship transactions. It ships log blocks.

The design is not really different than a stand alone server. On a stand alone server a commit transaction issues (FlushToLSN/StartLogFlush) to make sure all LSN’s up to and including the commit LSN flushed. This causes the commit to block the session, waiting for the log manager to indicate that all blocks of the log have been properly flushed to stable media. Once the LSN has been reached any pending transaction(s) can be signaled to continue.

Let’s use the diagram on the left for discussion.   The ODD LSNs are from Session 1 and the EVEN LSNs are from Session 2.

The Log Block is a contiguous, chunk of memory (often 64K and disk sector size aligned), maintained by the Log Manager. Each database has multiple log blocks maintained in LSN order. As multiple workers are processing they can use various portions of the log block, as shown here.

To make this efficient a worker requests space in the block to store its record. This request returns the current location in the log block, increments the next write position in the log block (to be used by the next caller) and acquires a reference count.   This makes the allocation of space for a log record only a few CPU instructions. The storage position movement is thread safe and the reference count is used to determine when the log block can be closed out.

In general, closing out a log block means all the space has been reserved and new space is being handed out for another log block. When all references are released the block can be compressed, encrypted, … and flushed to disk.

Note: A commit transaction (FlushToLSN/StartLogFlush) can trigger similar behavior, even when the block is not full, so a commit transaction does not have to wait for the block to become full.   Reference: http://support.microsoft.com/kb/230785

In this example both commits would be waiting on the log block to be written to stable media.

Session 1 – FlushToLSN (05)
Session 2 – FlushToLSN (06)

The log writer’s completion routine is invoked when the I/O completes for the block.   The completion routine checks for errors and when successful, signals any sessions waiting on a LSN <= 6.   In this case both session 1 and 2 are signaled to continue processing.

Write Log Waits accumulate during this wait for the flush activities.   You can read more about write log waits at: http://blogs.msdn.com/b/psssql/archive/2009/11/03/the-sql-server-wait-type-repository.aspx

Misconception

I had a discussion on an e-mail where the individual was thinking we only shipped committed transactions. Not true (for Always On or Database Mirroring). If I only shipped committed transactions it would require a different set of log blocks for EACH transaction. This would be terrible performance impacting overhead.   It would also be extremely difficult to handle changes on the same page.   If SQL Server doesn’t have the ACID series of log records how would SQL Server ever be able to run recovery, both redo and undo.   Throw in row versioning and shipping just the committed log records becomes very cumbersome.

We don’t ship log records, we ship log blocks and optimize the recovery needs.

Parallel Flushing / Hardening

Always On is a bit different than database mirroring (DBM) with respect to sending the log blocks to the secondary replica(s). DBM flushes the log block to disk and once completed locally, sends the block to the secondary.

Always On changed this to flush the block(s) in parallel. In fact, a secondary could have hardened log block(s) before the primary I/O completes. This design increases performance and narrows the NOT IN SYNC window(s).

SQL Server uses an internal, callback mechanism with the log manager. When a log block is ready to be flushed (fully formatted and ready to write to disk) the notification callbacks are fired. A callback you might expect is Always On. These notifications start processing in parallel with the actual flushing of the log to the local (LDF) stable media.

As the diagram shows, the race is on. One worker (log writer) is flushing to the local media and the secondary consumer is reading new blocks and flushing on the secondary. A stall in the I/O on the primary can allow the secondary to flush before the primary just as a delay on the secondary could cause the primary to flush the I/O before the secondary.

My first reaction to this was, oh no, not in sync this is bad. However, the SQL Server developers didn’t stop at this juncture, Always On is built to handle this situation from the ground up.

Not shown in the diagram are the progress messages. The secondary sends messages to the primary indicating the hardened LSN level. The primary uses that information to help determine synchronization state. Again, these messages execute in parallel to the actual log block shipping activities.

Cluster Registry Key for the Availability Group

The cluster, AG resource is the central location used to maintain the synchronization states. Each secondary has information stored in the AG resource key (binary blob) indicating information about the current LSN levels, synchronization state and other details. This registry key is already replicated, atomically across the cluster so as long as we use the registry at the front of our WAL protocol design the AG state is maintained.

Note: We don’t update the registry for every transaction. In fact, it is seldom updated, only at required state changes. What I mean by WAL protocol here is that the registry is atomically updated before further action is taken on the database so the actions taken in the database are in sync with the registry across the cluster.

Secondary Connection State (Key to Synchronized State)

The design of Always On is a pull, not a push model. The primary does NOT connect to the secondary, the secondary must connect to the primary and ask for log blocks.

Whenever the secondary is NOT connected the cluster registry is immediately updated to NOT SYNCHRONIZED. Think if it this way. If we can’t communicate with the secondary we are unable to guarantee the state remains synchronized and we protect the system by marking it NOT SYNCHRONIZED.

Primary Database Startup

Whenever a database is taken offline/shutdown the secondary connections are closed. When the database is started we immediately set the state of the secondary to NOT SYNCHRONIZED and then recover the database on the primary. Once recovery has completed the secondary(s) are allowed to connect and start the log scanning activity.

Note: There is an XEvent session, definition included at the end of this blog, that you can be use to track several of the state changes.

Connected

Once the secondary is connected it asks (pull) for a log scan to begin. As the XEvents show, you can see the states change for the secondary scanner on the primary.

Uninitialized	The secondary has connected SQL Server but it has not sent LSN information yet.
WaitForWatermark	Waiting for the secondary to reconcile the hardened log LSN position on the secondary with the cluster key and recovery information. The secondary will send its end-of-log (EOL) LSN to the primary.
SendingLog	The primary has received the end-of-log position from the secondary so it can send log from the specified LSN on the primary to the secondary.

Note: None of these states alone dictate that the secondary is IN SYNC. The secondary is still marked as NOT SYNCHRONIZED in the cluster registry.

Hardened Log On Secondary

You will notice the 3rd column is indicating the commit, harden policy. The harden policy indicates how a commit transaction should act on the primary database.

DoNothing	There is no active ‘SendingLog’ so the commits on the primary don’t wait for acknowledgement from the secondary. There is no secondary connected so it can’t wait for an acknowledgement even if it wanted to. The state of the secondary must remain NOT SYNCHRONIZED as the primary is allowed to continue. I tell people this is why it is called HADR and not DRHA. High Availability (HA) is the primary goal so if a secondary is not connected the primary is allowed to continue processing. While this does put the installation in danger of data loss it allows production uptime and alternate backup strategies to compensate.
Delay	When a secondary is not caught up to the primary end–of-log (EOL) the transaction commits are held for a short delay period (sleep) helping the secondary catch up. This is directly seen while the secondary is connected and catching up (SYNCHRONIZING.)
WaitForHarden	As mentioned earlier the secondary sends progress messages to the primary. When the primary detects that the secondary has caught up to the end of the log the harden policy is changed to WaitForHarden. SYNCHRONIZING – DMVs will show synchronizing state until the end-of-log (EOL) is reached. Think of as a catch up phase. You can’t be synchronizing unless you are connected. SYNCHRONIZED – This is the point at which the secondary is marked as SYNCHRONIZED. (Secondary is connected and known to have achieved log block hardening with the primary EOL point.) !!! SYNCHRONIZED IS THE ONLY STATE ALLOWING AUTOMATIC FAILOVER !!! From this point forward all transactions have to wait for the primary (log writer) and secondary to advance the LSN flushes to the desired harden location. Going back to the first example, Session 2 waits for all LSNs up to and including 06 to be hardened. When involving the synchronous replica this is a wait for LSNs up to 06 to be hardened on the primary and the secondary. Until the progress of both the primary and secondary achieve LSN 06 the committing session is held (Wait For Log Flush.)

Clean vs Hard Unexpected Database Shutdowns

When you think about database shutdown there are 2 main scenarios, clean and unexpected (hard). When a clean shutdown occurs the primary does not change the synchronized state in the cluster registry. Whatever the current synchronization state is at the time the shutdown was issued remains sticky. This allows clean failovers, AG moves and other maintenance operations to occur cleanly.

Unexpected, can’t change the state if the unexpected action occurs at the service level (SQL Server process terminated, power outage, etc..). However, if the database is taken offline for some reason (log writes start failing) the connection to the secondary(s) are terminated and terminating the connection immediately updates the cluster registry to NOT SYNCHRONIZED. Something like failure to write to the log (LDF) could be as simple as an administrator incorrectly removing a mount point. Adding the mount point back to the system and restarting the database restores the system quickly.

Scenarios

Now I started running scenarios on my white board. I think a few of these are applicable to this post to help solidify understanding.

In Synchronized State
Primary flushed LSN but not flushed on Secondary

Primary experiences power outage.
Automatic failover is allowed.
The primary has flushed log records that the secondary doesn’t have and the commits are being held.
Secondary will become the new primary and recovers.
When old primary is restarted the StartScan/WaitForHarden logic will rollback to the same location that the new primary, effectively ignoring the commits flushed but never acknowledged.

In Synchronized State
Secondary flushed LSN but not flushed on Primary

Primary experiences power outage.
Automatic failover is allowed.
Secondary has committed log records that primary doesn’t have.
Secondary becomes new primary and recovers.
When old primary is restarted the StartScan/WaitForHarden logic will detect a catch up is required.

Note: This scenario is no different than a synchronized secondary that has not hardened as much as the primary. If the primary is restarted on the same node, upon connect, the secondary will start the catch up activity (SYNCHRONIZING), get back to end of log parity and return to SYNCHRNONED state.

The first reaction when I draw this out for my peers is, we are loosing transactions. Really we are not. We never acknowledge the transaction until the primary and secondary indicate the log has been hardened to LSN at both locations.

If you take the very same scenarios to a stand alone environment you have the same timing situations. The power outage could happen right after the log is hardened but before the client is sent the acknowledgement. It looks like a connection drop to the client and upon restart of the database the committed transaction is redone/present. In contrast, the flush may not have completed when the power outage occurred so the transaction would be rolled back. In neither case did the client receive an acknowledgement of success or failure for the commit.

SYNCHRONIZED – AUTOMATIC FAILOVER ALLOWED

Going back to the intent of this blog, only when the cluster registry has the automatic, targeted secondary, marked SYNCHRONIZED is automatic failover allowed. You can throw all kinds of other scenarios at this but as soon as you drop the connection (restart the log scan request, …) the registry is marked NOT SYNCHRONIZED and it won’t be marked SYNCHRONIZED again until the end-of-log (EOL) sync point is reached.

Many customers have experienced failure to allow fail over because they stopped the secondary and then tried a move. They assumed that because they no longer had primary, transaction activity it was safe. Not true as ghost, checkpoint and other processes can still be adding log records. As soon as you stop the secondary, by definition you no longer have HA so the primary marks the secondary NOT SYNCHRONIZED.

As long as the AG failover detection can use proper, cluster resource offline behaviors, SQL Server is shutdown cleanly or SQL Server is terminated harshly, while the secondary is in the SYNCHRONIZED state, automatic failover is possible. If the SQL Server is not shutdown but a database is taken offline the state is updated to NOT SYNCHRONIZED.

Single Failover Target

Remember that you can only have a single, automatic failover target. To help your HA capabilities you may want to setup a second, synchronous replica. While it can’t be the target of automatic failover it could help High Availability (HA).

For example, the automatic failover, secondary target machine has a power outage. Connection on primary is no longer valid so the secondary is marked NOT SYNCHRONIZED. The alternate synchronous, replica can still be SYNCHRONIZED and a target for a manual move WITHOUT DATA LOSS. The automatic failover target, in this example, is only a move WITH ALLOW DATA LOSS target.

Don’t forget that to enable true HA for this example the replica(s) should have redundant hardware. Second network cards, cabling and such. If you use the same network and a networking problem arises the connections on the primary are dropped and that immediately marks the replica(s) NOT SYNCHRONIZED.

Resolving State

Most of the time the question addressed in this post comes up because the secondary is NOT becoming the primary and is in Resolving state. Looking at the state changes leading up to the issue the secondary was in SYNCHRONIZING. When the primary goes down the secondary knows it was not SYNCHRONIZED. The secondary is attempting to connect to the a primary and the primary is down so the state is RESOLVING.

-------------------------------------------------------------------------------------------------------

Customizing Failover – All Kinds of Options

A secondary question that always follows this main question is: “If a disk fails on my database, within an AG why does automatic failover not occur?”

The short answer is that the secondary connections are dropped during database shutdown – NOT SYNCHRONIZED. (SQL Server development is looking into keeping the SYNCHRONIZED state in this situation instead of forcing NOT SYNCHRONIZED in vNext, opening up the window for automatic failover possibilities.)

The other part of the answer is that the built-in, failover logic is not designed to detect a single database failure. If you look at the failure conditions in SQL Server Books Online none of these are database level detections.

I was part of the work we did to enhance the failover diagnostics and decision conditions/levels. We specifically considered the custom solution needs. We evaluated dozens of scenarios, ranked and targeted those conditions safe for the broad customer base using Always On. This design specifically involved allowing any customer to extend the logic for your specific business needs. We made sure the mechanisms, the SQL Server and resource DLL use, were using publicly consumable interfaces and documented in SQL Server Books Online.

Note: All of the following can be done with PowerShell.

XEvents

For XEvents you can use the XEvent Linq Reader and monitor a live feed from the SQL Server. The easiest way to accomplish this would be to setup a SQL Agent job (continuous running so if the processes exits it restarts itself) which launches a C# executable or Powershell script.

The job can make sure it is only starting the executable on the primary server.
The executable can make sure the proper XEvent sessions are running (these sessions can even be defined to startup during SQL Server, service startup).
The executable can monitor the steam of events for the custom trigger points you consider critical to your business needs and when the parameters fall out of the desired boundary(s) issue the Cluster command to MOVE the AG to another node.
The XEvent session can also write to a file (.XEL) so the system has history of the event stream as well.

Note: The executable should be drop connection resilient. The design of the XEvent live stream is to terminate the connection for the stream if the server detects the event stream is stalled (client not processing events fast enough.) This means the client needs to detect the connection failure and reset. This usually means actions are posted to a worker thread in the application and the main reader only accepts the events and hands them to background tasks.

Example:http://sqlblog.com/blogs/extended_events/archive/2011/07/20/introducing-the-extended-events-reader.aspx

sp_server_diagnostics (http://msdn.microsoft.com/en-us/library/ff878233.aspx)

This was specifically designed to flow across a T-SQL connection (TDS) so anyone using a SQL Server client (.NET, ODBC, OLDEB, …) can execute the procedure and process the results. You don’t want dozens of these running on the SQL Server but you could easily monitor this stream as well and take any custom actions necessary.

Note: The I/O result row is NOT used by the SQL Server resource dll to make failover decisions. It is used for logging purposes only. It is not safe assumption that an I/O stall would be resolved by a failover of the system or even restart of the service. We have many examples of virus scanners and such components that can cause this issue and it would lead to a Ping-Pong among nodes if we trigger automated failover to occur.

DMVs and Policy Based Management (PBM)

In most cases it will be more efficient to setup an XEvent to monitor various aspects of the system. (Specific errors, database status changes, AG status changes, ….). However, the DMVs are also useful and a great safety net. We use many of the DMVs and the PBM rules to drive the Always On dashboard. You can create your own policies and execute them as well as using the XEvent predicates to limit the events produced.

Between some DMV queries and the policies you can easily detect things like corruption errors occurring, loss of a drive, etc…

External Factor Detections

Using PowerShell and WMI you can query information about the machine. For example you can check each drive for reported failure conditions, such as too many sector remaps or temperature problems. When detected you can take preemptive action to move the AG and pause the node, marking it for proper maintenance.

Example
$a = get-wmiobject win32_DiskDrive
$a[0] | get-member

Loss of LDF (No Automatic Failover)

A specific tenant of Always On is – protect the data- don’t automate things that can lead to data loss.

The scenario is a mount point, used to hold the LDF, is mistakenly removed from the primary node. This causes the SQL Server database to become suspect, missing log file but does not trigger automatic failover.

If the mount point can simply be added back to the node the database can be brought back online and business continues as usual, no data loss. If we had forced failover (ALLOW DATA LOSS) it could have led to data loss for a situation that the administrators could have cleanly resolved.

When the secondary drops a connection (loss of network, database LDF is damaged, …) the state is updated to ‘not synchronized’, preventing automatic failover. We are careful because allowing anything else may lead to split brain and other such scenarios that cause data loss. Furthermore, if you change a primary to a secondary it goes into recovery state and at that point if we had serious damage and needed to recover the data it is much more difficult to access the database.

A situation like this requires a business decision. Can the issue be quickly resolved or does it require a failover with allow data loss?

To help in preventing data loss the replicas are marked suspended. As described in the following link you can use a snapshot database, before resuming, to capture the changes that will be lost. http://msdn.microsoft.com/en-us/library/ff877957.aspx Then using T-SQL queries and facilities such as TableDiff one can determine the best reconciliation.

Also reference:http://download.microsoft.com/download/D/2/0/D20E1C5F-72EA-4505-9F26-FEF9550EFD44/Building%20a%20High%20Availability%20and%20Disaster%20Recovery%20Solution%20using%20AlwaysOn%20Availability%20Groups.docx

Note: You want to make sure the snapshot has a short life span to avoid the additional overhead for a long period of time and the fact that is can hold up other operations, such as File Stream garbage collection actions.

One could build additional monitoring to:

Make sure primary was marked suspended
Force the failover with allow data loss
Create snapshot on OLD primary
Resume OLD primary as a new secondary

Then take appropriate business steps to use the data in the snapshot to determine what the data loss would/could be. This is likely to involve a custom, data resolver design (much like the custom conflict resolution options of database replication) to determine how the data should be resolved.

Don’t Kill SQL

Killing SQL Server is a dangerous practice. It is highly unlikely but I can never rule out that it may be possible to introduce unwanted behavior, such as when SQL Server is attempting to update the cluster registry key, leaving the key corrupted. A corrupted registry key, blob for the Availability Group (AG) would then render every replica of the AG damaged because the AG configuration is damaged, not the data! You would then have to carefully drop and recreate the AG in a way that did not require you to rebuild the actual databases but instead allows the cluster configuration to be corrected. It is only few minute operation, once discovered, to fix it but immediate downtime and is usually a panic stricken situation.

SQL Server is design to handle power outages and tested well to accommodate this. Kill is a bit like simulating a power outage and not something Microsoft would recommend as a business practice. Instead you should be using something like PowerShell and issuing a ‘move’ of the availability group in a clean and designed way.

Example: (Move-ClusterResource) http://technet.microsoft.com/en-us/library/ee461049.aspx

XEvent Session

CREATEEVENTSESSION[HadronSyncStateChanges_CommitHardenPolicies]ONSERVER

ADDEVENTsqlserver.hadr_db_commit_mgr_set_policy( ACTION(package0.callstack,sqlserver.database_name)),

ADDEVENTsqlserver.hadr_db_commit_mgr_update_harden( ACTION(package0.callstack,sqlserver.database_name)),

ADDEVENTsqlserver.hadr_db_partner_set_sync_state( ACTION(package0.callstack,sqlserver.database_name)),

ADDEVENTsqlserver.hadr_db_manager_state (ACTION(package0.callstack,sqlserver.database_name)),

ADDEVENTsqlserver.hadr_ag_wsfc_resource_state(ACTION(package0.callstack,sqlserver.database_name)),

ADDEVENTsqlserver.hadr_scan_state(ACTION(package0.callstack,sqlserver.database_name))

ADDTARGETpackage0.event_file(SETfilename=N'C:\temp\SyncStates',max_rollover_files=(100))

WITH (MAX_MEMORY=4096 KB,EVENT_RETENTION_MODE=NO_EVENT_LOSS,MAX_DISPATCH_LATENCY=5 SECONDS,MAX_EVENT_SIZE=0 KB,MEMORY_PARTITION_MODE=PER_CPU,TRACK_CAUSALITY=ON,STARTUP_STATE=ON)

Bob Dorr - Principal SQL Server Escalation Engineer

↧

AlwaysON - HADRON Learning Series: HADR_SYNC_COMMIT vs WRITELOG wait

April 26, 2013, 9:54 am

≫ Next: DReplay Message: “Active connections exceed 8192, connection 8409 is waiting.”

≪ Previous: How It Works: Always On–When Is My Secondary Failover Ready?

The distinction between these two wait types is subtle but very helpful in tuning your Always On environment.

The committing of a transaction means the log block must be written locally as well as remotely for synchronous replicas. When in synchronized state this involves specific waits for both the local and remote, log block, harden operations.

HADR_SYNC_COMMIT = Waiting on response from remote replica that the log block has been hardened. This does not mean the remote, redo has occurred but instead that the log block as been successfully stored on stable media at the remote, replica. You can watch the remote, response behavior using the XEvent: hadr_db_commit_mgr_update_harden.

WRITELOG = Waiting on local I/O to complete for the specified log block.

The design puts the local and remote log block writes in motion at the same time (async) and then waits for their completion. The wait order is 1) remote replica(s) and 2) the local log.

The HADR_SYNC_COMMIT is usually the longer of the waits because it involves shipping the log block to the replica, writing to stable media on the replica and getting a response back. By waiting on the longer operation first the wait for the local write is often avoided.

Once the response is received any wait on the local (primary), log (WRITELOG) occurs as necessary.

Accumulation of HADR_SYNC_COMMIT wait time is the remote activity and you should look at the network and log flushing activities on the remote replica.

Accumulation of WRITELOG wait time is the local log flushing and you should look at the local I/O path constraints.

Reference: http://blogs.msdn.com/b/psssql/archive/2011/04/01/alwayson-hadron-learning-series-how-does-alwayson-process-a-synchronous-commit-request.aspx

Reference: http://blogs.msdn.com/b/psssql/archive/2013/04/22/how-it-works-always-on-when-is-my-secondary-failover-ready.aspx

Bob Dorr - Principal SQL Server Escalation Engineer

↧

DReplay Message: “Active connections exceed 8192, connection 8409 is waiting.”

July 23, 2013, 11:53 am

≫ Next: When Does sp_prepare Return Metadata

≪ Previous: AlwaysON - HADRON Learning Series: HADR_SYNC_COMMIT vs WRITELOG wait

This message was an interesting dive into the DReplay, session boundary logic that I thought I would share.

Internally DReplay maintains a progressive, session queue. This queue is limited to 8192 entries and populated in connection replay order based on the connect/disconnect boundaries. A background worker maintains the queue for the replay workers, adding new sessions and cleaning up completed sessions.

DReplay is designed to allow 8192 concurrent sessions to replay. During the capture, this means you must have 8192 or fewer entries in sys.dm_exec_sessions. Exceeding the limit can result in the message and the wait state.

If you actually have 8192+ sessions that require synchronization with the 8193rd, 8194th, … session(s) the replay can stall because the 8193rd, 8194th, … session won’t have a replay worker until one of the 8192, previous sessions completes. This does not mean the replay will be stuck forever. Actions such as query completion, session completion, query timeouts, kill scripts and other such options can be used to achieve forward progress.

Why Do I Get This Message With Only 100 Concurrent Sessions Doing Make/Break?

You can encounter the message with fewer than 8192, concurrent sessions. The logic is to prepare sessions to be executed (read ahead if you will.) If, as a group, the current 8192 sessions take longer to replay than it takes to prepare sessions, the background worker will reach the limit and sleep until a session slot is available.

Here is an example where the background worker reaches the 8192 limit and waits. 3 sessions complete replay activities and the background worker prepares 3 new sessions and again reaches the 8192 limit. The messages are showing forward progress and that the “read ahead” limit has been reached. The background worker limits the queue size and waits to avoid encountering potential memory, resource limitations.

2013-07-05 19:00:16:255 CRITICAL [Client Replay] Active connections exceed 8192, connection 8467 is waiting.

2013-07-05 19:00:16:255 INFORMATION [Client Replay] All events for spid=298 have been replayed

2013-07-05 19:00:16:255 INFORMATION [Client Replay] All events for spid=362 have been replayed

2013-07-05 19:00:16:255 INFORMATION [Client Replay] All events for spid=293 have been replayed

2013-07-05 19:00:16:255 CRITICAL [Client Replay] Active connections exceed 8192, connection 8470 is waiting

I would also point out that the ‘connection #### is waiting’ is a nice progress indicator. The DReplay log previously shows the number of dispatched connections. 2013-07-05 18:59:47:956 OPERATIONAL [Client Replay] 35212 events are dispatched in 8800 connections. From the messages above you can see DReplay has prepared sessions up to 8469 of the 8800 total to be replayed.

One reproduction of the message was the 100 concurrent make/break connections, each repeating 160 times for a total of 16000 total sessions. DReplay sees these as 16000 unique sessions and sequences them accordingly. In doing this DReplay will queue 8192 sessions wait for sessions to complete, add a few more and repeat the logic. The message in this case is simply showing you have more then 8192 connect/disconnect boundaries (unique sessions) and DReplay has reached the prepared depth limit.

Bob Dorr - Principal SQL Server Escalation Engineer

↧

When Does sp_prepare Return Metadata

July 23, 2013, 12:35 pm

≫ Next: Tracking down Power View Performance Problems

≪ Previous: DReplay Message: “Active connections exceed 8192, connection 8409 is waiting.”

I was running an RML Utilities Suite test pass and encountered varying behavior from our sp_prepare suite. Here is what I uncovered.

The command sp_prepare returns (or does not return) metadata depending on the server version. For the client version, it is only significant whether it is prior to SQL 2012 or it is a later one (i.e. 2012 RTM, SP1, etc.).

1. Prior to SQL 2012, sp_prepare returns metadata to the user. This was implemented by internally setting FMTONLY ON and executing the statement.

2. In SQL 2012 RTM and SP1, sp_prepare does NOT return metadata, if client version is 2012 or greater. FMTONLY ON is deprecated and used only for backward compatibility with the older (i.e. 2008) clients.

3. In SQL 2012 CU6 (build 11.0.2401.0) and later, and SP1 CU3 and later, sp_prepare DOES return metadata to the user, if the batch contains one statement. This is to address a performance issue with some scenarios (see hotfix KB2772525).

The following matrix shows when sp_prepare should return metadata for batches containing one statement.

Client\Server Version	2008/R2	2012 RTM	2012 CU6 +	2012 SP1	2012 SP1 CU3 +	SQL 14
2008 R2	yes	yes	yes	yes	yes	yes
2012 (all versions)	yes	no	yes	no	yes	yes
SQL 14 CTP	yes	no	yes	no	yes	yes

yes - sp_prepare returns metadata
no - sp_prepare does NOT return metadata

The following matrix shows when sp_prepare should return metadata for multi-statement batches, such as

declare@p1int

set@p1=NULL

execsp_prepare@p1output,NULL,N'select * from sys.objects; select 1;',1

select@p1

Client\Server Version	2008/R2	2012 RTM	2012 CU6 +	2012 SP1	2012 SP1 CU3 +	SQL 14
2008 R2	yes	yes	yes	yes	yes	yes
2012 (all versions)	yes	no	no	no	no	no
SQL 14 CTP	yes	no	no	no	no	no

Bob Dorr - Principal SQL Server Escalation Engineer

↧

Tracking down Power View Performance Problems

July 29, 2013, 6:21 am

≫ Next: After applying Service Pack 1 for SQL Server 2012 you may encounter a known issue! Details inside…..

≪ Previous: When Does sp_prepare Return Metadata

The scenario was that we saw sluggishness on the initial load of a Power View Report and also when we went to use one of the filters on the report - like a Pie Chart Slice. For the given report the customer had showed us, the initial load was taking 13-15 seconds to come up, versus 3-5 seconds on my system with the same report and PowerPivot Workbook. For the Pie Slice, it was taking upwards of 20-28 seconds on the customer's system and about 5-8 seconds on my system.

That may not sound like a whole lot if we were just opening an Excel workbook off of a SharePoint site, but for a Power View report, it is all about the experience of the report, and this killed the experience.

I was able to reproduce this using the HelloWorldPicnic Samples Report and Workbook. To do this, I modified the workbook slightly to include 3 more tabs and some more data on those tabs. This modification did not have the extreme impact that the customer's workbook did, but it was enough to illustrate the differences.

Where to start?

From a Reporting Services Report perspective, we want to start with the ExecutionLog of Reporting Services.

However, this doesn't appear to have anything. There are some different versions of this view, so I changed to ExecutionLog3

ItemPath	TimeStart	TimeEnd	TimeDataRetrieval	TimeProcessing	TimeRendering	ByteCount	RowCount
/{986edfff-e480-4c77-b8ef-8e09b3a5a27d}/Reports/HelloWorldPicnicReport.rdlx	9:17:09 AM	9:17:24 AM	12811	166	1877	67313	284
	9:17:28 AM	9:17:37 AM	7076	257	2308	47971	328

We see two entries here. The first entry is the initial load. The second is when I clicked on the report to filter it with a slicer. We can see that the report we ran was HelloWorldPicnicReport.rdlx. Let's have a look at the original report to compare the difference!

ItemPath	TimeStart	TimeEnd	TimeDataRetrieval	TimeProcessing	TimeRendering	ByteCount	RowCount
/{986edfff-e480-4c77-b8ef-8e09b3a5a27d}/Reports/HelloWorldPicnicReport.rdlx	9:17:09 AM	9:17:24 AM	12811	166	1877	67313	284
	9:17:28 AM	9:17:37 AM	7076	257	2308	47971	328
/{986edfff-e480-4c77-b8ef-8e09b3a5a27d}/Reports/HelloWorldPicnicReport - Fast.rdlx	9:17:58 AM	9:18:01 AM	3280	45	302	67313	284
	9:18:04 AM	9:18:07 AM	1714	83	535	47971	328

We can see the report I labeled as "- Fast.xlsx". Notice the difference? Also note that ByteCount and RowCount are identical. So, the data didn't change. To put this into perspective, have a look at the timing information from the customer that I worked on.

ItemPath	TimeStart	TimeEnd	TimeDataRetrieval	TimeProcessing	TimeRendering	ByteCount	RowCount
/{986edfff-e480-4c77-b8ef-8e09b3a5a27d}/Reports/Repro - Slow.rdlx	9:25:27 AM	9:25:52 AM	21317	268	2810	23642	74
	9:25:55 AM	9:26:19 AM	20005	383	4217	35511	103
/{986edfff-e480-4c77-b8ef-8e09b3a5a27d}/Reports/Repro - Fast.rdlx	9:26:53 AM	9:26:59 AM	4651	39	516	23642	74
	9:27:01 AM	9:27:09 AM	6323	94	968	35511	103

You can see a bigger variance here. And these numbers were pretty consistent when we ran the report. Of note, We saw the fast numbers when I tried to reproduce the issue locally. I was not able to originally see the times they were seeing on their end.

Being that we saw that the TimeDataRetrieval was the biggest difference, that's what we decided to focus on. My first thought when I saw DataRetrieval high was that we should get a Profiler Trace from the Analysis Services PowerPivot Instance.

Analysis Services

Here is an example of what we found in the Profiler Trace from SSAS:

Lock Acquired                 2013-03-20 01:28:56.247
Query Begin 3 - DAXQuery      2013-03-20 01:28:56.247 2013-03-20 01:28:56.247
Lock Released                 2013-03-20 01:28:56.263
Lock Released                 2013-03-20 01:28:56.263
Query End   3 - DAXQuery      2013-03-20 01:28:56.263 2013-03-20 01:28:56.247 16

The 16 was the query duration. That's in milliseconds. Actually everything we are really looking at here is milliseconds. The queries within SSAS varied between 0ms and 16ms. Hmmm. So, it looked like that ruled out SSAS. Which made this a little bit more difficult.

AdditionalInfo

If TimeDataRetrieval is showing over 20 seconds, but Analysis Services does not reflect that, how can we tell what is going on? Enter the AdditionalInfo field within the ExecutionLog. To really see what is going on, you need to bump this up to Verbose. This can be done in the properties of the Reporting Services Service Application under System Settings at the very bottom.

This will enable added detail in the AdditionalInfo field. For example, each dataset for the report will be broken out. Here is an example of the slow report.

<Connection>
<ConnectionOpenTime>141</ConnectionOpenTime>
<DataSource>
    <Name>EntityDataSource</Name>
    <ConnectionString>Data Source="http://asaxtontest1/PowerPivotGallery/HelloWorldPicnicPowerView.xlsx"</ConnectionString>
    <DataExtension>DAX</DataExtension>
</DataSource>
<DataSets>
    <DataSet>
      <Name>Band1_SliderDataSet</Name>
      <CommandText>…</CommandText>
      <RowsRead>24</RowsRead>
      <TotalTimeDataRetrieval>541</TotalTimeDataRetrieval>
      <QueryPrepareAndExecutionTime>541</QueryPrepareAndExecutionTime>
      <ExecuteReaderTime>541</ExecuteReaderTime>
      <DataReaderMappingTime>0</DataReaderMappingTime>
      <DisposeDataReaderTime>0</DisposeDataReaderTime>
    </DataSet>
</DataSets>
</Connection>

Let's compare this to the fast report.

<Connection>
<ConnectionOpenTime>23</ConnectionOpenTime> <-- 23 vs. 141. This is ultimately what we are troubleshooting
<DataSource>
    <Name>EntityDataSource</Name>
    <ConnectionString>Data Source="http://asaxtontest1/Reports/HelloWorldPicnicPowerView.xlsx"</ConnectionString>
    <DataExtension>DAX</DataExtension>
</DataSource>
<DataSets>
    <DataSet>
      <Name>Band1_SliderDataSet</Name>
      <CommandText>…</CommandText> <-- The actual DAX Query
      <RowsRead>24</RowsRead>
      <TotalTimeDataRetrieval>125</TotalTimeDataRetrieval> <-- this is the total time for the items below except for QueryPrepareAndExecutionTime.
      <QueryPrepareAndExecutionTime>125</QueryPrepareAndExecutionTime>
      <ExecuteReaderTime>125</ExecuteReaderTime> <-- 125 vs. 541
      <DataReaderMappingTime>0</DataReaderMappingTime>
      <DisposeDataReaderTime>0</DisposeDataReaderTime>
    </DataSet>
</DataSets>
</Connection>

That definitely gives us a good break down. Overall, that doesn't really look like much. 24 milliseconds vs 200 milliseconds for a connection? Why do we care about that? Well, for this particular report, there are 12 datasets like the above. And some are worse than others. For example:

<Connection>
<ConnectionOpenTime>1354</ConnectionOpenTime> <-- 1354 milliseconds
<DataSource>
    <Name>EntityDataSource</Name>
    <ConnectionString>Data Source="http://asaxtontest1/PowerPivotGallery/HelloWorldPicnicPowerView.xlsx"</ConnectionString>
    <DataExtension>DAX</DataExtension>
</DataSource>
<DataSets>
    <DataSet>
      <Name>Tablix1DataSet</Name>
      <CommandText>…</CommandText>
      <RowsRead>21</RowsRead>
      <TotalTimeDataRetrieval>1626</TotalTimeDataRetrieval> <-- ExecuteReaderTime + CancelCommandTime
      <QueryPrepareAndExecutionTime>447</QueryPrepareAndExecutionTime>
      <ExecuteReaderTime>447</ExecuteReaderTime>
      <DataReaderMappingTime>0</DataReaderMappingTime>
      <DisposeDataReaderTime>0</DisposeDataReaderTime>
      <CancelCommandTime>1179</CancelCommandTime> <-- There were a bunch of datasets that had a high CancelCommandTime
    </DataSet>
</DataSets>
</Connection>

SharePoint ULS Logs

Where do we go from here to get more information? How about the SharePoint ULS Log? To start, I turned off everything, and then bumped the following categories to Verbose. The reason I went this route is for two reasons. First, I wanted to see what this looked like without added noise. The second is that I wanted to see if this alone would provide what I needed.

Excel Services Application

Data Model
Excel Calculation Services

PowerPivot Service

Request Processing
Unknown
Usage

SQL Server Reporting Services

Power View
Report Server Data Extension
Report Server Processing

I used UlsViewer to combine the log files that I had for my WFE and my App Server to see one view of the logs. There are a couple of items I noticed from a logging perspective:

1. The Power View Category was not useful in this case as we were dealing with Processing/Data Retrieval which was outside of the Silverlight control

2. When we about to process the datasets, we saw the following blog of entries indicating the items on the report.

SQL Server Reporting Services        Report Server Processing        00000        Verbose        Report contains a scope 'Chart3_Month_Name' which uses or contains NonNaturalGroup. This may prevent optimizations from being applied to parent or child scopes.        8dcc0d9c-7d7f-706b-8c45-2aa4abdb9317
SQL Server Reporting Services        Report Server Processing        00000        Verbose        Report contains a scope 'Chart3_ItemID' which uses or contains NonNaturalGroup. This may prevent optimizations from being applied to parent or child scopes.        8dcc0d9c-7d7f-706b-8c45-2aa4abdb9317
SQL Server Reporting Services        Report Server Processing        00000        Verbose        Report contains a scope 'Chart3' which uses or contains PeerChildScopes. This may prevent optimizations from being applied to parent or child scopes.        8dcc0d9c-7d7f-706b-8c45-2aa4abdb9317
SQL Server Reporting Services        Report Server Processing        00000        Verbose        Report contains a scope '' which uses or contains PeerChildScopes. This may prevent optimizations from being applied to parent or child scopes.        8dcc0d9c-7d7f-706b-8c45-2aa4abdb9317
SQL Server Reporting Services        Report Server Processing        00000        Verbose        Report contains a scope 'Band1_ItemID' which uses or contains PeerChildScopes. This may prevent optimizations from being applied to parent or child scopes.        8dcc0d9c-7d7f-706b-8c45-2aa4abdb9317
SQL Server Reporting Services        Report Server Processing        00000        Verbose        Report contains a scope 'Chart4_ItemID' which uses or contains NonNaturalGroup. This may prevent optimizations from being applied to parent or child scopes.        8dcc0d9c-7d7f-706b-8c45-2aa4abdb9317
SQL Server Reporting Services        Report Server Processing        00000        Verbose        Report contains a scope 'Chart4_Category' which uses or contains NonNaturalGroup. This may prevent optimizations from being applied to parent or child scopes.        8dcc0d9c-7d7f-706b-8c45-2aa4abdb9317
SQL Server Reporting Services        Report Server Processing        00000        Verbose        Report contains a scope 'Chart4' which uses or contains PeerChildScopes. This may prevent optimizations from being applied to parent or child scopes.        8dcc0d9c-7d7f-706b-8c45-2aa4abdb9317
SQL Server Reporting Services        Report Server Processing        00000        Verbose        Report contains a scope '' which uses or contains PeerChildScopes. This may prevent optimizations from being applied to parent or child scopes.        8dcc0d9c-7d7f-706b-8c45-2aa4abdb9317
SQL Server Reporting Services        Report Server Processing        00000        Verbose        Report contains a scope 'Band2_Month_Name' which uses or contains PeerChildScopes. This may prevent optimizations from being applied to parent or child scopes.        8dcc0d9c-7d7f-706b-8c45-2aa4abdb9317
SQL Server Reporting Services        Report Server Processing        00000        Verbose        Report contains a scope 'Chart4_SyncDataSet' which uses or contains Aggregates. This may prevent optimizations from being applied to parent or child scopes.        8dcc0d9c-7d7f-706b-8c45-2aa4abdb9317
SQL Server Reporting Services        Report Server Processing        00000        Verbose        Report contains a scope 'SmallMultipleContainer1_SyncDataSet' which uses or contains Aggregates. This may prevent optimizations from being applied to parent or child scopes.        8dcc0d9c-7d7f-706b-8c45-2aa4abdb9317

3. After these entries, I saw the start of a given DataSet being processed

SQL Server Reporting Services Report Server Processing 00000 Verbose Data source 'EntityDataSource': Transaction = False, MergeTran = False, NumDataSets = 1 8dcc0d9c-7d7f-706b-8c45-2aa4abdb9317

4. Then followed the timing for how long it took to open a connection

SQL Server Reporting Services Report Server Processing 00000 Verbose Opening a connection for DataSource: EntityDataSource took 141 ms. 8dcc0d9c-7d7f-706b-8c45-2aa4abdb9317
SQL Server Reporting Services Report Server Processing 00000 Verbose Data source 'EntityDataSource': Created a connection. 8dcc0d9c-7d7f-706b-8c45-2aa4abdb9317

5. This was then ended with the following entry

SQL Server Reporting Services Report Server Processing 00000 Verbose Data source 'EntityDataSource': Processing of all data sets completed. 8dcc0d9c-7d7f-706b-8c45-2aa4abdb9317

This was our signature for a given dataset being processed. However, with what I had enabled, it didn't really shed any light on why it was slow. The only thing I potentially noticed was that in between these entries, the number of log entries was different. For the slower run, I saw more entries then for the faster run. The entries just looked like background information.

I was then pointed to VerboseEX Logging by a friend of mine, Gyorgy Homolya from the SharePoint Support group. He indicated that this would show additional entries related to the SQL commands that SharePoint was issuing within different scopes. This will turn everything up to VerboseEX and had to be enabled using the following PowerShell Command:

get-sploglevel | set-sploglevel -traceseverity verboseex

This didn't show up before because I had turned off the main SharePoint logs, but another thing you can use to spot the start of a Power View Report is to search for "RenderEdit". This is the command that will be issued to begin processing of a given Power View Report. The main entry point.

SharePoint Foundation General 6t8b Verbose Looking up context site http://asaxtontest1:80/_vti_bin/reportserver/?rs:Command=RenderEdit&rs:ProgressiveSessionId=d9b277f73c204cfb92a6193c284c043entvusm24nykqfbbxmfzr0h55 in the farm SharePoint_Config

NOTE: VerboseEX will severely bloat your ULS log. Very quickly, and without much going on with the Server, my WFE Log got up to ~250MB. Be very careful when turning this one. And, I don't recommend doing this on a production box.

Based on what we found above, I did some searches for "DataSet" to get back to the first entries we had above. Of note, this started dropping SQL Commands, so I had some false hits until I got to what I was looking for. Then we had the start entry for our first DataSet:

SQL Server Reporting Services Report Server Processing 00000 Verbose Data source 'EntityDataSource': Transaction = False, MergeTran = False, NumDataSets = 1 2bce0d9c-dda0-706b-8c45-2c28019a07c4

This was then followed by a SQL Scope:

SharePoint Foundation Monitoring nasq Verbose Entering monitored scope (SPSqlClient). Parent No 2bce0d9c-dda0-706b-8c45-2c28019a07c4

This was then followed by entries pertaining to the SQL Command and then the following.

SharePoint Foundation Monitoring b4ly Verbose Leaving Monitored Scope (SPSqlClient). Execution Time=101.5835 2bce0d9c-dda0-706b-8c45-2c28019a07c4
SharePoint Foundation Monitoring nass Verbose ____Execution Time=101.5835 2bce0d9c-dda0-706b-8c45-2c28019a07c4

Looking down further, we came to the following that looked different between the slow run and the fast run consistently between the DataSets.

SharePoint Foundation        Monitoring        b4ly        Verbose        Leaving Monitored Scope (SPSqlClient). Execution Time=48.2024        2bce0d9c-dda0-706b-8c45-2c28019a07c4
SharePoint Foundation        Monitoring        nass        Verbose        ____Execution Time=48.2024        2bce0d9c-dda0-706b-8c45-2c28019a07c4
SharePoint Foundation        Monitoring        b4ly        Verbose        Leaving Monitored Scope (EnsureListItemsData). Execution Time=77.6741        2bce0d9c-dda0-706b-8c45-2c28019a07c4
SharePoint Foundation        Monitoring        nass        Verbose        ____SQL Query Count=3        2bce0d9c-dda0-706b-8c45-2c28019a07c4
SharePoint Foundation        Monitoring        nass        Verbose        ____Execution Time=77.6741        2bce0d9c-dda0-706b-8c45-2c28019a07c4

Here is the fast run:

SharePoint Foundation        Monitoring        b4ly        Verbose        Leaving Monitored Scope (SPSqlClient). Execution Time=6.264        6dcf0d9c-9d35-706b-8c45-270465d749b0
SharePoint Foundation        Monitoring        nass        Verbose        ____Execution Time=6.264        6dcf0d9c-9d35-706b-8c45-270465d749b0
SharePoint Foundation        Monitoring        b4ly        Verbose        Leaving Monitored Scope (EnsureListItemsData). Execution Time=21.718        6dcf0d9c-9d35-706b-8c45-270465d749b0
SharePoint Foundation        Monitoring        nass        Verbose        ____SQL Query Count=3        6dcf0d9c-9d35-706b-8c45-270465d749b0
SharePoint Foundation        Monitoring        nass        Verbose        ____Execution Time=21.718        6dcf0d9c-9d35-706b-8c45-270465d749b0

EnsureListItemsData seemed to be chewing up more time in the slow run. We also had the query that was taking the large amount of time (in milliseconds) out of the 3 queries that were within the EnsureListItemsData scope:

SharePoint Foundation Database tzkv Verbose SqlCommand: ' SELECT TOP(@NUMROWS) t3.[nvarchar9] AS c14c8, t1.[SortBehavior] AS c0, t3.[nvarchar12] AS c14c10, CASE WHEN DATALENGTH(t1.DirName) = 0 THEN t1.LeafName WHEN DATALENGTH(t1.LeafName) = 0 THEN t1.DirName ELSE t1.DirName + N'/' + t1.LeafName END AS c15, t4.[nvarchar4] AS c21c6, t4.[tp_Created] AS c21c11, t1.[CheckinComment] AS c28, UserData.[tp_ItemOrder], UserData.[tp_ModerationStatus], UserData.[tp_Created], t1.[Size] AS c23, UserData.[nvarchar1], UserData.[nvarchar6], UserData.[tp_WorkflowInstanceID], t1.[ETagVersion] AS c36, t2.[nvarchar4] AS c3c6, t3.[nvarchar4] AS c14c6, UserData.[ntext1], UserData.[tp_AppAuthor], t2.[tp_Created] AS c3c11, t1.[MetaInfo] AS c18, t4.[nvarchar11] AS c21c9, UserData.[tp_AppEditor], t1.[TimeLastModified] AS c13, UserData.[tp_ID], t4.[nvarchar1] AS c21c4, t1.[Size] AS c26, UserData.[nvarchar5], UserData.[tp_GUID], t1.[ParentVersionString] AS c34, UserData.[bit1], t1.[TimeCreated] AS c1, UserData.[tp_Editor], t2.[nvarchar11] AS c3c9, t3.[nvarchar1] AS c14c4, t3.[nvarchar11] AS c14c9, UserData.[tp_Author], t2.[nvarchar1] AS c3c4, t3.[tp_Created] AS c14c11, t1.[ItemChildCount] AS c29, t7.[Title] AS c33c32, t1.[IsCheckoutToLocal] AS c16, UserData.[tp_ContentTypeId], t4.[nvarchar6] AS c21c7, t1.[LTCheckoutUserId] AS c24, t6.[Title] AS c31c32, UserData.[tp_WorkflowVersion], UserData.[nvarchar4], UserData.[tp_CheckoutUserId], t3.[nvarchar6] AS c14c7, UserData.[tp_Version], t5.[nvarchar1] AS c4, UserData.[tp_IsCurrentVersion], UserData.[nvarchar9], t2.[nvarchar6] AS c3c7, UserData.[tp_HasCopyDestinations], UserData.[tp_Level], t4.[nvarchar12] AS c21c10, t1.[Id] AS c19, t4.[tp_ID] AS c21c5, t1.[DirName] AS c22, t1.[ParentLeafName] AS c35, t1.[LeafName] AS c2, UserData.[nvarchar3], UserData.[tp_Modified], UserData.[tp_UIVersion], t1.[FolderChildCount] AS c30, UserData.[nvarchar8], t2.[tp_ID] AS c3c5, t3.[tp_ID] AS c14c5, UserData.[tp_CopySource], UserData.[tp_InstanceID], t2.[nvarchar12] AS c3c10, t1.[Type] AS c12, t1.[ProgId] AS c17, t4.[nvarchar9] AS c21c8, t1.[ClientId] AS c25, UserData.[tp_UIVersionString], t1.[ScopeId] AS c20, UserData.[nvarchar2], UserData.[nvarchar7], t2.[nvarchar9] AS c3c8 FROM AllDocs AS t1 WITH(FORCESEEK(AllDocs_Url(SiteId,DeleteTransactionId)),NOLOCK) INNER LOOP JOIN AllUserData AS UserData ON (UserData.[tp_RowOrdinal] = 0) AND (t1.SiteId=UserData.tp_SiteId) AND (t1.SiteId = @SITEID) AND (t1.ParentId = UserData.tp_ParentId) AND (t1.Id = UserData.tp_DocId) AND ( (UserData.tp_Level = 1 OR UserData.tp_Level =255) ) AND (t1.Level = UserData.tp_Level) AND ((UserData.tp_Level = 255 AND t1.LTCheckoutUserId =@IU OR (UserData.tp_Level = 1 AND (UserData.tp_DraftOwnerId IS NULL) OR UserData.tp_Level = 2)AND (t1.LTCheckoutUserId IS NULL OR t1.LTCheckoutUserId <> @IU ))) AND (UserData.[tp_IsCurrentVersion] = CONVERT(bit,1) ) AND (UserData.[tp_CalculatedVersion] = 0 ) AND (UserData.[tp_DeleteTransactionId] = 0x ) AND (t1.[DeleteTransactionId] = 0x ) LEFT OUTER LOOP JOIN AllUserData AS t2 WITH(FORCESEEK(AllUserData_PK(tp_SiteId,tp_ListId,tp_DeleteTransactionId,tp_IsCurrentVersion,tp_ID,tp_CalculatedVersion)),NOLOCK) ON (UserData.[tp_Editor]=t2.[tp_ID]) AND (UserData.[tp_RowOrdinal] = 0) AND (t2.[tp_RowOrdinal] = 0) AND ( (t2.tp_Level = 1) ) AND (t2.[tp_IsCurrentVersion] = CONVERT(bit,1) ) AND (t2.[tp_CalculatedVersion] = 0 ) AND (t2.[tp_DeleteTransactionId] = 0x ) AND (t2.tp_ListId = @L3 AND t2.tp_SiteId = @SITEID) AND (UserData.tp_ListId = @L4 AND UserData.tp_SiteId = @SITEID) AND (UserData.[tp_IsCurrentVersion] = CONVERT(bit,1) ) AND (UserData.[tp_CalculatedVersion] = 0 ) AND (UserData.[tp_DeleteTransactionId] = 0x ) LEFT OUTER LOOP JOIN AllUserData AS t3 WITH(FORCESEEK(AllUserData_PK(tp_SiteId,tp_ListId,tp_DeleteTransactionId,tp_IsCurrentVersion,tp_ID,tp_CalculatedVersion)),NOLOCK) ON (UserData.[tp_CheckoutUserId]=t3.[tp_ID]) AND (UserData.[tp_RowOrdinal] = 0) AND (t3.[tp_RowOrdinal] = 0) AND ( (t3.tp_Level = 1) ) AND (t3.[tp_IsCurrentVersion] = CONVERT(bit,1) ) AND (t3.[tp_CalculatedVersion] = 0 ) AND (t3.[tp_DeleteTransactionId] = 0x ) AND (t3.tp_ListId = @L3 AND t3.tp_SiteId = @SITEID) AND (UserData.tp_ListId = @L4 AND UserData.tp_SiteId = @SITEID) AND (UserData.[tp_IsCurrentVersion] = CONVERT(bit,1) ) AND (UserData.[tp_CalculatedVersion] = 0 ) AND (UserData.[tp_DeleteTransactionId] = 0x ) LEFT OUTER LOOP JOIN AllUserData AS t4 WITH(FORCESEEK(AllUserData_PK(tp_SiteId,tp_ListId,tp_DeleteTransactionId,tp_IsCurrentVersion,tp_ID,tp_CalculatedVersion)),NOLOCK) ON (UserData.[tp_Author]=t4.[tp_ID]) AND (UserData.[tp_RowOrdinal] = 0) AND (t4.[tp_RowOrdinal] = 0) AND ( (t4.tp_Level = 1) ) AND (t4.[tp_IsCurrentVersion] = CONVERT(bit,1) ) AND (t4.[tp_CalculatedVersion] = 0 ) AND (t4.[tp_DeleteTransactionId] = 0x ) AND (t4.tp_ListId = @L3 AND t4.tp_SiteId = @SITEID) AND (UserData.tp_ListId = @L4 AND UserData.tp_SiteId = @SITEID) AND (UserData.[tp_IsCurrentVersion] = CONVERT(bit,1) ) AND (UserData.[tp_CalculatedVersion] = 0 ) AND (UserData.[tp_DeleteTransactionId] = 0x ) LEFT OUTER LOOP JOIN AllUserData AS t5 WITH(FORCESEEK(AllUserData_PK(tp_SiteId,tp_ListId,tp_DeleteTransactionId,tp_IsCurrentVersion,tp_ID,tp_CalculatedVersion)),NOLOCK) ON (t1.[LTCheckoutUserId]=t5.[tp_ID]) AND (t5.[tp_RowOrdinal] = 0) AND ( (t5.tp_Level = 1) ) AND (t5.[tp_IsCurrentVersion] = CONVERT(bit,1) ) AND (t5.[tp_CalculatedVersion] = 0 ) AND (t5.[tp_DeleteTransactionId] = 0x ) AND (t5.tp_ListId = @L3 AND t5.tp_SiteId = @SITEID) LEFT OUTER LOOP JOIN AppPrincipals AS t6 WITH(NOLOCK) ON (UserData.[tp_AppAuthor]=t6.[Id]) AND (UserData.[tp_RowOrdinal] = 0) AND (t6.SiteId = @SITEID) AND (UserData.tp_ListId = @L4 AND UserData.tp_SiteId = @SITEID) AND (UserData.[tp_IsCurrentVersion] = CONVERT(bit,1) ) AND (UserData.[tp_CalculatedVersion] = 0 ) AND (UserData.[tp_DeleteTransactionId] = 0x ) LEFT OUTER LOOP JOIN AppPrincipals AS t7 WITH(NOLOCK) ON (UserData.[tp_AppEditor]=t7.[Id]) AND (UserData.[tp_RowOrdinal] = 0) AND (t7.SiteId = @SITEID) AND (UserData.tp_ListId = @L4 AND UserData.tp_SiteId = @SITEID) AND (UserData.[tp_IsCurrentVersion] = CONVERT(bit,1) ) AND (UserData.[tp_CalculatedVersion] = 0 ) AND (UserData.[tp_DeleteTransactionId] = 0x ) WHERE ( (UserData.tp_Level = 1 OR UserData.tp_Level =255) AND ( UserData.tp_Level= 255 AND UserData.tp_CheckoutUserId = @IU OR ( UserData.tp_Level = 2 AND UserData.tp_DraftOwnerId IS NOT NULL OR UserData.tp_Level = 1 AND UserData.tp_DraftOwnerId IS NULL ) AND ( UserData.tp_CheckoutUserId IS NULL OR UserData.tp_CheckoutUserId <> @IU))) AND (UserData.tp_SiteId=@SITEID) AND (UserData.tp_RowOrdinal=0) AND (((t1.[LeafName] = @L5LNP) AND (t1.[DirName] = @L6DNP)) AND t1.SiteId=@SITEID AND (t1.DirName=@DN OR t1.DirName LIKE @DNEL+N'/%')) ORDER BY t1.[SortBehavior] DESC ,UserData.[tp_ID] ASC OPTION (FORCE ORDER, MAXDOP 1)' CommandType: Text CommandTimeout: 0 Parameter: '@LFFP' Type: UniqueIdentifier Size: 0 Direction: Input Value: '00000000-0000-0000-0000-000000000000' Parameter: '@SITEID' Type: UniqueIdentifier Size: 0 Direction: Input Value: '986edfff-e480-4c77-b8ef-8e09b3a5a27d' Parameter: '@IU' Type: Int Size: 0 Direction: Input Value: '1073741823' Parameter: '@L3' Type: UniqueIdentifier Size: 0 Direction: Input Value: '44daded8-6a19-474f-81cb-6c22d9208dbd' Parameter: '@L4' Type: UniqueIdentifier Size: 0 Direction: Input Value: '60467b41-a312-452e-a22c-6e96f08ed0c3' Parameter: '@L5LNP' Type: NVarChar Size: 4000 Direction: Input Value: 'HelloWorldPicnicPowerView.xlsx' Parameter: '@L6DNP' Type: NVarChar Size: 4000 Direction: Input Value: 'PowerPivotGallery' Parameter: '@DN' Type: NVarChar Size: 4000 Direction: Input Value: 'PowerPivotGallery' Parameter: '@DNEL' Type: NVarChar Size: 4000 Direction: Input Value: 'PowerPivotGallery' Parameter: '@NUMROWS' Type: BigInt Size: 0 Direction: Input Value: '2' Parameter: '@RequestGuid' Type: UniqueIdentifier Size: 0 Direction: Input Value: '2bce0d9c-edc6-706b-8c45-26dd20966162' 2bce0d9c-dda0-706b-8c45-2c28019a07c4

Here is where the VerboseEX comes in handy. First off it dumps out the Managed Call Stack, so we can see who even issued the call that got us here.

SharePoint Foundation        Database        tzkk        VerboseEx        SqlCommand StackTrace-Managed:
at Microsoft.SharePoint.Utilities.SqlSession.OnPreExecuteCommand(SqlCommand command)
at Microsoft.SharePoint.Utilities.SqlSession.ExecuteReader(SqlCommand command, CommandBehavior behavior, SqlQueryData monitoringData, Boolean retryForDeadLock)
at Microsoft.SharePoint.SPSqlClient.ExecuteQueryInternal(Boolean retryfordeadlock)
at Microsoft.SharePoint.SPSqlClient.ExecuteQuery(Boolean retryfordeadlock)
at Microsoft.SharePoint.Library.SPRequestInternalClass.GetListItemDataWithCallback2(IListItemSqlClient pSqlClient, String bstrUrl, String bstrListName, String bstrViewName, String bstrViewXml, SAFEARRAYFLAGS fSafeArrayFlags, ISP2DSafeArrayWriter pSACallback, ISPDataCallback pPagingCallback, ISPDataCallback pPagingPrevCallback, ISPDataCallback pFilterLinkCallback, ISPDataCallback pSchemaCallback, ISPDataCallback pRowCountCallback, Boolean& pbMaximalView)
at Microsoft.SharePoint.Library.SPRequestInternalClass.GetListItemDataWithCallback2(IListItemSqlClient pSqlClient, String bstrUrl, String bstrListName, String bstrViewName, String bstrViewXml, SAFEARRAYFLAGS fSafeArrayFlags, ISP2DSafeArrayWriter pSACallback, ISPDataCallback pPagingCallback, ISPDataCallback pPagingPrevCallback, ISPDataCallback pFilterLinkCallback, ISPDataCallback pSchemaCallback, ISPDataCallback pRowCountCallback, Boolean& pbMaximalView)
at Microsoft.SharePoint.Library.SPRequest.GetListItemDataWithCallback2(IListItemSqlClient pSqlClient, String bstrUrl, String bstrListName, String bstrViewName, String bstrViewXml, SAFEARRAYFLAGS fSafeArrayFlags, ISP2DSafeArrayWriter pSACallback, ISPDataCallback pPagingCallback, ISPDataCallback pPagingPrevCallback, ISPDataCallback pFilterLinkCallback, ISPDataCallback pSchemaCallback, ISPDataCallback pRowCountCallback, Boolean& pbMaximalView)
at Microsoft.SharePoint.SPListItemCollection.EnsureListItemsData()
at Microsoft.SharePoint.SPListItemCollection.get_Count()
at Microsoft.SharePoint.SPWeb.GetItem(String strUrl, Boolean bFile, Boolean cacheRowsetAndId, Boolean bDatesInUtc, String[] fields)
at Microsoft.SharePoint.SPFile.get_Item()
at Microsoft.AnalysisServices.SPClient.SPViewOnlyFileAccessor.Load(String url)
at Microsoft.AnalysisServices.SPClient.KeepAliveThread.Session.<GetFreshETagExceptions>d__e.MoveNext()
at System.Linq.Enumerable.<ConcatIterator>d__71`1.MoveNext()     at System.Linq.Enumerable.Any[TSource](IEnumerable`1 source)
at Microsoft.AnalysisServices.SPClient.WorkbookSession.BeginActivity()
at Microsoft.AnalysisServices.AdomdClient.XmlaClient.WriteEndOfMessage(Boolean callBaseDirect)
at Microsoft.AnalysisServices.AdomdClient.XmlaClient.EndRequest()
at Microsoft.AnalysisServices.AdomdClient.XmlaClient.SendMessage(Boolean endReceivalIfException, Boolean readSession, Boolean readNamespaceCompatibility)
at Microsoft.AnalysisServices.AdomdClient.XmlaClient.ExecuteStatement(String statement, IDictionary connectionProperties, IDictionary commandProperties, IDataParameterCollection parameters, Boolean isMdx)

We can also see the SQL Statistics IO for this given command. The items that was different between the slow and fast run was the following (here is the slow run).

SharePoint Foundation Database fdz2 VerboseEx SQL IO Statistics: Procedure , Table 'AllDocs'. Scan count 1, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 449, lob physical reads 0, lob read-ahead reads 96. 2bce0d9c-dda0-706b-8c45-2c28019a07c4

LOB Logical Reads and LOB Read-Ahead Reads. Based on the query above, we can easily just pump this into Management Studio to compare the query runs and Execution Plans on the SQL Side.

The PowerPivot Gallery is on the left and the regular Document Library is on the right. 234ms vs. 78ms. Of note, the Execution Plans were identical and I we saw seeks around the AllDocs table. Also, updating statistics and rebuilding the index did not change anything.

Looking at a Profiler trace of these two queries we see that the major difference are the Reads, which was highlighted by the SharePoint ULS Log with the Statistics IO.

This brought us down to which fields in the query that were related to AllDocs was causing us to read a lot more data. This was narrowed down to the MetaInfo field on the AllDocs Table. The MetaInfo field is used to store properties about a document added to a given library. It is not the actual document itself as that is stored in the DocStreams table. Looking at the datalength of the MetaInfo field, we see the following.

-- PowerPivot Gallery
select datalength(MetaInfo) from alldocs where id = '9319533E-35CC-46F8-89E0-C165AE588B0D'
-- Document Library
select datalength(MetaInfo) from alldocs where id = '1B4CEFB4-F6D0-4DCC-A7BA-4C6F02FFF846'
PowerPivot Gallery: 899,821 bytes
Document Library: 293 bytes

That's a pretty significant different and also explains the number of reads that are needed as we have more data to go grab. For the PowerPivot Gallery, this is where we store the Snapshots (screenshots) that we see in the Gallery.

This actually hasn't changed between SP 2010 and SP 2013 with regards to the PowerPivot Gallery. What did change is ADOMD.NET to accommodate some of the architectural differences with SharePoint in 2013. The main change was that we were now checking to see if the workbook had been updated at all. Part of that resulted in us hitting the MetaInfo field significantly more which incurred the delay.

Based on findings above in how we narrowed down the performance problem that we were seeing within Power View, we were able to get some items addressed to bring performance back in line to what we were seeing in SharePoint 2010. The KB Article for the issue itself is the following:

FIX: Slow performance when you render a Power View report that uses a SQL Server 2012 SP1 PowerPivot workbook as its data source
http://support.microsoft.com/kb/2846345

This was released as part of the following Cumulative Update:

Cumulative update package 5 for SQL Server 2012 SP1
http://support.microsoft.com/kb/2861107/en-us

Adam W. Saxton | Microsoft Escalation Services
http://twitter.com/awsaxton

↧

After applying Service Pack 1 for SQL Server 2012 you may encounter a known issue! Details inside…..

July 31, 2013, 10:53 am

≫ Next: Error trying to access SharePoint List from Power Query

≪ Previous: Tracking down Power View Performance Problems

I’d like to make you aware of an issue that may occur after installing Service Pack 1 for SQL Server 2012. Some of the symptoms are:

The Windows Installer process (MSIExec.exe) repeatedly starts and attempts to repair specific SQL Server assemblies. Your Windows Application event log will contain entries like:

EventId: 1004
Source: MsiInstaller
Description: Detection of product '{A7037EB2-F953-4B12-B843-195F4D988DA1}', feature 'SQL_Tools_Ans', Component '{0CECE655-2A0F-4593-AF4B-EFC31D622982}' failed. The resource''does not exist.
EventId: 1001
Source: MsiInstaller
Description: Detection of product '{A7037EB2-F953-4B12-B843-195F4D988DA1}', feature 'SQL_Tools_Ans’ failed during request for component '{6E985C15-8B6D-413D-B456-4F624D9C11C2}'

Users are unable to log into Windows with their profile.
Insufficient resource errors when starting various services/applications.
The Windows registry has grown close to the 2 GB limit.

*If you installed Service Pack 1 through the Product Update/Slipstream method the symptoms do not occur

The good news is that the original cause of the problem has been resolved. The fix can be downloaded via the following link:

Windows Installer starts repeatedly after you install SQL Server 2012 SP1

http://support.microsoft.com/kb/2793634/en-us

You’ll also notice an update to the Service Pack 1 download page:

http://www.microsoft.com/en-us/download/details.aspx?id=35575

NOTE: MANDATORY SP1 Hotfix available: SP1 installations are currently experiencing an issue in certain configurations as described in Knowledge Base article KB2793634. The article provides a fix for this issue that is currently available for download, and is MANDATORY for application immediately following a Service Pack 1 installation. The fix is also being made available on Microsoft Update.

If your symptoms are severe enough in that you’re unable to apply the patch, contact Customer Support Services and we’ll be glad to help.

Troy Moen – Support Escalation Engineer

↧

Error trying to access SharePoint List from Power Query

August 8, 2013, 12:20 pm

≫ Next: How It Works: SQL Server 2012 Database Engine Task Scheduling

≪ Previous: After applying Service Pack 1 for SQL Server 2012 you may encounter a known issue! Details inside…..

When trying to pull data from a SharePoint List Data Source, using the Microsoft Online Services ID, you may see the following output in your query:

DataFormat.Error: OData: The given URL neither points to an OData Service or a feed: ‘https://login.microsoftonline.com/login.srf?wa=wsignin1.0&rpsnv=2&ct=1375471406&rver=6.1.6203.0&wp=MBI&wreply=<URL>&lc=1033&id=500046&guests=1’.

You may also see an error saying “The user was not authorized”.

You may see one of these errors when you do not select “Keep me signed in” when logging into O365.

After putting the check in “Keep me signed in”, you should then see the proper output.

Adam W. Saxton | Microsoft Escalation Services
http://twitter.com/awsaxton

↧

How It Works: SQL Server 2012 Database Engine Task Scheduling

August 13, 2013, 11:13 am

≫ Next: Optimizing partition split when the partition is not empty

≪ Previous: Error trying to access SharePoint List from Power Query

Over the years the SQL Server scheduling algorithms have been documented in various publications. Specifically, ‘The Guru’s Guide to SQL Server Architecture and Internals’ has a chapter, written by the scheduler developer (Sameer) and Ken; and I reviewed the technical content, covering the details.

This post outlines a few of the changes that were made in SQL Server 2012. The post is not intended to cover all the nuances(there are far too many), instead I will be highlighting a portion of the new algorithm so you understand the SQL Server behavior. I also take several liberties with the description of the algorithm so this post does not turn into a whitepaper.

Algorithm Recap

Scheduling assignment starts at the NUMA node level. The basic algorithm is a round robin assignment for new connections.    As each NEW connection arrives it is assigned to a scheduler based on round robin, NUMA node connection assignment.
     Note: SQL Server Books Online outlines how to associate listeners with specific NUMA nodes.

A new connection is assigned to the scheduler with the smallest load factor within the same NUMA node.   Load factor is loosely equivalent to the number of tasks assigned to the scheduler. You can view the scheduler information in the DMV (sys.dm_os_schedulers.) The scheduler choice becomes the preferred (or hint) scheduler for the life of the connection.
The task assignment boundary is when the client submits a new command (Batch, RPC, etc…) The associated task in SQL Server is assigned to a scheduler and the task remains associated with the scheduler for the lifetime of the command (until the batch completes.)
The task assignment is also based on load factor.   If the preferred scheduler (after connection) has 20% more load than the other schedulers, for the same NUMA node, the task is assigned to the scheduler with the least load on the same NUMA node.   This is often referred to at the 120% rule. Once your preferred scheduler achieves 120% more load than the other schedulers, on the same NUMA node, new tasks are assigned to other schedulers within the same NUMA node.

SQL Server 2012 Changes – EE SKU Only

Out of the gate, SQL Server 2012 does not make any changes. What - Why did I write the blog? Am I just wasting your time? No, I say this because all SKUs other then Enterprise Edition use the same, fundamental logic that was introduced in SQL Server 7.0 (UMS scheduling) and as described above. The EE SKU has been updated to further accommodate CPU resource governance. If, and only if, you are running the SQL Server Enterprise Edition SKU do these new changes apply.

New Connection Assignments: Assigned round-robin to the nodes. 1st to Node 1, 2nd to Node 2, 3rd to Node 1 and so forth. The core algorithms for connection assignment remains the same in SQL Server 2012, all SKUs. As you can see from the diagram Port 1433 is bound to both NUMA nodes so the round-robin will occur between nodes. The load of the schedulers within the target node is queried and the new connection is assigned to the scheduler with the least load. This scheduler becomes the preferred scheduler for the life of the connection.

A new connection does not have a preferred scheduler. This means the scheduler assignment queries the schedulers, within the target node, and the scheduler with the least load is used. In the example Session 1 is assigned to Node 1, Scheduler 2. This algorithm is unchanged across all SQL Server 2012 SKUs. The connection has not been assigned to a pool yet so the initial logic to assign the connection task has to work from a basic load factor of the schedulers.

Least Load – After Connection Made and Preferred Scheduler Assigned

Prior to SQL 2012 or SQL Server 2012 NON-EE SKUs

The new task request (Batch, RPC, Disconnect, …) uses the preferred scheduler assignment. In our example the preferred scheduler is Scheduler #2. If the current scheduler hint (preferred scheduler) has exceeded a load factor of 120% of load on other schedulers within the same node ,the task is assigned to a different scheduler. The preferred scheduler remains the same for the connection.

In this example the new batch arrives. The preferred scheduler is Scheduler #2 but the load factor has increased to 13 with is > 120% of 10. So the batch would be assigned to Scheduler #1 because it is deemed to have less load and is likely to provide more CPU resources for the task.

SQL Server 2012 EE SKU

Starting with SQL Server 2012 EE SKU the behavior changes for a new task (Batch, RPC, Disconnect, …) only after the connection has been established and pool assignment made. (No connection no preferred scheduler yet!)

Note:Even if the RG is disabled the DEFAULT pool is used internally.

The process of assigning a new task to a specific scheduler is where the SQL Server 2012 EE SKU logic was upgraded from the basic, load factor algorithm. The idea is to find a scheduler that can provide the best scheduler, CPU capacity for the pool the connection is assigned to, within the same NUMA node while minimizing the overall, scheduler assignment randomization.

Each scheduler has per resource pool CPU targeting and the associated tracking capabilities along with the traditional load factor tracking. Instead of using the traditional, task count based load factor a CPU resource targeting load factor is used for the task assignments. In fact, SQL Server tracks the average CPU targeting per resource pool, per node.

The scheduler assignment starts with the preferred scheduler (hinted scheduler for the connection.) As long as adding the new task to the preferred scheduler does not drop the targeted resource pool consumption capabilities below 80% of the average for the pool across all schedulers, on the same NUMA node, the preferred scheduler is used as the target.

Once adding another task to the preferred scheduler, within the same resource pool causes the targeted resources to fall below 80% of the average the algorithm will then select the scheduler with the most targeted resources available for the pool.

Let’s do a mathematical example to show this in action, at a high level.

Note: The math included here is to show the basics and is not the exact algorithm.

In this example, tasks with the most resources available for the pool would currently be assigned to Scheduler #2. On average scheduler #2 can currently provide 6.25 (resources) per task vs the 5 (resources) on Scheduler #1 for the same pool. In very simple terms the 10 tasks on Scheduler #1 each get 5% of this CPU and the 8 tasks on Scheduler #2 get 6.25% of the CPU resources.

Scheduler	RG Pool Target	Pool Runnable Tasks	Avg Pool/Task
1	50	10	5
2	50	8	6.25	Currently Best Target – More resources to provide for tasks in the same pool

Current Avg: 5.625 = (6.25 + 5 / 2)

Now, assume I have a new task to assign and the preferred scheduler is Scheduler #1. SQL Server 2012 EE SKU uses the current average and increased resource usage to help determine the assignment.

80th percentile average = 4.5008 (Average resource units a task within the same pool achieves across all schedulers of the same node.)

Scheduler	RG Pool Target	Pool Runnable Tasks +1 Additional	Avg Pool/Task +1
1	50	11	4.5454	Not below 80th percentile
2	50	9	5.55

Adding another task, to the same pool on Scheduler #1 does not drop the average below the 80th percentile for the entire NUMA node. The assignment of this request will still occur on Scheduler #1.

Current Avg Adjusted: 5.3977 (6.25 + 4.5454/2) and 80th percentile = 4.3181

Now lets attempt to add a task to Scheduler #1 again. This time assigning the task to Scheduler #1 would drop below the 80th percentile average. SQL Server will instead find the scheduler with the most resources for the pool and assign the task. In our example this is Scheduler #2.

Scheduler	RG Pool Target	Pool Runnable Tasks +1 Additional	Avg Pool/Task +1
1	50	12	4.1666	Below 80th percentile
2	50	9	5.55

The adjusted load would then become the following.

Scheduler	RG Pool Target	Pool Runnable Tasks	Avg Pool/Task
1	50	11	4.5454
2	50	9	5.55	Added task

Current Avg Adjusted:5.047 (5.55 + 4.5454 / 2)

Overall the assignments of 2 new tasks within the same resource pool and NUMA node did not cause the targeted resources for the pool to vary significantly and maintained the goal of scheduler assignment locality and reduced movement.

TRACE FLAGS USE WITH CAUTION

I considered not documenting the trace flags as I have never seen us use them for a production system. There are some trace flags to control the behavior. As always, these are not intended for extended use and should only be used under the guidance of Microsoft SQL Server support. Because these change the task assignment algorithm they can impact the concurrency and performance of your system.

-T8008 - Force the scheduler hint to be ignored. Always assign to the scheduler with the least load (pool based on SQL 2012 EE SKU or Load Factor for previous versions and SKUs.)

-T8016 - Force load balancing to be ignored. Always assign to the preferred scheduler.

Bob Dorr - Principal SQL Server Escalation Engineer

↧

Optimizing partition split when the partition is not empty

August 13, 2013, 2:35 pm

≫ Next: Power View display issues with Page Viewer Web Part

≪ Previous: How It Works: SQL Server 2012 Database Engine Task Scheduling

Some of our Field Engineers, Kal Yella and Denzil Ribeiro (@DenzilRibeiro) have posted a blog that discusses how to optimize adding a partition when either the right most, or left most partition is not empty.. It is well worth the read, so we are posting it on PSSSQL to help get it out there.

Oops… I forgot to leave an empty SQL table partition, how can I split it with minimal IO impact?
http://blogs.msdn.com/b/sql_pfe_blog/archive/2013/08/13/oops-i-forgot-to-leave-an-empty-sql-table-partition-how-can-i-split-it-with-minimal-io-impact.aspx

Venu Cherukupalli - SQL Support Escalation Services
https://twitter.com/cherku

↧

Power View display issues with Page Viewer Web Part

August 16, 2013, 8:55 am

≫ Next: Service status watcher in SQL Server Management Studio – How it works

≪ Previous: Optimizing partition split when the partition is not empty

I ran into two separate issues this week that both dealt with displaying a Power View report through a Page Viewer Web Part within a SharePoint page in SharePoint 2013. The resolution to both were the same, so I thought I would lump them both into this post.

Issue 1: Visual Artifacts while scrolling page

The first issue was that some of the client machines were browsing the page within Internet Explorer 8 & 9. When they did this, and tried scrolling the actual page, they saw some video artifacts and it didn’t look right.

Issue 2: Managed Navigation cut off by Page Viewer Web Part

The second issue dealt with the Managed Navigation feature within SharePoint 2013. The navigation looked fine when on a normal page.

However, when going to the page that had the Page Viewer Web Part on it that hosted the Power View report, the menu would get cut off.

Solution: Silverlight Web Part

Within SharePoint 2013, there is a web part called the Silverlight Web Part. Switching over to use this Web Part instead of the Page Viewer Web Part corrected both issues that we were hitting.

There are two main configuration points for this web part to get it to work, outside of the visual settings like height and weight. The first is the Application Configuration area.

The relative URL to the XAP file is the following for Power View.

/_layouts/ReportServer/ClientBin/Microsoft.Reporting.AdHoc.Shell.Bootstrapper.xap

The other setting is under “Other Settings” called “Customer Initialization Parameters”.

It should be something similar to the following.

ItemPath=http://admadama/PowerPivot/HelloWorldPicnicReport.rdlx,ReportServerUri=http://admadama/_vti_bin/reportserver/,ViewMode=Presentation,PreviewBar=False,Fit=True

Admadama was my site URL, so you would need to adjust that along with the path to your Power View report.

After that was setup, we no longer saw any artifacts when scrolling the page, and the menu navigation looks as expected.

Adam W. Saxton | Microsoft Escalation Services
http://twitter.com/awsaxton

↧

Service status watcher in SQL Server Management Studio – How it works

August 21, 2013, 11:08 pm

≫ Next: When is throttling in Windows Azure SQL Database really throttling….

≪ Previous: Power View display issues with Page Viewer Web Part

Have you ever wondered about the mechanism using which SQL Server Management Studio(SSMS) - Object Explorer shows the service status for SQL Server and SQL Agent service? We recently worked with a customer on a issue related to this and thought that this might be useful information to share out. So here it is.

Here is a screenshot of what we are discussing in this post:

You will see the service status through the Green and Red color arrow icons present next to the service name.

All of the magic to populate the information happens through WMI layer. When you launch SSMS and connect to a SQL Server, the Object Explorer window performs a lot of initializations. One of them involves getting the service information for the two services of interest from the machine where SQL Server is running. In order to get this information, Object Explorer connects to the WMI namespace \\TOKENLEAKSERVER\root\cimv2 and performs various WMI queries. In this scenario, I am launching SSMS from a remote machine named TOKENLEAKCLIENT and connecting a SQL Server named TOKENLEAKSERVER.

First the Object Explorer extracts information about the two services of interest from the WMI provider CIMWin32 using calls similar to the following:

Provider::GetObject - Win32_Service.Name=""MSSQLSERVER""
Provider::GetObject - Win32_Service.Name=""SQLSERVERAGENT""

After this, it sets up a notification to get state change information using the ManagementEventWatcher classes from System.Management. The notification query used is of the format:

IWbemServices::ExecNotificationQuery - select * from __InstanceModificationEvent within 10 where TargetInstance isa 'Win32_Service'

This essentially allows the Object Explorer to receive service status information every 10 seconds. Internally this will show up as the following query executed every 10 seconds under the wmiprvse.exe process that has the cimwin32.dll provider loaded:

IWbemServices::ExecQuery - select * from Win32_Service

This allows the Object Explorer to get service state change information at frequent intervals.

The polling interval of 10 comes from the default value used by Object Explorer. You have the flexibility to change this polling interval using the following configuration:

On 64-bit machines: HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\Microsoft\Microsoft SQL Server\100\Tools\Shell => The PollingInterval DWORD should be set to value x.

On 32-bit machines: HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\100\Tools\Shell => The PollingInterval DWORD should be set to value x.

The value x will correspond to the PollingInterval in seconds. If you set it to a value of zero, then no polling would occur and Object Explorer will not be able to obtain the service status information. Exercise appropriate caution when modifying registry values.

For all of the above mechanics to work, the Windows account launching SSMS need to have appropriate permissions to the cimv2 namespace in WMI. You will notice that by default “Authenticated Users” do not have the remote access to this namespace. Only Administrators group has this permission.

So, if you do not have the required permissions, you will see the following status information in the Object Explorer of SSMS.

If you have several SQL DBA’s connecting to the same server remotely via SSMS, then every one of these clients will perform these service polling at the frequency of the default polling interval (10 seconds). You might notice that the wmiprvse.exe and lsass.exe consume some resources to satisfy these requests.

While troubleshooting this problem, we also came across the Enterprise Hotfix Rollup for Windows Server 2008 R2. Close to 90 fixes and improvements are present in this rollup. It looks like a mini-service pack! And it contains WMI related fixes as well.

You can observe all the WMI activity I mentioned above using the WMI Tracing.

Thanks & regards

Suresh B. Kandoth

Sr. Escalation Engineer, SQL Server

↧

When is throttling in Windows Azure SQL Database really throttling….

August 27, 2013, 9:16 am

≫ Next: [SQL 2012 query plan enhancement] I want to know why my query is not parallelized

≪ Previous: Service status watcher in SQL Server Management Studio – How it works

During the course of this year I have spoken at several customer and internal events on troubleshooting the Windows Azure SQL Database environment. As I built my presentations, one of the confusing topics I found was a term used in Azure Database called throttling. It seemed that this term was used for multiple reasons and multiple different error messages. So we got together with the technical writers within Microsoft and revamped the documentation on this topic. Thanks to writer Kumar Vivek, we now have a document called Resource Management in Windows Azure SQL Database.

This document now explains conditions when a Windows Azure SQL Database application could receive different types of errors including the “real engine throttling” set of errors. If you have developed or are considering developing an application for Azure Database, I highly recommend you read this. It will save you time and effort in understanding how to write your application to avoid errors and how to react to certain situations unique to the Azure Database environment.

Bob Ward
Microsoft

↧

[SQL 2012 query plan enhancement] I want to know why my query is not parallelized

August 28, 2013, 6:42 am

≫ Next: Interpreting the counter values from sys.dm_os_performance_counters

≪ Previous: When is throttling in Windows Azure SQL Database really throttling….

In the past, we have got repeated questions from customers on why a particular query is not parallelized. We didn’t have a good way to let customer know the reason until SQL 2012.

Starting SQL Server 2012, XML showplan is enhanced to include the reason why the plan is not or cannot be parallelized.

When you open showplan XML, you will see an attribute called “NonParallelPlanReason” under QueryPlan element. See the example below.

I will pick out a few most common ones. Most of them are self-explanatory.

MaxDOPSetToOne: Max
Degree of Parallelism set to 1 at query or server level
NoParallelDynamicCursor: Dynamic cursor doesn’t support parallel plan
NoParallelFastForwardCursor: Fast Forward cursor doesn’t support parallel
plan
NoParallelCreateIndexInNonEnterpriseEdition: We don’t’ support parallel index operations
for non Enterprise editions
NoParallelPlansInDesktopOrExpressEdition: No parallel plan for express edition (SQL 2000 desktop edition is the same as
express edition for later builds)
TSQLUserDefinedFunctionsNotParallelizable: Scalar TSQL user defined function used in the
query
CLRUserDefinedFunctionRequiresDataAccess: If a CLR user defined function ends up access
data via context connection, the query can’t be parallelized. But a CLR user defined function that doesn’t
do data access via context connection can be parallelized.
NoParallelForMemoryOptimizedTables: This is for any query accessing memory
optimized tables (part of SQL 2014 in-memory OLTP feature)

We will blog more about 2012 XML plan enhancements in the future. Stay tuned.

Jack Li | Senior Escalation Engineer | Microsoft SQL Server Support

↧