Quantcast
Channel: CSS SQL Server Engineers
Viewing all 339 articles
Browse latest View live

Interpreting the counter values from sys.dm_os_performance_counters

$
0
0

The performance counters exposed by SQL Server are invaluable tools for monitoring various aspects of the instance health. The counter data is exposed as a shared memory object for the windows performance monitoring tools to query. It is also available as a Dynamic Management View (DMV) within SQL Server, namely, sys.dm_os_performance_counters. The VIEW SERVER STATE permission is required to be able to query this view.

The counter data exposed in the view are in a raw form. This needs to be interpreted appropriately before it can be used. The cntr_type column value indicates how the values have to be interpreted. There were some questions around the values reported by this column which prompted this blog post. In this article, we will look at how to interpret the counter values.

The columns exposed by the view are described in the MSDN documentation but is reproduced here for reference.

Column name

Data type

Description

object_name

nchar(128)

Category to which this counter belongs.

counter_name

nchar(128)

Name of the  counter.

instance_name

nchar(128)

Name of the  specific instance of the counter. Often contains the database name.

cntr_value

bigint

Current value  of the counter.

       

    Note   

   

For per-second counters, this value is cumulative. The
    rate value must be calculated by sampling the value at discrete time
    intervals. The difference between any two successive sample values is equal
    to the rate for the time interval used.

   

cntr_type

int

Type of counter as defined by the Windows performance architecture. See WMI
  Performance Counter Types
on MSDN or your Windows Server documentation for more information on performance counter types.

 

The type of each counter is indicated in the cntr_type column as a decimal value. The distinct values used by all versions between SQL Server 2005 and SQL Server 2012 are the following

 

Decimal    

Hexadecimal    

Counter type define

1073939712

 0x40030500

PERF_LARGE_RAW_BASE

537003264 
 

 0x20020500

PERF_LARGE_RAW_FRACTION

1073874176

 0x40020500

PERF_AVERAGE_BULK

272696576 
 

 0x10410500

PERF_COUNTER_BULK_COUNT

65792          

 0x00010100

PERF_COUNTER_LARGE_RAWCOUNT

 

Let us look at them individually.

1)     PERF_LARGE_RAW_BASE

                Decimal Value                   : 1073939712
                Hexadecimal value            : 0x40030500

This counter value is raw data that is used as the denominator of a counter that presents a instantaneous arithmetic fraction. See PERF_LARGE_RAW_FRACTION for more information.

   Eg :

object_name

counter_name

instance_name

cntr_value

cntr_type

MSSQL$SQLSVR:Buffer Manager

Buffer cache hit ratio base

 

3170

1073939712

  This value is the base for the MSSQL$SQLSVR:Buffer Manager\Buffer cache hit ratio calculation.  

 

2)     PERF_LARGE_RAW_FRACTION

                Decimal Value                   : 537003264
                Hexadecimal value          : 0x20020500

   This counter value represents a fractional value as a ratio to its corresponding PERF_LARGE_RAW_BASE counter value.

   Eg :

object_name

counter_name

instance_name

cntr_value

cntr_type

MSSQL$SQLSVR:Buffer Manager

Buffer cache hit ratio

 

2911

537003264

                               

                Using the value here and the base value from the previous example, we can now calculate the MSSQL$SQLSVR:Buffer Manager\Buffer cache hit ratio as follows

Hit ratio %  = 100 * MSSQL$SQLSVR:Buffer Manager\Buffer cache hit ratio / MSSQL$SQLSVR:Buffer Manager\Buffer cache hit ratio base
                   = 100 * 2911 / 3170
                   = 91.83%

 

3)     PERF_AVERAGE_BULK

                Decimal Value                   : 1073874176
                Hexadecimal value            : 0x40020500

This counter value represents an average metric. The cntr_value is cumulative. The base value of type PERF_LARGE_RAW_BASE is used which is also cumulative. The value is obtained by first taking two samples of both the PERF_AVERAGE_BULK value A1 and A2 as well as the  PERF_LARGE_RAW_BASE value B1 and B2. The difference between A1 and A2 and B1 and B2 are calculated. The final value is then calculated as the ratio of the differences. The example below will help make this clearer.

    Eg :

Sample 1

object_name

counter_name

instance_name

cntr_value

cntr_type

 

MSSQL$SQLSVR:Latches

Average Latch Wait Time (ms)

 

14257

1073874176

<== A1

MSSQL$SQLSVR:Latches

Average Latch Wait Time Base

 

359

1073939712

<== B1

               

Sample 2

object_name

counter_name

instance_name

cntr_value

cntr_type

 

MSSQL$SQLSVR:Latches

Average Latch Wait Time (ms)

 

14272

1073874176

<== A2

MSSQL$SQLSVR:Latches

 

Average Latch Wait Time Base

 

 

 

360

 

1073939712

 

<== B2

 

               

Average Latch Wait Time (ms) for the interval = (A2 - A1) / (B2 - B1)
                                                                          = (14272 - 14257) / (360 - 359)
                                                                          = 15.00 ms

               

4)     PERF_COUNTER_BULK_COUNT

                Decimal Value                   : 272696576
                Hexadecimal value            : 0x10410500

This counter value represents a rate metric. The cntr_value is cumulative. The value is obtained by taking two samples of the PERF_COUNTER_BULK_COUNT value. The difference between the sample values is divided by the time gap between the samples in seconds. This provides the per second rate.

   Eg : For this example, I obtain the ms_ticks column from sys.dm_os_sys_info for calculation. You may use any method of choice to determine the difference in time between the counter value snapshots including getdate()

Sample 1

ms_ticks

object_name

counter_name

instance_name

cntr_value

cntr_type

488754390

MSSQL$SQLSVR:Databases

Transactions/sec

AdvWrks

1566

272696576

Sample 2                                

ms_ticks

object_name

counter_name

instance_name

cntr_value

cntr_type

488755468

MSSQL$SQLSVR:Databases

Transactions/sec

AdvWrks

2055

272696576

                                               

The value for Transactions/sec for the interval = (Value2 - Value1) / (seconds between samples)
                                                                          = (Value2 - Value1) / ((ms_value2 - ms_value1) / 1000)
                                                                          = (2055 - 1566) / ((488755468-488754390) / 1000)
                                                                          = 489 transactions/sec

 

5)     PERF_COUNTER_LARGE_RAWCOUNT

                Decimal Value                   : 65792
                Hexadecimal value            : 0x00010100

   This counter value shows the last observed value directly. Primarily used to track counts of objects.

  Eg :

object_name

counter_name

instance_name

cntr_value

cntr_type

MSSQL$SQLSVR:Buffer Manager

Total pages

 

5504

65792

               The value of the counter MSSQL$SQLSVR:Buffer Manager\Total pages = 5504.

               

 

Related links :

The sys.dm_os_performance_counters DMV documentation

                sys.dm_os_performance_counters (Transact-SQL)         
                http://msdn.microsoft.com/en-us/library/ms187743%28v=sql.110%29.aspx

More information about the various SQL Server counters and what information they convey.

                Use SQL Server Objects
                http://technet.microsoft.com/en-us/library/ms190382.aspx

 

Information about the performance counter defined values from Microsoft Performance Counter Query Protocol documentation

                2.2.4.2  _PERF_COUNTER_REG_INFO
                http://msdn.microsoft.com/en-us/library/cc238313.aspx

 

Ajith Krishnan | Escalation Engineer | Microsoft SQL Server Support

 


How It Works: Maximizing Max Degree Of Parallelism (MAXDOP)

$
0
0

I was working on an index build issue for an 80 CPU system and kept seeing that only 64 CPUs were getting used. I had carefully studied sys.dm_os_spinlock_stats and sys.dm_os_wait_stats along with performance counters, memory usage pattern, and I/O activities.   In fact, I had an 80 CPU, 2TB RAM, 4TB SSD system so I was convinced the SQL Server was CPU bound and adding more CPUs for the index build could be a benefit.

Note: I must caution you that adding more CPUs can lead to reduced performance because a bottleneck such as memory or I/O can become a larger problem.  

I like to think of it like pumping gas at my favorite filling station.  The pump at the storage tank can only move so much liquid through a finite pipe size.   Adding more filling outlets for patrons to use does not mean the overall flow of gas (gallons/sec) increases.  In fact, I like filling when no one else is filling because I maximize my flow and reduce my overall time at the pump. 

TEST any MAXDOP setting well as you might find less goes faster.

After looking at various SQL Server Books Online references and then stepping though the code I realized our documentation is not as accurate as it could be.   I hope I this post can reduce some of the confusion.

There are plenty of references for tuning MAXDOP to allow queries to run at their best while reducing the overhead of the parallelism.  You have all seen the references for capping MAXDOP at 8, or number of schedulers of the NUMA node if smaller, or ….   The fact is, these are all great and recommended best practices.

This post is not intended to contradict any of the current recommendations.  This blog is solely focused on a specific maintenance target.    The reset of my system has to be idle so I can safely consume the majority of schedulers.  It is probably the middle of the night, users are sleeping and you want to schedule a job that can take full advantage of the overall system.  I have reviewed various performance points and I believe using a high level of parallelism could allow my index build to complete quickly.

Warning: Index fragmentation with increased levels or parallelism: http://blogs.msdn.com/b/psssql/archive/2012/09/05/how-it-works-online-index-rebuild-can-cause-increased-fragmentation.aspx

There are several stages to determining the degree of parallelism (MAXDOP) a query can utilize.

Stage 1 – Compile

During complication SQL Server considers the hints, sp_configure and resource workgroup settings to see if a parallel plan should even be considered.  Only if the query operations allow parallel execution:

If hint is present and > 1 then build a parallel plan

else if no hint or hint (MAXDOP = 0)

          if sp_configure setting is 1 but workload group > 1 then build a parallel plan

           else if sp_configure setting is 0 or > 1 then build parallel plan

Stage 2 – Query Execution

When the query begins execution the runtime, degree of parallelism is determined.  This involves many factors, already outlined in SQL Server Books Online: http://technet.microsoft.com/en-US/library/ms178065(v=SQL.105).aspx 

Before SQL Server looks at the idle workers and other factors it determines the target for the degree of parallelism.

if sp_configure or query hint forcing serial plan use (1)

else if resource workgroup set

    if query hint present use min(hint, resource workgroup)

     else use resource workgroup

If still 0 after the detailed calculations it is set to 64  (default max for SQL Server as documented in Books Online.)   This fooled me some because on the 80 CPU system it has 2 Windows scheduler groups x 40 CPUs.   I might have expected a 40 CPU cap to avoid crossing over a Windows scheduler group.   This is not the case, SQL Server hard codes the 64 CPU target when the runtime target of MAXDOP is still 0 (default.)

The MAXDOP target is now adjusted for:

  • Actual CPU count (affinity settings from sp_configure and the resource pool). 

  • Certain query types (index build for example) look at the partitions

  • Other query type limitations that may exist

Now SQL Server takes a look at the available workers (free workers for query execution.)   You can loosely calculate the free worker count on a scheduler using   (Free workers = Current_workers_count – current_tasks_count) from sys.dm_os_schedulers

Once the target is calculated the actual is determined by looking at the available resources to support a parallel execution.  This involves determining the node(s) and CPUs with available workers.

Older versions of SQL Server used a polling mechanism every ~1 second to determine the node with the most free workers to target.   This meant you could encounter race conditions from multiple queries, both going parallel on the same node when going parallel on separate nodes would have resulted in better CPU usage.

 

Newer builds of SQL Server actively track the free workers.   This significantly reduces the possibility of assigning parallel queries to the same set of schedulers.

Trace flag 2466 - Force older version logic to determine number of available resources.

The worker location information is then used to target an appropriate set of CPUs to assign the parallel task to.

In general the placement decisions are:

  • SMP (FPlaceThreadsOneNodeSystem): If a single node system treat as SMP and only enqueue to a single node. SOFT NUMA and affinity may cause SQL to treat as SMP

  • CONNECTION (FPlaceThreadsOneNodeSystem): If trace flag 2479 is enabled force all parallel decisions to be limited to the node the connection is associated with.  This may be helpful when using SOFT Numa or connection, node affinity.

  • FULL (FPlaceThreadsAllNodes): If MAXDOP target is equal to all schedulers enqueue work to all schedulers
  • LEAST (FPlaceThreadsWithinLeastLoadedNode): If target MAXDOP target is less than a single node can provide and if trace flag 2467 is enabled attempt to locate least loaded node 

  • SPREAD (FPlaceThreadsMultipleNodes): The load is spread across any available node

 

Using XEvents you can monitor the MAXDOP decision logic.  For example:

  • XeSqlPkg::calculate_dop_begin
  • XeSqlPkg::calculate_dop

Back to trying to get my index build to use all 80 CPUs.  I can do several things:

1. Use MAXDOP=80 query hint

2. Set sp_configure ‘max degree of parallelism’, 80    -- Warning this applies to any query

3. Create resource pool/workload group and set MAXDOP=80 and assign only the index build connection to it using a resource governor classifier.

Testing Results

Here are the MAXDOP results on my 80 CPU system at different setting levels.

Query Hintsp_configureWorkgroupRUNTIME
008080
00064
10801
20802
080080
0101
0202
802080
801022

You can monitor the number of parallel workers by querying:

sys.dm_os_tasks

Note:  Some configuration changes may require a flush of procedure cache (dbcc freeproccache) or a disconnect/connect pairing to take affect.

 

Bob Dorr - Principal SQL Server Escalation Engineer

Error during installation of an SQL server Failover Cluster Instance

$
0
0

A common issue I've run into while helping with SQL Server Failover Cluster (FCI) installations is the failure of the Network Name. In the following post I'll discuss a bit of background, the common root cause, and how to resolve it.

Background

The SQL Server Database Engine service is dependent on the Network Name resource. A failure of the Network Name will result in the SQL Server Resource not coming online.

When the Windows Failover Cluster (WFC) is initially configured a Cluster Name object (CNO) will be created. The CNO is visible as a computer object in your Activity Directory Users and Computer snap-in (dsa.msc). By default the CNO will be created in the Computers container and granted specific permissions:

image

After a successful SQL Server FCI installation you will now see a Virtual Computer Object (VCO) for the SQL Server Network Name:

clip_image002

*Note: After the CNO is created any additional Network Name resource in the cluster is considered a Virtual Computer Object. VCO’s are simply Computer objects in which the CNO has permissions to change the properties or reset the password.

Problem

But what if the CNO does not possess the required permissions to create computer objects in the “Computers” container?

It is in the above scenario where we commonly see the following errors during SQL Server FCI installation:

clip_image003

The following error has occurred:

The cluster resource 'SQL Server (SQL2012)' could not be brought online due to an error bringing the dependency resource 'SQL Network Name(VSQL2012)' online. Refer to the Cluster Events in the Failover Cluster Manager for more information.

A user encountering the same issue while installing a pre-SQL Server 2012 version may see:

The cluster resource 'SQL Server (MSSQLSERVER)' could not be brought online.  Error: The resource failed to come online due to the failure of one or more provider resources. (Exception from HRESULT: 0x80071736)

System log:

Cluster network name resource 'SQL Network Name (VSQL2012)' failed to create its associated computer object in domain 'motox.com' during: Resource online.

The text for the associated error code is: A constraint violation occurred.

Please work with your domain administrator to ensure that:

- The cluster identity 'CLUS2012$' has Create Computer Objects permissions. By default all computer objects are created in the same container as the cluster identity 'CLUS2012$'.

- The quota for computer objects has not been reached.

- If there is an existing computer object, verify the Cluster Identity 'CLUS2012$' has 'Full Control' permission to that computer object using the Active Directory Users and Computers tool.

Cluster log:

[RES] Network Name: [NNLIB] Creating object VSQL2012 using ADSI in OU OU=SQL,DC=motox,DC=com on DC: \\MOTOXDC.motox.com, result: 8239

[RES] Network Name: [NNLIB] Failed to create Computer Object VSQL2012 in the Active Directory, error 8239

Cause

The common cause of the Network Name resource failure is insufficient permissions. More specifically, the permission "Create Computer Objects" has not been granted to the Cluster Name Object(CNO).

http://technet.microsoft.com/en-us/library/cc731002(v=ws.10).aspx

“…when you create a failover cluster and configure clustered services or applications, the failover cluster wizards create the necessary Active Directory computer accounts (also called computer objects) and give them specific permissions. The wizards create a computer account for the cluster itself (this account is also called the cluster name object or CNO) and a computer account for most types of clustered services and applications”

When the SQL Server Network Name is first brought online during the FCI installation process, the CNO identity is used to create the VCO(as long as the VCO doesn’t already exist). If the required permissions are not granted to the CNO, the creation of the VCO will fail and so will your SQL Server FCI installation.

*Note: The Create Computer objects right only applies to Domain Functional Levels above Windows Server 2003. For Windows Server 2003 the required privilege is “Add Workstations to the Domain”.

Resolution(s)

Option #1

We must grant the permissions "Read all properties" and "Create Computer objects" to the CNO via the container. Here's an example of granting the required permissions for demonstration purposes:

1. Open the Active Directory Users and Computers Snap-in (dsa.msc).

2. Locate “Computers” container:

clip_image004

3. Make sure "Advanced Features" is selected:

clip_image005

4. Open the properties of the container and click the "Security" tab. Click "Add" and add the CNO. Make sure to select “Computers” option in the “Object Types” window:

clip_image006

clip_image007

5. Click "Advanced", highlight the CNO, and click "Edit":

clip_image008

6. Make sure "Read all properties" and "Create Computer objects" are checked. Click OK until you're back to the AD Users and Computer window:

clip_image009

7. Retry your previously failed installation. Note that with SQL Server 2012 there will be a “retry” button.

Option # 2

We can also “Pre-Stage” the VCO, which is useful in situations where the Domain Administrator does not allow the CNO “Read All Properties” and “Create computer Objects” permissions:

1. Ensure that you are logged in as a user that has permissions to create computer objects in the domain.

2. Open the Active Directory Users and Computers Snap-in (dsa.msc).

3. Select View -> Advanced Features.

4. Right click the OU/Container you want the VCO to reside in and click “New” -> “Computer”

clip_image010

5. Provide a name for the object (This will be your SQL Server Network Name) and click “OK”:

clip_image011

6. Right click on the on the VCO you just created and select “Properties”. Click the security tab and then click “Add”:

clip_image012

7. Enter the CNO (Make sure to select “Computers” option in the “Object Types” window) and click “OK”.

clip_image013

clip_image014

8. Highlight the CNO, check the following permissions, and click “OK”.

Read

Allowed To Authenticate

Change Password

Receive As

Reset Password

Send As

Validate write To DNS Host Name

Validate Write To Service Principle Name

Read Account Restrictions

Write Account Restrictions

Read DNS Host Name Attributes

Read MS-TS-GatewayAccess

Read Personal Information

Read Public Information

*Note: You can replace step #8 by giving the CNO “Full Control” over the VCO

9. Install SQL Server and the Network Name resource should start without issue.

References:

Failover Cluster Step-by-Step Guide: Configuring Accounts in Active Directory

http://technet.microsoft.com/en-us/library/cc731002(WS.10).aspx

Before Installing Failover Clustering

http://msdn.microsoft.com/en-us/library/ms189910.aspx/html

Add workstations to domain

http://technet.microsoft.com/en-us/library/cc780195(v=WS.10).aspx

Troy Moen – Support Escalation Engineer

Invalid or loopback address when configuring SharePoint against a SQL Server

$
0
0

I was presented with a connectivity issue when trying to configure SharePoint 2013 using a CTP build of SQL 2014.  They got the following error when they were it was trying to create the Configuration Database.

Exception: System.ArgumentException: myserver,50000 is an invalid or loopback address.  Specify a valid server address.
   at Microsoft.SharePoint.Administration.SPServer.ValidateAddress(String address)
   at Microsoft.SharePoint.Administration.SPServer..ctor(String address, SPFarm farm, Guid id)
   at Microsoft.SharePoint.Administration.SPConfigurationDatabase.RegisterDefaultDatabaseServices(SqlConnectionStringBuilder connectionString)
   at Microsoft.SharePoint.Administration.SPConfigurationDatabase.Provision(SqlConnectionStringBuilder connectionString)
   at Microsoft.SharePoint.Administration.SPFarm.Create(SqlConnectionStringBuilder configurationDatabase, SqlConnectionStringBuilder administrationContentDatabase, IdentityType identityType, String farmUser, SecureString farmPassword, SecureString masterPassphrase)
   at Microsoft.SharePoint.Administration.SPFarm.Create(SqlConnectionStringBuilder configurationDatabase, SqlConnectionStringBuilder administrationContentDatabase, String farmUser, SecureString farmPassword, SecureString masterPassphrase)
   at Microsoft.SharePoint.PostSetupConfiguration.ConfigurationDatabaseTask.CreateOrConnectConfigDb()
   at Microsoft.SharePoint.PostSetupConfiguration.ConfigurationDatabaseTask.Run()
   at Microsoft.SharePoint.PostSetupConfiguration.TaskThread.ExecuteTask()

They had indicated that they had hit this before, and they worked around it by creating a SQL Alias.  However this time it was not working.  It was presented to me as a possible issue with using SQL 2014 and I was asked to have a look to see if this would affect other customers using SQL 2014.

I found some references regarding the error, and the majority of comments indicated to have SQL Server use the default port of 1433.  Also some that said create an Alias.  Some of the SharePoint documentation even shows how to change the SQL Port, and they also show how to create an Alias, but none really explained why this was necessary, or what SharePoint what actually looking for.

For this issue, it has nothing to do with SQL 2014 specifically and could happen with any version of SQL.  The issue is what SharePoint is looking for.  Whatever you put in for the Server name needs to be a valid DNS name.  For a non-default port (1433), you would need to create a SQL Alias.  If you create a SQL Alias, the name should be resolvable and not a made up name that doesn’t exist in DNS.  Otherwise, you will get the same error.

 

Techie Details

I started by looking at the error first.  Of note, this is a SharePoint specific error and not a SQL error.

Exception: System.ArgumentException: myserver,50000 is an invalid or loopback address.  Specify a valid server address.
   at Microsoft.SharePoint.Administration.SPServer.ValidateAddress(String address)

This was an ArgumentException when SPServer.ValidateAddress was called.  I’m going to assume that the string being passed in is whatever we entered for the database server.  In my case it would be “myserver,50000”.  I’ve seen this type of behavior before, here is one example.  My first question was, what is ValidateAddress actually doing?  I had an assumption based on the behavior that it was doing a name lookup on what was being passed in, but I don’t like assumptions, so I wanted to verify.

Enter JustDecompile!  This is a create tool if you want to see what .NET Assemblies are really doing.  The trick sometimes is to figure out what the actual assembly is.  I know SharePoint 2013 using the .NET 4.0 Framework, so the assemblies that are GAC’d will be in C:\Windows\Microsoft.NET\assembly\GAC_MSIL.  After that, I go off of the namespace as assemblies are typically aligned to the namespaces that are within it.  I didn’t see an assembly for Microsoft.SharePoint.Administration, so I grabbed the Microsoft.SharePoint assembly within C:\Windows\Microsoft.NET\assembly\GAC_MSIL\Microsoft.SharePoint\v4.0_15.0.0.0__71e9bce111e9429c.  This prompted me to load a few others, but it told me which ones to go get.

Within the Microsoft.SharePoint assembly, we can see that we have the Administration namespace.

SNAGHTML1464f782

So, now we want the SPServer object and the ValidateAddress method.

SNAGHTML1465d9b6

internal static void ValidateAddress(string address)
{
    Uri uri;
    if (address == null)
    {

        throw new ArgumentNullException("address");

    }

    UriHostNameType uriHostNameType = Uri.CheckHostName(address); <-- This is what gets us into trouble
    if (uriHostNameType == UriHostNameType.Unknown)
    {
        object[] objArray = new object[] { address };
        throw new ArgumentException(SPResource.GetString("InvalidServerAddress", objArray)); <-- The exception will be thrown here
    }

    uri = (uriHostNameType != UriHostNameType.IPv6 ||
        address.Length <= 0 ||
        address[0] == '[' ||
        address[address.Length - 1] == ']' ?
        new Uri(string.Concat("
http://", address)) : new Uri(string.Concat("http://[", address, "]")));
    if (uri.IsLoopback)
    {

        object[] objArray1 = new object[] { address };
        throw new ArgumentException(SPResource.GetString("InvalidServerAddress", objArray1));
    }
}

Uri.CheckHostName Method
http://msdn.microsoft.com/en-us/library/system.uri.checkhostname.aspx

Determines whether the specified host name is a valid DNS name.

So, if the string we pass in cannot be resolved via DNS, it will fail.  We never get to the point where we actually hit SQL itself.

 

Adam W. Saxton | Microsoft Escalation Services
http://twitter.com/awsaxton

Microsoft CSS @ PASS Summit 2013

$
0
0

During October 15th-18th, the US PASS Summit 2013 will be held in Charlotte, NC at the Charlotte Convention Center.  The Microsoft CSS team has a long history with PASS.  We have been speaking and working at PASS since 2003.  This year we will have the added advantage of being on our home turf with one of our main CSS sites being in Charlotte.  This means you will see a lot more folks from our CSS group helping at the Clinic.  Here is a look at what we will be doing this year.

 

Pre-Conference Seminar – AlwaysOn

Curt Mathews and Shon Hauck will be leading a session about AlwaysOn.  These are our experts for AlwaysOn within the Support group.  No one knows this topic better!  With over 31 years of experience between them, they bring a wealth of knowledge to this session.

They will walk you through a lot of aspects of AlwaysOn, but more importantly how you can monitor and gain insights to resolve issues you may encounter.  From Clustering to SQL Engine, these two will be able to answer your questions.  You will also get to see some of the common issues we are seeing on the Support side and what techniques we used to approach & resolve them.

 

Main Conference Talks

(DBA-500-HD) Inside SQL Server 2012 Memory: The Sequel– Bob Ward

Wednesday, October 16th – 1:30pm-4:15pm – Ballroom B

This is a half day session in which Bob will revisit the SQL Memory topic that he presented in 2008 and will go deep into the architecture, implementation, and “how it works” of SQL Server 2012 database engine memory management.  This will have plenty of demo’s and a lot of insights into SQL Server Memory.  This is a must see for the bitheads out there, or those of you that like your brain to hurt!

(Chalk Talk) SharePoint and Power Pivot and Power View, Oh My – Chuck Heinzelman, Kay Unkroth, Riccardo Muti and Adam Saxton

Wednesday, October 16th – 1:00pm-1:30pm – Microsoft Expo Booth

This is an informal setting.  Don’t expect a full presentation.  But, this is your chance to ask extra questions on a topic with Microsoft’s Engineers and Program Managers.  This will be more of a white board-style discussion with people who are building and supporting the features that you are using!

(BIA-304-M)Death by a Thousand Cuts: A Look at Power View Performance– Adam Saxton

Friday, October 18th – 10:15am-11:30am – Ballroom B

This will recap learnings that I gained from a customer case that I worked earlier this year.  When we did supportability reviews for Power View before it was initially released, I dubbed Power View Performance as the nightmare scenario.  The reason is because of all of the different technologies involved.  From SharePoint to Reporting Services, from Excel Services to Analysis Services, from the Browser to Silverlight, and throw in SQL Server and the Operating System for good measure. 

(DBA-406-M) SQL Server Transaction Log Internals– Tim Chapman & Denzil Ribeiro

Friday, October 18th – 1:00pm-2:15pm – 203 A

Tim and Denzil will take a hard look at the SQL Transaction Log and the roles it plays.  This will span logging, recovery and look at the checkpoint process and write-ahead logging.  If you are interesting in learning more about the SQL Transaction Log, this is going to be a session for you!

(CLD-307-M) SQL Server Performance and Monitoring in Windows Azure at Scale– Daniel Sol

Friday, October 18th – 1:00pm-2:15pm – 217 A

Daniel is coming from across the pond to talk about lessons learned from one of our largest Windows Azure SQL Database deployments.  This will look at performance tuning and what you can do Windows Azure SQL Database to help get the level of performance that you are looking for, along with what to monitor to stay on top of your deployment.

 

image

This is one of the highlights for the SQL PASS Conference.  Last year, we had the heaviest use of the SQL Clinic that we have every seen!  Imagine a room where, on one side, you have the Azure CAT team to ask “design” or “advisory” type questions to.  And, on the other side of the room, you have the SQL Support team that you can ask about an error you are getting, or how to fix something.  That is the SQL Server Clinic!  As mentioned above, with Charlotte having one of our Support sites, we will have a large contingent of our support personal on hand to help you out. 

This year we will be in room 219AB! We will also be using a new tool to try and accommodate everyone that needs assistance.  It will be a queue system similar to what I’ve seen in the AT&T stores.  You will be able to see where you are in the queue on the TV’s.  Hopefully this will help us to better assist you, as well as to track what type of things we are seeing in the Clinic.

image

We will have a full contingent of members of the Azure CAT team.  This will be matched by the CSS Conference Speakers as well as Support Engineers and Premier Field Engineers from across the globe!  This is the ultimate unique opportunity to interact with the CSS and Azure CAT teams like no other.  We cannot guarantee this is like “getting a free case” from CSS, but we can help point you in the right direction.  The questions we get in the Clinic range from “how does this work” to “I have a crash, can you look at it?”.  In some situations in the past, we have been able to use our laptops, or the customer’s, and either demonstrate how to solve the problem, or actually fix it on the spot.  They don’t all work out that way, but it is probably fair to say that when you walk out of the room, you will have left with more than when you walked in the door.  Even if that just means you wanted to come in to meet a new face or network with some of the people from Microsoft that put out customer fires or test the limits of SQL Server.

The clinic hours this year are:

  • Wed 10/16, 9:45AM-6PM (Clinic Happy Hour from 4-6pm)
  • Thur 10/17, 9:45AM-6PM
  • Fri 10/18, 8AM-2PM

Be sure to stop by the Clinic at least once during the conference!  We want to hear from you and your experience with SQL Server, even if you don’t have a question.  We will be available to talk about any topic related to SQL Server, or to strike up a conversation about your favorite sports teams (although the Texas Rangers and Cowboys may be sore topics this year).

We hope to see you there!

Every time I ‘ATTACH DATABASE’ SQL logs error 1314 for SetFileIoOverlappedRange

$
0
0

Turns out this is an issue in the SQL Server code and the error is a bit noisy during attach database.

When opening the database files, SQL Server calls SetFileIoOverlappedRange (when enabled properly) in order to help improve I/O performance.  This is commonly done under the SQL Server, service account; which requires locked pages privilege.  When the privilege is not held the Windows error (1314 - A required privilege is not held by the client) is logged in the SQL Server error log, shown below.

Starting with SQL Server 2005, when attaching a database, SQL Server impersonates the client connection when opening the files to validate proper security (ACLs.)   In doing so the SQL Server invokes SetFileIoOverlappedRange under the impersonated account and not the SQL Server, service account.   This can lead to the 1314 error condition.

The error is more noise than a problematic issue.   Using ALTER DATABASE OFFLINE and ONLINE will re-open the database files under the SQL Server, service account and allow SetFileIoOverlappedRange to complete successfully for the database.

Microsoft SQL Server 2012 (SP1) - 11.0.3000.0 (X64)

                Oct 19 2012 13:38:57

                Copyright (c) Microsoft Corporation

                Enterprise Edition (64-bit) on Windows NT 6.2 <X64> (Build 9200: )

2013-10-16 03:23:20.010 Server     Using locked pages in the memory manager.

                .

                .

                .

2013-10-16 09:21:35.970 spid52    Starting up database 'dbAttachTest'.

2013-10-16 09:25:10.300 spid52    SetFileIoOverlappedRange failed, GetLastError is 1314

 

WARNING – Make sure you have the following applied to avoid unexpected issues as well.

http://support.microsoft.com/kb/2679255

http://blogs.msdn.com/b/psssql/archive/2012/03/20/setfileiooverlappedrange-can-lead-to-unexpected-behavior-for-sql-server-2008-r2-or-sql-server-2012-denali.aspx

Bob Dorr - Principal SQL Server Escalation Engineer

SQL Connection Pool Timeout Debugging

$
0
0

This is a follow up to two blog posts from back in 2009 which talked about leaked connections.  In Part 1 and Part 2 of that post, it was about how to determine that you actually filled your pool.  This was centered around the following error:

Exception type: System.InvalidOperationException
Message: Timeout expired.  The timeout period elapsed prior to obtaining a connection from the pool.  This may have occurred because all pooled connections were in use and max pool size was reached.
InnerException: <none>
StackTrace (generated):
    SP               IP               Function
    000000001454DDC0 00000642828425A8 System.Data.ProviderBase.DbConnectionFactory.GetConnection(System.Data.Common.DbConnection)
    000000001454DE10 0000064282841BA2 System.Data.ProviderBase.DbConnectionClosed.OpenConnection(System.Data.Common.DbConnection, System.Data.ProviderBase.DbConnectionFactory)
    000000001454DE60 000006428284166C System.Data.SqlClient.SqlConnection.Open()

The issue I just worked on was the same exception, but in the case the Pools were not exhausted. In this case, the issue was occurring within BizTalk 2006 R2.  We narrowed this down to the following exception:

0:138> !pe e09e13f0
Exception object: 00000000e09e13f0
Exception type: System.Data.SqlClient.SqlException
Message: Timeout expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.
InnerException: <none>
StackTrace (generated):
    SP               IP               Function
    0000000015CBDF10 00000642828554A3 System_Data!System.Data.SqlClient.SqlInternalConnection.OnError(System.Data.SqlClient.SqlException, Boolean)+0x103
    0000000015CBDF60 0000064282854DA6 System_Data!System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(System.Data.SqlClient.TdsParserStateObject)+0xf6
    0000000015CBDFC0 0000064282CDCCF1 System_Data!System.Data.SqlClient.TdsParserStateObject.ReadSniError(System.Data.SqlClient.TdsParserStateObject, UInt32)+0x291
    0000000015CBE0A0 000006428284ECCA System_Data!System.Data.SqlClient.TdsParserStateObject.ReadSni(System.Data.Common.DbAsyncResult, System.Data.SqlClient.TdsParserStateObject)+0x13a
    0000000015CBE140 000006428284E9E1 System_Data!System.Data.SqlClient.TdsParserStateObject.ReadNetworkPacket()+0x91
    0000000015CBE1A0 0000064282852763 System_Data!System.Data.SqlClient.TdsParserStateObject.ReadBuffer()+0x33
    0000000015CBE1D0 00000642828526A1 System_Data!System.Data.SqlClient.TdsParserStateObject.ReadByte()+0x21
    0000000015CBE200 0000064282851B5C System_Data!System.Data.SqlClient.TdsParser.Run(System.Data.SqlClient.RunBehavior, System.Data.SqlClient.SqlCommand, System.Data.SqlClient.SqlDataReader, System.Data.SqlClient.BulkCopySimpleResultSet, System.Data.SqlClient.TdsParserStateObject)+0xbc
    0000000015CBE2D0 00000642828519E6 System_Data!System.Data.SqlClient.SqlInternalConnectionTds.CompleteLogin(Boolean)+0x36
    0000000015CBE320 000006428284A997 System_Data!System.Data.SqlClient.SqlInternalConnectionTds.AttemptOneLogin(System.Data.SqlClient.ServerInfo, System.String, Boolean, Int64, System.Data.SqlClient.SqlConnection)+0x147
    0000000015CBE3C0 000006428284859F System_Data!System.Data.SqlClient.SqlInternalConnectionTds.LoginNoFailover(System.String, System.String, Boolean, System.Data.SqlClient.SqlConnection, System.Data.SqlClient.SqlConnectionString, Int64)+0x52f
    0000000015CBE530 0000064282847505 System_Data!System.Data.SqlClient.SqlInternalConnectionTds.OpenLoginEnlist(System.Data.SqlClient.SqlConnection, System.Data.SqlClient.SqlConnectionString, System.String, Boolean)+0x135
    0000000015CBE5D0 00000642828471E3 System_Data!System.Data.SqlClient.SqlInternalConnectionTds..ctor(System.Data.ProviderBase.DbConnectionPoolIdentity, System.Data.SqlClient.SqlConnectionString, System.Object, System.String, System.Data.SqlClient.SqlConnection, Boolean)+0x153
    0000000015CBE670 0000064282846E36 System_Data!System.Data.SqlClient.SqlConnectionFactory.CreateConnection(System.Data.Common.DbConnectionOptions, System.Object, System.Data.ProviderBase.DbConnectionPool, System.Data.Common.DbConnection)+0x296
    0000000015CBE730 0000064282846947 System_Data!System.Data.ProviderBase.DbConnectionFactory.CreatePooledConnection(System.Data.Common.DbConnection, System.Data.ProviderBase.DbConnectionPool, System.Data.Common.DbConnectionOptions)+0x37
    0000000015CBE790 000006428284689D System_Data!System.Data.ProviderBase.DbConnectionPool.CreateObject(System.Data.Common.DbConnection)+0x29d
    0000000015CBE830 000006428292905D System_Data!System.Data.ProviderBase.DbConnectionPool.UserCreateRequest(System.Data.Common.DbConnection)+0x5d
    0000000015CBE870 0000064282846412 System_Data!System.Data.ProviderBase.DbConnectionPool.GetConnection(System.Data.Common.DbConnection)+0x6b2
    0000000015CBE930 00000642828424B4 System_Data!System.Data.ProviderBase.DbConnectionFactory.GetConnection(System.Data.Common.DbConnection)+0x54
    0000000015CBE980 0000064282841BA2 System_Data!System.Data.ProviderBase.DbConnectionClosed.OpenConnection(System.Data.Common.DbConnection, System.Data.ProviderBase.DbConnectionFactory)+0xf2
    0000000015CBE9D0 000006428284166C System_Data!System.Data.SqlClient.SqlConnection.Open()+0x10c
    0000000015CBEA60 0000064282928C2D Microsoft_BizTalk_Bam_EventObservation!Microsoft.BizTalk.Bam.EventObservation.DirectEventStream.StoreSingleEvent(Microsoft.BizTalk.Bam.EventObservation.IPersistQueryable)+0x8d
    0000000015CBEAE0 0000064282928947 Microsoft_BizTalk_Bam_EventObservation!Microsoft.BizTalk.Bam.EventObservation.DirectEventStream.StoreCustomEvent(Microsoft.BizTalk.Bam.EventObservation.IPersistQueryable)+0x47

The end result was to either increase the connection timeout for that connection string, or to look at the performance on the SQL Server and determine why SQL wasn’t able to satisfy the connection.  The customer had indicated that this occurred at the month end operations, which probably means that we ramped up pressure on SQL Server.  It may have come down to us not having enough Workers within SQL to handle the connection request which resulted in a Timeout after the default timeout which is 15 seconds.

Techie details:

This will look at how we determined what the problem was once we had a memory dump of the process. These debugging instructions are based on a 64-bit dump.  The steps should be similar for a 32-bit dump as well.  For the dumps, we used the SOS debugging extension which ships with the .NET Framework.  You can load the extension in the debugger by using the following command:

0:000> .loadby sos mscorwks

Let’s first find the Connection Pools that are in the dump:

0:138> !dumpheap -stat -type DbConnectionPool

000006428281fce8        4          416 System.Data.ProviderBase.DbConnectionPool+TransactedConnectionPool
000006428085dbc8       28          672 System.Data.ProviderBase.DbConnectionPoolCounters+Counter
000006428281f6d8        8          704 System.Data.ProviderBase.DbConnectionPool+PoolWaitHandles
0000064282810450        4          704 System.Data.ProviderBase.DbConnectionPool
000006428281d320      165         5280 System.Data.ProviderBase.DbConnectionPoolIdentity

This shows the MethodTable that we can use to go get the different items.  Of note, you may see multiple items, and may have to go through each one.

0:138> !dumpheap -mt 0x0000064282810450
------------------------------
Heap 4
         Address               MT     Size
00000000c021b348 0000064282810450      176    
total 1 objects
------------------------------
Heap 6
         Address               MT     Size
00000000e05add10 0000064282810450      176    
total 1 objects
------------------------------
Heap 12
         Address               MT     Size
000000014004b1d8 0000064282810450      176    
total 1 objects
------------------------------
Heap 13
         Address               MT     Size
00000001502e6af0 0000064282810450      176
 

We have 4 pools.  Let’s have a look at each pool and see how many connections we have for each.

Pool 1:

0:138> !do 0x00000000c021b348
Name: System.Data.ProviderBase.DbConnectionPool
MethodTable: 0000064282810450
EEClass: 00000642827da538
Size: 176(0xb0) bytes
(C:\WINDOWS\assembly\GAC_64\System.Data\2.0.0.0__b77a5c561934e089\System.Data.dll)
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name

00000642827ef760  400153f       18 ...nnectionPoolGroup  0 instance 0000000160036630 _connectionPoolGroup
0000064282818d18  4001540       20 ...nPoolGroupOptions  0 instance 0000000160036608 _connectionPoolGroupOptions

000006427843d998  4001551       98         System.Int32  1 instance                7 _totalObjects <-- Only 7 Objects out of a total pool size of 500

0:138> !do 0000000160036608
Name: System.Data.ProviderBase.DbConnectionPoolGroupOptions
MethodTable: 0000064282818d18
EEClass: 000006428282ce58
Size: 40(0x28) bytes
(C:\WINDOWS\assembly\GAC_64\System.Data\2.0.0.0__b77a5c561934e089\System.Data.dll)
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
00000642784358f8  4001598       14       System.Boolean  1 instance                1 _poolByIdentity
000006427843d998  4001599        8         System.Int32  1 instance                1 _minPoolSize
000006427843d998  400159a        c         System.Int32  1 instance              500 _maxPoolSize <-- Total pool size

Pool 2:

0:138> !do 0x00000000e05add10
Name: System.Data.ProviderBase.DbConnectionPool
0000064282818d18  4001540       20 ...nPoolGroupOptions  0 instance         e05ad798 _connectionPoolGroupOptions
000006427843d998  4001551       98         System.Int32  1 instance                6 _totalObjects <-- Only 6 Objects out of a total pool size of 100

0:138> !do e05ad798
Name: System.Data.ProviderBase.DbConnectionPoolGroupOptions
              MT            Field           Offset                 Type VT             Attr            Value Name
00000642784358f8  4001598       14       System.Boolean  1 instance                1 _poolByIdentity
000006427843d998  4001599        8         System.Int32  1 instance                0 _minPoolSize
000006427843d998  400159a        c         System.Int32  1 instance              100 _maxPoolSize <-- Total pool size

Pool 3:

0:138> !do 0x000000014004b1d8
Name: System.Data.ProviderBase.DbConnectionPool
0000064282818d18  4001540       20 ...nPoolGroupOptions  0 instance         d01e8288 _connectionPoolGroupOptions
000006427843d998  4001551       98         System.Int32  1 instance                7 _totalObjects <-- Only 7 Objects out of a total pool size of 500

0:138> !do d01e8288
Name: System.Data.ProviderBase.DbConnectionPoolGroupOptions
              MT            Field           Offset                 Type VT             Attr            Value Name
00000642784358f8  4001598       14       System.Boolean  1 instance                1 _poolByIdentity
000006427843d998  4001599        8         System.Int32  1 instance                1 _minPoolSize
000006427843d998  400159a        c         System.Int32  1 instance              500 _maxPoolSize <-- Total pool size

Pool 4:

0:138> !do 0x00000001502e6af0
Name: System.Data.ProviderBase.DbConnectionPool
0000064282818d18  4001540       20 ...nPoolGroupOptions  0 instance        1600f1940 _connectionPoolGroupOptions
000006427843d998  4001551       98         System.Int32  1 instance                4 _totalObjects <-- Only 4 Objects out of a total pool size of 100

0:138> !do 1600f1940
Name: System.Data.ProviderBase.DbConnectionPoolGroupOptions
              MT            Field           Offset                 Type VT             Attr            Value Name
00000642784358f8  4001598       14       System.Boolean  1 instance                1 _poolByIdentity
000006427843d998  4001599        8         System.Int32  1 instance                0 _minPoolSize
000006427843d998  400159a        c         System.Int32  1 instance              100 _maxPoolSize <-- Total pool size

The connection pools are dictated by the Connection String used.  So, this means 4 different connection strings were used.  We can look at the stack objects to see if we can pick apart some more information.

0:138> !dso
OS Thread Id: 0x70b0 (138)
RSP/REG          Object           Name
...
000000001454df30 00000001602a0f00 System.Data.SqlClient.SqlConnection
000000001454df40 00000000c0ace890 System.String
000000001454df48 00000001602a0cf0 Microsoft.BizTalk.Bam.EventObservation.BAMTraceFragment
000000001454df50 0000000150511568 System.String
000000001454df60 00000001602a0b00 Microsoft.BizTalk.Bam.EventObservation.DirectEventStream
000000001454df70 00000001602a0b00 Microsoft.BizTalk.Bam.EventObservation.DirectEventStream
000000001454df78 00000001602a0cf0 Microsoft.BizTalk.Bam.EventObservation.BAMTraceFragment
000000001454df80 00000001505112d0 System.String
000000001454df88 0000000150511568 System.String
000000001454df90 00000001602a0cf0 Microsoft.BizTalk.Bam.EventObservation.BAMTraceFragment
000000001454dfa8 00000001602a13d0 System.InvalidOperationException
000000001454dfb0 00000001602a0b38 System.Object
000000001454dfb8 000000015050d780 System.Data.SqlClient.SqlCommand
...

Here is the SQL Command Object that was issuing the command when we had the exception.

0:138> !do 000000015050d780
Name: System.Data.SqlClient.SqlCommand
MethodTable: 000006428279dbd0
EEClass: 00000642827d1dc0
Size: 224(0xe0) bytes
(C:\WINDOWS\assembly\GAC_64\System.Data\2.0.0.0__b77a5c561934e089\System.Data.dll)
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
0000064278436018  400018a        8        System.Object  0 instance 0000000000000000 __identity
00000642828144d8  40008de       10 ...ponentModel.ISite  0 instance 0000000000000000 site
00000642826664d8  40008df       18 ....EventHandlerList  0 instance 0000000000000000 events
0000064278436018  40008dd      210        System.Object  0   static 00000000f0269548 EventDisposed
000006427843d998  40016f2       b0         System.Int32  1 instance              672 ObjectID
0000064278436728  40016f3       20        System.String  0 instance 00000000f0020178 _commandText <-- The query/command issued
000006428279c370  40016f4       b4         System.Int32  1 instance                4 _commandType
000006427843d998  40016f5       b8         System.Int32  1 instance               30 _commandTimeout
000006428279d908  40016f6       bc         System.Int32  1 instance                3 _updatedRowSource
00000642784358f8  40016f7       d0       System.Boolean  1 instance                0 _designTimeInvisible
000006428288d490  40016f8       28 ...ent.SqlDependency  0 instance 0000000000000000 _sqlDep
00000642784358f8  40016f9       d1       System.Boolean  1 instance                0 _inPrepare
000006427843d998  40016fa       c0         System.Int32  1 instance               -1 _prepareHandle
00000642784358f8  40016fb       d2       System.Boolean  1 instance                0 _hiddenPrepare
00000642827e3128  40016fc       30 ...rameterCollection  0 instance 000000015050d940 _parameters
00000642827eea48  40016fd       38 ...ent.SqlConnection  0 instance 000000015050f308 _activeConnection <-- The SqlConnection that we used for this command
00000642784358f8  40016fe       d3       System.Boolean  1 instance                0 _dirty

In this case, we know the SqlConnection isn’t valid because we erred trying to get it from the Pool.  The Command Text would be interesting has this been a Query timeout, but for a connection Timeout, it is irrelevant.  We can poke at the strings on the stack and we will find the Connection String used for this operation.

0:138> !do 00000001505112d0
Name: System.String
MethodTable: 0000064278436728
EEClass: 000006427803e520
Size: 330(0x14a) bytes
(C:\WINDOWS\assembly\GAC_64\mscorlib\2.0.0.0__b77a5c561934e089\mscorlib.dll)
String: server=MyServer; database= MyDatabase;Integrated Security=SSPI;Connect Timeout=25; pooling=true; Max Pool Size=500; Min Pool Size=1

From this, we can see Max Pool Size is at 500, so that narrows it down to two of the four Pools listed above. When we went through the pools previously, I noticed that one of the pools had something that the others didn’t.  And, it happened to be one of the pools with the Pool Size of 500.  Let’s look at the full input of the pool in question.

0:138> !do 0x000000014004b1d8
Name: System.Data.ProviderBase.DbConnectionPool
MethodTable: 0000064282810450
EEClass: 00000642827da538
Size: 176(0xb0) bytes
(C:\WINDOWS\assembly\GAC_64\System.Data\2.0.0.0__b77a5c561934e089\System.Data.dll)
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
000006427843d998  400153c       88         System.Int32  1 instance           200000 _cleanupWait
000006428281d320  400153d        8 ...ctionPoolIdentity  0 instance 000000014004b1b8 _identity
00000642827ef2d0  400153e       10 ...ConnectionFactory  0 instance 0000000140022860 _connectionFactory
00000642827ef760  400153f       18 ...nnectionPoolGroup  0 instance 00000000d01e82b0 _connectionPoolGroup <-- We can get the connection string from this object
0000064282818d18  4001540       20 ...nPoolGroupOptions  0 instance 00000000d01e8288 _connectionPoolGroupOptions
000006428281d3c0  4001541       28 ...nPoolProviderInfo  0 instance 0000000000000000 _connectionPoolProviderInfo
00000642828102f8  4001542       8c         System.Int32  1 instance                1 _state
000006428281d4b8  4001543       30 ...InternalListStack  0 instance 000000014004b288 _stackOld
000006428281d4b8  4001544       38 ...InternalListStack  0 instance 000000014004b2a0 _stackNew
0000064278424d50  4001545       40 ...ding.WaitCallback  0 instance 000000014004c570 _poolCreateRequest
0000064278425c90  4001546       48 ...Collections.Queue  0 instance 0000000000000000 _deactivateQueue
0000064278424d50  4001547       50 ...ding.WaitCallback  0 instance 0000000000000000 _deactivateCallback
000006427843d998  4001548       90         System.Int32  1 instance                0 _waitCount
000006428281f6d8  4001549       58 ...l+PoolWaitHandles  0 instance 000000014004b3a8 _waitHandles
00000642784369f0  400154a       60     System.Exception  0 instance 00000000e09e13f0 _resError <-- We had an error on this pool
00000642784358f8  400154b       a0       System.Boolean  1 instance                1 _errorOccurred
000006427843d998  400154c       94         System.Int32  1 instance            10000 _errorWait
0000064278468a80  400154d       68 ...m.Threading.Timer  0 instance 00000001505bc420 _errorTimer
0000064278468a80  400154e       70 ...m.Threading.Timer  0 instance 000000014004c5f0 _cleanupTimer
000006428281fce8  400154f       78 ...tedConnectionPool  0 instance 000000014004c3e8 _transactedConnectionPool
0000000000000000  4001550       80                       0 instance 000000014004b400 _objectList
000006427843d998  4001551       98         System.Int32  1 instance                7 _totalObjects
000006427843d998  4001553       9c         System.Int32  1 instance                8 _objectID
0000064278425e20  400153b      c00        System.Random  0   static 00000000e0188968 _random
000006427843d998  4001552      968         System.Int32  1   static               18 _objectTypeCount

First, lets see if we can line up the connection string for this Pool with what was on the stack to make sure we are looking at the right pool.

0:138> !do 00000000d01e82b0
Name: System.Data.ProviderBase.DbConnectionPoolGroup
MethodTable: 00000642827ef760
EEClass: 00000642827da418
Size: 72(0x48) bytes
(C:\WINDOWS\assembly\GAC_64\System.Data\2.0.0.0__b77a5c561934e089\System.Data.dll)
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
0000064282816978  4001584        8 ...ConnectionOptions  0 instance 0000000170021600 _connectionOptions
0000064282818d18  4001585       10 ...nPoolGroupOptions  0 instance 00000000d01e8288 _poolGroupOptions
00000642823f2650  4001586       18 ....HybridDictionary  0 instance 00000000b00fb528 _poolCollection
000006427843d998  4001587       30         System.Int32  1 instance                1 _poolCount
000006427843d998  4001588       34         System.Int32  1 instance                1 _state
00000642828193b0  4001589       20 ...GroupProviderInfo  0 instance 00000000d01e82f8 _providerInfo
0000000000000000  400158a       28 ...DbMetaDataFactory  0 instance 0000000000000000 _metaDataFactory
000006427843d998  400158c       38         System.Int32  1 instance                7 _objectID
000006427843d998  400158b      978         System.Int32  1   static               20 _objectTypeCount

0:138> !do 0000000170021600
Name: System.Data.SqlClient.SqlConnectionString
MethodTable: 0000064282817158
EEClass: 00000642828234e0
Size: 184(0xb8) bytes
(C:\WINDOWS\assembly\GAC_64\System.Data\2.0.0.0__b77a5c561934e089\System.Data.dll)
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
0000064278436728  4000bef        8        System.String  0 instance 0000000150020230 _usersConnectionString
000006427843e080  4000bf0       10 ...ections.Hashtable  0 instance 00000001700216b8 _parsetable
00000642828180a0  4000bf1       18 ...mon.NameValuePair  0 instance 0000000170021878 KeyChain
00000642784358f8  4000bf2       28       System.Boolean  1 instance                0 HasPasswordKeyword
00000642784358f8  4000bf3       29       System.Boolean  1 instance                0 UseOdbcRules
000006427843cf18  4000bf4       20 ...ity.PermissionSet  0 instance 00000000d01e8330 _permissionset
00000642825a4958  4000beb      3e0 ...Expressions.Regex  0   static 00000000f026d658 ConnectionStringValidKeyRegex
00000642825a4958  4000bec      3e8 ...Expressions.Regex  0   static 00000000d01e7798 ConnectionStringValidValueRegex
00000642825a4958  4000bed      3f0 ...Expressions.Regex  0   static 0000000080032770 ConnectionStringQuoteValueRegex
00000642825a4958  4000bee      3f8 ...Expressions.Regex  0   static 0000000080034800 ConnectionStringQuoteOdbcValueRegex

0:138> !do 0000000150020230
Name: System.String
MethodTable: 0000064278436728
EEClass: 000006427803e520
Size: 330(0x14a) bytes
(C:\WINDOWS\assembly\GAC_64\mscorlib\2.0.0.0__b77a5c561934e089\mscorlib.dll)
String: server=MyServer; database= MyDatabase;Integrated Security=SSPI;Connect Timeout=25; pooling=true; Max Pool Size=500; Min Pool Size=1

We have a match!  So, now lets look at the error that was on the pool.

0:138> !pe 00000000e09e13f0
Exception object: 00000000e09e13f0
Exception type: System.Data.SqlClient.SqlException
Message: Timeout expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.
InnerException: <none>
StackTrace (generated):
    SP               IP               Function
    0000000015CBDF10 00000642828554A3 System_Data!System.Data.SqlClient.SqlInternalConnection.OnError(System.Data.SqlClient.SqlException, Boolean)+0x103
    0000000015CBDF60 0000064282854DA6 System_Data!System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(System.Data.SqlClient.TdsParserStateObject)+0xf6
    0000000015CBDFC0 0000064282CDCCF1 System_Data!System.Data.SqlClient.TdsParserStateObject.ReadSniError(System.Data.SqlClient.TdsParserStateObject, UInt32)+0x291
    0000000015CBE0A0 000006428284ECCA System_Data!System.Data.SqlClient.TdsParserStateObject.ReadSni(System.Data.Common.DbAsyncResult, System.Data.SqlClient.TdsParserStateObject)+0x13a
    0000000015CBE140 000006428284E9E1 System_Data!System.Data.SqlClient.TdsParserStateObject.ReadNetworkPacket()+0x91
    0000000015CBE1A0 0000064282852763 System_Data!System.Data.SqlClient.TdsParserStateObject.ReadBuffer()+0x33
    0000000015CBE1D0 00000642828526A1 System_Data!System.Data.SqlClient.TdsParserStateObject.ReadByte()+0x21
    0000000015CBE200 0000064282851B5C System_Data!System.Data.SqlClient.TdsParser.Run(System.Data.SqlClient.RunBehavior, System.Data.SqlClient.SqlCommand, System.Data.SqlClient.SqlDataReader, System.Data.SqlClient.BulkCopySimpleResultSet, System.Data.SqlClient.TdsParserStateObject)+0xbc
    0000000015CBE2D0 00000642828519E6 System_Data!System.Data.SqlClient.SqlInternalConnectionTds.CompleteLogin(Boolean)+0x36
    0000000015CBE320 000006428284A997 System_Data!System.Data.SqlClient.SqlInternalConnectionTds.AttemptOneLogin(System.Data.SqlClient.ServerInfo, System.String, Boolean, Int64, System.Data.SqlClient.SqlConnection)+0x147
    0000000015CBE3C0 000006428284859F System_Data!System.Data.SqlClient.SqlInternalConnectionTds.LoginNoFailover(System.String, System.String, Boolean, System.Data.SqlClient.SqlConnection, System.Data.SqlClient.SqlConnectionString, Int64)+0x52f
    0000000015CBE530 0000064282847505 System_Data!System.Data.SqlClient.SqlInternalConnectionTds.OpenLoginEnlist(System.Data.SqlClient.SqlConnection, System.Data.SqlClient.SqlConnectionString, System.String, Boolean)+0x135
    0000000015CBE5D0 00000642828471E3 System_Data!System.Data.SqlClient.SqlInternalConnectionTds..ctor(System.Data.ProviderBase.DbConnectionPoolIdentity, System.Data.SqlClient.SqlConnectionString, System.Object, System.String, System.Data.SqlClient.SqlConnection, Boolean)+0x153
    0000000015CBE670 0000064282846E36 System_Data!System.Data.SqlClient.SqlConnectionFactory.CreateConnection(System.Data.Common.DbConnectionOptions, System.Object, System.Data.ProviderBase.DbConnectionPool, System.Data.Common.DbConnection)+0x296
    0000000015CBE730 0000064282846947 System_Data!System.Data.ProviderBase.DbConnectionFactory.CreatePooledConnection(System.Data.Common.DbConnection, System.Data.ProviderBase.DbConnectionPool, System.Data.Common.DbConnectionOptions)+0x37
    0000000015CBE790 000006428284689D System_Data!System.Data.ProviderBase.DbConnectionPool.CreateObject(System.Data.Common.DbConnection)+0x29d
    0000000015CBE830 000006428292905D System_Data!System.Data.ProviderBase.DbConnectionPool.UserCreateRequest(System.Data.Common.DbConnection)+0x5d
    0000000015CBE870 0000064282846412 System_Data!System.Data.ProviderBase.DbConnectionPool.GetConnection(System.Data.Common.DbConnection)+0x6b2
    0000000015CBE930 00000642828424B4 System_Data!System.Data.ProviderBase.DbConnectionFactory.GetConnection(System.Data.Common.DbConnection)+0x54
    0000000015CBE980 0000064282841BA2 System_Data!System.Data.ProviderBase.DbConnectionClosed.OpenConnection(System.Data.Common.DbConnection, System.Data.ProviderBase.DbConnectionFactory)+0xf2
    0000000015CBE9D0 000006428284166C System_Data!System.Data.SqlClient.SqlConnection.Open()+0x10c
    0000000015CBEA60 0000064282928C2D Microsoft_BizTalk_Bam_EventObservation!Microsoft.BizTalk.Bam.EventObservation.DirectEventStream.StoreSingleEvent(Microsoft.BizTalk.Bam.EventObservation.IPersistQueryable)+0x8d
    0000000015CBEAE0 0000064282928947 Microsoft_BizTalk_Bam_EventObservation!Microsoft.BizTalk.Bam.EventObservation.DirectEventStream.StoreCustomEvent(Microsoft.BizTalk.Bam.EventObservation.IPersistQueryable)+0x47

As we can see, it is a normal Connection Timeout error.  Which makes sense, as our pools were not exhausted.  Of note, they had set their Connection Timeout to 25 seconds in the connection string.  Which means they would need to bump it higher, or look at what is going on with SQL Server at the time this occurs.  Not much more we can get from the dump.

 

Adam W. Saxton | Microsoft Escalation Services
http://twitter.com/awsaxton

Cumulative Update 2 to the RML Utilities for Microsoft SQL Server Released

$
0
0
image 

Version 9.04.004 of the RML Utilities for Microsoft SQL Server has been released.  This release of the RML Utilities provides support for:

  • SQL Server 2005
  • SQL Server 2008
  • SQL Server 2008 R2
  • SQL Server 2012
  • SQL Server 2014 CTP2

on

  • Windows 7
  • Windows 8
  • Windows 8.1
  • Windows Server 2008
  • Windows Server 2008 R2
  • Windows Server 2012
  • Windows Server 2012 R2

To download the current web release of the RML Utilities for SQL Server visit the following Microsoft Web site:

RML Utilities for SQL Server (x86) - http://download.microsoft.com/download/4/6/a/46a3217e-f523-4cc6-96e9-df73dd0fdd04/RMLSetup_X86.msi 

RML Utilities for SQL Server (x64) - http://download.microsoft.com/download/0/a/4/0a41538e-2d57-40ff-ae85-ec4459f7cdaa/RMLSetup_AMD64.msi 

This package includes:

• SQL Azure – Connectivity and ReadTrace/Reporter database storage support
• SQL Server 2012/2014 Client and Server Versions supported
• SQL 2005 thru SQL 2012/2014 Engine TRC support
• Limited SQL Server 2012/2014 XEL ReadTrace support, including limited XEL to TRC conversion
• Improved compressed file handling (types and performance – See Expander documentation for complete details)
• Includes ReadTrace and Reporter support for MARS connections  (OStress and Replay do not support MARS connections)
• Reduced memory footprint across all utilities
• .NET 4.0 Framework
• Performance enhancements (> 64 CPU support, processor group awareness, NUMA awareness)
• Parallel worker pools used internally to boost performance and reduce system impact
• Rollup of RML hotfixes
• Logging enhancements and additions

Note: Acompatible release of SQLNexus is scheduled for release in the near future.

Bob Dorr - Principal SQL Server Escalation Engineer


All about RollbackSnapshotTempDB...

$
0
0

 

I’ve been recently involved in several cases where Databases named RollbackSnapshotTempDB +<someGUID> were generating confusion. The purpose of this post is to clarify their origin and use and to enable SQL Server admins to know what to do if they need to deal with them. The following topics will be discussed (click for a quick jump) :

 

·         RollbackSnapshotTempDB : why such a name ?

·         RollbackSnapshotTempDB under the hood.

·         But I see permanent RollbackSnapshotTempDB on my SQL Server instance ?

·         Why do we have leftover RollbackSnapshotTempDB ?

·         I have leftover RollbackSnapshotTempDB, What should I do ?

·         Should my backup application really request ‘autorecovered snapshots’ ?

 

RollbackSnapshotTempDB : why such a name ?

 

The name itself could be slightly misleading in SQL Server context J. It’s not related to any sort of ongoing server transaction rollback, or snapshot (in SQL Server’s “database snapshot” acceptation), or to tempDB system database.

This database is a temporary construct created by SQL Server during backups initiated by VSS framework. So not your usual ‘backup database’ TSQL statement, but instead a backup initiated at system level by NtBackup, Server Backup, Microsoft DPM, or any 3rd party tool relying on VSS.

Furthermore, not every VSS backup will trigger the creation of such a database, only a very specific (and supposedly not mainstream) option will : it’s the request of “autorecovered snapshots”.

·         Here’s the link of the option in the VSS API : http://msdn.microsoft.com/en-us/library/windows/desktop/aa385012(v=vs.85).aspx

·         And the SQL Writer implementation’s specific can be found here : http://technet.microsoft.com/en-us/library/cc966520.aspx#ECAA (look for ‘Auto-Recovered Snapshots’ entry)

 

So the name must be understood in the VSS context : snapshot relates here to VSS snapshot. TempDB just means temporary database. Rollback is the VSS feature it enables (‘application rollback’, cf. VSS API link, where a read-only, point-in-time state of the structure being backed up is later on made available). The GUID is a uniquifier which derives from the VSS context in which the temporary DB is created.

 

RollbackSnapshotTempDB under the hood.

 

The ‘Guide for SQL Server Backup Application Vendors’ previously linked actually gives us the high level view on the logic taking place which is specific to autorecovered snapshots :

    • Attach the snapshot database to the original SQL Server instance (i.e., the instance to which the original database is attached).
    • Recover the database (this happens as part of the “attach” operation).
    • Shrink log files.
      • (This implies switching the DB to simple recovery in the first place)
    • Detach the database.

This would translate to the following patterns in SQL Server ERRORLOGs :

10:39:19.23 spid301 I/O is frozen on database ABC. No user action is required. However, if I/O is not resumed promptly, you could cancel the backup.

10:39:19.64 spid301 I/O was resumed on database ABC. No user action is required.

10:39:23.38 Backup Database backed up. Database: ABC, creation date(time): 2009/09/17(20:41:36), pages dumped: 243107, first LSN: 96553:62855:81, last LSN: 96553:62490:1, number of dump devices: 1, device information: (FILE=1, TYPE=VIRTUAL_DEVICE: {'{7B28E39D-7E44-411F-AF80-373C204B95DF}1'}). This is an informational message only. No user action is required.

10:39:23.66 spid301 Starting up database 'RollbackSnapshotTempDB{AF504647-90AC-4354-8BC0-FB6A42FAC1B7}'.

10:39:24.07 spid301 Setting database option RECOVERY to SIMPLE for database RollbackSnapshotTempDB{AF504647-90AC-4354-8BC0-FB6A42FAC1B7}.

10:39:24.30 spid301 Setting database option SINGLE_USER to ON for database RollbackSnapshotTempDB{AF504647-90AC-4354-8BC0-FB6A42FAC1B7}.

 

 

The important thing to understand is that the RollbackSnapshotTempDB is created by attaching Database files which are located in the ongoing active VSS snapshot of the drive(s) hosting the original database files.

So the RollbackSnapshotTempDB database is really a duplicate of the DB being backed up. It will have the same physical file names (mdf ndf and ldf) but the path to those files is not SQL Server data folder, but the snapshot path.

 

Here’s the attach statement :

create database [RollbackSnapshotTempDB{C9934090-EE6D-47B3-8346-FCBF9DFB2FF3}] on (filename = '\\?\GLOBALROOT\Device\HarddiskVolumeShadowCopy17\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\DATA\ABC.mdf'), (filename = '\\?\GLOBALROOT\Device\HarddiskVolumeShadowCopy17\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\DATA\ABC_log.ldf') for attach with presume_abort

 

Notice the ABC_log.ldf but the path starting with \\?\GLOBALROOT . TheRollbackSnapshotTempDB is therefore completely unrelated to the initial (ABC) database as far as SQL Server is concerned (waay down below in OS layers, the RollbackSnapshotTempDB physical files are ‘snapshots’ version of ABC’s relying on copyonwrite mechanism, but again, for SQL Server they’re two distinct DBs). Note that Windows’ File Explorer won’t let you explore the \\?\globalroot path : failure to access it from explorer doesn’t necessarily mean there’s a problem.

 

But why bother with all this overhead?

-Well the purpose is to obtain a database that can be mounted from read-only media so that we can fulfill VSS requirements for the rather specific ‘application rollback’ feature.

This is not possible if the database must be recovered when attached, as this implies write activity against the DB files. Furthermore a snapshot of a busy DB like we just did in steps described above is most likely to yield a DB not in a ‘cleanly shutdown’ state, and therefore candidate to a recovery upon attach.

So before we ‘deliver’ the DB files to the VSS framework, there is extra logic to let the database undergo its recovery and checkpoint, but the only way this can happen is to let SQL Server attach the Database files from the snapshot we just created ! (yes, that might seem a bit counterintuitive J). Note that this requires that the snapshot we’re working with is enabled for Read+Write activity, the VSS layer automatically takes care of this.

As an optional bonus, the logic also shrinks the snapshot DB’s LDF once the recovery has finished (space savings), thanks to the switch to simple recovery model. You will also see an explicit CHECKPOINT passing by to make really sure there’s nothing left for recovery. The single_user mode is there to prepare for the detach operation, to prevent any rogue SPID from blocking the detach, and the DB is finally detached and the VSS backup workflow will continue from there. As you can see in log extract, the whole attach/Detach operation is rather fast (less than one second, but this will vary based on DB recovery duration), so most users/DBA shouldn’t even realize the RollbackSnapshotTempDB<GUID> DB has appeared and disappeared (and yes, this Database would be treated like any other user DB and would appear in Management Studio DB list for the duration of its lifetime, even though it would require a very accurate or lucky refresh operation to catch it).

 

At the end of the day, the difference with the output of a snaphshot backup without autorecovery is that the autorecovered snapshot has been recovered and therefore is cleanly shutdown. This enables the backup application to attach the final files in readonly mode within SQL Server, giving access to a readonly 'snapshot' of the DB.

 

Note that if your backup requestor selects more than one single database for the VSS backup operation, and the option ‘autorecovered snapshots’ is activated, each temporary database will use the same GUID since the VSS context is the same. This will lead to repeated entries for the exact same DB name, but each iteration will actually handle a different user database, so this following pattern is perfectly normal and expected :

 

spid55      I/O is frozen on database ABC. No user action is required. However, if I/O is not resumed promptly, you could cancel the backup.

spid56      I/O is frozen on database DEF. No user action is required. However, if I/O is not resumed promptly, you could cancel the backup.

spid56      I/O was resumed on database DEF. No user action is required.

spid55      I/O was resumed on database ABC. No user action is required.

Backup      Database backed up. Database: ABC, creation date(time): 2012/09/20(13:54:10), pages dumped: 10553, first LSN: 48:6825:1, last LSN: 48:6828:1, number of dump devices: 1, device information: (FILE=1, TYPE=VIRTUAL_DEVICE: {'{67780D33-20D5-48B0-8CA4-73F78345CEA9}1'}). This is an informational message only. No user action is required.

Backup      Database backed up. Database: DEF, creation date(time): 2012/10/24(13:54:21), pages dumped: 465, first LSN: 37:59:1, last LSN: 37:62:1, number of dump devices: 1, device information: (FILE=1, TYPE=VIRTUAL_DEVICE: {'{67780D33-20D5-48B0-8CA4-73F78345CEA9}2'}). This is an informational message only. No user action is required.

spid55      Starting up database 'RollbackSnapshotTempDB{80C5D84B-6CB0-4D5C-B34C-94060F798344}'.                                   <= Database ABC

spid55      Setting database option RECOVERY to SIMPLE for database 'RollbackSnapshotTempDB{80C5D84B-6CB0-4D5C-B34C-94060F798344}'

spid55      Setting database option SINGLE_USER to ON for database 'RollbackSnapshotTempDB{80C5D84B-6CB0-4D5C-B34C-94060F798344}'.

spid55      Starting up database 'RollbackSnapshotTempDB{80C5D84B-6CB0-4D5C-B34C-94060F798344}'.                                   <= Database DEF

spid55      Setting database option RECOVERY to SIMPLE for database 'RollbackSnapshotTempDB{80C5D84B-6CB0-4D5C-B34C-94060F798344}'.

spid55      Setting database option SINGLE_USER to ON for database 'RollbackSnapshotTempDB{80C5D84B-6CB0-4D5C-B34C-94060F798344}'.

 

 

But I see permanent RollbackSnapshotTempDB on my SQL Server instance ?

 

In an ideal world, DBA’s should not even see any RollbackSnapshotTempDB database, unless they happen to query sysdatabases at the very second a RollbackSnapshotTempDB is activated, or if they audit database attach/detach events. As described, RollbackSnapshotTempDB are transient construct and their lifetime should not exceed a few seconds.

 

However, of late I’ve been working on more than a few occurrences where leftover RollbackSnapshotTempDBdatabases were present on SQL Server instances, and were surviving service restarts and not going away at all.

 

And well, this is bound to grab attention, even if the DBA doesn’t wonder about that new DB in his DBlist, because the leftover database will very likely generate (benign) errors at various levels :

 

  • Right after the “leftover DB” issue takes place, the snapshot still exists, but may not be read/write anymore. Any attempt from SQL to write to that DB will generate the following:

2013-08-29 13:15:21.66 spid52      Error: 823, Severity: 24, State: 2.

2013-08-29 13:15:21.66 spid52      The operating system returned error 21(The device is not ready.) to SQL Server during a read at offset 0x000000000da000 in file '\\?\GLOBALROOT\Device\HarddiskVolumeShadowCopy2\Program Files\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\DATA\PUB.mdf'. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

2013-08-29 13:15:23.47 spid60      Error: 823, Severity: 24, State: 2.

2013-08-29 13:15:23.47 spid60      The operating system returned error 21(The device is not ready.) to SQL Server during a read at offset 0x000000000da000 in file '\\?\GLOBALROOT\Device\HarddiskVolumeShadowCopy2\Program Files\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\DATA\PUB.mdf'. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

  •  This often brings confusion as the user recognizes the physical filename (PUB.mdf') without considering the GLOBALROOT path and may think there is a problem with the original user DB (here PUB).

  • After a SQL Service stops, SQL Server releases all handles against its files, including those within the snapshot, which usually leads to the snapshot’s de-allocation. Therefore upon service restart we see slightly different errors :

2013-08-29 13:18:41.62 spid37s     Error: 17204, Severity: 16, State: 1.

2013-08-29 13:18:41.62 spid37s     FCB::Open failed: Could not open file \\?\GLOBALROOT\Device\HarddiskVolumeShadowCopy2\Program Files\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\DATA\PUB.mdf for file number 1.  OS error: 3(The system cannot find the path specified.).

2013-08-29 13:18:41.62 spid37s     Error: 5120, Severity: 16, State: 101.

2013-08-29 13:18:41.62 spid37s     Unable to open the physical file "\\?\GLOBALROOT\Device\HarddiskVolumeShadowCopy2\Program Files\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\DATA\PUB.mdf". Operating system error 3: "3(The system cannot find the path specified.)".

2013-08-29 13:18:41.62 spid37s     Error: 17207, Severity: 16, State: 1.

2013-08-29 13:18:41.62 spid37s     FileMgr::StartLogFiles: Operating system error 2(The system cannot find the file specified.) occurred while creating or opening file '\\?\GLOBALROOT\Device\HarddiskVolumeShadowCopy2\Program Files\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\DATA\PUB_log.ldf'. Diagnose and correct the operating system error, and retry the operation.

2013-08-29 13:18:41.62 spid37s     File activation failure. The physical file name "\\?\GLOBALROOT\Device\HarddiskVolumeShadowCopy2\Program Files\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\DATA\PUB_log.ldf" may be incorrect.

  • These errors will repeat at each service start and whenever DB iterations tasks are executed, leading to many entries in errorlogs and event logs
  • It may also prevent some GUI operations in Management studio, with errors like
    •  Invalid object name '#unify_temptbl63513453714256.1'. (Microsoft SQL Server, Error: 208)

 

 

Why do we have leftover RollbackSnapshotTempDB ?

 

There isn’t a single scenario leading to this situation. Basically ‘something’ interrupted the logical flow described above and prevented SQL Writer from detaching the RollbackSnapshotTempDB as a final step.

One must understand that the ‘autorecovered snapshot’ logic belongs to the SQL Writer service, which is more or less ‘stateless’ (a given ongoing SQLWriter backup operation has no knowledge of past or simultaneous VSS backups). On the SQL Server side, the RollbackSnapshotTempDB is just a regular DB (with a slightly exotic filepath, granted). So if anything breaks SQL Writer’s logic and generates a leftover RollbackSnapshotTempDB, SQL Writer won’t do any clean-up upon restart (stateless logic), and SQL Server won’t decide to drop that ‘regular’ DB on its own (I’m sure you are all relieved to know that SQL Server doesn’t drop databases randomly J).

I suspect that any severe event (shutdown, crash, power outage) taking place precisely while a RollbackSnapshotTempDB is active will lead to leftover DB. But that should be a rare enough situation.

I’ve also identified one specific scenario where  the replication status of a DB interferes with the autorecovered snapshot. Namely, a published Database which is backed up with the autorecovery option will fail the VSS backup and generate a leftover DB in most cases. But read the last topic of this blog to understand why this should not be a common scenario.

 

Also note that server-wide DDL triggers on operations like create DB (for attach) may interfere with the SQLWriter logic.

 

I have leftover RollbackSnapshotTempDB, What should I do ?

 

Based on what we described so far, you now know that a left-over RollbackSnapshotTempDB is a regular DB pointing to an obsolete snapshot path. It is similar to a DB for which you have deleted the physical files while SQL Server was offline : it’s an empty shell.

By all means delete (drop) any and all leftover RollbackSnapshotTempDBs present on your SQL Server instances (skip readers please make sure to review the whole post to know what ‘leftover’ means !). They will only cause trouble (cf. list of errors above), and there’s nothing to learn from their presence for troubleshooting purposes : SQL Server errorlog has recorded the physical file names along with the activation errors, so that you will know which original user DB was associated to the leftover temporary DB based on those physical names, and this is pretty much all the useful information you could extract a posteriori.

 

If the database that was the source of the RollbackSnapshotTempDB was a replication publisher, you’ll need to put the RollbackSnapshotTempDB database offline first (‘alter database xxx set offline’) to avoid the following message :

Msg 3724, Level 16, State 3, Line 1

Cannot drop the database 'RollbackSnapshotTempDB{E122ECA8-9BFC-451A-81A1-2158B44A8570}' because it is being used for replication

 

The cause of the leftover DB may be a one-off, in which case you’re now good to go. But if you realize they come back on a regular basis, it will be useful to record how they happen to become orphaned.

A SQL profiler of the SQL Server instance recording SQLWriter activity against the SQL Server Instance while the ‘orphaning’ backup is taking place usually is the best way to start (VSS traces won’t help a lot here), along with the SQL errorlogs and OS Events logs covering the same period. So that means that a proactive tracing needs to be activated. I won’t elaborate more here, this is ad hoc troubleshooting which may be a valid reason to involve Microsoft Support Services.

 

Should my backup application really request ‘autorecovered snapshots’ ?

 

As you can see, there’s quite a bit of extra logic involved when this option is enabled. Furthermore, it is apparent that the resulting VSS backup content will not be a ‘strict’ copy of the original database(s) : the backup will contain a database which has been switched to simple recovery model, which transaction log has been shrinked, and that will even remember its ‘RollbackSnapshotTempDB’ logical name upon restore. Noticeable differences indeed. But should that come as a surprise ?

 

Let us go back to our most detailed reference document, http://technet.microsoft.com/en-us/library/cc966520.aspx.

 

Extract :

With SQL Server 2005 and the VSS framework running on Windows 2003 SP1, it is possible to auto-recover the snapshots as part of the snapshot creation process. As part of the Writer Metadata Document, the SQL writer will specify the component flag “VSS_CF_APP_ROLLBACK_RECOVERY” to indicate that recovery needs to be performed for the database on snapshot before the database can be accessed When specifying the snapshot set, the requestor can indicate that the snapshot should be an app-rollback snapshot (i.e., all database files in a snapshot are meant to be in a consistent state for application usage) or a backup snapshot (a snapshot used for backing up data to be restored later in case of a system failure). The requestor should set VSS_VOLSNAP_ATTR_ROLLBACK_RECOVERY to indicate that this component is being backed up for a non-backup purpose.

It is very clear that app-rollback snapshot (the ones for which ‘autorecovered snapshot’ option has been set) are not meant for backup purposes, but only for application-rollback usage.

Therefore if your backup utility’s purpose is to generate system-wide backups to protect from disaster scenarios, by capturing a complete system image as it was at the time the backup was taken, it is unlikely that the activation of ‘autorecovered snapshot’ makes sense as far as SQL Server is concerned : the backup won’t be able to help recreate SQL Server databases in their exact original state. The logical data they contain will be the same and available in the same way, though.

 

One example of valid use of this option is Microsoft DPM 2012 new architecture, where the use of autorecovered snapshot against SharePoint Content Databases enables a very optimized method of retrieving single document items (“Item level Recovery”) from DPM backups (see http://technet.microsoft.com/en-us/library/hh758215.aspx). Actually, this is the only valid use of autorecovery I’m aware of J.

 

Hope this helps !

 

-- Guillaume Fourrat
-- SQL Server Escalation Engineer

 

Spatial Indexing: From 4 Days to 4 Hours

$
0
0

Over the past month I have been involved in the optimization of a Spatial Index creation/rebuild.  Microsoft has several fixes included in the SQL Server 2012 SP1 CU7 Release

I have been asked by several people to tell the story of how I was able to determine the problem code lines that allowed the Spatial Index build/rebuild to go from taking 4+ days to only 4 hours on the same machine.

Environment

  • 1.6 billion rows of data containing Spatial, geography polygon regions
  • Data stored on SSD drives, ~100GB
  • Memory on machine 256GB
  • 64 schedulers across 4 NUMA nodes

Customer was on an 8 scheduler system that took 22 hours to build the index and expected to move to an 80 scheduler system and get at least 50% improvement (11 hours.)  Instead it took 4.5 days!

Use of various settings (max dop, max memory percentage) only showed limited improvement.   For example 16 schedulers took 28 hours, 8 scheduler took 22 hours, 80 too 4.5 days.  

Just from the summary of the problem I was skeptical of a hot spot that required the threads to convoy and that is exactly what I found.

Started with DMVs and XEvents  (CMemThread)

I started with the common DMVs (dm_exec_requests, dm_os_tasks, dm_os_wait_stats and dm_os_spinlock_stats.)

CMEMThread quickly jumped to the top of the list, taking an average of 3.8 days of wait time, per CPU over the life of the run.   If I could address this I get back most of my wall clock time.

Using XEvent I setup a capture, bucketed by the waits on the CMemThread and looked at the stacks closely.  This can be some with public symbols.

-- Create a session to capture the call stack when these waits occur.  Send events to the async bucketizer target "grouping" on the callstack.

-- To clarify what we're doing/why this works I will also dump the same data in a ring buffer target

CREATEEVENTSESSIONwait_stacks

ONSERVER

ADDEVENTsqlos.wait_info

(

      action(package0.callstack)

      whereopcode= 1              -- wait completed

            andwait_type= 190     -- CMEMTHREAD on SQL 2012

)

addtargetpackage0.asynchronous_bucketizer(SETsource_type= 1,source='package0.callstack'),

addtargetpackage0.ring_buffer(SETmax_memory= 4096)

With (MAX_DISPATCH_LATENCY= 1 SECONDS)

go

-- Start the XEvent session so we can capture some data

altereventsessionwait_stacksonserverstate=start

go

RunthereproscripttocreateacoupleoftablesandcausesomebriefblockingsothattheXEventfires.

selectevent_session_address,target_name,execution_count,cast(target_dataasXML)

fromsys.dm_xe_session_targetsxst

      innerjoinsys.dm_xe_sessionsxson (xst.event_session_address=xs.address)

wherexs.name='wait_stacks'

What I uncovered was the memory object used by spatial is created as a thread safe, global memory object (PMO) but it was not partitioned.   I PMO is like a private HEAP and the spatial, object is marked thread safe for multi-threaded access.  Reference: http://blogs.msdn.com/b/psssql/archive/2012/12/20/how-it-works-cmemthread-and-debugging-them.aspx

A parameter to the creation of the memory object is the partitioning schema (by NUMA node, CPU or none.)   Under the debugger I forced the scheme to be, partitioned by NUMA Node and the runtime came down to ~25 hours on 80 schedulers.   I then forced the partitioning by scheduler (CPU) and achieved an 18 hour completion.

Issue #1 was identified and filed with the SQL Server development team.  When creating the memory object used for ANY spatial activity partition it by NUMA Node.  Partitioning adds a bit more memory overhead so we don’t target per CPU unless we know it is really required.   We also provide startup, trace flag –T8048 that upgrades any NUMA partitioned memory object to CPU partitioned.  

Why is –T8048 important?   Heavily used synchronization objects tend to get very hot at around 8 CPUs.  You may have read about the MAX DOP targets for SQL that recommend 8 as an example.   If the NUMA node, CPU density if 8 or less node partitioning may suffice but if the density (like on this system with 16 CPUs per node) increases the trace flag can be used to further partition the memory object.

In general we avoid use of –T8048 because in many instances partitioning by CPU starts to approach a design of scheduler local memory.  Once we get into this state we always consider the design to see if a private memory object is a better long term, design solution.

Memory Descriptor

Now that I had the CMemThread out of the way I went back to the DMVs and XEvents and looked for the next target. This time I found a spinlock object encountering millions of back-offs and trillions of spins in just a 5 minutes.    Using a similar technique as shown above I bucketed the spinlock back-off stacks.

The spinlock was the SOS_ACTIVEDESCRIPTOR.  The descriptor, spinlock is used to protect aspects of the memory manager and how it is able to track and hand out memory.    Each scheduler can have an active descriptor, pointing to hot, free memory to be used.  Your question at this juncture might be, why, if you have a CPU partitioned memory object would this be a factor.

Issue #2 is uncovered but to understand it you have to understand the spatial index build flow first.  The SQL Server spatial data type is exposed in the Microsoft.SqlServer.Types.dll (managed assembly) but much of the logic is implemented in the SqlServerSpatial110.dll (native).

Sqlserver Process (native) –> sqlaccess.dll (managed) –> Microsoft.SqlServer.Types.dll(managed) –> SqlServerSpatial110.dll (native) –> sqlos.dll (native)

Shown here is part of the index build plan.  The nested loops in native SQL engine, doing a native clustered index scan and then calling the 'GetGergraphyTessellation_VarBinary' managed implementation for each row.

clip_image001

CREATESPATIALINDEX[idxTeset]ON[dbo].[tblTest]
(      [GEOG] )USING  GEOGRAPHY_GRID
WITH (GRIDS=(LEVEL_1=MEDIUM,LEVEL_2=MEDIUM,LEVEL_3=MEDIUM,LEVEL_4=MEDIUM),
CELLS_PER_OBJECT= 16,PAD_INDEX=OFF,STATISTICS_NORECOMPUTE=OFF,SORT_IN_TEMPDB=OFF,DROP_EXISTING=OFF,ONLINE=OFF,ALLOW_ROW_LOCKS=ON,ALLOW_PAGE_LOCKS=ON)

The implementation of the tessellation is in the SqlServerSpatial110.dll (native) code that understands it is hosted in the SQL Server process when used from SQL Server.    When in the SQL Server process space the SQL Server (SOS) memory objects are used.  When used in a process outside the SQL Server the MSVCRT memory allocators are used.

What is the bug?  Without going into a specifics, when looping back to the SQLOS, memory allocators, didn’t properly honor the scheduler id hint and the active partition ended up always being partition 0 making the spinlock hot.

To prove I could gain some speed by fixing this issue I used the debugger to make SqlServerSpatial110.dll think it was running outside of the SQL Server process and use the MSVCRT memory allocators.  The MSVC allocators allowed me to dip near the 17 hour mark.

A fix for the SOS_ACTIVEDESCRIPTIOR issue dropped the index creation/rebuild to 16 hours on the 80 scheduler system.

First set of (2) fixes released: http://support.microsoft.com/kb/2887888 

Geometry vs Geography

Before I continued on I wanted to understand the plan execution a bit better.   Found out that the Table Valued Function (TVF) is called for each row to help build the spatial, index grid accepting 3 parameters and a return value.  You can read about these at:  http://technet.microsoft.com/en-us/library/bb964712(v=SQL.105).aspx

Changing the design of TVF execution is not something I would be able to achieve in a hot fix so I set off to learn more about the execution and if I could streamline it.  In my learning I found out that the logic to build the grid is highly tied to the data stored in the spatial data type, and the type used.    For example, LineString requires less mathematical computations than a Polygon.

Geometry is a flat map and Geography is a map of the round earth.   My customer could not move to a flat map as they are tracking polygons down to 120th of a degree of latitude and longitude.   The distance of Longitude 0 degrees and 30 degrees as the equator is different than the distance 2000 miles north of the equator, for example.

The geography calculations are often more CPU intense because of the additional logic to use cos, radius, diameter, and so forth.  I am unlikely able to make the MSVC runtime, cos function faster and the customer requires Polygons.

Index Options

I was able to adjust the ALTER/CREATE index grid levels and shave some time off the index creation/rebuild.  The problem with that is going to a LOW grid or limiting the number of grid entries for a row can force me to do additional filtering for runtime queries (residual where clause activities.)  In this case going to LOW was trading index build speed for increased query durations.   This was a non-goal for the customer and the savings in time was not significant in my testing.

P/Invokes

Those familiar with managed and native code development may have asked the same question I did.  How many P/Invokes are occurring to execute the TVF as the function is being called and we are going in and out of native?

This was something I could not change in a hot fix but I broken the reproduction down to 500,000 rows and filed a work item with the SQL Server development team, who is investigating, to see if we can go directly to the native, SqlServerSpatial110.dll and skip the managed logic to avoid the P/Invokes.

SOS_CACHESTORE, SOS_SELIST_SIZED_SLOCK, and SOS_RW

Still trying to reach the 11 hour mark I went back to looking at the DMVs and XEvent information for hot spots.  I noticed a few more spinlocks that were hot (SOS_CACHESTORE, SOS_SELIST_SIZED_SLOCK, and SOS_RW.) 

Found the cache and selist was related to the CLR Procedure Cache.   The CLR procedure, execution plans are stored in a cache.   Each parallel thread (80 in this test) uncached, used, cached the TVF.   As you can imagine the same hash bucket is used for all 80 copies of the execution plan (same object id) and the CLR Procedure Cache is not partitioned as much as the SQL Server, TSQL, procedure cache.

Issue #3 - The bucket and list was getting hot as each execution uncached and cached the CLR, TVF plan.  The SQL Server dev team was able to update this behavior to the additional overhead.   Reduced the execution by ~45 minutes.

The SOS_RW was found to protect the application domain for the database.   Each time the CLR procedure was executed the application domain (AppDomain) had a basic validation step and the spinlock for the reader, writer lock in the SOS_RW object was hot.

Issue #4 – SQL Server development was able to hold a better reference to the application domain across the entire query execution.   Reducing the execution by ~25 minutes.

Array Of Pages

The Polygon can contain 1000s of points and is supported in the SqlServerSpatial110.dll with an array of page allocations.   Think of it a bit like laying down pairs of doubles in a memory location.   This is supported by –>AllocPage logic inside a template class that allows ordinal access to the array but hide the physical allocation layout from the template user.   As the polygon requires more pages they are added to an internal list as needed.

Using the Windows Performance Analyzer and queries against the data (<<Column>>.ToString(), datalength, …) we determined a full SQL Server page was generally not needed.   A better target than 8K is something like 1024 bytes.  

Issue #5 - By reducing the target size used by each TVF invocation SQL Server can better utilize the partitioned memory object.   This reduces the number of times pages need to be allocated and freed all the way to the memory manager.  The per scheduler or NUMA node memory objects hold a bit a memory in a cache.   This allows better utilization of the memory manager, less memory usage overall and gained us another ~35 minutes of runtime reduction.

What Next Is Big - CLR_CRST?

As you can see I am quickly running out of big targets to get me under the 11 hour mark.  I am slowly getting to 13 hours but running out of hot spots to attack. 

I am running all CPUs at 100% on the system, taking captures of DMVs, XEvents and Windows Performance Analyzer.  Hot spots like scheduler switch, cos and most of the function captures are less than 1% of the overall runtime.    At this rate I am looking at trying to optimize 20+ more functions to get approach the 11 hours.

That is, except for one wait type I was able to catch, CLR_CRST.  Using XEvent I was able to see that CLR_CRST is associated with GC activities as SQL Server hosts the CLR process.

Oh great, how am I going to reduce GC activity?  This is way outside of the SQL Server code path.   Yet, I knew most, if not all the core memory allocations were occurring in SqlServerSpatial110.dll using the SQLOS hosting interfaces.  Why so much GC activity.

What I found was the majority of the GC was being triggered from interop calls.  When you traverse from managed to native the parameters have to be pinned to prevent GC from moving the object while referenced in native code.

The Microsoft.SqlServer.Types.dll used a pinner class that performed a GCHandle::Alloc to pin the object and when the function was complete GCHandle::Free.   Turns out this had to be used for 12 parameters to the tessellation calls.  When I calculated the number of calls, 16. billion per row, add in the number of grid calculations per row, etc.. the math lands me at 7,200 GCHandle::Alloc/GCHandle::Free calls, per millisecond, per CPU for the a run on 80 schedulers over a 16 hour window.

Working with the .NET development team they pointed out the fixed/__pin keywords that are used to create, stack local references (pin) objects in the .NET heap without requiring the overhead of GCHandle::Alloc and ::Free.   http://msdn.microsoft.com/en-us/library/f58wzh21.aspx 

Issue #6 – Change the Geographic, grid method in the Microsoft.SqlServer.Types.dll to use the stack, local object pin references.  BIG WIN – The entire index build now takes 3 hours 50 minutes on the 80 scheduler system.

The following is the new CPU usage with all the bug fixes.  The valley is the actual .NET Garbage collection activity.  The performance analysis now shows the CPUs are pegged doing mathematical calculations and the index build will scale, reasonably as more schedulers are available.

clip_image001

In order to maximize the schedulers for the CPU bound index activity I utilized a resource pool/group – MAX DOP setting.  http://blogs.msdn.com/b/psssql/archive/2013/09/27/how-it-works-maximizing-max-degree-of-parallelism-maxdop.aspx

createworkloadgroupIndexBuilder
with
(
       MAX_DOP= 80,
       REQUEST_MAX_MEMORY_GRANT_PERCENT= 66
)

Why was this important to customer?

Like any endeavor, getting the best result for your money is always a goal.   However, some of the current Spatial limiations fueled the endeavor.

  • Spatial indexes must be built offline.  They are not online aware a lock down the core table access during the index creation/rebuild.
  • Spatial data is not allowed as part of a partitioning function.  This means a divide and scale out type of approach using partitioned tables is not a clear option.  (Note you can use separate tables and a partitioned view, using a constraint on a different column to achieve a scale out like design where smaller indexes can be maintained on each base table.)
  • We looked at a Performance Data Warehouse (PDW) approach but PDW does not currently support Spatial data types.
  • HDInsight has Spatial libraries but the nature of the interactive work the customers application was doing required a more interactive approach.   However, for batch and analysis activities HDInsight may be a great, scale out, compute intensive solution.
  • SQL Server handles and provides a full featured, Spatial library for Polygon storage, manipulation and query activities.   Other products may only support points or are not as feature rich.

Fix References

The following Microsoft Knowledge Base articles contain the appropriate links to these fixes.

Fixes are included in SQL Server 2012 SP1 CU7 - http://support.microsoft.com/kb/2894115/en-us

Note: Don’t forget you have to enabled the proper trace flags to enable the fixes. 

2887888FIX: Slow performance in SQL Server when you build an index on a spatial data type of a large table in a SQL Server 2012 instance
2887899FIX: Slow performance in SQL Server 2012 when you build an index on a spatial data type of a large table
2896720FIX: Slow performance in SQL Server 2012 when you build an index on a spatial data type of a large table

Recap

I realize this was a bit of a crash course in Spatial and internals.   It was a bit like this for me as well.   This was my first, in depth experience with tuning Spatial activities in SQL Server.   As you have read SQL Server, with the fixes, is very effective with Spatial data and can scale to large numbers.

The DMVs and XEvents are perfect tools to help narrow down the hot spot and possible problems.   Understanding the hot spot and then building a plan to avoid it could provide you with significant performance gains.

If you are using spatial you should consider applying the newer build(s), enabling trace flags and creating a specific workload group to assign maintenance activities to.

Additional Tuning

The following is an excellent document discussing the ins and outs of spatial query tuning.

http://social.technet.microsoft.com/wiki/contents/articles/9694.tuning-spatial-point-data-queries-in-sql-server-2012.aspx

Bob Dorr - Principal SQL Server Escalation Engineer

Kerberos Configuration Manager updated for Reporting Services

$
0
0

Back in may, we released the Kerberos Configuration Manager tool to help with diagnosing and correcting Kerberos related issues for SQL Server.  Today, I’m happy to announce that version 2 of this tool has been released and has been updated for Reporting Services.  A lot of work went into this tool to get us to this point and the team that worked on it was awesome! You can download it from the following link:

Microsoft® Kerberos Configuration Manager for SQL Server®
http://www.microsoft.com/en-us/download/details.aspx?id=39046

Here is a look at what versions of Reporting Services are supported with this release of the

SQL RS VersionNative ModeSharePoint Integrated Mode
2005No support in v.2No support in v.2
2008SupportedSupported
2008 R2SupportedSupported
2012SupportedNo support in v.2

The tool does not add a shortcut to the Start Menu, so you will need to go to the directory of which it is installed to.  By default, this is C:\Program Files\Microsoft\Kerberos Configuration Manager for SQL Server.  There are a couple of things here that are pretty neat.  First is that you can decide whether to include SQL Server and Reporting Services instances in the list.  This comes in handy if you have several of both on the same server and you just care about one or the other.  Great for Dev and Test environments!

image

We will also show you the Mode of the Report Server that we discovered.  Native or SharePoint.

image

Common Problems

Unauthorized

If you see the “Unauthorized” message under status, this just means that you didn’t start the tool with Admin rights.  Be sure to launch the tool with Admin rights.  The tool will not prompt automatically.

image

Kerberos not enabled

If you see the “Kerberos not enabled” message under status, this means that the rsreportserver.config doesn’t not have either RSWindowsNegotiate or RSWindowsKerberos in the Authentication Types. 

image

If you do want to use Kerberos with Reporting Services, which I’m assuming you do if you got this far, you will need to modify the rsreportserver.config to get past this.

image

For more information on this, check out one of my previous blogs on setting up Kerberos with Reporting Services.  It covers this.

Logging

If you happen to encounter something that I didn’t highlight above, you may be able to find additional information.  Each time you run the tool, we will create a log file.  The default location for this is the following:  C:\Users\<user>\AppData\Roaming\Microsoft\KerberosConfigMgr.

The details of the log file will be flushed when you close the program.  So, if it is blank, just close the tool and the log should populate.  You may also find some details in the Event Log under the Source “Kerberos Configuration Manager”.  If we encounter an error, it should be logged in the Application Event Log as well as the tool’s log file.

 

Limitations

There are a few limitations that you will run into.  I wanted to walk through those so you are aware.

SharePoint Integrated Mode with RS 2012

Unfortunately this does not cover SharePoint Integrated mode with RS 2012.  This is true regardless of whether it is SharePoint 2010 or SharePoint 2012.  The reason for this was that it was a completely different approach with regards to discovery.  We still want to get this in as it is important, but for now it is not. 

Reporting Services 2005

For 2005, we had to make the same call as with SharePoint Integrated with RS 2012.  RS 2005 is a different architecture and would have caused us extra work to get the discovery in.  As such, we opted to not include RS 2005 with this tool.  Unfortunately, I don’t believe this will make it into the tool.

Multiple Domains

Right now, the tool will only work in a single domain scenario.  So, if you have the service installed in Domain A, but want to use a Service Account from Domain B, we won’t be able to discover and correct the issue appropriately.  As long as the machine the instance is in and the Service Account are in the same domain, you should be good to go.  This is true for Reporting Services and the SQL Server discovery.

Delegation

We will discover the Delegation settings for the service account itself, but that is the extent of the Delegation checks that this tool will make for Reporting Services.  To determine if delegation is indeed configured correctly, we would need to crack all of the Data Sources.  This means not just the shared data sources, but any embedded data source within the RDL files.  That part did not make the cut for v2, but it is something we would like to include down the road.

image

 

I hope that you find this tool useful!  I’ll also point you back to an older blog post I created regarding my Kerberos Checklist to help you understand the different areas to consider when tracking down Kerberos issues.  As you can probably gather from this point, there is more that we want to do with this tool, so we aren’t done yet!  Stay tuned for future updates!

Adam W. Saxton | Microsoft SQL Server Escalation Services
http://twitter.com/awsaxton

How Simple Parameterization works

$
0
0

Recently we got a customer who upgraded from SQL Server 2005 to 2008. But their performance degraded greatly. What happened was that they had an update query that was run many times in a batch. The query was submitted as ad hoc by the application with different values.

Upon further investigation, we discovered that in SQL 2005, the query was parameterized but in SQL 2008 the query wasn't. So the cost came from the compiling every time for the update. Eventually, we resolved the issue. But it prompts a post on simple parameterization.

In SQL Server 2000, there isn't the concept of simple parameterization. There is just one option (auto parameterization). Starting SQL Server 2005, we offer two options for parameterization (simple and forced). Simple parameterization is the default and is the same as auto parameterization in SQL 2000.

How do I know if my query is even parameterized?

The easiest way to find out if a query is parameterized is to use graphical XML plan. Just point to the operator and take a look at some of the seek predicate.

Let's use this update example:

update t1 set c1 = 12, c2 = 23 where c1 = 12

The table t1 has c1 as clustered primary key (the reason why I also update c1 is related to customer's problem which I will talk about later). So the plan has a clustered index seek predicate on c1.

If the plan is parameterized, you will see the seek predicate on c1 as "Scalar Operator (CONVERT_IMPLICIT(int,..)" or "Scalar Operator (@2)" as shown in figure 1. But if the plan is not parameterized, the hard coded value will show up in the seek predicate like "Scalar Operator (12)" as shown in figure 2 below.

Figure 1

 

Figure 2

 

When is a query parameterized?

If you set your database's parameterization as forced, SQL Server will try to parameterize every query except the conditions documented http://technet.microsoft.com/en-us/library/ms175037(v=SQL.105).aspx.

But what about when your database's parameterization is set simple (default)? Our books online documentation (http://technet.microsoft.com/en-us/library/ms186219(v=sql.105).aspx) states that only very small set of queries will qualify. There is no easy answer to what queries qualify. But in general, if your query involves multiple tables, chances are it won't be parameterized. A more precise answer is that simple parameterization can only occur if the plan is a trivial plan. In case you are wondering why your query is not parameterized, you need to look no further than the XML plan itself. In xml plan (you will need to open as XML), you will see an attribute called "StatementOptmLevel" as shown below. If the StatementOptmLevel="FULL", then the query will not be parameterized with default simple parameterization option.

 

What happened to this customer mentioned earlier?

For this customer, the application was really doing something not that optimal. I simplified the scenario like below. t1 is referenced by t2

create table t1 (c1 int primary key, c2 int)

go

create table t2 (c1 int references t1(c1))

go

 

In the update statement, they also seek on the primary key column but also update on the primary key column with the exact same value. The statement looks like something below.

update t1 set c1 = 12, c2 = 23 where c1 = 12

In 2005, the update was a simple trivial plan. But in 2008 and 2008 R2, we made an optimizer change to do some Halloween protection due to incorrect results. As a result, this type of query has to go through full optimization. Therefore, in simple parameterization configuration, the query can no longer parameterize.

Fortuhnately, it's easy to fix. The easiest is to set force parameterization. But this customer doesn't want to do that citing it can impact other queries. Fortunately, template plan guide solves the issue.

All you need to do is to create a template plan guide to force parameterization for that particular query (like below)

DECLARE @stmt nvarchar(max);

DECLARE @params nvarchar(max);

EXEC sp_get_query_template

N'update t1 set c1 = 12, c2 = 23 where c1 = 12',

@stmt OUTPUT,

@params OUTPUT;

EXEC sp_create_plan_guide

N'TemplateGuide2',

@stmt,

N'TEMPLATE',

NULL,

@params,

N'OPTION(PARAMETERIZATION FORCED)';

 

Jack Li | Senior Escalation Engineer | Microsoft SQL Server Support

 

Spatial Index is NOT used when SUBQUERY used

$
0
0

I have found the following link to be invaluable when working with and tuning SQL Server Spatial indexes:  http://technet.microsoft.com/en-us/library/bb895265.aspx

However, the link is not as clear as it could be about the Spatial index selections made by the SQL Server query processing.  Here are a few additional tidbits that may assist you.  (Note:  Similar tips may apply to non-Spatial queries as well.)

1. The Spatial method must be on the left side of the predicate (where clause)

       col.STIntersects(@val) = 1   --    Can use the index if costing is appropriate
       1 = col.STIntersects(@val)   --    Unlikely to use index, use previous form

2. The value passed to the spatial method must be ‘constant like’

       col.STDistance(@val) = 1 * 10000   --    Can use the index if costing is appropriate
       col.STDistance(@val / 10000) = 1   --    Unlikely to use index, use previous form

3. Extension of #2 for more complex operations

/* The subquery form does not consider the index */
Select * from Spat where col2.STIntersects((select col2 from Spat where Id = 23 and col2 is not null))=1

/* Using index hint - getting an error message for this query form */
-- Msg 8622, Level 16, State 1, Line 1
-- Query processor could not produce a query plan because of the hints defined in this query.

Select * from Spat with (index(SpatIDX)) where col2.STIntersects( (select col2 from Spat where Id =23) ) = 1

/* Variable or Join forms attempt to use the index */
Declare @i geography
Set @i = (select col2 from Spat where Id =23)
Select * from Spat  where col2.STIntersects((@i))=1  order by Id

Select s1.* from Spat as s1
join Spat as s2 ON
      s1.col2.STIntersects(s2.col2) = 1
   and s2.Id = 23
order by s1.Id

As you can see the variable or join syntax is a construct the SQL Server query processing can evaluate for Spatial index usage where as the subquery is generally not considered.

Be sure to check the form of your queries to make sure the indexes are properly considered.

Bob Dorr - Principal SQL Server Escalation Engineer

As The World Turns: SQL Server NUMA Memory Node and the Operating System Proximity

$
0
0

It felt a bit like ‘As The World Turns’ as I unraveled how the following worked so the title is a bit of a tribute to my grandmother.  She could not miss here ‘stories’ in the afternoon.

Proximity

Before I dive into the details I would like to talk about NUMA node proximity.    The idea of proximity is how close is one NUMA Node to another based on the memory layout of the system.

For example Node 00 is 0 steps from itself.   However, it is 3 steps from Node 03.   Node 01 is 2 steps from Node 3 and so forth.

Node00010203
0000010203
0101000102
0202010001
0303020100

K-Group (Processor Group)

Windows attempts to gather the nodes, based on proximity, into the same K-Group

Examples

Here is a snippet from the SQL Server error log, 128 CPU, 8 Node system.  

Notice nodes 0,1,2 and 3 are aligned to K-Group = 0.

image

Notice the change in node to group alignment on this same system.image

So What Does This Mean?

For this specific system the second configuration, aligning nodes 0,1,4 and 5 is not optimal.   This system’s true (SRAT) proximity is the layout from the first example.   Nodes 4,5,6 and 7 are in a separate blade, using NUMA node glue based architecture.    Memory accesses between the hardware blades can be slower then local blade access.

The SRAT table provided by the hardware level, BIOS contains the proximity information.   Prior to Windows 2012 Server the Windows memory manager performs memory access tests, attempting to determine optimal proximity responsiveness and can override the SRAT information.

The second example is after the machine had a hardware failure on nodes 2 and 3.   The nodes were taken offline and Windows adjusted.  However, once the problem was corrected the system maintained the adjusted proximity layout.

The hardware manufactures are aware of this behavior and have specific tuning instructions per system, CPU type, memory layout, etc… that establishes the appropriate override of the Windows Server behavior using the Group Affinity registry key: http://support.microsoft.com/kb/2506384

These advanced configurations can involve additional processor groups to further sub-divide the current SRAT layout in order to obtain optimal performance.

For SQL Server installations, running on NUMA hardware I recommend you contact your hardware manufacture and obtain the optimal proximity settings.

Reference: CoreInfo.exe  http://technet.microsoft.com/en-us/sysinternals/cc835722.aspx 

Bob Dorr - Principal SQL Server Escalation Engineer

Can’t Connect to SQL when provisioning a Project App in SharePoint

$
0
0

The customers issue was that they were trying to provision a Project site within the Project SharePoint Application. This was done via a PowerShell script that they ran on one of the SharePoint App Servers.

They had two SharePoint App Servers – AppServerA and AppServerB. They had indicated that the provisioning would fail on either App Server and it started failing around November of last year (4 months ago). The error that they would see when the failure occurred was the following from the SharePoint ULS Logs:

02/05/2014 10:14:32.87        OWSTIMER.EXE (0x2024)        0x0BC8        Project Server        Database        880i        High        System.Data.SqlClient.SqlException: A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: Named Pipes Provider, error: 40 - Could not open a connection to SQL Server) at System.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection)

02/05/2014 10:14:32.87        OWSTIMER.EXE (0x2024)        0x0BC8        Project Server        Database        880j        High        SqlError: 'A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: Named Pipes Provider, error: 40 - Could not open a connection to SQL Server)' Source: '.Net SqlClient Data Provider' Number: 53 State: 0 Class: 20 Procedure: '' LineNumber: 0 Server: ''        f5009e1d-12cd-4a70-a0af-f0400acf99e6

02/05/2014 10:14:32.87        OWSTIMER.EXE (0x2024)        0x0BC8        Project Server        Database        tzkv        High        SqlCommand: 'CREATE PROCEDURE dbo.MSP_TimesheetQ_Acknowledge_Control_Message @serverUID UID , @ctrlMsgId int AS BEGIN IF @@TRANCOUNT > 0 BEGIN RAISERROR ('Queue operations cannot be used from within a transaction.', 16, 1) RETURN END DECLARE @lastError INT SELECT @lastError = 0 UPDATE dbo.MSP_QUEUE_TIMESHEET_HEALTH SET LAST_CONTROL_ID = @ctrlMsgId WHERE SERVER_UID = @serverUID SELECT @lastError = @@ERROR Exit1: RETURN @lastError END ' CommandType: Text CommandTimeout: 0        f5009e1d-12cd-4a70-a0af-f0400acf99e6

02/05/2014 10:14:32.87        OWSTIMER.EXE (0x2024)        0x0BC8        Project Server        Provisioning        6935        Critical        Error provisioning database. Script: C:\Program Files\Microsoft Office Servers\14.0\Sql\Project Server\Core\addqueue1timesheetsps12.sql, Line: 0, Error: A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: Named Pipes Provider, error: 40 - Could not open a connection to SQL Server), Line: CREATE PROCEDURE dbo.MSP_TimesheetQ_Acknowledge_Control_Message @serverUID UID , @ctrlMsgId int AS BEGIN IF @@TRANCOUNT > 0 BEGIN RAISERROR ('Queue operations cannot be used from within a transaction.', 16, 1) RETURN END DECLARE @lastError INT SELECT @lastError = 0 UPDATE dbo.MSP_QUEUE_TIMESHEET_HEALTH SET LAST_CONTROL_ID = @ctrlMsgId WHERE SERVER_UID = @serverUID SELECT @lastError = @@ERROR Exit1: RETURN @lastError END .        f5009e1d-12cd-4a70-a0af-f0400acf99e6

02/05/2014 10:14:32.89        OWSTIMER.EXE (0x2024)        0x0BC8        Project Server        Provisioning        6971        Critical        Failed to provision site /CMS with error: Microsoft.Office.Project.Server.Administration.ProvisionException: Failed to provision databases. ---> Microsoft.Office.Project.Server.Administration.ProvisionException: CREATE PROCEDURE dbo.MSP_TimesheetQ_Acknowledge_Control_Message @serverUID UID , @ctrlMsgId int AS BEGIN IF @@TRANCOUNT > 0 BEGIN RAISERROR ('Queue operations cannot be used from within a transaction.', 16, 1) RETURN END DECLARE @lastError INT SELECT @lastError = 0 UPDATE dbo.MSP_QUEUE_TIMESHEET_HEALTH SET LAST_CONTROL_ID = @ctrlMsgId WHERE SERVER_UID = @serverUID SELECT @lastError = @@ERROR Exit1: RETURN @lastError END ---> System.Data.SqlClient.SqlException: A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: Named Pipes Provider, error: 40 - Could not open a connection to SQL Server) at

One thing they had mentioned was that if they increased the Connection Timeout to 60 seconds, it would sometimes work. My thought process on this is that if connection timeout would sometimes allow it to work that we may have had a timeout when actually connecting to SQL Server, but that wasn’t the error.

Looking at the actual error we can draw some conclusions.

provider: Named Pipes Provider, error: 40 - Could not open a connection to SQL Server

By default, we should be using TCP. If there is a serious error with that, we will use Named Pipes. The error Named Pipes got back was that we couldn’t open the connection. Not a timeout. Think of this as “SQL Server does not exist or access denied”. SQL Server in this case was also a default instance Cluster. Not a Named Instance, so SQL Browser was not coming into the picture. This is a straight shot to port 1433 via TCP.

Which machine was getting the error?

For troubleshooting, we need to consider which machines are involved. One thing that we noticed over the course of troubleshooting was that the error always occurred on AppServerB and we were always starting the script from AppServerA. If you think about how SharePoint works with its App Servers, when a service is running, you can have it started on individual App Servers and control the load.  The fact that we were always seeing the error on AppServerB led me to believe that the Project Application Server Service was only started on AppServerB and not AppServerA.  Looking in Central Admin, this was correct.  So, we want to concentrate data collection from AppServerB.

Network Traces

The first thing that was looked at was getting a network trace. We collected network traces from AppServerB and the SQL Server.  If we go back to error that was happening, we recall that we know that TCP was not working as expected and then Named Pipes was failing.  Named Pipes uses the SMB protocol to talk.  This will first reach out to TCP port 445.  We didn’t see any traffic in the Network trace going to that.  We also didn’t see any SMB traffic that was relevant to the error.  We only saw browser announcements which had nothing to do with us.  This tells me that we never hit the wire.  So, the network traces wouldn’t be helpful.

BIDTrace

Enter BIDTrace.  BIDTrace is really just diagnostic logging within our client providers and server SNI stack.  Think Event Tracing for Windows (ETW).  I’m not going to dive into how to set this up as it would take its own blog post.  You can read more about it in the following MSDN Page:

Data Access Tracing in SQL Server 2012
http://msdn.microsoft.com/en-us/library/hh880086.aspx

Typically I won’t go this route unless I know what I’m looking to get out of it.  It is actually pretty rare that I’ll jump to this.  In this particular case, it was an excellent case.  We have some evidence that we are not getting far enough to hit the wire, and we know we are getting an error when trying to make a connection to SQL.  So, what I’m looking for here is if there is some Windows Error that we are getting that wasn’t presented in the actual exception.

Here is the Logman command that I used to start the capture after getting the BIDTrace items configured.

Logman start MyTrace -pf ctrl.guid -ct perf -o Out%d.etl -mode NewFile -max 150 –ets

A few things I’ll point out with this comment.  The output file has a %d in it.  This is a format string because we will end up with multiple files.  -mode is used to tell it to create a new file after hitting the max size that is listed.  We then set –max to 150 which means that we want to cap the size of the file to 150MB in size.  I did this because when we first went for it with a single file, the ETL file was 300MB and when I went to convert it to text it was over 1GB in size.  That’s a lot to look through.  I also had troubles opening it.  So, I decided to break it up.  Of note, it took about 4-5 minutes to reproduce the issue.  That’s a long time to capture a BIDTrace.  When you go to capture a BIDTrace, it is better to get a small window to capture if you can.  These files fill up fast.

Here is the ctrl.guid that I used to capture.  This is effectively the event providers that I wanted to capture:

{8B98D3F2-3CC6-0B9C-6651-9649CCE5C752}  0x630ff  0   MSDADIAG.ETW
{914ABDE2-171E-C600-3348-C514171DE148}  0x630ff  0   System.Data.1
{C9996FA5-C06F-F20C-8A20-69B3BA392315}  0x630ff  0   System.Data.SNI.1

The capture will produce ETL files which are binary files.  You need to convert them after you are done.  I use TraceRPT to do this.  It is part of Windows.  Here is the command I used to output it to a CSV file to look at.

TraceRPT out5.etl –of CSV

In our case, it had generated 5 etl files – remember the %d?  So, we grabbed the last file that was produced which was out5.etl and converted it.  Although at first, I didn’t know it was out5.etl.  I actually started with out4.etl.  One problem is though is I didn’t have timestamps within the CSV output.  I had clock CPU time which is hard to visualize compared to an actual timestamp.

Enter Message Analyzer! Message Analyzer is a replacement for Network Monitor.  But it has another awesome ability in that it can open ETL files.  One other thing I had was the timestamp of the error from the SharePoint ULS Log on the attempt that we made when we captured the BIDTrace.

02/06/2014 13:14:20.55     OWSTIMER.EXE (0x2024)                       0x1C4C    Project Server                    Database                          880i    High        System.Data.SqlClient.SqlException: A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: Named Pipes Provider, error: 40 - Could not open a connection to SQL Server)     at System.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection)     at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj)     at System.Data.SqlClient.TdsParser.Connect(ServerInfo serverInfo, SqlInternalConnectionTds connHandler, Boolean ignoreSniOpenTimeout, Int64 timerExpire, Boolean encrypt, Boolean trustServerCert, Boolean integratedSec...    bc7aaa60-93fc-4873-8f75-416d802aa55b

02/06/2014 13:14:20.55     OWSTIMER.EXE (0x2024)                       0x1C4C    Project Server                    Provisioning                      6993    Critical    Provisioning '/Test3': Failed to provision databases. An exception occurred: CREATE PROCEDURE dbo.MSP_TimesheetQ_Get_Job_Count_Simple   @correlationID UID ,    @groupState int ,    @msgType int  AS BEGIN    IF @@TRANCOUNT > 0    BEGIN              RAISERROR ('Queue operations cannot be used from within a transaction.', 16, 1)       RETURN    END     SELECT COUNT(*) FROM dbo.MSP_QUEUE_TIMESHEET_GROUP        WHERE CORRELATION_UID = @correlationID       AND   GRP_QUEUE_STATE = @groupState       AND   GRP_QUEUE_MESSAGE_TYPE = @msgType END .    bc7aaa60-93fc-4873-8f75-416d802aa55b

Our issue occurred at 1:14:20.55 Server Time. We can also see the statement it was going to try and run.  If we open the ETL file within Message Analyzer, we can see the timestamps that are covered within the file. 

image

We can see that this went up to 12:14:04 local time.  We were looking for 12:14:20.55.  So, out4.etl was not the file I was looking for.  Which left Out5.etl.  Technically you can read the data within Message Analyzer as you can see from the lower right of the screenshot.  It’s unicode data, and we see l.e.a.v.e.  I still prefer the output from TraceRPT when going to CSV as I can get the readable text from that.  It is just a little easier to work with.

So, I have the CSV output from out5.etl, but what do we look for?  Well, we know the statement that it was trying to make, so lets look for that - MSP_TimesheetQ_Get_Job_Count_Simple. We get a hit and it looks like this:

System.Data,      TextW,            0,          0,          0,          0,         18,          0, 0x0000000000000000, 0x00002024, 0x00001C4C,                    0,             ,                     ,   {00000000-0000-0000-0000-000000000000},                                         ,   130361840460213871,       7080,      21510,        2, "<sc.SqlCommand.set_CommandText|API> 4187832#, '"
System.Data,      TextW,            0,          0,          0,          0,         18,          0, 0x0000000000000000, 0x00002024, 0x00001C4C,                    0,             ,                     ,   {00000000-0000-0000-0000-000000000000},                                         ,   130361840460213910,       7080,      21510,        2, "CREATE PROCEDURE dbo.MSP_TimesheetQ_Get_Job_Count_Simple   @correlationID UID ,    @groupState int ,    @msgType int  AS BEGIN    IF @@TRANCOUNT > 0    BEGIN              RAISERROR ('Queue operations cannot be used from within a transaction.', 16, 1)       RETURN    END     SELECT COUNT(*) FROM dbo.MSP_QUEUE_TIMESHEET_GROUP        WHERE CORRELATION_UID = @correlationID       AND   GRP_QUEUE_STATE = @groupState       AND   GRP_QUEUE_MESSAGE_TYPE = @msgType END "
System.Data,      TextW,            0,          0,          0,          0,         18,          0, 0x0000000000000000, 0x00002024, 0x00001C4C,                    0,             ,                     ,   {00000000-0000-0000-0000-000000000000},                                         ,   130361840460213935,       7080,      21510,        2, "' "

Not the prettiest, but when looking in notepad or some other text reader, we can just go over to the right to get a better view.

image

The first time you look at this it can be a little overwhelming.  Especially if you aren’t familiar with how SNI/TDS works.  If we go through the results, we’ll see a few interesting things.

<prov.DbConnectionHelper.ConnectionString_Set|API> 4184523#, 'Data Source=<server>;Initial Catalog=<database>;Integrated Security=True;Pooling=False;Asynchronous Processing=False;Connect Timeout=15;Application Name="Microsoft Project Server"' "

<GetProtocolEnum|API|SNI>

<Tcp::FInit|API|SNI>

<Tcp::SocketOpenSync|API|SNI>

<Tcp::SocketOpenSync|RET|SNI> 10055{WINERR}

<Tcp::Open|ERR|SNI> ProviderNum: 7{ProviderNum}, SNIError: 0{SNIError}, NativeError: 10055{WINERR} <-- 10055 = WSAENOBUFS

<Np::FInit|RET|SNI> 0{WINERR}

<Np::OpenPipe|API|SNI> 212439#, szPipeName: '\\<server>\PIPE\sql\query', dwTimeout: 5000

<Np::OpenPipe|ERR|SNI> ProviderNum: 1{ProviderNum}, SNIError: 40{SNIError}, NativeError: 53{WINERR} <-- ERROR_BAD_NETPATH = network path was not found

We can get the Connection string, which was also available in the SharePoint ULS Log.  We will also see some entries around Protocol Enumeration.  This is where we look at the Client Registry items to see what Protocols we will go through and in what order (TCP, NP, LPC, etc…).  Then we see TCP trying to connect.  You’ll recall I mentioned that we try TCP first by default.  We then see that this received a Windows error of 10055 (WSAENOBUFS).  We then see Named Pipes fail with Error 53 which is ERROR_BAD_NETPATH.  We got what we were looking for out of the BIDTrace.

WSAENOBUFS is the key here.  It is a WinSock error which we actually have a KB Article on.

When you try to connect from TCP ports greater than 5000 you receive the error 'WSAENOBUFS (10055)'
http://support.microsoft.com/kb/196271

There is a registry key called MaxUserPort which can increase the number of dynamic ports that are available.  In Windows 2003, this was under 5000.  Starting in Windows 2008, this was increased as we use to see a lot of problems here.  Especially when connection pooling was not being used.  Here is the port range on my Windows 8.1 machine.

image

And for a Windows 2008 R2 Server, which the customer was using:

image

I have 64510 ports available.  On the customer’s machine, they had mentioned that for a previous issue, the engineer had asked them to add this registry key, and they set the value to 4999.  By setting it to 4999, we are effectively limiting the number of ports that would have otherwise been available.  If you look back at the connection string, you can see that Pooling was set to False.  This means we are turning off connection pooling, and every time we go to connect, we will establish a new hard connection.  This eats up a port.  You can look at NETSTAT to see what it looks like.  We did then when running the provisioning scripts and we saw it get up to around 3000 or so before it was done.  You will also see a lot of ports in a TIME_WAIT status.  When you disconnect and the port is released, it will go into a TIME_WAIT state for a set amount of time.  The default of which is around 4 minutes.  That’s 4 minutes you can’t use that port.  If you are opening and closing connections a lot, you will run out of ports because a lot will be in the TIME_WAIT state.  That’s typically when we would bump up the number of ports using the MaxUserPort registry key.  However, this is never really a fix, you are just putting  a bandaid on without understanding the problem.

End result…

In our case, Project Server was turning off connection pooling.  I don’t know why they are doing that, but that, in conjunction with the MaxUserPort being set to 4999, was causing this issue.  We had removed the MaxUserPort registry key and rebooted AppServerB, and it started working after that.  Of note, we had also started the Project Application Server on AppServerA and cleaned up the TCP registry keys on that machine as well so that they could effectively balance their load on the SharePoint App Server.

 

Adam W. Saxton | Microsoft SQL Server Escalation Services
http://twitter.com/awsaxton


sp_reset_connection – Rate Usage (Don’t fight over the grapes)

$
0
0

Discussions surrounding sp_reset_connection behavior and usage come up all the time and over the years I have discussed the subject 100s of times with customers.  Blogs, API documentation and Books Online describe SQL Server, pooled connections behavior.  

Under the covers SQL Server uses the sp_reset_connection logic to ‘reset’ the connection state for the SQL Server, which is faster than establishing a completely new connection.  Older drivers send the procedure call as a separate, TDS, round-trip to the SQL Server.   Newer client drivers add a flag bit along with the next command, avoiding the extra network round-trip.

The discussions quickly turn to: “What is a reasonable rate of sp_reset_connections on my server?” As usual the answer is always it depends.

However, a cursory inspection of the rate usually reveals itself as I am not that worried or my stomach hurts. – The simple, old, smell test works pretty well for getting a high level understanding of your system.  Then you can work with your application developers to tune the behavior accordingly.

The documentation refers to the concepts of Open Late and Close Early.  Meaning you open the connection right before you need it and you release the connection as soon as it is no longer needed.   This allows the connection pool to work optimally by sharing the connection whenever possible.

The problem I most often see is that this behavior is taken to the extreme.   The development team often cookie cutters functional logic.

MyFunction
{
   Get Connection
   Run Query
   Release Connection
}

The function uses Open Late / Close Early just like the documentation pointed out.  Now imagine you have dozens or 100s of these functions in the application.  The problem is that seldom does logical business activity call a single function activity.

LoadMyPage
{
    Call Func1
    Call Func2
    Call Func3
}

In this example the application drives the connection pool 3 times, resulting in 3, sp_reset_connection operations.   This is the worst case scenario with a 1:1 ratio of commands to sp_reset_connection invocations.

Don’t fight over the grapes

You may have wondered why this was in the title.  It is because I was recently sitting in an airport and two little girls gave me an analogy for sp_reset_connection.  

They looked to be about 3 or 4 years old and were sharing a bag of grapes.   They started out very polite, each taking just one grape at a time.   However, they kept waiting on each other and as time went on they got a bit more combative until the mother finally told them not to fight over the grapes.   Instead of taking one at a time take a handful.  Then they would not be constrained all the time waiting to get access to the bag of grapes. 

If you will, she optimized their activity.

The grapes analogy was perfect.   If each command in the application acquires a connection, executes and releases you are placing pressure and resource constraints on the connection pool.  Each time the sp_reset_connection executes it uses resources on the SQL Server and client.

It is far better to write the application logic to avoid contention points, and align with logical units of work.   This maintains the concepts of Open Late and Close Early while reducing overhead and improving performance.

LoadMyPage
{
    conn = Get Connection
      Call Func1(conn)
      Call Func2(conn)
      Call Func3(conn)
    Release conn
}

In this example the connection spans the logic to load the page.  The connection to command ratio goes from 1:1 to 1:3 and in doing so removes 2 of the sp_reset_connection activities.

Simply put you need to find a healthy sp_reset_connection to command ratio for your environment.   I can tell you that 1:1 is poor and the applications that I see functioning well are usually in the 1:8, 1:10, 1:15 range.

There is not a hard and fast rule but using performance monitor you can quickly compare the overall batch rate to the reset connection rates.   When I see the rate start to climb above 15% it is a pretty good indication that the application may need to be revisited and tuned a bit.

It is true that the connection reset may not be driving your CPU load (sp_reset_connection has been tuned and is lightweight in general) as compared to the overall work done by the TSQL commands executing on the SQL Server.   You should think of this more as a gate than a CPU consumer.   Before the command, that is doing the work you need it to do can execute, the reset must complete.  While the delay is small in wall-clock time the overall performance of the application could be better with strategic use of the reset behavior.

The load page examples above will simply run faster with the 1:3 ratio because it avoids the 2 trips though the pooled connection logic.

With all this said, you need to be careful that you don’t extend the ratio too far.   Keeping the connection when it is not needed will increase the overall number of connections using more client and SQL Server overhead.   You need to find the sweet spot that optimizes the client and SQL Server resources and maximizes the application performance capabilities.

You may also consider recent changes that reduce the overhead of sp_reset_connection at the SQL Server

http://support.microsoft.com/kb/2926217 

The following are additional references pertaining to the sp_reset_connection subject

http://blogs.msdn.com/b/psssql/archive/2010/12/29/how-it-works-error-18056-the-client-was-unable-to-reuse-a-session-part-2.aspx

http://blogs.msdn.com/b/psssql/archive/2013/02/13/breaking-down-18065.aspx

http://support.microsoft.com/kb/180775

Bob Dorr - Principal SQL Server Escalation Engineer

SQL Nexus 4.0 Released to codeplex

$
0
0

 

We have just released SQL Nexus 4.0 (https://sqlnexus.codeplex.com/) which supports latest SQL Server (2012) with enhanced reports. 

In addition to read release notes, make sure you also read the top issues. Please report any issues under the Issues page under https://sqlnexus.codeplex.com/.

New Release Notes (4.0.0.64):

You must meet the following requirements:

Don’t Rely On a Static IP Address for Your SQL Database

$
0
0

I’ve seen a number of customers open support incidents because they couldn’t connect to their SQL Database server which was ultimately due to the incorrect assumption that the server’s IP address is static. In fact, the IP address of your logical server is not static and is subject to change at any time. All connections should be made using the fully qualified DNS name (FQDN) rather than the IP address.

The following picture from the Windows Azure SQL Database Connection Management Technet article shows the network topology for a SQL Database cluster.

image

Your logical server (e.g., with a FQDN of xyz.database.windows.net) resides on a SQL Database cluster in one of the backend SQL Server nodes. Within a given region (e.g., North Central US, South Central US, North Europe, etc) there are generally many SQL Database clusters, as required to meet the aggregate capacity of all customers.  All logical servers within a cluster are accessed through the network load balancer (the single blue block with the note saying “Load balancer forwards ‘sticky’ sessions…” in the diagram) via a virtual IP address.

If you do a reverse name lookup from your server’s IP address you will actually see the name of the cluster load balancer. For example, if I try to ping one of my servers (whose actual server name starts with ljvt in the screenshot below) you will see that the displayed name associated with the IP address is instead data.sn3-1.database.windows.net, where the sn3-1 portion of the name maps to the specific cluster in the region (South Central) hosting this server.

image

Microsoft may do an online migration of your logical server between clusters within a region, load balancing capacity across the clusters within the region. This move is a live operation and there is no loss of availability to your database during the operation. When the migration completes, existing connections to your logical server are terminated and upon reconnecting via fully qualified domain name your app will be directed to the new cluster.  However, if your application caches or connects by IP address instead of FQDN then your connection attempts will fail.

A migration moves all of your settings, including any SQL Database firewall rules that you have.  Consequently there are no Azure-specific changes that are required in order to connect.  However, if your on-premise network infrastructure blocks/filters outgoing TCP/IP traffic to port 1433—the port used for SQL connections—and you had it restricted to a fixed IP address then you may need to adjust your client firewall/router.  The IP address of your SQL Database server will always be a part of the address ranges listed in the Windows Azure Datacenter IP Ranges list.  You should allow outgoing traffic for port 1433 to these address ranges rather than a specific IP address.

Keith Elmore – Principal Escalation Engineer

SQL Server 2014’s new cardinality estimator (Part 1)

$
0
0

One of the performance improvement in SQL Server 2014 is the redesign of cardinality estimation. The component which does cardinality estimation (CE) is called cardinality estimator. It is the essential component of SQL query processor for query plan generation. Cardinality estimates are predictions of final row count and row counts of intermediate results (such as joins, filtering and aggregation). These estimates have direct impact on plan choices such as join order, join type etc. Prior to SQL Server 2014, cardinality estimator was largely based on SQL Server 7.0 code base. SQL Server 2014 introduces new design and the new cardinality estimator is based on research on modern workloads and learning from past experience.

A whitepaper planned by the SQL Server product team will document specific scenarios where new and old cardinality estimators differ. We will follow up with a later blog post when that paper is released. Additionally, Juergen Thomas has posted an overview of the feature on "Running SAP on SQL Server blog".

In this blog, we will provide a quick overview about controlling the SQL Server 2014 feature, guidelines on troubleshooting issues. We have plans to release more blog posts related to SQL Server 2014 new cardinality estimator in the future.

One of the goals for this blog post is to help make customers aware of this feature for upgrades and new deployments as query plans may change. We encourage users test sufficiently prior to upgrading to avoid performance surprises.

New deployments vs upgrade

SQL Server 2014 uses database compatibility level to determine if new cardinality estimator will be used. If the database compatibility level is 120, new cardinality estimator will be used. If you create a new database on SQL Server 2014, compatibility level will be 120. When you upgrade or restore from a previous version to SQL 2014, database compatibility level will not be updated. In other words, you will continue to use old cardinality estimator in upgrade and restore situations by default. This is to avoid plan change surprises for upgrades. You can manually change the compatibility level to be 120 so that new cardinality estimator can be used. Please refer to online documentation on how to view and change database compatibility level. Be aware that changing database compatibility level will remove all existing query plans from the plan cache for the database.

Please note the following:

  1. Which version of cardinality estimator to use is based on current database context where the query is compiled even if the query references multiple databases. Let's assume you have db1 with compatibility level of 120 and db2 with compatibility level of 110, and you have a query that references two databases. If the query is compiled under db1, new cardinality estimator will be used. But if the query is compiled under db2, old cardinality estimator will be used.
  2. You cannot change database compatibility level for system databases which will always be on the latest compatibility level. What this means is that queries compiled under context of system databases will use new cardinality estimator (subject to server, session and query level controls discussed later in the blog).
  3. If your query references temporary tables, the database context under which the query is compiled determines which version of cardinality estimator to be used. In other words, if your query is compiled under a user database, the user database compatibility level (not tempdb) will determine which version of cardinality estimator to be used even though the query references temp table.

How to tell if you are using new cardinality estimator

There are two ways you can tell if new cardinality estimator is used.

In the SQL 2014 XML plan, there is a new attribute in StmtSimple called CardinalityEstimationModelVersion. When the value is 120, it means the new cardinality estimator is used. If the value is 70, it means the old cardinality estimator is used. This new XML attribute is only available for SQL 2014 and above (see screenshot below).

If you start capturing a new SQL Server 2014 XEvent called query_optimizer_estimate_cardinality, this event will be produced during compilation if new cardinality estimator is used. If the old cardinality estimator is used, this XEvent won't be produced even if you enable the capturing (see a screenshot below). We will talk more about how to use this XEvent to help troubleshoot cardinality issues in future blogs.

Additional ways to control new cardinality estimator

In addition to database compatibility level, you can use trace flags 2312 and 9481 to control if new or old cardinality estimator will be used. Trace flag 2312 is used to force new cardinality estimator while 9481 is used to force old cardinality estimator regardless of the database compatibility level setting. If you enable both trace flags, neither will be used to determine which version of cardinality estimator. Instead, database compatibility level will determine which version of cardinality estimator to be used. When such a case occurs, a new XEvent "query_optimizer_force_both_cardinality_estimation_behaviors" will be raised to warn user (if you enable this XEvent).

You can enable these trace flags at server, session or query level. To enable the trace flag at query level, you use QUERYTRACEON hint documented in 2801413. Below is an example query

select * from FactCurrencyRate where DateKey = 20101201 option (QUERYTRACEON 2312)

Precedence

Since we have multiple ways to control the behavior, let's talk about order of precedence. If the query has QUERYTRACEON hint to disable or enable the new cardinality estimate, it will be respected regardless of server/session or database level settings. If you have a trace flag enabled at server or session level, it will be used regardless the database compatibility level setting. See the diagram below.

 

 

 

Guidelines on query performance troubleshooting with new cardinality estimator

When you run into issues with new cardinality estimator, you have a choice to revert to the old behavior. But we encourage you spend time troubleshooting the query and find out if the new cardinality estimator even plays a role in terms of your slow query performance. Basic troubleshooting query performance stays the same.

Statistics

Regardless of the versions of cardinality estimators, the optimizer still relies on statistics for cardinality estimate. Make sure you enable auto update and auto create statistics for the database. Additionally, if you have large tables, auto update statistics threshold may be too high to trigger statistics update frequently. You may need to schedule jobs to manually update statistics.

Indexes

You may not have sufficient indexes on tables involved for the slow query. Here are a few ways you can help tune your indexes.

  1. XML Plan will display missing index warning for a query.
  2. Missing index DMVs. SQL Server tracks potential indexes that can improve performance in DMVs. This blog has sample queries on how to use the DMVs. Additionally, SQL Nexus also has a report on missing indexes server wide.
  3. Database Tuning Advisor (DTA) can be used to help you tune a specific query. Not only can DTA recommend indexes but also recommend statistics needed for the query. Auto create statistics feature of SQL Server doesn't create multi-column statistics. But DTA can identify and recommend multi-column statistics as well.

Constructs not significantly addressed by the new cardinality estimator.

There are a few constructs that are known to have cardinality estimate issues but are not addressed by the new cardinality estimator. Below are a few common ones.

  1. Table variables. You will continue to get low estimate (1) for table variables. This issue is documented in a previous blog.
  2. Multi-statement table valued function (TVF): Multi-statement TVF will continue to get estimate of 100 instead of 1 in earlier version. But this can still cause issues if your TVF returns many rows. See blog for more details.
  3. Behaviors of Table valued parameter (TVP) and local variables are unchanged. The number of rows of TVP at compile time will be used for cardinality estimate regardless if the rows will change for future executions. Local variables will continued to be optimized for unknown.

References

  1. New cardinality estimator online documentation.
  2. Juergen Thomas's blog on New cardinality estimator and SAP applications

In the future blogs, we will document more on how to use new XEvent query_optimizer_estimate_cardinality to troubleshoot query plan issues and how plan guide may be used to control the new cardinality estimator behavior.

Many thanks to Yi Fang, a Senior Software Design Engineer from SQL Server Query Processor team at Microsoft, for reviewing and providing technical details on this blog.

 

Jack Li - Senior Escalation Engineer and Bob Ward - Principal Escalation Engineer, Microsoft SQL Server Support

I think I am getting duplicate query plan entries in SQL Server’s procedure cache

$
0
0

Before the post dives into the subject I need to point out that Keith did most of the work.  I just kept pestering him with various scenarios until he sent me the e-mail content I needed.   Thanks Keith – Smile

Keith devised a set of steps that you can use to collect information about the plans and the associated plan, key attributes.  Using these queries you can track down entries in procedure cache with the same handles and determine what attribute is different, indicating why there appears to be duplicate entries for the same query.    It is often as simple as a SET statement difference.

From: Keith Elmore

Bob asked me to take a quick look and see if I could make some headway on understanding why there appears to be duplicate plans in cache for the same sql_handle and query_hash. In researching this, if you call a procedure that references a temp table created outside of that scope, the cached plan has the session_id as part of the cache key for the plan. From http://technet.microsoft.com/en-us/library/ee343986(v=SQL.100).aspx

If a stored procedure refers to a temporary table not created statically in the procedure, the spid (process ID) gets added to the cache key. This means that the plan for the stored procedure would only be reused when executed again by the same session. Temporary tables created statically within the stored procedure do not cause this behavior.

Because the customer is invoking this query via sp_executesql and the temp table is created outside of the sp_executesql the above condition applies, and the theory is that this could be causing the larger number of entries even though the sql_handle and query_hash are the same. But in order to confirm this theory we need some additional data. If the customer wants to pursue this, the following queries is what I’d want to run:

1. A single execution of this query from SSMS.

-- Look and see if there is any hash bucket with a large number of entries (> 20)

-- which may cause slower lookup of entries

selectp1.*fromsys.dm_exec_cached_plansp1

join(selectbucketid,count(*)ascache_entries,count(distinctplan_handle)asdistinct_plansfromsys.dm_exec_cached_plansp

groupbybucketid

havingcount(*)> 20)asp2onp1.bucketid=p2.bucketid

2. Run the following query from SSMS, which will save all of these "duplicate" queries into a permanent table that we’ll retrieve.

-- Save all of the "duplicate" plans for this specific query in a table in tempdb

selectqs.sql_handle,qs.statement_start_offset,qs.statement_end_offset,

qs.creation_time,

qs.execution_count,

qs.plan_generation_num,

p.*

intotempdb..DuplicateCachePlans

fromsys.dm_exec_query_statsqs

joinsys.dm_exec_cached_plansponqs.plan_handle=p.plan_handle

whereqs.sql_handle= 0x0200000093281821F68C927A031EDA1B661FC831C10898D0

andqs.query_hash='0x07BD94E2146FD875'

3. From a command prompt, bcp out the data from the table above, as well as the plan_attributes data for each of these plans (add appropriate server name with –S parameter and optionally add path to where you want the file written, filename highlighted in yellow below)

bcp "select * from tempdb..DuplicateCachePlans" queryout cached_plans.out -n –T

bcp "select p.plan_handle, pa.* from tempdb..DuplicateCachePlans p cross apply sys.dm_exec_plan_attributes (p.plan_handle) as pa" queryout plan_attributes.out -n –T

4. Back in SSMS, you can drop the temp table created in step 2

droptabletempdb..DuplicateCachePlans

-Keith

 Bob Dorr - Principal SQL Server Escalation Engineer

Viewing all 339 articles
Browse latest View live


Latest Images