Query performance and table variables

August 24, 2010, 1:50 pm

≫ Next: Slow query performance because inaccurate cardinality estimate when using anti-semi join following upgrade from SQL Server 2000

≪ Previous: Why use SQL Server 2008 R2 BPA? Case 1: Missing Updates…..

Technorati Tags: Performance

Frequently, we see our customers using table variables in their stored procedures and batches and experience performance problems.

In general, these performance problems are introduced because of large number of rows being populated into the table variable.

Table variables were introduced in SQL Server 2000 with intention to reduce recompiles. Over time, it gained popularity. Many users use to to populate large number of rows and then join with other tables.

When the batch or stored procedure containing the table variable is compiled, the number of rows of the table variable is unknown. Therefore, optimizer has to make some assumptions. It estimates very low number of rows for the table variable. This can cause inefficient plan. Most of the time, a nested loop join is used with the table variable as outer table. If large number of rows exist in the table variable, this results in inner table be executed many times.

So if you anticipate large number of rows to be populated to the table variable, you should not use it to begin with unless you don’t intend to join with other tables or views.

If you have large number of rows to be populated into the table variable, consider this solution. You can add option recompile to the statement that involves the table variable joining with other tables. By doing this, SQL Server will be able to detect number of rows at recompile because the rows have already been populated. This option is only available for SQL Server 2005 and beyond.

Additionally, you can also use temp tables which can provide better statistics.

The script below demonstrate the cardinality issue and solution. I re-arranged the execution plan here. Note that the EstimateRows for @t1 is 1 row but in fact 100000 rows were populated into the table. The one that has option recompile has accurate cardinality estimate.

Execution plan without option recompile

Execution plan with option recompile

/******************************************************
1. create a permenant table t2 and insert 100,000 rows
*******************************************************/
set statistics profile off
go
use tempdb
go
if OBJECT_ID ('t2') is not null
drop table t2
go
create table t2 (c2 int)
go
create index ix_t2 on t2(c2)
go
--insert 100,000 rows into the perm table
set nocount on
begin tran
declare @i int
set @i = 0
while @i < 100000
begin
insert into t2 values (@i)
set @i = @i + 1
end
commit tran
go
--update stats
update statistics t2

go
/********************************************************
2. join permantant table with table variable
the table variable gets 100,000 rows inserted

then it is joined to t2
@t1 gets 1 rows estimate
it ends up with nested loop join
*********************************************************/

set nocount on
declare @t1 table (c1 int)
begin tran
declare @i int
set @i = 0
while @i < 100000
begin
insert into @t1 values (@i)
set @i = @i + 1
end
commit tran
set statistics profile on
select * from @t1 inner join t2 on c1=c2
go

set statistics profile off

/****************************************************
3. solution
use stmt level recompile
******************************************************/
declare @t1 table (c1 int)
set nocount on
begin tran
declare @i int
set @i = 0
while @i < 100000
begin
insert into @t1 values (@i)
set @i = @i + 1
end
commit tran
set statistics profile on
select * from @t1 inner join t2 on c1=c2 option (recompile)
go

set statistics profile off

↧

Slow query performance because inaccurate cardinality estimate when using anti-semi join following upgrade from SQL Server 2000

September 1, 2010, 2:36 pm

≫ Next: Error 17953 - SidePageTable::Init() DeviceIoControl() : Operating system error 1(Incorrect function.) encountered.

≪ Previous: Query performance and table variables

We have a few customers who reported that some of their queries run slower following upgrade from SQL Server 2000 to SQL Sever 2005, 2008 and 2008 R2. Specifically, queries experiencing the issue have anti-semi joins in the query plan and the join involves multiple columns as joining condition.

Anti-semi joins are results of query constructs like NOT EXISTS, NOT IN. Here is an example of the query that would result in anti-semi join:

SELECT t1.*
FROM tst_TAB1 t1
WHERE NOT EXISTS( SELECT * FROM tst_TAB2 t2 WHERE t1.c1 = t2.c1 AND t1.c2 = t2.c2 )

Note that you only experience this issue when multiple joining columns are involved in the join as the example above.

If you examine the query plan, you can spot the issue. In this query execution plan output (re-arranged for ease of explanation), the left anti semi join (merge join) returned 2808 rows but the EstimateRows only estimate 1 row.

Inaccurate estimate will impact overall query plan and potentially slow performance.

Solution:

This is a product regression and we have put out fixes for both SQL Server 2005, 2008 and 2008 R2. Currently the fixes for SQL 2005 and 2008 are released. Refer to KB http://support.microsoft.com/kb/2222998 for this fix. SQL Server 2008 R2 fix is being planned and the same KB will be updated to reflect the fix once it becomes available. Please note that you will need to enable trace flag 4199 to activate the performance fix.

Jack Li | Senior Escalation Engineer | Microsoft SQL Server Support

↧

Error 17953 - SidePageTable::Init() DeviceIoControl() : Operating system error 1(Incorrect function.) encountered.

September 22, 2010, 8:40 am

≫ Next: Case of using filtered statistics

≪ Previous: Slow query performance because inaccurate cardinality estimate when using anti-semi join following upgrade from SQL Server 2000

When attempting to run dbcc checkdb or create a snapshot database you may encounter the following errors when using a UNC location that does not support all sparse file operations.

2010-09-21 17:27:26.47 spid82      Error: 17053, Severity: 16, State: 1.

2010-09-21 17:27:26.47 spid82      SidePageTable::Init() DeviceIoControl() : Operating system error 1(Incorrect function.) encountered.

2010-09-21 17:27:26.47 spid82      Error: 17204, Severity: 16, State: 1.

2010-09-21 17:27:26.47 spid82      FCB::Open failed: Could not open file \\MyServers\smb2\MSSQL10.MSSQLSERVER\MSSQL\DATA\MYDB.mdf:MSSQL_DBCC11 for file number 1. OS error: 1(Incorrect function.).

This error was encountered on a NAS device. Specifically, the SidePageTable initialization is a call to the Windows API DeviceIoControl using FSCTL_QUERY_ALLOCATED_RANGES.

The following is an example of the call used by SQL Server.

retcode = DeviceIoControl (

handle,

FSCTL_QUERY_ALLOCATED_RANGES,

&startRange,

sizeof (startRange),

ranges,

sizeof (ranges),

&bytesReturned,

NULL);

error = GetLastError ();

http://msdn.microsoft.com/en-us/library/aa364582(VS.85).aspx

If you are encountering this issue contact your hardware manufacture to obtain the proper device and driver updates.

Note: SQL Server 2008 R2 supports databases on SMB 2.0 compliant devices. The SQLIOSim.exe that ships with SQL Server 2008 R2 has been updated to allow testing against a UNC location. Previous versions of SQLIOSim.exe failed with an error when calling the Windows API GetVolumeNameForVolumeMountPoint.

Currently SQLIOSim.exe can test sparse file implementations but does NOT attempt the FSCTL_QUERY_ALLOCATED_RANGES and as such will not expose the issue that could be encountered by dbcc checkdb or snapshot database creations.

Bob Dorr - Principal SQL Server Escalation Engineer

↧

Case of using filtered statistics

September 28, 2010, 11:02 am

≫ Next: Query performance and plan cache issues when parameter length not specified correctly

≪ Previous: Error 17953 - SidePageTable::Init() DeviceIoControl() : Operating system error 1(Incorrect function.) encountered.

Technorati Tags: Performance

SQL Server 2008 introduces a new feature called filtered statistics. When used properly, it can dramatically improve cardinality estimate. Let’s use an example below to illustrate how cardinality estimate can be incorrect and how filtered statistics can improve this situation.

We have two tables. Region has only 2 rows. Sales table have 1001 rows but only 1 row has id of 0. The rest of it have id of 1’s.

Table Region

id	name
0	Dallas
1	New York

Table Sales

id	detail
0	0
1	1
1	2
1	3
1	4
1	5
1	6
1	7
1	8
1	9
1	10
1	11
1	12
1	13
1	14
1	15
1	16
…	…
1	1000

Now let’s look at query “select detail from Region join Sales on Region.id = Sales.id where name='Dallas'”. From human eye perspective, we immediately know that only one row would qualify. If we look at Dallas there is only one row in Region table and one row in Sales table. But from optimizer perspective, it does not know that when the query is compiled and before query is executed. In order to know that, basically SQL would have to execute the query half way and filter out values for Dallas and take the id of 0 and then evaluate how many rows are there in table Sales. In other words, it would require incremental execution.

If you execute the query, you will get a plan like this. Note that the nested loop estimated 500.5 rows but only 1 row actually was retrieved.

Now let’s see what happens if we create a statistics on Region.id but put a filter on name (“Dallas”). Here is the statement “create statistics Region_stats_id on Region (id) where name = 'Dallas'”.

Now if you execute the same select statement (select detail from Region join Sales on Region.id = Sales.id where name='Dallas'), the cardinality estimate is correct as shown below for the nested loop join.

What happened here is the filtered statistics (create statistics Region_stats_id on Region (id) where name = 'Dallas') is used for optimization. When SQL optimizes the query, it sees there is a statistics that matches the where clause. It then discovers there is only 1 id of 0 and thus is able to do a correct estimate.

Correct cardinality estimate is very import for complex joins as it affects join order and join types dramatically.

Here is a complete demo:

drop table Region
go
drop table Sales
go

create table Region(id int, name nvarchar(100))
go
create table Sales(id int, detail int)
go
create clustered index d1 on Region(id)
go
create index ix_Region_name on Region(name)
go
create statistics ix_Region_id_name on Region(id, name)
go
create clustered index ix_Sales_id_detail on Sales(id, detail)
go

-- only two values in this table as lookup or dim table
insert Region values(0, 'Dallas')
insert Region values(1, 'New York')
go

set nocount on
-- Sales is skewed
insert Sales values(0, 0)
declare @i int
set @i = 1
while @i <= 1000 begin
insert Sales values (1, @i)
set @i = @i + 1
end
go

update statistics Region with fullscan
update statistics Sales with fullscan
go

set statistics profile on
go
--note that this query will over estimate
-- it estimate there will be 500.5 rows
select detail from Region join Sales on Region.id = Sales.id where name='Dallas' option (recompile)
--this query will under estimate
-- this query will also estimate 500.5 rows in fact 1000 rows returned
select detail from Region join Sales on Region.id = Sales.id where name='New York' option (recompile)
go

set statistics profile off
go

create statistics Region_stats_id on Region (id)
where name = 'Dallas'
go
create statistics Region_stats_id2 on Region (id)
where name = 'New York'
go

set statistics profile on
go
--now the estimate becomes accurate (1 row) because
select detail from Region join Sales on Region.id = Sales.id where name='Dallas' option (recompile)

--the estimate becomes accurate (1000 rows) because stats Region_stats_id2 is used to evaluate
select detail from Region join Sales on Region.id = Sales.id where name='New York' option (recompile)
go

set statistics profile off

↧

Query performance and plan cache issues when parameter length not specified correctly

October 5, 2010, 1:53 pm

≫ Next: Numbers are better than letters…

≪ Previous: Case of using filtered statistics

We recently worked with a customer who reported his update to a linked server table runs very slow. This only happens when he doesn’t specify the character parameter length in the .NET code. This actually brings up plan cache issue as well. So this is worth a blog.

Let’s use this simplified .NET example. Here is the .NET code:

SqlConnection conn = new SqlConnection (@"Data Source=Server1;Integrated Security=SSPI");
conn.Open();
SqlCommand cmd = conn.CreateCommand();
cmd.CommandText = @"update Server2.master.dbo.t set c2 = @c2 where c1 = 1";
string str = "abc";
cmd.Parameters.Add("@c2", SqlDbType.VarChar).Value = str;
cmd.ExecuteNonQuery();

Furthermore, the table t on the remote server has a definition of “create table t (c1 int, c2 varchar(500))”

In the above code, the query runs on Server1 but updates a table on Server2. The query is parameterized to take @c2 as a parameter. But when adding the parameter, the code didn’t specify the length of parameter for @c2.

When the latest SQL .NET provider sees this, it will determine the length of the string str and use that as the length of the parameter for @c2. This translate into the following query:

exec sp_executesql N'update Server2.master.dbo.t set c2 = @c2 where c1 = 1',N'@c2 varchar(3)',@c2='abc'

Why the above query will perform slowly?

The table t on remote server has a column c2 as varchar(500). But the parameterized update (translated by the Provider) specify that the @c2 parameter is of varchar(3). This will result in mismatching parameter. In linked server situation, we are very cautious involving update and insert if the character data length do not match for fear of truncation of characters. So SQL Server generates a plan that would bring all data locally to the server and then do update. Finally it sends the update back to remote server.

Here is the plan. As you can see a “remote scan” is used to bring the entire table locally.

update [Server2].master.dbo.t set c2 = @c2 where c1 = 1
|--Remote Update(SOURCE:(Server2), OBJECT:("master"."dbo"."t"), SET:([Server2].[master].[dbo].[t].[c2] = [Expr1003]))
       |--Compute Scalar(DEFINE:([Expr1003]=CONVERT_IMPLICIT(varchar(500),[@c2],0)))
            |--Table Spool
                 |--Filter(WHERE:([Server2].[master].[dbo].[t].[c1]=(1)))
                      |--Remote Scan(SOURCE:(Server2), OBJECT:("master"."dbo"."t"))

One can argue that SQL Server should know that this particular case there is no risk of truncation because the parameter length is less than column length. But currently SQL Server is being conservative and is being looked into for future products. However, even this is addressed, the approach application is taking has other disadvantages that I will talk about.

Problems of not specifying parameter length

Linked server performance: As illustrated above, linked server performance will suffer because the parameter length doesn’t match table definition. This will happen for update and insert because both are at risk of character truncation.
Multiple compiles and plan cache pollution: In the above .NET code in the very beginning, you can supply different strings of different length. for example, if you supply string str=”abcd”. Now, the provider generates a parameterized query like this: “exec sp_executesql N'update [Server2].master.dbo.t set c2 = @c2 where c1 = 1',N'@c2 varchar(4)',@c2='abcd'”. Note that length of parameter @c2 is now 4. This requires a different plan for the parameterized query. You will have multiple compiles and multiple plans being cached. Note that this problem is not specific to linked server queries. Any query done this way will pollute the procedure cache. This impacts all parameterized queries such as insert, update, delete and select.

Solutions

The solution is to change your app so that the parameter length matches the column length. In this example, you just do this “cmd.Parameters.Add("@c2", SqlDbType.VarChar, 500).Value = str;”. Note that last parameter is the length of the parameter. After doing this, you solve two problems mentioned above. You will have a linked server query that runs fast and you will have just one copy of plan in cache.

After the change, you will get the following plan. Note that “Remote Query” means that the entire update was sent to remote server without pulling data locally for processing.

update [Server2].master.dbo.t set c2 = @c2 where c1 = 1
|--Remote Query(SOURCE:(Server2), QUERY:(UPDATE "master"."dbo"."t" set "c2" = ? WHERE "c1"=(1)))

Jack Li | Senior Escalation Engineer |Microsoft SQL Server Support

↧

Numbers are better than letters…

October 10, 2010, 5:08 pm

≫ Next: How It Works: Extended Event (sqlos.wait_info*)

≪ Previous: Query performance and plan cache issues when parameter length not specified correctly

Back in a post from May (http://blogs.msdn.com/b/psssql/archive/2010/05/30/don-t-touch-that-schema.aspx), I reported that SSRS referenced fields by index instead of by name in code for performance benefits. As a follow up to that, I decided to do some testing to demonstrate the performance benefit of this approach. Here is the code I wrote for the testing (note: for simplicity’s sake, I am running this against a ReportServer catalog database):

Imports System.Data.SqlClient
Imports System.Data

Module Module1

    Sub Main()

        Dim tsName As New DateTime
        Dim tsNum As New DateTime
        Dim teName As New DateTime
        Dim teNum As New DateTime

        Dim totalName As Long
        Dim totalNum As Long
        Dim iterations As Int16 = 10000
        Dim strCn As String = "Integrated Security=SSPI;Initial Catalog=ReportServer;Data Source=myserver”
        Dim cmdTxt As String = "select * from ExecutionLog2"

        For i As Int32 = 1 To iterations
            Dim cn As New SqlConnection(strCn)
            Dim cmd As New SqlCommand(cmdTxt, cn)
            cn.Open()
            Dim dr As SqlDataReader = cmd.ExecuteReader(CommandBehavior.CloseConnection)

            'randomly pick a field number to reference
            Dim rand As New Random()
            Dim fieldnum As Int16 = rand.Next(0, dr.FieldCount - 1)

            'now, get it by index
            dr.Read()
            tsNum = DateTime.Now
            Dim val1 As Object = dr.Item(fieldnum)
            teNum = DateTime.Now
            totalNum += teNum.Subtract(tsNum).Ticks

            'close the connection
            cn.Close()

            If (i Mod 1000) = 0 Then Console.WriteLine(i)

        Next i
        Console.WriteLine("By index (ms): " + (totalNum / 10000).ToString)

        'Now repeat the process by name
        For i As Int32 = 1 To iterations
            Dim cn As New SqlConnection(strCn)
            Dim cmd As New SqlCommand(cmdTxt, cn)
            cn.Open()
            Dim dr As SqlDataReader = cmd.ExecuteReader(CommandBehavior.CloseConnection)

            'randomly pick a field number to reference
            Dim rand As New Random()
            Dim fieldnum As Int16 = rand.Next(0, dr.FieldCount - 1)

            'now, get the field by name
            dr.Read()
            fieldnum = rand.Next(0, dr.FieldCount - 1)
            tsName = DateTime.Now
            Dim fieldname As String = KnownFieldName(fieldnum)
            Dim val2 As Object = dr.Item(fieldname)
            teName = DateTime.Now
            totalName += teName.Subtract(tsName).Ticks

            'close the connection
            cn.Close()

            If (i Mod 1000) = 0 Then Console.WriteLine(i)

        Next i
        Console.WriteLine("By name (ms): " + (totalName / 10000).ToString)

    End Sub

    Private Function KnownFieldName(ByVal num As Int16) As String
        Select Case num
            Case 0
                Return "InstanceName"
            Case 1
                Return "ReportPath"
            Case 2
                Return "UserName"
            Case 3
                Return "ExecutionId"
            Case 4
                Return "RequestType"
            Case 5
                Return "Format"
            Case 6
                Return "Parameters"
            Case 7
                Return "ReportAction"
            Case 8
                Return "TimeStart"
            Case 9
                Return "TimeEnd"
            Case 10
                Return "TimeDataRetrieval"
            Case 11
                Return "TimeProcessing"
            Case 12
                Return "TimeRendering"
            Case 13
                Return "Source"
            Case 14
                Return "Status"
            Case 15
                Return "ByteCount"
            Case 16
                Return "RowCount"
            Case 17
                Return "AdditionalInfo"
        End Select

        'if we don't hit a case statement, throw an exception
        Throw New System.NotSupportedException
    End Function



End Module

And, here’s the output:

1000
2000
3000
By index (ms): 530053
1000
2000
3000
By name (ms): 1020102


1000
2000
3000
By index (ms): 580058
1000
2000
3000
By name (ms): 710071


1000
2000
3000
By index (ms): 510051
1000
2000
3000
By name (ms): 850085

As you can see, accessing the fields by index took an average of 540 seconds for 3000 iterations, or about 0.18 seconds per access. Accessing them by name took 860 seconds for 3000 iterations, or about 0.29 seconds per access. Personally, I’ll take 1/10th of a second of performance improvement when I am running a service that might serve thousands of simultaneous requests.

Evan Basalik | Senior Support Escalation Engineer | Microsoft SQL Server Escalation Services

↧

How It Works: Extended Event (sqlos.wait_info*)

October 20, 2010, 12:11 pm

≫ Next: Query Performance and multi-statement table valued functions

≪ Previous: Numbers are better than letters…

I was posed a good question today about how the wait_info* event works in SQL Server 2008. The easiest way for me to answer the question was to prove the behavior. using WAITFOR DELAY it shows the behavior nicely.

From: Robert Dorr
Sent: Wednesday, October 20, 2010 2:07 PM
Subject: RE: Extended Events

The wait types are similar to those exposed in sys.dm_os_wait_stats.

The code shows me that: completed_count = Number of times the action has been completed. In this case of something like (NETWORK_IO) SQL checked and made sure lots have completed. The wait time for many of these could have been 0 (already completed).

You can use the WAITFOR action to determine the behavior. You can see below that I waited for 5 seconds and the outputs show the MS values.

select * from sys.dm_os_wait_stats

where wait_type = 'WAITFOR'

waitfor delay '00:00:05'

select * from sys.dm_os_wait_stats

where wait_type = 'WAITFOR'

create event session WaitTest on server

add event sqlos.wait_info

add target package0.asynchronous_file_target

(set filename=N'c:\temp\Wait.xel')

with (max_dispatch_latency=1 seconds)

alter event session WaitTest on server state = start

alter event session WaitTest on server state = stop

select * from fn_xe_file_target_read_file('c:\temp\*.xel', 'c:\temp\*.xem', NULL, NULL)

where event_data like '%5000%'

</event>

Bob Dorr - Principal SQL Server Escalation Engineer

↧

Query Performance and multi-statement table valued functions

October 28, 2010, 1:29 pm

≫ Next: How It Works: SQL Parsing of Number(s), Numeric and Float Conversions

≪ Previous: How It Works: Extended Event (sqlos.wait_info*)

Lately I worked with a customer to help tune his query involving multi-statement table valued function. When using table valued functions, you should be aware of a couple of things

First, there are two type of table valued functions which are inline table valued function (Inline TVF) and multi-statement table valued function (multi-statement TVF). Inline table valued function refers to a TVF where the function body just contains one line of select statement. There is not return variable. Multi-statement table valued function refers to a TVF where it has a return table variable. Inside the function body, there will be statements populating this table variable. In the demo at the end of this blog, there are examples of inline TVF and multi-statement TVF.

Secondly, multi-statement TVF in general gives very low cardinality estimate.

If you use inline TVF, it’s like you are just using a view and if it takes parameter, it’s like a parameterized view. The final SQL plan will not have any reference to the TVF itself. Instead, all the referenced objects will be in the final plan.

But if you use multi-statement TVF, it’s treated as just like another table. Because there is no statistics available, SQL Server has to make some assumptions and in general provide low estimate. If your TVF returns only a few rows, it will be fine. But if you intend to populate the TVF with thousands of rows and if this TVF is joined with other tables, inefficient plan can result from low cardinality estimate.

In the demo, I created a TVF called tvf_multi_test(), then I join it with other tables with the query below.

select c.ContactID, c.LastName, c.FirstName, Prod.Name,
COUNT (*) 'numer of unit'
from Person.Contact c inner join
dbo.tvf_multi_Test() tst on c.ContactID = tst.ContactID
inner join Production.Product prod on tst.ProductID = prod.ProductID
group by c.ContactID, c.LastName, c.FirstName, Prod.Name

As you can see from the plan here, the estimates are off (resulting from the Table Scan on tvf_multi_test)

Solutions

If you don’t plan to join a multi-statement TVF with other tables, you are OK because the low cardinality estimate doesn’t matter.
If you know that your multi-statement TVF will always return small number of rows, you are OK as well.
Use inline TVF when possible: In the demo, it’s unnecessary to use a multi-statement TVF. By changing it to inline TVF, the estimates will be accurate.
If you anticipate large number of rows will result from executing the multi-statement TVF and you will need to join this TVF with other tables, consider putting the results from the TVF to a temp table and then join with the temp table.

Demo

/*
Purpose: to demonstrate estimate for multi-statement table valued function
will have incorrect estimate if large number of rows
setup: it requires sql 2008 AdventureWorks sample database
*/

/*************************************************************
1. creating a TVF to populate from a few other tables
**************************************************************/
use AdventureWorks
go

if OBJECT_ID ('tvf_multi_Test') is not null
drop function tvf_multi_Test
go

/*
creating multi-statement TVF
*/
create function tvf_multi_Test()
returns @SaleDetail table (ContactID int, ProductId int)
as
begin
insert into @SaleDetail
select ContactID, ProductID from Sales.SalesOrderHeader soh inner join
Sales.SalesOrderDetail sod on soh.SalesOrderID = sod.SalesOrderID
return
end

/*************************************************************
2. exec plan with the multi-statement TVF
**************************************************************/
set statistics profile on
set statistics io on
set statistics time on
go
/*
the estimate is inaccurate for tvf_multi_Test (always 1 row)
the plan is not efficient because it drove 121,317 index seek on Product table
and additional 121,317 seeks on contact table
*/
select c.ContactID, c.LastName, c.FirstName, Prod.Name,
COUNT (*) 'numer of unit'
from Person.Contact c inner join
dbo.tvf_multi_Test() tst on c.ContactID = tst.ContactID
inner join Production.Product prod on tst.ProductID = prod.ProductID
group by c.ContactID, c.LastName, c.FirstName, Prod.Name

go
set statistics profile off
set statistics io off
set statistics time off

/*************************************************
3. re-write to use inline table valued function
*************************************************/
if OBJECT_ID ('tvf_Inline_Test') is not null
    drop function tvf_Inline_Test
go
create function tvf_Inline_Test()
returns table
as
return select ContactID, ProductID
    from Sales.SalesOrderHeader soh
        inner join Sales.SalesOrderDetail sod
        on soh.SalesOrderID = sod.SalesOrderID

/*****************************************************
4. exec plan for inline TVF
this will get good plan.
In fact, you no longer see the table valued function in
the plan. It behavies like a view
******************************************************/
set statistics profile on
set statistics io on
set statistics time on
go

select c.ContactID, c.LastName, c.FirstName, Prod.Name,
COUNT (*) 'numer of unit'
from Person.Contact c inner join
dbo.tvf_inline_Test() tst on c.ContactID = tst.ContactID
inner join Production.Product prod on tst.ProductID = prod.ProductID
group by c.ContactID, c.LastName, c.FirstName, Prod.Name

go
set statistics profile off
set statistics io off
set statistics time off

Jack Li |Senior Escalation Engineer|Microsoft SQL Server Support

↧

How It Works: SQL Parsing of Number(s), Numeric and Float Conversions

November 1, 2010, 1:47 pm

≫ Next: CSS once again comes to the SQL PASS Summit

≪ Previous: Query Performance and multi-statement table valued functions

SQL Server and other documentation have always indicated that float values are not precise and comparison or conversion of them can be problematic and imprecise. Working on a recent customer case it required me to dig into the single/double precision point format as well as SQL Server NUMERIC format and handling.

The application was designed in SQL Server 6.x when BIGINT did not exist. In order to handle large integer values the application used SQL Server a float data type and never stored anything other than zeros (0's) in the decimal positions. They developers knew that the float could be imprecise and tried to account for it when building their where clauses using >= style of actions. However, they later found out that when the numbers got large, such as 576460752305 and there were .000000 (6 zeros) vs .00000 (5 zeros) in the decimal positions they could encounter rounding behaviors.

Resolution: In this particular case only readable integer values were being used so the table should be updated to a BIGINT. BIGINT requires 8 bytes of storage just like the float. If decimal precision was required the column should be converted to NUMERIC.

Several versions ago the SQL Server parser was changed to use the NUMERIC data type for all numeric values. Prior to this the values were treated within domain ranges and types. For example:

1 = INTEGER

1. = FLOAT – Scale of 0

1.0 = FLOAT – Scale of 1

The problem with the old design was that it leads to unwanted conversions in plans (usually table scans). Using a NUMERIC we have number capable of handling large values and exact precision to manipulate cleanly and avoid unwanted conversions.

I want to take a specific look at a numeric to float conversion because float is imprecise and floating point arithmetic can fool you.

The sqltypes.h file included with the sql.h header outlines the SQL_NUMERIC_STRUCT structure.
{
   SQLCHAR precision;
   SQLSCHAR scale;
   SQLCHAR sign;
   SQLCHAR val[16];
} SQL_NUMERIC_STRUCT;

Simplified, the design of a NUMERIC is a 128 bit integer value. Look at the number, remove the decimal (ignore it for the 128 bit integer view) and store it in the val member, keeping the decimal position (power of 10) for the scale.

Here is a small program that shows how a NUMERIC is converted to a float by the SQL Server components. I used 4 ULONG values for the 128 bit value and it shows how the double is accumulated and converted.

#include "stdafx.h"

#include "windows.h"

#include "math.h"

#include "assert.h"

const ULONGLONG x_dwlBaseUI4      = 0x100000000ui64;

const ULONG          x_ulBase10           = 10;

int main(int argc, char* argv[])

{

       ULONG         m_l128Val[4] = {};

       ULONGLONG     ullVal =      576654928035000000; //     576654928035.000000

       BYTE          bScale =      6;

       assert(0xFFFFFFFFFFFFFFFF > ullVal);            // Use only 2 slots for the test



       m_l128Val[0] = ullVal & 0xFFFFFFFF;

       m_l128Val[1] = (ullVal >> 32) & 0xFFFFFFFF;



       int           iulData     = 0;

       double        dVal        = 0.0;

       ULONGLONG     ullAccum    = 0;

       for (iulData = 2; iulData >= 0; iulData--)

              dVal = m_l128Val[iulData] + dVal * x_dwlBaseUI4;

       dVal /= pow ((double) x_ulBase10, bScale);

             //            Testing Sanity Check
      for (iulData = 2; iulData >= 0; iulData--)
              ullAccum = m_l128Val[iulData] + ullAccum * x_dwlBaseUI4;
       assert(ullAccum == ullVal);
       dVal = ullAccum / (pow ((double) x_ulBase10, bScale));



       return 0;

}

The value in the sample is specifically picked because floating point arithmetic will result in a small change from what appears to be an integer value to the reader. You can actually break this down to a very simple form. Taking the large integer that fits in an UINT64 for the example and add the decimal place based on the power of 10 and the scale. Notice that the floating point arithmetic results in the additional of 0.00012 to the readable value.

ULONGLONG f6= 576654928035000000ui64;

dRetVal1 = f6 / pow ((double) 10.0, 6);

dRetVal2 = f6 / (double)1000000;

dRetVal3 = f6 / 1000000.0;

dRetVal1 576654928035.00012 double

dRetVal2 576654928035.00012 double

dRetVal3 576654928035.00012 double

Note: When testing I see differences between a 32 bit project running in WOW64 and a pure X64 project.

SQL Server stores float values as 8 byte, double precision values. You can see the parser and data storage in action with a simple select using a cast. The additional 1 at the end of the binary storage represents the 0.00012.

select cast( cast(576654928035.00000 as float) as varbinary) -- 0x4260C869FD146000
select cast( cast(576654928035.000000 as float) as varbinary) -- 0x4260C869FD146001

The floating point arithmetic is outlined well on various internet sites so I am not going to repeat the details here. You can research more using: "ANSI/IEEE Standard 754-1985, Standard for Binary Floating Point Arithmetic" if you want to try to reverse the binary to the floating point value. I prefer to use '.formats' in the Windows Debugger and let it do the heavy lifting for me. It is difficult to predict the behavior without running the algorithm for a very broad set of values.

I wanted to know why, when I called atof with 6 digits to the right of the decimal it returned .000000 and not .000001 like the SQL NUMERIC to floating point was doing. The answer is an optimization in the atof logic, or that is appears to consider trailing 0's to the right of the decimal position insignificant to a degree. (_IsZeroMan is used to check for zero mantissa values.)

After running some simple iteration tests I found that the six zeros (.000000) seemed to really start showing imprecisions from the raw integer at 576460752305.000000 and larger values when 64 bit precision was in play.

for(ULONGLONG ullValue = 576654928035000000; ullValue > 476654928035000000 && ullValue < 0xFFFFFFFFFFFFFFFF; ullValue -= 1000000)

{

double dVal = ullValue / pow(10.0,6);

ULONGLONG ullCast = ullValue / pow(10.0, 6);

if(ullCast != dVal)

{

printf("----- %f\r\n", dVal);

}

NOTE: You can't just assume the display from the client is the same as the storage of the SQL Server. For example the following print the same on the client but they don't actually match when physical/storage comparison takes place.

select cast(576460752305.000000 as float)
select cast(cast(576460752305.000000 as float) as numeric)

if( cast(cast(576460752305.000000 as float) as numeric) = cast(576460752305.000000 as float))
begin
print('Match')
end
else
begin
print('NOT A Match') <-------- !!! This branch is printed !!!
end

Looking at the actual storage you can see the difference.

select cast(cast(576460752305.000000 as float) as varbinary) - storage (0x4260C6F7A0B61FFF)

select cast(576460752305 as float) - storage (0x4260C6F7A0B62000)

atof(“576460752305.000000”) = 576460752305.000000 - storage (0x4260C6F7A0B62000)

Breaking this down you can see the actual double precision value when using 64 bit precision.

S EEEEEEEEEEE FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

0 10000100110 0000110001101111011110100000101101100010000000000000

10000100110 = 1062 – 1023 = 39, (2^39) = 549755813888

10000110001101111011110100000101101100010000000000000 = 4722366482882560

4722366482882560 / (2 ^52^th= [4503599627370496])

4722366482882560 / 4503599627370496 = 1.0485760000028676586225628852844

549755813888 * 1.0485760000028676586225628852844 = 576460752304.99999999999999999999 ß Close but not exact

If you force different rounding behavior by manipulating the precision and scale you can see the storage values change.

select cast(cast(cast(cast(576460752305.000000 as float(53)) as numeric(38,3))as float) as varbinary) – storage (0x4260c6f7a0b62000)
select cast(cast(cast(cast(576460752305.000000 as float(53)) as numeric(38,4))as float) as varbinary) – storage (0x4260c6f7a0b61fff)

If you look at the display in SQL Server Management Studio for these variants you get the same value of 576460752305 but the actual storage of the float sent from the SQL Server contains the additional precision.

select cast(cast(576460752305.000000 as float(53)) as varbinary)

0x4260C6F7A0B61FFF

2010-11-01 13:31:38.22 spid51      ODS Event: language_exec : Xact 0 ORS#: 1, connId: 0

2010-11-01 13:31:38.22 spid51         Text:select cast(cast(576460752305.000000 as float(53)) as varbinary)

2010-11-01 13:31:38.22 Server      2010-11-01 13:31:38.

2010-11-01 13:31:38.23 Server      spid51

2010-11-01 13:31:38.23 Server      Printing send buffer:

2010-11-01 13:31:38.23 Server      04 01 00 2D 00 33 01 00 81 01 00 00 00 00 00 21   ...-.3.........!

2010-11-01 13:31:38.23 Server      00 A5 1E 00 00 D1 08 00 42 60 C6 F7 A0 B6 1F FF   ........B`......

2010-11-01 13:31:38.23 Server      FD 10 00 C1 00 01 00 00 00 00 00 00 00            .............

select cast(cast(576460752305.0000 as float(53)) as varbinary)

0x4260C6F7A0B62000

2010-11-01 13:30:00.56 spid51      ODS Event: language_exec : Xact 0 ORS#: 1, connId: 0

2010-11-01 13:30:00.57 spid51         Text:select cast(cast(576460752305.0000 as float(53)) as varbinary)

2010-11-01 13:30:00.58 Server      2010-11-01 13:30:00.

2010-11-01 13:30:00.58 Server      spid51

2010-11-01 13:30:00.58 Server      Printing send buffer:

2010-11-01 13:30:00.59 Server      04 01 00 2D 00 33 01 00 81 01 00 00 00 00 00 21   ...-.3.........!

2010-11-01 13:30:00.59 Server      00 A5 1E 00 00 D1 08 00 42 60 C6 F7 A0 B6 20 00   ........B`.... .

2010-11-01 13:30:00.60 Server      FD 10 00 C1 00 01 00 00 00 00 00 00 00            .............

select cast(576460752305.0000 as float(53))

select cast(576460752305.000000 as float(53))

----------------------

576460752305

----------------------

576460752305

Using a different SQL Server client (SQLCMD.exe) it displays a value of 576460752304.99988 showing the additional precision. Looking closer at SSMS it appears to round to the 3rd decimal position by default.

select cast(576460752305.001901 as float(53)) -- 576460752305.002

For safety use the convert when selecting the data to validate the values.

select convert(varchar(32), cast(576460752305.0000 as float(53)), 2)
select convert(varchar(32), cast(576460752305.000000 as float(53)), 2)

--------------------------------
5.764607523050000e+011
--------------------------------
5.764607523049999e+011

I now get SQL Server and printf to display the value as 576460752305 but using the following you can see we get different values when you start to involve various floating point instructions (divsd, mulsd, …)

double dValueAgain = atof("576460752305.000000");

printf("%f\n",dValueAgain); 576460752305.000000 – storage (0x4260c6f7a0b62000)

ULONGLONG ullCheck = 576460752305000000;

double dNew = ullCheck / pow(10.0, 6);

printf("%f\n", dNew); 576460752304.999880 – storage (0x4260c6f7a0b61fff)

Stepping into the logic for atof there is logic to check for an all zero mantissa as well as round the mantissa which appears to keep the floating point closer to the readable integer value a user might expect from the conversion. Looking at the assembly for the ULONGLONG division the CPU logic using xmm and/or floating point instructions apply. Depending on the precision control flags (PC) you might get different behaviors. Reference: http://msdn.microsoft.com/en-us/library/c9676k6h.aspx

I updated my sample to force different precision levels

_controlfp_s(&currentFP, _PC_24, MCW_PC);

double dNew = ullCheck / pow(10.0, 6);

printf("%f\n", dNew); 576460752305.000000 – storage (0x4260c6f7a0b62000)

_controlfp_s(&currentFP, _PC_64, MCW_PC);

ULONGLONG ullCheck = 576460752305000000;

double dNew = ullCheck / pow(10.0, 6);

printf("%f\n", dNew); 576460752304.999880 – storage (0x4260c6f7a0b61fff)

To understand this better I looked at the raw, binary format for the double precision float values.

select cast(cast(576460752305 as float) as varbinary)

576460752305 = 1000011000110111101111010000010110110001

0x4260C6F7A0B62000 = 100001001100000110001101111011110100000101101100010000000000000

select cast(cast(576460752304 as float) as varbinary)

576460752304 = 1000011000110111101111010000010110110000

0x4260C6F7A0B60000 = 100001001100000110001101111011110100000101101100000000000000000

Looking at the double precision format details again you can see the difference between the values …305 and … 304 shown below.

S EEEEEEEEEEE FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

576460752305 = 0 10000100110 0000110001101111011110100000101101100010000000000000

10000100110 = 1062 – 1023 = 39, (2^39) = 549755813888

10000110001101111011110100000101101100010000000000000 = 4722366482882560

4722366482882560 / (2 ^52^th= [4503599627370496])

4722366482882560 / 4503599627370496 = 1.0485760000028676586225628852844

549755813888 * 1.0485760000028676586225628852844 = 576460752304.99999999999999999999 ß Close but not exact

Looking at the mathematical formula for this it would be the following.

549755813888 * X = 576460752305

Transform the equation.

576460752305 / 549755813888 = X

Solve for X = 1.0485760000028676586225628852844

576460752304 = 0 10000100110 0000110001101111011110100000101101100000000000000000

10000100110 = 1062 – 1023 = 39, (2^39) = 549755813888 ß Same Exponent

10000110001101111011110100000101101100000000000000000

4722366482874368 / (2 ^52^th= [4503599627370496])

4722366482874368 / 4503599627370496 = 1.0485760000010486692190170288086

549755813888 * 1.0485760000010486692190170288086 = 576460752304

Thinking about this more it becomes clear that the exponent can only represent approximately (2 ^ 11th = ~2048) exact values if you leave the mantissa all zeros. (There are special cases for NaN and Infinite states). With a all zero mantissa the mathematics will be Exponent * 1.0 for an exact match. Anything outside of 1.0 for the mantissa has the possibility to vary from a strict integer, whole value.

When looking at this for representation of whole integer values only each power of 2 step has to represent 1/nth of the step. For example the only integer between 2 and 4 is 3 and so 1/n for 1/2 = .50 or 2 + (2 * .50). Looking at 4 to 8 we need to represent 5, 6, and 7 which is 1/4th or .25, .50 and .75 + base (4) respectively. When you get out to larger powers of 2 the precision becomes more complex and plays a factor.

Going back to the original statement: Floating point quality involves many factors from precision handling, which conversion is involved and even what the client vs the server uses to do the default conversions can play a factor in what values you see. Just because you see a number you may need to go to the actual storage format to determine the raw value.

Bob Dorr - Principal SQL Server Escalation Engineer

↧

CSS once again comes to the SQL PASS Summit

November 4, 2010, 7:44 pm

≫ Next: The week that was PASS…

≪ Previous: How It Works: SQL Parsing of Number(s), Numeric and Float Conversions

I’m proud once again to be a part of our CSS team coming to the US SQL PASS Summit next week in Seattle. As in the past, our team will be speaking at a pre-conference seminar, main conference talks, and available to talk to you in person at the SQL Server Clinic.

Here is a list of the talks we will be giving:

Pre-Conference Seminar

Microsoft BI Deployment Lessons Learned
Mon 11/8 8:30am-4:30pm

Adam Saxton
Todd Carter
John Sirmon

If you want to learn more about how to deploy our BI technologies I think you will really get a lot out of this session. One of the things I believe that you really like in this session is how we cover Sharepoint. Sharepoint has become such a major part of the BI story and you won’t be disappointed in how our team of speakers will give you practical advice regarding Sharepoint integration.

Main Conference Talks

(DBA-226-M) Increasing Uptime by Managing SQL Server Configurations From the Cloud
Tues 11/9 10:15-11:30am 612
Paul Mestemaker
Bob Ward

Want to learn more about why I’ve been doing for the past 3 months (and not spending much time blogging). This talk covers it. Join myself and Paul Mestemaker for this session.

(BIA-399-A) SQLCAT: Configuring and Securing BI Applications in a SharePoint 2010 Environment
Wed 11/10 10:15-11:30am 602-604
Carl Rabeler
Adam Saxton

Adam Saxton from our team joins Carl Rabeler from the SQLCAT to talk about integration of our BI technologies with Sharepoint 2010.

(DBA-499-C) Kerberos, SQL and You
Wed 11/10 1:30-2:45pm 4C4
Adam Saxton

As a DBA, you would hope you don’t have to learn anything about Kerberos. Unfortunately, that is not true. Adam does a great job here of making the topic practical and to a level you can understand and apply to your environment

(BIA-388-C) Troubleshooting SSRS Performance
Wed 11/10 1:30-2:45pm 606-607
Evan Basalik

Evan Basalik is one of our engineers from the Charlotte, NC office. And one of the areas he spends a great deal of time on is SSRS. Trying to find out why your reports are not performing to your expectations is not a trivial task. So any developer or administrator deploying reports will walk away with tips they can use immediately in an SSRS environment.

(DBA-599-C) Inside SQL Server Latches
Wed 11/10 4:30-6pm 2AB
Bob Ward

This year I asked some of our MVPs what kind of deep internal session they wanted me to present on. And Inside SQL Server latches was by far the #1 vote. So you asked for it, and you go it. This is not for the timid. I’m sure a Windows Debugger session will pop-up at some time in the talk so consider yourself warned. Latches are not something you want to have to worry about but that is not reality. So I do think you will walk away with a better understanding of how they can affect your application and operation of SQL Server.

SQL Server Clinic

Last year may have been the best I’ve seen in our interaction with customers. We partnered with SQLCAT last year and we are doing it again this year. This is a very unique opportunity I don’t think attendees know exists. In the clinic you will meet some of our top CSS engineers along with the top SQLCAT architects. Have a question about architecture and design, you can talk to SQLCAT. Having a problem with SQL Server or a question you don’t know, you can chat with CSS. So bring on your laptops so we can debug your problems live. In addition to our speakers listed above, Suresh Kandoth, John Gose, and other CSS engineers will be staffing the clinic in room 611. We open right after the keynote and generally wrap-up each day around 6pm. We are there Tuesday, Wednesday, and Thursday.

I look forward to this conference every year. It is a unique opportunity to meet customers face to face. I hope to see many of you there I’ve met over the years and new ones I’ve yet to meet. Please don’t hesitate to stop me during the conference and just chat about your experiences with SQL Server or CSS.

See you in Seattle.

Bob Ward
Microsoft

↧

The week that was PASS…

November 12, 2010, 10:01 am

≫ Next: Microsoft Atlanta Begins….

≪ Previous: CSS once again comes to the SQL PASS Summit

I’m sitting in SEATAC Airport waiting for my flight, so I thought now would be a great time to write up some thoughts and share some experiences of my week at PASS. There was a ton of great information this year and we also announced Denali – the next version of SQL Server! From what I could tell and hear, people were very impressed with the Crescent demo and looked forward to getting their hands on it. Still have to wait a little bit for that though as it isn’t in CTP1.

Someone also mistook me for Adam Machanic. While I was flattered, I was not he.

Sessions

We had a great showing at the CSS Sessions. Evan mentioned his talk was well attended. Bob Ward’s is always packed and my Kerberos talk was way more than I thought it would be. Hopefully the attendees got some great information out of it. I personally enjoyed the back and forth with Paul Randal during Bob’s talk. If you ever meet up with Bob, ask him what Paul Randal’s kid’s nick name for Bob is. Smile

Atlanta

http://www.microsoftatlanta.com

Atlanta was announced at PASS this year! This is an incredible tool. Bob, Suresh and Paul Mestemaker (and crew) have been putting countless hours into Atlanta and I think people will see the benefit and be a little surprised what they find with their servers. There are great rules in the system. The Kerberos SPN rule was demoed at Ted Kummert’s Keynote last Tuesday. What added to that was in the SQL Clinic room, I referred several people to use Atlanta to try and help the potential kerberos issue they were having. It came up several times. It was a nice change from confusing customers with everything they would need to check and how to put it all together, to just being able to say, slap Atlanta on there and let it tell you what needs to be done. A real time saver!

We were able to take a minute Wednesday night and get a shot of the CSS Atlanta team along with Paul. I don’t remember if this was before or after the Sushi that Bob had to eat because we got the beta out at PASS. He was a trooper with that though and ate his fair share!

Twitter

One thing that was amazing for me was meeting my fellow tweeps. It was great to put some faces and names to the twitter feed. Especially when you find out you are sitting behind someone you are talking to on Twitter and didn’t realize it. Was great to meet you @erinstellato! I really enjoyed connecting with people and talking about their frustrations and experiences. I’m sure next year will be even better!

SQL Clinic

Every year the SQL Clinic room seems to get more intense. This year was no exception. The room was packed every day. There were a lot of great conversations. As always it was also great working next to the SQL CAT. The collaboration between the two groups is great. Cindy did an awesome job helping customers get to someone that could help!

Here are some of the type of questions I got this week…

1. How do I get the column headers in a report to repeat on every page? (I get this question every year – multiple times)

The trick with this one is there are actually two settings that you need to make sure are selected within the report. First is the “Repeat header columns on each page” under the tablix properties. The second is a little hidden. You have to show the Advanced Mode for the groupings area. There will be a static entry for the header row. In the properties there is an option “RepeatOnNewPage”. That needs to be set to True. That should get it going.

2. Is there another way to get a hash value for your row outside of BINARY_CHECKSUM?

BINARY_CHECKSUM has limitations with the fact that the value may not be unique depending on the length. Another option is to use the HashBytes function which can do an MD5 hash on a given value. The problem is that that doesn’t necessarily do it for the whole row, so you would need to come up with a way to calculate a row signature using that.

3. Lots and lots of Kerberos questions…

Between the Atlanta demo and my Kerberos talk and people describing their issue, not realizing it was a Kerb issue, there was a lot of traffic on this topic. We worked through it though. I’m not going to go in detail on this blog post, but will be getting some more posts out on the topic as well as to dive into the whole Claims topic. Also be sure to check out the Atlanta Beta and take advantage of the Kerb rules with how they relate to your SQL Server.

4. I’m getting a “Login failed for user ‘LinkedServerAccount’”

When an actual login name is in the error, chances are it’s a permission issue or the password wasn’t entered correctly if it is a SQL Auth account. This is not a kerberos issue. We were able to RDP into the affected server right in the Clinic room and correct the issue. In this case, we created a test account and changed the Linked Server credentials to match the new test account and that worked. WIth the same customer, we also helped to look at her Cluster setup as the cluster node was not coming up.

5. I’ve enabled SSL for Reporting Services, but the links generated in the Report have http, not https.

When you enable SSL for Reporting Services through the Reporting Services Configuration Manager tool, it does not adjust the SecureConnectionLevel setting in the rsreportserver.config file. It will have a default value of 0. Setting this to 3 will get the links to have https instead of http.

6. I have a SQL Cluster and want to install Reporting Services on the cluster as well to avoid additional licenses. What happens to Reporting Services when SQL fails over and will there be a virtual URL for RS similar to the VNN for the SQL Cluster?

This actually comes up pretty frequently for customers wanting to get Reporting Services to have some sort of high availability. The important thing to realize is that RS is not Cluster-Aware. So, even though you are putting it on a Cluster node, from an RS perspective, treat it like any other box. The process will be running all of the nodes you installed it on and it will not be represented as a Cluster Resource. Those instances will have different URL’s and may have different instance names depending on how you installed it. Refer to http://msdn.microsoft.com/en-us/library/bb522745.aspx. The way to get high availability with RS is to do a Scale out operation and stick a Network Load Balancer (NLB) in front of it with a virtual URL for the NLB.

7. How does Database Mirroring actually handle the failover from the connection side of things (failoverPartner)? (I think I got this question about 7 or 8 times from different customers)

I’m going to do a blog post on the internals of this and how it works. It was clear that there were some misunderstandings of how this is actually handled. But the questions I got relating to this were well thought out and very good. Look for this post in early December after I get back from Vacation. It’s on my list!

There were many more Reporting Services questions that I got along with a ton of Connectivity related issues and general troubleshooting advice. I was also surprised with the Database Mirroring questions. All in all, I was impressed with the types of questions I got. The PASS community definitely is a bunch of smart people!

SQL KILT

The last think I will leave you with is that I will be participating in the SQL Kilt Wednesday event. Mark Souza said that we would get matching bright green Kilts to match with the SQL CAT shirts. We’ll see if we can find them. Should be awesome though!

Thanks for another great year at SQL PASS Summit! I will see everyone again next year…

Adam W. Saxton | Microsoft SQL Server Escalation Services
http://twitter.com/awsaxton

↧

Microsoft Atlanta Begins….

November 12, 2010, 1:35 pm

≫ Next: HOW IT WORKS: IO Affinity Mask - Should I Use It?

≪ Previous: The week that was PASS…

There is an incredible level of satisfaction in being able to turn an idea into something real. Something valuable. Years go at Microsoft I talked to my colleagues about a vision where we could provide our knowledge in CSS using online technologies instead of having customers struggle to find KB articles on the web. Provide a service that would give customers specific advice based on their installation. That vision has now become a reality in the form of Microsoft Codename “Atlanta”. Partnering with the System Center product team, engineers in our SQL CSS team have been able to provide our knowledge in the form of alerts (or rules) as advice for what you should check on your SQL Server 2008 or SQL Server 2008 R2 installation. This advice is based on common issues we encounter when working with customers on our support cases. I’ve been working on the Atlanta project for some time (which is why my blogging rate has gone down) but of course could not talk about until now.

Over the next month I’ll be posting various blogs on this site about how Atlanta works, what kind of alerts and advice we are providing, why it may be important for you, and where we are heading in the future for this project.

But for now I’ll let you look at some resources to help you get started:

1) Watch a video from this week’s PASS Summit keynote on Tuesday. Ted Kummert, Senior VP, Business Platform Division at Microsoft, asked me to give a brief demo of Atlanta during the keynote. To view the entire recording of the keynote including this demo, use this link for on-demand viewing:

http://www.sqlpass.org/summit/na2010/LiveKeynotes/Tuesday.aspx

2) Try out the beta. How?

Just go to http://www.microsoftatlanta.com. Sign-up using a Windows Live ID (you can create one from this site), run the very simple setup program, and start looking at what advice Atlanta provides you about your SQL Server installation using the same web site.

Turning ideas and visions into a real product is only half the story. Someone has to actually like it and want to use it. So I decided to search on twitter to see what folks were saying after we launched on Tuesday. The results I’m happy to say are mostly very positive (look towards the end of these tweets for comments at the conference).

http://twitter.com/#!/search/Microsoft%20atlanta

Here is also blog post from one of our customers who has installed Atlanta and done an early review

http://sqlblog.com/blogs/john_paul_cook/archive/2010/11/10/microsoft-atlanta-first-look.aspx

We have all kinds of knowledge in support we wish our customers could find easily. Atlanta is a game changer for us (and hopefully you) because we now have that online connection to provide our knowledge, update it regularly, and target it to each customer’s configuration and installation.

Bob Ward
Microsoft

↧

HOW IT WORKS: IO Affinity Mask - Should I Use It?

November 19, 2010, 9:24 am

≫ Next: AlwaysON - HADRON Learning Series - What Is HADRON?

≪ Previous: Microsoft Atlanta Begins….

The IO Affinity mask question has come across my desk several times in the last week so it is time to blog about it again.

The IO Affinity mask is a very targeted optimization for SQL Server. I have only seen 6 cases where the use of it improved performance and was warranted. It is much more applicable to very large database needing high rates of IO on 32 bit systems. Allow me to explain with a simplified example.

If you are running on 32 bit you have a limited amount of RAM so the LazyWriter process could be very busy with buffer pool maintenance as large data scans and such take place. This means the SQL Server could be producing large rates of IO to flush older buffers and supply new buffers to the SQL Server in support of your queries. In contrast a 64 bit system can utilize larger RAM and reduce the IO churn for the same scenario.

Each IO that SQL Server processes requires completion processing. When the IO completes SQL Server has to check for proper completion (bytes transferred, no operating system errors, proper page number, correct page header, checksum is valid, etc…) This takes CPU resources.

IO Affinity was designed to offload the completion CPU resources to a hidden scheduler. When IO Affinity is enabled a hidden scheduler is created with a special lazy writer thread that only does IO operations. This is why documentation tells you to never assign the affinity mask and IO affinity mask to the same schedulers. If you do the schedulers will compete for the same CPU resources, just what you were trying to avoid. When you don't use IO affinity the SQL Server worker handles (posts) the IO and takes care of the IO completion on the scheduler the worker was assigned to.

Customer Scenario: I want Instance #1 to only use CPU #1 on my system so I set affinity mask and IO affinity mask. (WRONG)

REASON: By setting both it results in a context switch for each IO request to a different worker on the hidden scheduler. Just setting the affinity mask would be sufficient as the IO would be processed on the normal scheduler the worker was assigned to already.

The following shows the affinity mask and IO affinity mask assigned to the same scheduler (improper configuration) as they compete for the same CPU resources.

The following shows the proper setup of affinity mask and IO affinity mask.

Notice that in the proper configuration the only SQL Server activity assigned to the IO affinity scheduler is the IO activity. This configuration would assume that the amount of IO activity on the SQL Server is intense enough to consume significant resources on its own CPU. In the vast majority of installations, especially 64 bit, this is simply not the case and IO affinity is not necessary.

Bob Dorr - Principal SQL Server Escalation Engineer

↧

AlwaysON - HADRON Learning Series - What Is HADRON?

December 7, 2010, 9:44 am

≫ Next: AlwaysON - HADRON Learning Series - New DMVs

≪ Previous: HOW IT WORKS: IO Affinity Mask - Should I Use It?

I have been reviewing and working on supportability aspects of HADRON (High Availability Disaster Recovery - AlwaysON) for months and I am kicking off a blog post series related my 'HADRON Learning Series' which I am putting together for the SQL Server support team.

You can download CTP1 and try it yourself: http://www.microsoft.com/downloads/en/details.aspx?FamilyID=6a04f16f-f6be-4f92-9c92-f7e5677d91f9&displaylang=en

AMAZING: Just a Few Clicks And You Have HADR

HADRON design took a look at all the SQL Server technologies from replication, log shipping, database mirroring and other HA implementations customers had put in place and set forth a goal of one technology that really met the HA needs and allow the other technologies to focus on what they were originally intended for.

My first reaction was that the Windows Server Failover Cluster (WSFC) was - ugh! However, this was based on a lack of knowledge and not reality. I setup a multi-node cluster using Windows 2008 R2 with a majority node quorum in only a few clicks. IT WAS SO EASY. The new Windows 2008 wizards and validations are fantastic and the HADRON integration is seamless.

Then I learned that HADRON is NOT a clustered instance of SQL Server so you install a standalone instance and the Availability Group becomes the cluster resource. None of that install for cluster, add node setup of SQL Server. You just click through a simple standalone instance installation. You pop into configuration manager and enable the instance to allow HADRON capabilities and restart the SQL Server and you are ready to go. - JUST THAT SIMPLE!

Create a database or two (HADRON will allow multiple databases per Availability Group so you can truly fail over at an application centric level).

Pop into SQL Server Management Studio | Management | Availability Groups and add a New Availability Group.

This wizard will help you setup the HADRON Availability group. You can select the target secondary(s), which databases, save as T-SQL Script, create the end-points, it even allows you to start the synchronization by taking a backup and restoring it for you. It takes about 5 clicks and you have a fully functioning HA solution for your database(s).

Example: Availability Group As A Cluster Resource

Here are a couple of images I am using internally to explain what HADRON is, does and will do.

Video(s)

I have been doing targeted training using a '1 minute' like video series. My intention is to make these available to the community as I write this series of blog posts on HADRON.

WARNING: The series is based on pre-release software so things could change but I will attempt to provide you with the best information I can!

Bob Dorr - Principal SQL Server Escalation Engineer

↧

AlwaysON - HADRON Learning Series - New DMVs

December 14, 2010, 1:06 pm

≫ Next: AlwaysON - HADRON Learning Series - Running DBCC On A Secondary

≪ Previous: AlwaysON - HADRON Learning Series - What Is HADRON?

As I have been reviewing and learning about HADRON I have found a great wealth of information exposed in the DMVs. I have pulled together a starting, entity relationship diagram and the HADRON PMs have been helping me refine it. We continue to update it for changes but I thought you would find it helpful as well as you start to use HADRON.

WARNING: The series is based on pre-release software so things could change but I will attempt to provide you with the best information I can!

Bob Dorr - Principal SQL Server Escalation Engineer

↧

AlwaysON - HADRON Learning Series - Running DBCC On A Secondary

December 16, 2010, 6:36 am

≫ Next: How It Works: Error 18056 - The client was unable to reuse a session - Part 2

≪ Previous: AlwaysON - HADRON Learning Series - New DMVs

HADRON allows DBCC to be executed directly against a secondary replica. The DBCC can be run online, as is, or with TABLOCK if the HADR activity is suspended so the DBCC can acquire the database lock necessary to support the TABLOCK option.

WARNING: The series is based on pre-release software so things could change but I will attempt to provide you with the best information I can!

A secondary that allows connections enables the administrator to execute DBCC CHECKDB. The log blocks are being shipped and redone on the secondary so the DBCC is able to execute as if it was being run on a primary replica. The DBCC can be executed in one of two ways, ONLINE or with TABLOCK.

ONLINE is the most common as it does not require the HADR activity to be suspended in order to execute. Online DBCC works just like you are used to the online DBCC today. It creates an internal snapshot and performs copy-on-write activity in order to check a specific point-in-time while allowing redo to progress. The difference when running it on the secondary is that the point in time on the secondary replica may be behind the primary based on your synchronization settings and capabilities.

To avoid the internal snapshot the DBCC can be executed with TABLOCK. In order to allow DBCC checkdb to obtain the proper database lock you must first suspend the HADR activity on the database. Run the DBCC checkdb(MyDb) with TABLOCK and then resume the HADR activity. It goes without saying that suspending the HADR activity can lead to a backlog of log blocks and cause the database log file(s) to grow on the primary.

Command Examples

DBCC CHECKDB(MyDb)

ALTER DATABASE MyDb SET HADR SUSPEND
DBCC CHECKDB(MyDB) with TABLOCK
ALTER DATABASE MyDb SET HADR RESUME

Bob Dorr - Principal SQL Server Escalation Engineer

↧

How It Works: Error 18056 - The client was unable to reuse a session - Part 2

December 29, 2010, 8:48 am

≫ Next: Tracking database recovery progress using information from DMV

≪ Previous: AlwaysON - HADRON Learning Series - Running DBCC On A Secondary

I have had several questions on my blog post: http://blogs.msdn.com/b/psssql/archive/2010/08/03/how-it-works-error-18056-the-client-was-unable-to-reuse-a-session-with-spid-which-had-been-reset-for-connection-pooling.aspx related to SQL Server 2008's honoring of an query cancel (attention) during the processing of the reset connection. This blog will augment my prior post.

Facts

You will not see the sp_reset_connection on the wire when tracing the network packets. It is only a bit set in the TDS header and not RPC text in the packet.
sp_reset_connection is an internal operation and generates RPC events to show its activity.
Newer builds of SQL Server added logical disconnect and connect events. http://blogs.msdn.com/b/psssql/archive/2007/03/29/sql-server-2005-sp2-trace-event-change-connection-based-events.aspx
An attention from the client (specific cancel or query timeout) records the time it arrives (out-of-band) but the attention event is not produced until the query has ceased execution, honored the attention. This makes the start time of the attention the received time, the end time the complete honor time and the duration how long it took to interrupt the execution, handle rollback operations if necessary and return control of the session to the client.

The questions normally center around the Error 18056, State 29 and how one can encounter it. I have outlined the high level flow in the diagram below for producing the error.

The application will reuse a connection from the pool. When this occurs the client driver will set the reset bit in the TDS header when the next command is executed. In the diagram I used an ODBC example of SQLExecDirect.

The command is received at the SQL Server, assigned to a worker and begins processing. If the reset bit is located the sp_reset_connection logic is invoked.
When tracing the RPC:Starting and logical Disconnect events are produced.
The login is redone; checking permissions, making sure password has not expired, database still exists and is online, user has permission in the database and other validations take place.
Client explicitly cancels (SQLCancel) or query timeout is detected by client drivers and an attention is submitted to the SQL Server. The attention is read by the SQL Server, starting time captured and the session is notified of the cancel request, STOP! (Note: This is often a point of confusion. The overall query timeout applies to reset login and execution of the query in this scenario.)
During all these checks the logic will also check to see if a query cancellation (attention) has arrived. If so the Redo Login processing is interrupted the 18056 is reported and processing is stopped.
The attention event is always produced after the completed event. (When looking at a trace look for the attention event after the completed event to determine if the execution was cancelled.) This allows the attention event to show the duration required to honor the attention. For example, if SET_XACT_ABORT is enabled an attention will upgrade to a rollback of the transaction. If it was a long running transaction the rollback processing could be significant. Without SET_XACT_ABORT the attention interrupts processing as quickly as possible and leaves the transaction active. The client is then responsible for the scope of the transaction.

The "If Cancelled" used by Redo Login is where the change occurs between SQL 2005 and SQL 2008. The cancel was not checked as frequently in SQL 2005 so it was not honored until the command execution started. SQL Server 2008 will honor the attention during the redo login processing.

Here was an example that I received that will show the behavior. Notice that the execution (rs.Open) is done asynchronously so control returns to the client as soon as the query is put on the wire to the SQL Server. The cn.Cancel following the rs.Open will submit the attention for the request that was traveling to the SQL Server. This will produce the same pattern as shown in the diagram above, interrupting the Redo Login. If you were not using pooled connections the reset activity would not be taking place and the query itself would be interrupted.

dim cn

dim rs

set cn = CreateObject("ADODB.Connection")

set rs = CreateObject("ADODB.Recordset")

for i = 1 to 1000

                cn.Open "Provider=SQLNCLI10;Integrated Security=SSPI;Data Source=SQL2K8Server; initial catalog =whatever;"

                rs.ActiveConnection = cn

                rs.CursorLocation = 2

                             ‘ 48 = adAsyncExecute + adAsyncFetch

                rs.Open "select * from whatever", cn, 0, 1, 48

                cn.Cancel

                cn.Close

next

Internally an attention is raised as a 3617 error and handled by the SQL Server error handlers to stop execution of the request. You can see the 3617 errors in the sys.dm_os_ring_buffers. You can watch them with the trace exception events as well.

<Record id= "1715" type="RING_BUFFER_EXCEPTION" time="12558630"><Exception><Task address= 0x11B4D1B88</Task><Error>3617</Error><Severity>25</Severity><State>23</State><UserDefined>0</UserDefined></Exception><Stack

Bob Dorr - Principal SQL Server Escalation Engineer

↧

Tracking database recovery progress using information from DMV

December 29, 2010, 1:18 pm

≫ Next: Discussion About SQL Server I/O

≪ Previous: How It Works: Error 18056 - The client was unable to reuse a session - Part 2

Technorati Tags: Engine,SQL Server 2008

You must be very familiar with the database recovery related messages printed to the SQL Server Error log. These come in very handy when troubleshooting issues that are related to long recovery. These messages provide information about the stage of the recovery process and approximate time for completion.

2010-12-29 12:02:10.43 spid25s     Starting up database 'testdb'.
2010-12-29 12:02:31.23 spid25s     Recovery of database 'testdb' (11) is 0% complete (approximately 1725 seconds remain). Phase 1 of 3. This is an informational message only. No user action is required.
2010-12-29 12:03:08.94 spid25s     Recovery of database 'testdb' (11) is 1% complete (approximately 1887 seconds remain). Phase 1 of 3. This is an informational message only. No user action is required.
2010-12-29 12:03:08.95 spid25s     Recovery of database 'testdb' (11) is 1% complete (approximately 1887 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.
2010-12-29 12:04:57.97 spid25s     Recovery of database 'testdb' (11) is 43% complete (approximately 192 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.
2010-12-29 12:04:57.97 spid25s     580056 transactions rolled forward in database 'testdb' (11). This is an informational message only. No user action is required.
2010-12-29 12:04:57.97 spid25s     Recovery of database 'testdb' (11) is 43% complete (approximately 192 seconds remain). Phase 3 of 3. This is an informational message only. No user action is required.
2010-12-29 12:06:32.05 spid25s     Recovery of database 'testdb' (11) is 99% complete (approximately 2 seconds remain). Phase 3 of 3. This is an informational message only. No user action is required.
2010-12-29 12:06:35.09 spid25s     1 transactions rolled back in database 'testdb' (11). This is an informational message only. No user action is required.
2010-12-29 12:06:35.09 spid25s     Recovery is writing a checkpoint in database 'testdb' (11). This is an informational message only. No user action is required.
2010-12-29 12:06:35.44 spid25s     Recovery completed for database testdb (database ID 11) in 244 second(s) (analysis 37849 ms, redo 109038 ms, undo 97146 ms.) This is an informational message only. No user action is required.

Starting with SQL Server 2008, you do not need to repeatedly open/refresh the SQL Server error log or execute the stored procedure sp_readerrorlog to get up to date information about the progress of the database recovery. Most of the information is readily available in the dynamic management views [DMV]. The 2 DMV’s that offer insights into the progress of database recovery are: sys.dm_exec_requests and sys.dm_tran_database_transactions. The information presented in these DMV’s vary depending upon the situation: recovery of databases during server startup, recovery of database after a attach operation, recovery of database after a restore operation.

Here is a view of sys.dm_exec_requests showing recovery related information: [I repeatedly queried this DMV and loaded it into a temporary table for later analysis]

The key information here is the command type [DB STARTUP] and the session_id that indicates this is a system task performing the startup recovery. The percent_complete shows the same value that the error log messages indicate about the progress within the stage of recovery. You can use the information from other columns [wait information] to understand if the recovery is taking much longer due to an I/O issue or other problems.

Since the redo and undo portions can actually happen on different tasks in fast recovery scenarios in Enterprise Edition, you could actually see 2 different session_id’s for the same database recovery process.

In the case of attaching and restoring the database, this information will be reflected under the same session_id as the user session which executes those commands.

The DMV sys.dm_tran_database_transactions exposes information that can be useful to track the progress of the undo stage of the recovery. Here is a snapshot that corresponds to the above mentioned progress information from sys.dm_exec_requests.

The key information here is how the database_transaction_log_bytes_reserved keeps coming down as undo progresses. Also in the case where there are several transactions to undo, you will be able to see their progress using the database_transaction_next_undo_lsn.

Next time you encounter a really long recovery problem, try using some of these information points to understand if the recovery is progressing very slow or there is a lot of work to be done or if it completely stuck.

Suresh B. Kandoth

SQL Server Escalation Services

Other references:

http://blogs.msdn.com/b/psssql/archive/2008/01/23/how-it-works-what-is-restore-backup-doing.aspx

http://blogs.msdn.com/b/psssql/archive/2009/05/21/how-a-log-file-structure-can-affect-database-recovery-time.aspx

Here are some sample scripts I used to gather and report out the information I showed earlier:

-- create the tables to store the information you collect about recovery

USE <db_name>

DROP TABLE [dbo].[tbl_recovery_tracking]

DROP TABLE [dbo].[tbl_dm_tran_database_transactions]

CREATE TABLE [dbo].[tbl_recovery_tracking](

[runtime] [datetime] NOT NULL,

[command] [nvarchar](256) NOT NULL,

[session_id] [smallint] NOT NULL,

[database_id] [smallint] NOT NULL,

[total_elapsed_time] [int] NOT NULL,

[percent_complete] [real] NOT NULL,

[estimated_completion_time] [bigint] NOT NULL,

[wait_resource] [nvarchar](256) NOT NULL,

[wait_time] [int] NOT NULL,

[wait_type] [nvarchar](60) NULL,

[blocking_session_id] [smallint] NULL,

[reads] [bigint] NOT NULL,

[writes] [bigint] NOT NULL,

[cpu_time] [int] NOT NULL

) ON [PRIMARY]

CREATE TABLE [dbo].[tbl_dm_tran_database_transactions](

[runtime] [datetime] NOT NULL,

[transaction_id] [bigint] NOT NULL,

[database_id] [int] NOT NULL,

[database_transaction_log_record_count] [bigint] NOT NULL,

[database_transaction_log_bytes_used] [bigint] NOT NULL,

[database_transaction_log_bytes_reserved] [bigint] NOT NULL,

[database_transaction_next_undo_lsn] [numeric](25, 0) NULL

) ON [PRIMARY]

-- collect the information in a loop

WHILE 1 = 1

BEGIN

INSERT INTO [dbo].[tbl_recovery_tracking]

SELECT GETDATE() as runtime, command,

session_id, database_id, total_elapsed_time,

percent_complete, estimated_completion_time,

wait_resource, wait_time, wait_type, blocking_session_id,

reads, writes, cpu_time

FROM sys.dm_exec_requests

WHERE command = 'DB STARTUP' -- may need to change this if troubleshooting recovery as part of attach database or restore

INSERT INTO tbl_dm_tran_database_transactions

SELECT GETDATE() as runtime,

transaction_id, database_id,

database_transaction_log_record_count, database_transaction_log_bytes_used,

database_transaction_log_bytes_reserved, database_transaction_next_undo_lsn

FROM sys.dm_tran_database_transactions

WAITFOR DELAY '00:00:01' -- change this capture interval

END

-- after you collect information for some time, you can analyze the information to understand the progress of recovery

SELECT runtime, command,

session_id, database_id, total_elapsed_time,

percent_complete, estimated_completion_time,

wait_resource, wait_time, wait_type, blocking_session_id,

reads, writes, cpu_time

FROM [dbo].[tbl_recovery_tracking]

WHERE session_id = 25 -- change this

ORDER BY runtime

SELECT

runtime, transaction_id, database_id,

database_transaction_log_record_count,

database_transaction_log_bytes_used, database_transaction_log_bytes_reserved,

database_transaction_next_undo_lsn

FROM tbl_dm_tran_database_transactions

WHERE database_id = 11 and transaction_id = 1452239 -- change this

ORDER BY runtime

↧

Discussion About SQL Server I/O

January 7, 2011, 7:49 am

≫ Next: SQL Server 7.0: “She sure was a good ship”….

≪ Previous: Tracking database recovery progress using information from DMV

I received a request today for help on how SQL Server I/O behaves. As I was writing up my e-mail response I thought it would also make an interesting blog post.

Sent: Friday, January 07, 2011 2:53 AM
To: Robert Dorr
Subject: Async I\O questions

…

Background
What we do know :

Considering that in Windows the I/O API's allow for sync requests via calls to the API such as WriteFile that will not return control to the calling code until the operation is complete.
Async hands the request off to the operating system and associated drivers and returns control to the calling code.

[[rdorr]] Correct, this is the key to how SQL does the IO so we can hand off the IO and use other resources (CPU) while the IO is being handled by the hardware.

So , when we look at SQL Server which mostly use Async I/O patterns , it exposes the pending (async) I/O requests in the sys.dm_io_pending_io_requests ,the column ‘io_pending’ provides insight into the I/O request
and who is responsible for it , if the value is TRUE it indicates that 'HasOverlappedIOCompleted' in the Windows the I/O API returned FALSE and the operating system or driver stack has yet to complete the I/O.

[[rdorr]] Correct and exactly why we exposed the column. This lets us tell if it is a Non-SQL issue or if the IO has been completed and SQL is not processing it in a timely fashion.

Looking at the io_pending_ms_ticks indicates how long the I/O has been pending ,if the column reports FALSE for io_pending it indicates that the I/O still has been completed at the operating system level and
SQL Server is now responsible for handling the remainder of the request.

[[rdorr]] Correct. We snap the start time when we post the IO (right before the read or write call) and the column is materialized as the result of HasOverlappedIoCompleted.   If you look at HasOverlappedIoCompleted this is really a macro that checks the INTERNAL member of the OVERLAPPED structures for != STATUS_PENDING (0x103, 259).   So in a dump we already look at the internal status on the OVERLAPPED for pending status value.

What we do NOT know and would like detailed feedback on :

When SQL server hands of the Async I/O request to the windows I/O API ,

A1) What is the order of action to complete a write from the I/O API's side in a SAN environment ?
A2) Lets consider an environment where we have delayed IO as a result of a substandard SAN or SAN configuration

a) In a case where the Windows I/O API could not complete the I/O (write) due to a time out or a missing I/O request below the driver / API stack, would the Windows I/O API try to reissue the I/O (write)
    OR would the Windows I/O API thru some call back mechanism in the SQLOS layer notify the SQLOS of such a failure and then rely on the SQLOS to reissue the I/O write request ?
           Provide as much detail as possible.

[[rdorr]] The environment does not matter at all to SQL Server. It could be direct attached, SAN or other Io subsystem. SQL Server calls the write routine with the OVERLAPPED structure. This is a kernel transition which builds an IRP and puts it in motion, returning control from the API to SQL Server. Each of our IOs are placed on a pending list. This list is associated with the local SQLOS scheduler the task is running on. Each time the a switch takes place on the scheduler (< 4ms) the I/O list used HasOverlappedIoCompleted to check the status. If the IO is no longer STATUS_PENDING the completion routine registered with SQL Server is fired. This is not a completion routine setup with Windows it is an internal function pointer associated with the IO request structure maintained by SQL Server. The callback routine will check the sanity of the IO (error code, bytes transferred, read check page id, etc..) and then release the EX latch so the page can be used by any requestor.

Not exactly sure what you mean by timeout of the IO. SQL Server does not allow CancelIO so the IO will stay pending until it completes or returns an error. From a SQL Server standpoint there is no timeout only success of failure. If you are talking about HBA level timeouts it is driver specific and the hardware vendor implements the details. For example a dual HBA system can have a timeout value. When the HBA can’t service the IO and the timeout is exceeded it will transfer the IO to the alternate HBA controller. We have seen issues in the past with this pattern where the timeout is not setup properly. It was supposed to be set to 5 seconds but instead was 45 seconds. This meant the IO was pending for 45 secs on an HBA that had lost communications with the SAN and would not move the IO to the second controller until that timeout was expired. The query timeouts from the application were set to 30 seconds so when we went to flush a log record for a commit the SQL Server queries were getting cancelled at 30 seconds because of the underlying hardware setup.

I am not aware of any of the Windows APIs that retry the IO and even if they do SQL Server would not be aware of it. The OVERLAPPED structure is only updated by the IRP completion routine here. When the IRP completes one of the last things that occurs is the kernel callback routine if fired. This callback routine does the basic cleanup, sets the values in the OVERLAPPED structure such as Internal (status value) InternalHigh (bytes transferred) and such so that a caller to GetOverlappedResult or HasOverlappedIoCompleted can obtain the proper state. It then checks to see if the OVERLAPPED structure contains a valid event and if so will signal it and finally if the IO request is registered with a completion port it will queue the IO to the completion port. For disk IO SQL Server does not use the completion port. SQL Server posts the IOs with the OVERLAPPED and an event. Every IO on the same SQLOS scheduler used the same event. This allows a scheduler in an idle wait state wake on the event and check the IO list right away.

With that said there are a couple of retry locations in SQL Server but not like you might expect. When you first attempt to post the IO Read/Write file you can get the error (1450 or 1452) returned which is an out of resources error indicating that the IO could not be started (no IRP created). In these cases the SQL Server will Sleep for 1 second and attempt to post the IO again. In these cases the IO is not on the IO list and may not show up in the DMV because the IO is not pending.

For Reads Only if the SQL Server completion routine detects damage (failure of some sort, bad page header, checksum, …) we can retry the same IO up to 4 times before we consider it a failure. We have found (SQL and Exchange) that if a read fails you can retry the read and sometimes it works properly. If you do fail 4 times in a row for the same page it is usually damaged and we will log the error. In either case SQL Server logs information in the error log about this condition.

A failed write will leave the buffer in the buffer pool, hashed with the BUF->berrno set to a value other than 0. This will essentially, poison the page and for data integrity SQL Server will no longer allow access to the page nor will it write the page to disk. Since a write is often offline from the query (checkpoint or lazy writer) the original query that dirtied the data is usually not aware of the situation. However, if that write was for the log records (WAL protocol requires log records to be flushed before commit if valid) not only is the query notified of the error but the database is often marked suspect. SQL Server has some mechanisms to detect when a log write fails and will even try to take the database offline and bring it online to rectify the situation. If this fails the database is marked suspect. If this succeeds SQL Server has cleared a database error condition and allowed runtime but the DBA should be looking closely at the system and their backup strategies.

With all this said there are some situations we have found with snapshot databases that result in OS error 1450 or 1452 that can’t be resolved on Windows 2003. The error code was changed by Windows 2008 when the spare file limitation is reached and the 1450/1452 can’t be resolved so SQL will stop the retry attempts and be given a clear error message instead.

b) Taking into account the answer above, How many times would the component responsible retry/reissue the I/O (write) request before it reports a failure up to the calling components ? Sybase = 10 times.

[[rdorr]] Any write request results in a report to the error log. As I stated the only failure for a write that is generally retried are (1450 and 1452) error conditions. I went back to the code in SQL Server that handles acquiring the buffer latch (AcquireIoLatch and AcquireLatchActions) on SQL 2008 and it always checks the BUF->berrorno value as I outlined. There are some special conditions to allow page level restore on enterprise and such but for the most part once a write fails SQL Server won’t retry the write as we can’t trust the data. Let me give you an example.

If the database is enabled for checkpoint it enables a feature named constant page validation. When the page is read into memory or was written and is again clean and has a checksum the LW and other actions may validate the checksum the first time the page prepared to become dirty. If the checksum fails we know the page has been changed while in memory, incorrectly. Perhaps bad ram a memory scribbler or such activity. We would never want to flush this page so the BUF->berrno is set to indicate the checksum failure, preventing any further use of the page for any action.

c) and most importantly, can we influence the wait-time before an I/O request gets re-issued ?
d) Finally what is the difference in SQL2000 and SQL2008 wrt IO retries, and the has the IO engine been changed significantly between these two versions, specifically wrt retries, and layer where IO gets handled.

[[rdorr]] Read retry is the only major difference. SQL 2000 did not retry a failed read because of a page header problem for example. SQL 2008 can build a write request that is 2x the size of SQL 2000 but that is a perf issue. Read ahead sizes have always been controlled by SKU.

We think the answer is YES, so any answer would require some detail discussion on this point, possibly with a follow-up. We will do some tests to validate our suspicions with evidence (Shannon)

B1) Does IO to the ERRORLOG happen in an asynchronous way with queuing or measures to minimise SMP contention as well as Engine sync IO spin-wait?

[[rdorr]] Error log is written sync, it does not use the same IO path as database and log writes.

We have trace flags switched on which log's each login into the SQL-server in the ERRORLOG (details/...)
I.E. if IO to the errorlog disk is slow, or happens from 8 engines, could that slow things down significantly.

[[rdorr]] The error log path is a different animal. The path for handling a report to the log is really quite long. It has checks for all the dumptrigger behavior, the message has to be formatted each time which involved FormatMessage and the default heap, the messages are reported to the event log with ReportMessage and this path gets longer on a cluster and we write the information to the log in a serialized way. Also, this requires the workers to go preemptive to avoid holding the OS scheduler so it puts different pressure on the Windows scheduler than you might expect as well.

One way to remove the Event log from the picture as a test is to start SQL Server with the command line parameter of (-n) to avoid writing to the event log.

So I suppose it would be possible that a slow drive could appear to slow SQL Server down but if you could maintain that logic rate the other parts of the logic path are generally much longer than the error log. For integrated connections you have to talk to Windows (DC usually), you have to validate that login information such as database context, permissions and such. The longer part of the login would be outside the actual IO to the drive.

Any of this is really easy to review using the public symbols. You can use the Windows debugger and set a breakpoint on CreateFileW. When SQL starts up you can see how it opens the file (which flags) as documented by the Windows API on MSDN.com.

Here is a SQL Server 2008 opening the error log

00000000`05727b08 00000000`035831f1 kernel32!CreateFileWImplementation
00000000`05727b10 00000000`035d4a9f sqlservr!DiskCreateFileW+0xf1
00000000`05727b70 00000000`00cab289 sqlservr!DBCreateFileW+0x1bf
00000000`05727e70 00000000`0123560d sqlservr!CErrorReportingManager::InitErrorLog+0x4e9

       FileName = struct _UNICODE_STRING "C:\Program Files\Microsoft SQL Server\MSSQL10.SQL2008\MSSQL\Log\ERRORLOG"
      dwDesiredAccess = 0x40000000 = GENERIC_WRITE(0x40000000)
      dwFlagsAndAttributes = 0x000080   ß Normal (0x80)

By default it allows system cache involvement to avoid some of the performance issues you might be suspecting, but you can force it to use FILE_FLAG_WRITE_THROUGH (-T3663). One thing to watch here is registry checkpoints. We have seen a few system that when the registry is checkpointed the IO path slows it response and if the system drive is shared with the error logs or even database files that are some really limited conditions that can impact performance.

We open the error log with either FILE_ATTRIBUTE_NORMAL(0x00000080) or FILE_FLAG_WRITE_THROUGH(0x80000000) and we do not use FILE_FLAG_OVERLAPPED(0x40000000) so the IO is handled by Windows sync, NOT Async.

Testing this is really easy. You can produce messages at any rate you want with the following.

raiserror('Bob' ,25,1) with log

00000000`1606b878 00000000`7773328a kernel32!WriteFile
00000000`1606b880 00000000`04013055 kernel32!WriteFileImplementation+0x36
00000000`1606b8c0 00000000`0173ffb6 sqlservr!DiskWriteFile+0xa5
00000000`1606b900 00000000`01740b6e sqlservr!CErrorReportingManager::WriteToErrLog+0xc6
00000000`1606b960 00000000`01746dbb sqlservr!CErrorReportingManager::SendErrorToErrLog+0xb7e

[[rdorr]] The other thing I will caution you on is server based trace here. This has to do the IO and when tracing to a file we guarantee no event loss. So I have seen lots of cases where customers setup a trace of logins for example but the drive should be as good as the LDF drive and they have pointed the trace destination to a network share or something and caused a bottleneck

Bob Dorr - Principal SQL Server Escalation Engineer

↧

SQL Server 7.0: “She sure was a good ship”….

January 10, 2011, 1:58 pm

≫ Next: SQL Denali - DReplay/XEProfiler or RML Utilities

≪ Previous: Discussion About SQL Server I/O

In December of 1997, I was asked by my manager whether I would like to visit Seattle to spend time with the SQL Server Development team as they built our next generation of the database engine, SQL Server 7.0. I had been with Microsoft for about four years and at that time was already considered a veteran on our support staff. I had supported SQL 1.1 and 4.2 for OS/2 and SQL Server 4.20, 4.21, 6.0, and 6.5 for Windows NT. So I took up the opportunity and in January of 1998 visited our Microsoft campus in Redmond in hopes of learning more about what this new version would look like.

Back then the SQL team was in Building 1 and literally the entire team (including dev, test, etc) all fit on 2 floors in this building (The SQL team now takes up almost 2 buildings of much larger size now) . My six weeks spent there in Redmond was an experience I’ll never forget. Not only did I get to witness the birth of our new SQL 7.0 product with people like Paul Flessner, David Campbell, Sameet Agarwal, Mike Zwilling, Peter Byrne, Billie Jo Murray, and Steve Lindell. But I also was able to participate and help shape some of the new work. (such as feedback on DBCC commands and supportability features that are still in the product today .Steve if you are reading this no doubt you will not forget the fun of reviewing every DBCC CHECKDB message together).

SQL 7.0 represented such a major milestone for our product because it is still the foundation for many key concepts that still exist in the engine today. It was truly the first version where Microsoft finally veered away from all of the original SYBASE concepts and structure used in the first SQL4.2 version for Windows NT and built a new engine with new ideas, new algorithms, and new structures that you see today in SQL Server 2008 R2.

So it is with sadness that tomorrow marks the end of the product lifecycle for SQL Server 7.0. Official assisted support ends tomorrow January 11, 2011. You might be reading this and thinking that you didn’t know we still supported SQL 7.0 but until tomorrow we officially do. Customers who have purchased special custom support agreements can still get help from CSS, but other than that we won’t take phone calls or online cases for SQL 7.0 anymore. As I see the support of SQL 7.0 end, it reminds of the movie Apollo 13 where Bill Paxton (the character who plays Fred Haise) remarks “She sure was a good ship” as he watches the Aquarius Lunar Module drift away in space commenting on how that ship intended to land on the moon had saved all of their lives. SQL 7.0 was a great ship. One that changed the direction of Microsoft SQL Server from a small-time player in the RDBMS market to a major force.

In addition to the end of all assisted support or SQL 7.0, it is important for you also to not forget that mainstream support will be ending for SQL Server 2005 on April 11th of 2011. Before you panic, the end of “mainstream support” is not the complete end of support. But it does mean the end of cumulative updates for SQL Server 2005 and the end of hotfix support without a custom agreement. The SQL Release Services team is still working out the details of exactly when the last CU will be published for SQL Server 2005. (but it will be before the end of mainstream support date). For more information on what “mainstream support” means I recommend you read this previous blog post:

http://blogs.msdn.com/b/psssql/archive/2010/02/17/mainstream-vs-extended-support-and-sql-server-2005-sp4-can-someone-explain-all-of-this.aspx

Bob Ward, Microsoft