SQL and the like

Monday, 7 September 2009

Use of ranking functions result in an inefficient query plan

In SQL2005 (9.00.4207.00), if you use a ranking function within a view or CTE, then an inefficient query plan can be produced.

First off in Adventure works create this index

Create Index idxLastName on Person.Contact(LastName)

and then consider these similar queries:

select ContactId,
       LastName,
     row_number() over(partition by LastName order by ContactId) 
from [AdventureWorks].[Person].[Contact] 
where LastName = 'Smith'
go
create view vwContacts
as
select ContactId,
       LastName,
       rown = row_number() over(partition by LastName order by ContactId) 
from [AdventureWorks].[Person].[Contact] 
go
Select * from vwContacts  
 where LastName = 'Smith'
 go

You should now see two vastly different query plans

Notice in “Query 2” that Filter ?

Wow.. So it did calculate the entire query and then filter on the results. At first i thought that i was being unfair but when doing exactly the same on 2008 in get index seeks on both sides

CTE’s also suffer from the same problem, although they are easier to rewrite than a view to utilize a seek.

with cteLastNameRow(ContactId,LastName,Row_number)
as(
    select ContactId,
           LastName,
           row_number() over(partition by LastName order by ContactId) 
    from [AdventureWorks].[Person].[Contact] 
    )
select * 
  from cteLastNameRow
 where LastName = 'Smith'
go

Another index scan.

So , be careful when using ranking functions and as ever, investigate the query execution plan on EVERY query.

Thursday, 9 July 2009

A faster TSQL random length random string generator

The following function function will return you a random string of the specified characters, for a length of between @StrLenLo and @StrLenHi. The only ‘Oddity’ with using this function is that if the parameters are not dependant upon data within a table SQLServer will create a hash join which will cause the same value to be returned. This is the reason for the bizzare looking ‘case when Num>=0 then 8 else 8 end’

First off generate a ‘numbers’ table

CREATE TABLE dbo.Numbers (Num INT NOT NULL PRIMARY KEY CLUSTERED);
GO
DECLARE @i INT;
SELECT @i = 1;
WHILE @i <= 1000
BEGIN
    INSERT INTO dbo.Numbers(Num) VALUES (@i);
    SELECT @i = @i + 1;
END;
go
update statistics Numbers with fullscan
go

Then , as SqlServer does not allow the use of newid() within functions create a small view that will return a random integer.

drop View VwNewCheck
go
Create View VwNewCheck
with schemabinding
as
Select abs(checksum(NewId()))  as New_Id
go

Next up the actual function

Drop Function GetVariableLengthRandomCode
go

Create Function GetVariableLengthRandomCode(@StrLenLo integer,@StrLenHi integer,@CharsNeeded char(62))
returns table
with schemabinding
as
return
(
with cteRandomLength(StrLen)
as
(
   Select @StrLenLo + VwNewCheck.new_id%((@StrLenHi+1)-@StrLenLo)
     from dbo.VwNewCheck
),
cteRandomChars(num,c)
as
(
   Select Num,substring(@CharsNeeded,(Select VwNewCheck.new_id%(len(@CharsNeeded)-1)+1 from dbo.VwNewCheck  where num = num ),1)
     from dbo.numbers
    where Num <= (Select StrLen from cteRandomLength)
     
)
select (
select c as [text()] 
 from  cteRandomChars
 for xml path('')) as random
)
go

And you are good to go.

select * from GetVariableLengthRandomCode(8,16,'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789')

As mentioned above if you execute

select  Random from numbers cross apply GetVariableLengthRandomCode(8,16,'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789')

Then the same value will be returned a thousand times, so fool the optimizer by

select  Random 
   from numbers 
   cross apply GetVariableLengthRandomCode(case when Num>=0 then 8 else 8 end,16,'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789')

and you will get a thousand random strings between 8 and 16 characters long.

Tuesday, 9 June 2009

Fixed Headers With SSRS

Within SSRS, or at least Report Builder 2, there are a few options that would imply that headers can be fixed.

Heres how to really fix them, leave all the above settings unchecked and save the rdl to a local file. Load it up in notepad ( or an XML editor if you are that way inclined).

Then add in this line

Save the file , reload it in Report Writer 2 and (fingers crossed) hey presto , fixed headers.

Thursday, 4 June 2009

Catch all queries and indexing

Amongst many others, Gail Shaw has blogged on the subject of catch all queries. On many occasions I have needed to do something similar and found performance to be pretty dire on large tables. Typically I would use dynamic SQL to generate the query but there are occasions where that is not possible. After a wee bit of poking around and trial and error , I think I am now straight in my mind of when SQLServer will index and when it will scan. Consider this proc

Drop procedure test
go
Create procedure test @ProductId integer,
               @ReferenceOrderID integer

with recompile
as
select *
from [Production].[TransactionHistory]
where   (@ProductId        is null or @productid        = ProductID)
   and  (@ReferenceOrderID is null or @ReferenceOrderID = ReferenceOrderID)

This will always create an index scan ,try these calls

exec test NULL, 61197 
exec test 790 , 61197 
exec test 790 , NULL

However the Procedure can be re-written to the logically similar

drop proc test
go
create proc test @ProductId integer,
                  @ReferenceOrderID integer
as
select *
from [Production].[TransactionHistory]
where (@productid        is not null and @productid        = ProductID)
or    (@ReferenceOrderID is not null and @ReferenceOrderID = ReferenceOrderID)

Note no recompile is needed. Now re-execute the procedure. All being well you should notice index seeks and a much lower IO Cost. Before we celebrate to much , there are some problems. Firstly, if any one column that is referenced is not indexed then a scan will be reintroduced. Try

drop proc test
go
create proc test @ProductId integer,
                 @ReferenceOrderID integer,
                 @Quantity integer = NULL
with recompile
as
select *
from [Production].[TransactionHistory]
where (@productid        is not null and @productid        = ProductID)
or    (@ReferenceOrderID is not null and @ReferenceOrderID = ReferenceOrderID) 
or    (@quantity is not null and @quantity = Quantity)

Secondly an OR condition is used so it is now a union of Transactions which use either the given product of ReferenceOrderId. So, in most cases I would recommend two queries , the first to do the index seeks and dump the results to a temp table and the second to filter that based upon the non-indexed columns and to ensure that that all the required consitions are met. The only assumption here is that the first pass has matched some rows. This may seem long winded but if you are unable to use dynamic SQL and table scanning massive tables is out of the question then your options are rather limited.

Monday, 1 June 2009

A bit of self back slapping

Recently Adam Machanic issued a fresh SQL challenge based upon concatenating string data. The results are now in. Although I was eliminated in the first round, appeal lodged :) , just looking at the execution times mine were pretty damn quick. I feel particularly good as entries by Itzik Ben-Gan were slower than mine.

Friday, 29 May 2009

Optimal Query Plans With Unbalanced Data Loads

Sql Server is great at producing and caching efficient query plans. However these plans may not be the most efficient based upon the parameters passed into subsequent calls off a stored procedure.

In this article I am assuming that you have read both of Erland Sommarskog’s articles The Curse and Blessings of Dynamic SQL and Dynamic Search Conditions in T-SQL.

First off, in AdventureWorks we need to create a few indexes help help prove the point.

Create Index IdxLastName on person.Contact(LastName)
Create Index IdxContactId on sales.SalesOrderHeader(ContactID)
Create Index IdxOrderDate on sales.SalesOrderHeader(OrderDate)

Then if you execute :

Select * from sales.SalesOrderHeader  SOH ,
              person.Contact          CON 
       where SOH.ContactID = CON.ContactId
         and Con.LastName = 'Smith'
         and SOH.OrderDate between '2002-09-01' and '2002-09-02'

and

Select * from sales.SalesOrderHeader  SOH ,
              person.Contact          CON 
       where SOH.ContactID = CON.ContactId
         and Con.LastName = 'Smith'
         and SOH.OrderDate between '2002-09-01' and '2005-09-02'

You will notice the stark, and unsurprising, difference in query plans. In the first query its quicker to find all those order made on the 1st or 2nd of September 2002 and filter that list for those made by ‘Smith’s. In the second it has found the orders made by ‘Smiths’ and filter those order for those made between 01 September 2002 and 02 September 2005.

Naturally if we did :

declare @LastName nvarchar(50),
        @OrderDateLo smalldateTime,
        @OrderDateHi smalldatetime
        
select @LastName = 'Smith'
select @OrderDateLo = '2002-09-01'
select @OrderDateHi = '2002-09-01'
Select * from sales.SalesOrderHeader  SOH ,
              person.Contact          CON 
       where SOH.ContactID = CON.ContactId
         and Con.LastName = @LastName
         and SOH.OrderDate between @OrderDateLo and @OrderDateHi

followed by

declare @LastName nvarchar(50),
        @OrderDateLo smalldateTime,
        @OrderDateHi smalldatetime
        
select @LastName = 'Smith'
select @OrderDateLo = '2002-09-01'
select @OrderDateHi = '2005-09-01'
Select * from sales.SalesOrderHeader  SOH ,
              person.Contact          CON 
       where SOH.ContactID = CON.ContactId
         and Con.LastName = @LastName
         and SOH.OrderDate between @OrderDateLo and @OrderDateHi

Then SQL Server will use the same (non optimal in the second case) cached query plan.

This could have quite an impact on your system. We can , although not completely, resolve this inefficiency with one or two little tricks. When deciding if to re-use a cached plan or create a new one SqlServer compares the entire SQL Statement , comments and all. This means that :

Select * from person.Contact CON Where Con.LastName = 'Smith' /* 1 */

and

Select * from person.Contact CON Where Con.LastName = 'Smith' /* 2 */

will have different cache plans. Verify this yourself by checking sys.dm_exec_cached_plans. We can use this to our advantage by generating comments based upon the input data to create different query plans.

Using a simplified search_orders procedure where we are only interested in search on last name and an order date range such as this :

Create Procedure search_orders @LastName    nvarchar(50) = NULL,
                              @OrderDateLo smalldateTime =NULL,
                              @OrderDateHi smalldatetime =NULL
as
Declare @Sql nvarchar(max)

Select @Sql = 'Select * from sales.SalesOrderHeader  SOH , person.Contact CON where SOH.ContactID = CON.ContactId ' 
if(@LastName is not null)   Select @Sql = @Sql +' and CON.LastName = @LastName '
if(@OrderDateLo is not null)Select @Sql = @Sql +' and SOH.OrderDate >= @OrderDateLo '
if(@OrderDateHi is not null) Select @Sql = @Sql +' and SOH.OrderDate <= @OrderDateHi '
declare @DaysDiff integer
Select @DaysDiff = datediff(dd,coalesce(@OrderDateLo,'2001-07-01 00:00:00.000'),
                               coalesce(@OrderDateHi,getdate()))
Select @Sql = @Sql + case when @DaysDiff <=2 then ' /* DR:1 */'
                          when @DaysDiff <=7 then ' /* DR:2 */'
                          when @DaysDiff <=30 then ' /* DR:3 */'
                          else ' /* DR:4 */' end

exec sp_executesql @stmt = @Sql,
                   @params = N'@LastName nvarchar(50),@OrderDateLo smalldatetime,@OrderDateHi smalldatetime',
                   @LastName = @LastName,
                   @OrderDateLo = @OrderDateLo,
                   @OrderDateHi = @OrderDateHi

Note how a comment is added to the dynamic sql string depending on how many days are to be searched.

If we execute search_orders with varying parameters, the maximum number of query plans we will end up with is 4. However they will be optimized(ish) to account for possible date range sizes. Again check sys.dm_exec_cached_plans to confirm this for yourself.

Taking this one step further if you are querying multiple tables for multiple wide or narrow ranges, you could use a ratio of the count of resolved rows in each table as the comment. The con is you would obviously have the extra processing of the counting the rows, but at least you would not end up using an appalling query plan.

Friday, 20 March 2009

SQL and contiguous data ranges.

As the saying goes "work smarter, not harder", this is definitely the case for SQL. If you have a poorly performing query, then you are not only affecting that one user but diverting precious machine resources needlessly. This is doubly true if its part of the 20% of code executed 80% of the time. Here's one smart solution to a common problem, and after the umpteenth newsgroup question which could be solved by this , I thought id take a little time out to dissect the theory and give a few good examples. Steve Kass of Drew University , seems to be the man to credit for this idea, although i cant find any specific details.

The problem is given a list of numbers we need to output the covered ranges. The key is to simply generate an unique number to group your contiguous ranges by. The value of this number is completely irrelevant , it just has to be the same for each item within a range.

So lets start simply , if we have a range of numbers (1,2,3,5,6,7,8), what we need returned is:
1 - 3
5 - 8

Lets generate some test data


Drop table Contig
go
create table Contig(Id integer)
go
insert into contig values(1)
insert into contig values(2)
insert into contig values(3)
insert into contig values(5)
insert into contig values(6)
insert into contig values(7)
insert into contig values(8)

If we then execute


Select Id,row_number() over (order by Id desc)
from Contig

we get :

Id          RowNum
----------- --------------------
8           1
7           2
6           3
5           4
3           5
2           6
1           7

What we are going to do is generate our 'grouping number' by adding Id and Rownum. So now execute


with cteContigOrder(Id,RowNum)
as
(
Select Id,RowNum = row_number() over (order by Id desc) from Contig
)
Select Id,RowNum,Grouping = Id+RowNum from cteContigOrder

which returns

Id          RowNum               Grouping
----------- -------------------- --------------------
8           1                    9
7           2                    9
6           3                    9
5           4                    9
3           5                    8
2           6                    8
1           7                    8

Looking at this its easy to see how id 1 thru 3 have a grouping of 8 and id 5 thru 8 have a grouping of 9.

So to generate our ranges we need do this :


;with cteContigOrder(Id,RowNum)
as
(
Select Id,RowNum = row_number() over (order by Id desc) from Contig
)
Select Min(Id),Max(Id) from cteContigOrder
group by Id+RowNum order by 1

which gives us

----------- -----------
1           3
5           8

exactly the ranges we wanted.

Using this it very easy to apply to Dates.



Drop Table ContigDates
go
Create Table ContigDates
(
DateCol smalldatetime
)
go
insert into ContigDates(DateCol) values('01jan2009')
insert into ContigDates(DateCol) values('02jan2009')
insert into ContigDates(DateCol) values('03jan2009')
insert into ContigDates(DateCol) values('01feb2009')
insert into ContigDates(DateCol) values('02feb2009')
insert into ContigDates(DateCol) values('03feb2009')
insert into ContigDates(DateCol) values('10feb2009')
go
with cteDateList(DateCol,Grouping)
as
(
Select DateCol,DateCol + row_number() over (order by datecol desc)
from ContigDates
)
Select Min(DateCol),Max(DateCol)
from cteDateList
group by Grouping
order by 1
go

If you have a table which itself is a range of dates , i find it simplest to 'explode' the range and then re-summarize. Here I have used a Calendar table to find all the days within the ranges.


Drop Table ContigDateRanges
go
Create Table ContigDateRanges
(
DateCollo smalldatetime,
DateColHi smalldatetime
)
go
insert into ContigDateRanges(Datecollo,Datecolhi) 
values('01jan2009','03jan2009')
insert into ContigDateRanges(Datecollo,Datecolhi) 
values('04jan2009','10jan2009')
insert into ContigDateRanges(Datecollo,Datecolhi) 
values('01feb2009','01feb2009')
insert into ContigDateRanges(Datecollo,Datecolhi) 
values('02feb2009','05feb2009')
go
with cteDateList(DateCol,Grouping)
as
(
Select Calendar.Dte,
       Calendar.Dte + row_number() over (order by Calendar.Dte desc)
  from ContigDateRanges,
       Calendar
 where Calendar.Dte between DateColLo and DateColHi
)
Select Min(DateCol),Max(DateCol)
from cteDateList
group by Grouping
order by 1

I hope that this gives you at least one more option to improve your system and to remove a few more evil cursors.