November | 2010 | Martin's BI Blog

Adding Subtotals using GROUPING SETS

November 7, 2010 Martin Mason Leave a comment

I recently had to convert an existing report constructed against SQL Server 2000 to our new SQL Server 2008 dimensional data warehouse. The output of the report procedure was a single set of aggregated totals by different levels of the date dimension. After reviewing the code for some time, which consisted of several pages of stream of consciousness SQL and the creation of a multitude of temp tables, I arrived at the conclusion that there just had to be a better way. That’s when I stumbled upon a little documented enhancement in SQL Server 2008 called GROUPING SETS.

Using the AdventureWorksDW sample database as a source, the use of GROUPING SETS lets us produce output like below using a single SQL statement.

How is this done? Lets first consider the case where we want to generate an aggregate by month. Using a GROUP BY statement, this result set is produced by the following:

SELECT ordDate.CalendarYear,

ordDate.CalendarQuarter,

ordDate.MonthNumberOfYear,

ordDate.EnglishMonthName,

SUM(fact.SalesAmount) AS SalesAmount

FROM dbo.FactInternetSales fact

INNER JOIN dbo.DimTime ordDate

ON fact.OrderDateKey = ordDate.TimeKey

GROUP BY

ordDate.CalendarYear,

ordDate.CalendarQuarter,

ordDate.MonthNumberOfYear,

ordDate.EnglishMonthName

This statement simply produces a single vector aggregate with CalendarYear, CalendarQuarter, MonthNumberOfYear, and EnglishMonthName as coordinates. I can rewrite this same statement using the new GROUPING SETS function by passing in the same coordinates to a GROUPING SETS statement.

SELECT ordDate.CalendarYear,

ordDate.CalendarQuarter,

ordDate.MonthNumberOfYear,

ordDate.EnglishMonthName,

SUM(fact.SalesAmount) AS SalesAmount

FROM dbo.FactInternetSales fact

INNER JOIN dbo.DimTime ordDate

ON fact.OrderDateKey = ordDate.TimeKey

GROUP BY

GROUPING SETS (

(

ordDate.CalendarYear,

ordDate.CalendarQuarter,

ordDate.MonthNumberOfYear,

ordDate.EnglishMonthName

)

The two statements produce the same exact output. However, the use of GROUPING SETS lets me specify another vector aggregate, say just Year and Quarter like below

SELECT ordDate.CalendarYear,

ordDate.CalendarQuarter,

ordDate.MonthNumberOfYear,

ordDate.EnglishMonthName,

SUM(fact.SalesAmount) AS SalesAmount

FROM dbo.FactInternetSales fact

INNER JOIN dbo.DimTime ordDate

ON fact.OrderDateKey = ordDate.TimeKey

GROUP BY

GROUPING SETS (

(

ordDate.CalendarYear,

ordDate.CalendarQuarter,

ordDate.MonthNumberOfYear,

ordDate.EnglishMonthName

(

ordDate.CalendarYear,

ordDate.CalendarQuarter

)

This statement produces aggregates at two different levels, Month and Quarter, of the time dimension.

To add Yearly totals and GrandTotals, two new tuples are added to the GROUPING SETS function, (ordDate.CalendarYear), () so that the final statement looks like the following.

SELECT ordDate.CalendarYear,

ordDate.CalendarQuarter,

ordDate.MonthNumberOfYear,

ordDate.EnglishMonthName,

SUM(fact.SalesAmount) AS SalesAmount

FROM dbo.FactInternetSales fact

INNER JOIN dbo.DimTime ordDate

ON fact.OrderDateKey = ordDate.TimeKey

GROUP BY

GROUPING SETS(

(

ordDate.CalendarYear,

ordDate.CalendarQuarter,

ordDate.MonthNumberOfYear,

ordDate.EnglishMonthName

), — Month summary

(

ordDate.CalendarYear,

ordDate.CalendarQuarter

), — Quarter summary

(

ordDate.CalendarYear

), — Year summary

() — Grand Totals

)

Quite beautiful.

Categories: Data Warehousing Tags: GROUPING SETS, SQL Server 2008, TSQL

Modifying a Huge Table and SQL Server Partition Sliding

November 2, 2010 Martin Mason 2 comments

It’s been quite some time since I last posted. Since last time, I’ve moved my blog to WordPress as Live Spaces will be no more. Not that I’ve not had ideas. Just not been inspired to sit down and look at a computer after staring at a monitor for hours all day long. But today is different I guess.

I make the point of scripting out all objects in my data warehouse and saving the object definitions in VSS. For table objects, one can’t just drop the object and recreate otherwise all the existing data gets dropped also. So for table objects, I use the following formula: (I use a template that I’ll post on my SkyDrive account if anyone is interested.)

Drop all indexes on the table,
Drop all constraints including the primary key constraint,
Create a table with the modified structure with a _NEW suffix appended to the end (i.e. TableName_NEW)
Migrate the data from the existing table to the new table (i.e INSERT INTO TableName_NEW (……) SELECT ….. FROM TableName)
Check rowcounts in both tables and if equal, drop the old table and rename the new table (DROP TABLE TableName; EXEC sp_rename ‘TableName_NEW’, ‘TableName’
Recreate constraints and indexes.

This approach has worked well until recently. In this particular situation, I needed to add another dimension reference to a fact table containing several millions of rows. The outline above works well except for the step copying data from the existing to the new table. As written above, this step is logged and after several hours, eventually fills the transaction log (I’m guessing as I didn’t actually have the patience to attempt). A slightly modified approach was needed. Following an approach used by SSIS and taking advantage of partitioning, instead of copying all the data at once, the following script was used:

DECLARE @sSQL varchar(8000);

DECLARE @iPartitionNo int;

DECLARE @iLeftBoundary int;

DECLARE @iRightBoundary int;

DECLARE @sFileGroupName varchar(10);

SET @iLeftBoundary = 0;

DECLARE cPartitions INSENSITIVE CURSOR FOR

SELECT CAST(prv.boundary_id AS int) AS PartitionNo,

CAST(prv.value AS int) AS LeftPartitionBoundary,

CAST(ds.name AS varchar(10)) AS FileGroupName

FROM sys.partition_range_values prv

INNER JOIN sys.partition_functions pfunc

ON prv.function_id = pfunc.function_id

INNER JOIN sys.partition_schemes psch

ON prv.function_id = psch.function_id

INNER JOIN sys.destination_data_spaces dsp

ON psch.data_space_id = dsp.partition_scheme_id

AND prv.boundary_id = dsp.destination_id

INNER JOIN sys.data_spaces ds

ON dsp.data_space_id = ds.data_space_id

WHERE pfunc.[name] = ‘MyPartitionFunction’;

OPEN cPartitions;

FETCH cPartitions INTO @iPartitionNo, @iRightBoundary, @sFileGroupName;

WHILE (@@FETCH_STATUS = 0) BEGIN

IF EXISTS (

SELECT *

FROM sys.tables tbl

INNER JOIN sys.schemas sch

ON tbl.schema_id = sch.schema_id

WHERE tbl.name = ‘MyTableTemp’

AND sch.name = ‘dbo’

) BEGIN

DROP TABLE dbo.MyTableTemp;

END;

SET @sSQL = ‘

SELECT FieldList

INTO dbo.MyTableTemp

FROM dbo.MyTable

WHERE PartitioningField >= CAST(@iLeftBoundary AS varchar)+ ‘

AND PartitioningField <=’ + CAST(@iRightBoundary AS varchar)

EXEC(@sSQL);

SET @sSQL = ‘

ALTER TABLE dbo.MyTableTemp ADD CONSTRAINT MyTableTemp_PK

PRIMARY KEY (PartitioningField) ON [‘ + @sFileGroupName + ‘]’;

EXEC(@sSQL);

SET @sSQL = ‘

ALTER TABLE dbo.MyTableTemp ADD CONSTRAINT MyTableTemp_CK

CHECK ‘+ ‘(PartitioningField >= ‘ +

CAST(@iLeftBoundary AS varchar) + ‘

AND

PartitioningField <= ‘ +

CAST(@iRightBoundary AS varchar) +

‘)’;

EXEC(@sSQL);

SET @sSQL = ‘

ALTER TABLE dbo.MyTableTemp SWITCH TO dbo.MyTable_NEW

PARTITION ‘ + CAST(@iPartitionNo AS varchar);

EXEC(@sSQL);

SET @iLeftBoundary = @iRightBoundary + 1;

FETCH cPartitions INTO @iPartitionNo, @iRightBoundary, @sFileGroupName;

END;

CLOSE cPartitions;

DEALLOCATE cPartitions;

DECLARE @OldRowCount int

DECLARE @NewRowCount int

SELECT @OldRowCount = COUNT(*)

FROM dbo.MyTable

SELECT @NewRowCount = COUNT(*)

FROM dbo.MyTable_NEW

SELECT @OldRowCount, @NewRowCount

IF (@OldRowCount = @NewRowCount) BEGIN

DROP TABLE dbo.MyTable

EXEC sp_rename ‘dbo.MyTable_NEW’, ‘MyTable’

END ELSE BEGIN

DROP TABLE dbo.MyTable_NEW

END

Those who’ve worked with me must be floored that I’m posting TSQL code using a cursor. I normally break into hives when I see a cursor being used. But in this case, short of migrating to SSIS, it was the best I could do.

The SELECT… INTO dbo.TableNameTemp operation is not logged. So instead of taking minutes to run, it takes seconds. This step on my lightweight development server took about 30 seconds to copy over 10 million rows with over 30 columns (all numeric, of course. It is a fact table after all) Once the temp table was created, in order to be slid into the _NEW table, the source and destination partitions have to be located on the same file group and the two partitions have to have the same constraints applied. By adding a primary constraint to the temp table and creating it on the same partition as the target table, both conditions are satisfied.

If anyone finds themselves in the same predicament, I hope they find this approach useful.

Categories: Data Warehousing Tags: Partition Sliding, SQL Server

Martin's BI Blog

Archive

Adding Subtotals using GROUPING SETS

Modifying a Huge Table and SQL Server Partition Sliding

Random Posts

Categories

Blogroll

Archives

Meta

Martin's BI Blog

Archive

Adding Subtotals using GROUPING SETS

Modifying a Huge Table and SQL Server Partition Sliding

Random Posts

Tag Cloud

Categories

Blogroll

Archives

Meta