Archive

Archive for the ‘Data Warehousing’ Category

Role Playing Dimensions or Role Dimension?

Recently on the SSAS Forums, there was a question posted which was titled Dimension 1 or Dimension 2. The poster eventually answered their own question. I did not particularly like their proposed solution, decided to reply back with a response, but then realized that an alternate solution would be a better topic for a blog post than a forum response. So here goes that response. The question being asked is a more general dimensional modeling topic than the specific business problem described.

In a nutshell the question being asked was the following: Given that a Country can participate in a business process in two different roles, should the Country dimension be modeled as a single dimension (i.e. Country) or as two role-playing dimensions (i.e. Primary Country and Referer Country)? Additionally, their users needed to be able to query across both dimensions independently where a question could be answered that included contributions from both UK in the Primary Role OR France in the Referer Role. The OR is the important criteria to the solution; otherwise, use a role-playing dimension.

The solution the poster proposed in that post was heavily geared toward a technical approach to resolving the reporting requirement rather than guided by the composition of the business domain. I have to confess. I’m not even sure I could explain how their approach even solved the problem but it did involve creating a dimension directly from the fact table which is almost always leads to a very unscalable solution.

What the Or reporting criteria suggests is that the business domain is composed of a single Country dimension that participates in the Billing process in both Primary and Referer roles.Since a Billing process event can be related to more than one Country, the solution is going to involve modeling a many-to-many (M2M) relationship. Using a simple matrixification approach to address the M2M relationship to reduce the cardinality of the bridging dimension and the number of records in the bridging fact table, the separate primary and referer country references become concatenated into a single column in the fact table something like what’s shown below:

Matrixification

The dimensional model then would take that concatenated column and create the following structure,

  • DimCountryRoleMatrix – a bridge dimension with one row for each unique value of CountryRoleMatrix having a regular relationship with both Billing and CountryRoleMatrix measure groups,
  • FactCountryRoleMatrix – a  bridge fact table that contains two rows for each row in the DimCountryRoleMatrix table, one for country primary role and another for the country referer role,
  • DimCountry – one row for each country. The corresponding SSAS dimension would be related to the Billing measure group as a M2M relationship through the CountryRoleMatrix measure group, and
  • DimRole – contain only two rows, one for Primary and the other for Referer. Like Country, the corresponding SSAS dimension would also be related to the Billing measure group as a M2M relationship through the CountryRoleMatrix measure group.

DimensionalModel

The original question how to query for UK in the Primary role or France in the Referer role now in MDX is simply the following set:

{
( [Country].[UK], [Role].[Primary] ),
( [Country].[France], [Role].[Referer] )
}

 

Categories: SSAS, Data Warehousing

Modeling Several Boolean Attributes Using a Junk Dimension

October 26, 2012 1 comment

One of the discussions that I recently had involved modeling a business process associated with several Boolean attributes. The current model addressed each of these attributes as a separate dimension. While there is nothing necessarily wrong with this approach, when reviewing a model, I always try to think the way Ralph Kimball and colleagues at the Kimball Group would think. If you’re not modeling the way that they would model that process, you probably should re-evaluate and redesign.

So how would the Kimball Group model this scenario? The answer is most likely with a Junk dimension since a Junk dimension consolidates several low cardinality attributes into a single useful dimension. Since all of these attributes have only two possible values, either Yes or No, seems like a very likely candidate for a Junk dimension!

There’s several advantages to modeling this scenario with a Junk dimension. Below are three of them (and you’re always free to disagree)

  1. Fact table contains a single field rather than a whole bunch of columns to refer to these flags. More on that later.
  2. Greater flexibility on defining the labels of each attribute, especially when utilized with a view layer abstraction between the source and the user presentation layer, and
  3. Easier to enhance the cube should additional Boolean attributes be identified later in the project lifecycle.

Defining the Junk Dimension

While in most cases, the key for a dimension is a meaningless number and assigned by an Identity or Sequence, in the case of a Junk dimension consolidating several Boolean attributes a better approach is to calculate this key. For instance, if I’ve consolidated four Boolean attributes, I can calculate this key by multiplying the value of the Boolean by a corresponding power of 2, either 20, 21, 22, or 23 depending on the identity of the attribute. A T-SQL statement that defines a table consolidating four Boolean attributes is given below.

WITH cteFlags AS (
    SELECT    0 AS Value
    UNION ALL
    SELECT    1
)
SELECT    
    (
        Power(2, 3.0) * Flag3.Value 
     +  Power(2, 2.0) * Flag2.Value 
     +  Power(2, 1.0) * Flag1.Value 
     +  Flag0.Value
    ) AS ID, 
    Flag0.Value,
    Flag1.Value,
    Flag2.Value,
    Flag3.Value
FROM 
    cteFlags Flag0
CROSS JOIN 
    cteFlags Flag1
CROSS JOIN 
    cteFlags Flag2
CROSS JOIN 
    cteFlags Flag3
ORDER BY
    ID

The ETL to load fact tables is now slightly more complex. Rather than each dimension being represented as separate columns, a single column is used. But since the surrogate key for the Junk dimension is not incrementally assigned but coded based on business domain values, this value can be calculated rather than being looked up based on the individual column values.

Greater Flexibility

One of the problems with inclusion of each Boolean attribute as a separate dimension is that when modeled as a role-playing dimension in a multidimensional database, you’ll end up with a whole bunch of Yes and No or True and False entities. When querying this model, it’s not too difficult to get completely lost in the meaning of each of those values. Using a Junk dimension, one can easily manipulate the labels or names applied to each of those entities. So the True flag applied to the Base Fee Schedule Overridden? attribute can be changed from Yes to Base Fee Schedule Not Used or some other description more in line with the business domain.

Easier to Enhance

Another advantage is the ease in which additional Boolean attributes can be added to an existing data mart. Most of the values for these attributes are likely to be False for the vast majority of cases as these attributes are generally used to identify exceptions. Therefore, by simply adding another column to your Junk dimension with False as its value you’ve already accurately included the additional attribute for the vast majority of fact records without even touching the fact table. How Agile is that?

To complete inclusion of the exceptions, there are two steps that need to be executed. First, new rows need to be added to the Junk dimension. These can be simply added by appending the true values for the additional attribute to the existing Junk dimension records. To illustrate using the previous case, the query defining new rows to the Junk dimension created above would simply be the following:

WITH cteFlags AS (
    SELECT    0 AS Value
    UNION ALL
    SELECT    1
)
SELECT    
    (
        Power(2, 4.0) * New_Attribute.Value
     +  Power(2, 3.0) * Flag3.Value 
     +  Power(2, 2.0) * Flag2.Value 
     +  Power(2, 1.0) * Flag1.Value 
     +  Flag0.Value
    ) AS ID, 
    Flag0.Value,
    Flag1.Value,
    Flag2.Value,
    Flag3.Value,
    Flag4.Value
FROM 
    cteFlags Flag0
CROSS JOIN 
    cteFlags Flag1
CROSS JOIN 
    cteFlags Flag2
CROSS JOIN 
    cteFlags Flag3
CROSS JOIN 
    cteFlags New_Attribute
WHERE    
    New_Attribute.Value = 1
ORDER BY
    ID

With the new dimension records added, the second step is to update the fact records identified as “exceptions”. Again, as opposed to a complete reloading of the fact table to include the additional attribute, the Junk dimension key value for these exceptions can be updated accordingly. One way to complete this task is to load a temporary staging table that has the identifying fact fields for all records with a true New_Attribute value. By joining to this table and applying the following update conditions to the corresponding records,  JunkDimensionKey = JunkDimensionKey + 24, you’ve now successfully included a new Boolean attribute into the model. If there are many records where the New_Attribute value is true and your fact table is very large, your best approach would likely be to use a technique that I described in a previous blog post nearly two years ago.

Categories: Data Warehousing

Adding Subtotals using GROUPING SETS

November 7, 2010 Leave a comment

I recently had to convert an existing report constructed against SQL Server 2000 to our new SQL Server 2008 dimensional data warehouse.  The output of the report procedure was a single set of aggregated totals by different levels of the date dimension. After reviewing the code for some time, which consisted of several pages of stream of consciousness SQL and the creation of a multitude of temp tables, I arrived at the conclusion that there just had to be a better way. That’s when I stumbled upon a little documented enhancement in SQL Server 2008 called GROUPING SETS.

Using the AdventureWorksDW sample database as a source, the use of GROUPING SETS lets us produce output like below using a single SQL statement.

SubTotalOutput

How is this done? Lets first consider the case where we want to generate an aggregate by month. Using a GROUP BY statement, this result set is produced by the following:

SELECT  ordDate.CalendarYear,

        ordDate.CalendarQuarter,

        ordDate.MonthNumberOfYear,

        ordDate.EnglishMonthName,

        SUM(fact.SalesAmount) AS SalesAmount

FROM    dbo.FactInternetSales fact

INNER   JOIN dbo.DimTime ordDate

        ON   fact.OrderDateKey = ordDate.TimeKey

GROUP BY

        ordDate.CalendarYear,

        ordDate.CalendarQuarter,

        ordDate.MonthNumberOfYear,

        ordDate.EnglishMonthName

 

This statement simply produces a single vector aggregate with CalendarYear, CalendarQuarter, MonthNumberOfYear, and EnglishMonthName as coordinates. I can rewrite this same statement using the new GROUPING SETS function by passing in the same coordinates to a GROUPING SETS statement.
 

SELECT  ordDate.CalendarYear,

        ordDate.CalendarQuarter,

        ordDate.MonthNumberOfYear,

        ordDate.EnglishMonthName,

        SUM(fact.SalesAmount) AS SalesAmount

FROM    dbo.FactInternetSales fact

INNER   JOIN dbo.DimTime ordDate

        ON   fact.OrderDateKey = ordDate.TimeKey

GROUP BY

        GROUPING SETS (

            (

                  ordDate.CalendarYear,

                  ordDate.CalendarQuarter,

                  ordDate.MonthNumberOfYear,

                  ordDate.EnglishMonthName

            )

        )

The two statements produce the same exact output. However, the use of GROUPING SETS lets me specify another vector aggregate,  say just Year and Quarter like below

SELECT  ordDate.CalendarYear,

        ordDate.CalendarQuarter,

        ordDate.MonthNumberOfYear,

        ordDate.EnglishMonthName,

        SUM(fact.SalesAmount) AS SalesAmount

FROM    dbo.FactInternetSales fact

INNER   JOIN dbo.DimTime ordDate

        ON   fact.OrderDateKey = ordDate.TimeKey

GROUP BY

        GROUPING SETS (

            (

                  ordDate.CalendarYear,

                  ordDate.CalendarQuarter,

                  ordDate.MonthNumberOfYear,

                  ordDate.EnglishMonthName

            ),

            (

                  ordDate.CalendarYear,

                  ordDate.CalendarQuarter

            )

      )

This statement produces aggregates at two different levels, Month and Quarter, of the time dimension.

MonthAndQuarterAggregate

To add Yearly totals and GrandTotals, two new tuples are added to the GROUPING SETS function, (ordDate.CalendarYear), () so that the final statement looks like the following.

SELECT  ordDate.CalendarYear,

        ordDate.CalendarQuarter,

        ordDate.MonthNumberOfYear,

        ordDate.EnglishMonthName,

        SUM(fact.SalesAmount) AS SalesAmount

FROM    dbo.FactInternetSales fact

INNER   JOIN dbo.DimTime ordDate

        ON   fact.OrderDateKey = ordDate.TimeKey

GROUP BY

        GROUPING SETS(

            (

                  ordDate.CalendarYear,

                  ordDate.CalendarQuarter,

                  ordDate.MonthNumberOfYear,

                  ordDate.EnglishMonthName

            ),          — Month summary

            (

                  ordDate.CalendarYear,

                  ordDate.CalendarQuarter

            ),          — Quarter summary

            (

                  ordDate.CalendarYear

            ),          — Year summary

            ()          — Grand Totals

      )

Quite beautiful.

Modifying a Huge Table and SQL Server Partition Sliding

November 2, 2010 2 comments

It’s been quite some time since I last posted. Since last time, I’ve moved my blog to WordPress as Live Spaces will be no more. Not that I’ve not had ideas. Just not been inspired to sit down and look at a computer after staring at a monitor for hours all day long. But today is different I guess.

I make the point of scripting out all objects in my data warehouse and saving the object definitions in VSS. For table objects, one can’t just drop the object and recreate otherwise all the existing data gets dropped also. So for table objects, I use the following formula: (I use a template that I’ll post on my SkyDrive account if anyone is interested.)

  • Drop all indexes on the table,
  • Drop all constraints including the primary key constraint,
  • Create a table with the modified structure with a _NEW suffix appended to the end (i.e. TableName_NEW)
  • Migrate the data from the existing table to the new table (i.e INSERT INTO TableName_NEW (……) SELECT ….. FROM TableName)
  • Check rowcounts in both tables and if equal, drop the old table and rename the new table (DROP TABLE TableName; EXEC sp_rename ‘TableName_NEW’, ‘TableName’
  • Recreate constraints and indexes.

This approach has worked well until recently. In this particular situation, I needed to add another dimension reference to a fact table containing several millions of rows. The outline above works well except for the step copying data from the existing to the new table. As written above, this step is logged and after several hours, eventually fills the transaction log (I’m guessing as I didn’t actually have the patience to attempt). A slightly modified approach was needed. Following an approach used by SSIS and taking advantage of partitioning, instead of copying all the data at once, the following script was used:

DECLARE @sSQL varchar(8000);

DECLARE @iPartitionNo int;

DECLARE @iLeftBoundary int;

DECLARE @iRightBoundary int;

DECLARE @sFileGroupName varchar(10);

 

SET @iLeftBoundary = 0;

DECLARE cPartitions INSENSITIVE CURSOR FOR    

       SELECT  CAST(prv.boundary_id AS int) AS PartitionNo,

               CAST(prv.value AS int) AS LeftPartitionBoundary,

               CAST(ds.name AS varchar(10)) AS FileGroupName

       FROM    sys.partition_range_values prv

       INNER   JOIN sys.partition_functions pfunc

              ON      prv.function_id = pfunc.function_id

       INNER   JOIN sys.partition_schemes psch

              ON      prv.function_id = psch.function_id

       INNER   JOIN sys.destination_data_spaces dsp

              ON      psch.data_space_id = dsp.partition_scheme_id

              AND     prv.boundary_id = dsp.destination_id

       INNER   JOIN sys.data_spaces ds

              ON      dsp.data_space_id = ds.data_space_id

       WHERE   pfunc.[name]            = ‘MyPartitionFunction’;

OPEN cPartitions;

FETCH cPartitions INTO @iPartitionNo, @iRightBoundary, @sFileGroupName;

 

WHILE (@@FETCH_STATUS = 0) BEGIN

       IF EXISTS (

              SELECT  *

              FROM    sys.tables tbl

              INNER   JOIN sys.schemas sch

                      ON      tbl.schema_id = sch.schema_id

              WHERE   tbl.name            = ‘MyTableTemp’

                AND   sch.name            = ‘dbo’

       ) BEGIN

              DROP TABLE dbo.MyTableTemp;

       END;

      

       SET @sSQL =

              SELECT  FieldList

              INTO    dbo.MyTableTemp

              FROM    dbo.MyTable

               WHERE   PartitioningField >= CAST(@iLeftBoundary AS varchar)+    

                 AND   PartitioningField <=’ + CAST(@iRightBoundary AS varchar)

          

       EXEC(@sSQL);

      

       SET @sSQL =

       ALTER TABLE dbo.MyTableTemp ADD CONSTRAINT MyTableTemp_PK

       PRIMARY KEY (PartitioningField) ON [‘ + @sFileGroupName + ‘]’;

       EXEC(@sSQL);

      

       SET @sSQL =

              ALTER TABLE dbo.MyTableTemp ADD CONSTRAINT MyTableTemp_CK

              CHECK ‘+ ‘(PartitioningField >= ‘ +

                      CAST(@iLeftBoundary AS varchar) +

              AND

              PartitioningField <= ‘ +

                      CAST(@iRightBoundary AS varchar) +

              ‘)’;

       EXEC(@sSQL);

      

       SET @sSQL =

              ALTER TABLE dbo.MyTableTemp SWITCH TO dbo.MyTable_NEW

              PARTITION ‘ + CAST(@iPartitionNo AS varchar);

       EXEC(@sSQL);

      

       SET @iLeftBoundary = @iRightBoundary + 1;

       FETCH cPartitions INTO @iPartitionNo, @iRightBoundary, @sFileGroupName;

 

END;

CLOSE cPartitions;

DEALLOCATE cPartitions;

 

DECLARE @OldRowCount int

DECLARE @NewRowCount int

 

SELECT @OldRowCount = COUNT(*)

FROM   dbo.MyTable

 

SELECT @NewRowCount = COUNT(*)

FROM   dbo.MyTable_NEW

      

SELECT @OldRowCount, @NewRowCount

      

IF (@OldRowCount = @NewRowCount) BEGIN

       DROP TABLE dbo.MyTable

       EXEC sp_rename ‘dbo.MyTable_NEW’, ‘MyTable’

END ELSE BEGIN

       DROP TABLE dbo.MyTable_NEW

END

Those who’ve worked with me must be floored that I’m posting TSQL code using a cursor. I normally break into hives when I see a cursor being used. But in this case, short of migrating to SSIS, it was the best I could do.
 
The SELECT… INTO dbo.TableNameTemp operation is not logged. So instead of taking minutes to run, it takes seconds. This step on my lightweight development server took about 30 seconds to copy over 10 million rows with over 30 columns (all numeric, of course. It is a fact table after all) Once the temp table was created, in order to be slid into the _NEW table, the source and destination partitions have to be located on the same file group and the two partitions have to have the same constraints applied. By adding a primary  constraint to the temp table and creating it on the same partition as the target table, both conditions are satisfied.
 
If anyone finds themselves in the same predicament, I hope they find this approach useful.

Do I really need a data warehouse?

August 16, 2010 Leave a comment

I recently received a link to a BI Requirements for IT white paper from Tableau Software and SQL Server magazine. Though most of the points I strongly agree with, the paper seems to suggest that building a data warehouse is simply unnecessary. Other than the fact that operational systems that run the business and the decision support systems from which the business is analyzed should not share the same physical resources, there is a fundamental mathematical reason data is stored redundantly in a well-architected IT data infrastructure.

Sorry to sound like a pure academic here, but it comes with being in school way too long. And I do mean, way, way, way too long. Like my academic training in chemical physics, data warehousing is also essentially applied elementary linear algebra. So as a simple example, consider the description of a point in space as shown in the figure. The most familiar representation is in a Cartesian or xyz coordinate system. In this case, a point is defined by specifying the distance along the x axis, along with the distance along the y-axis, and the height along the z-axis. In a real world implementation, the x and y coordinates would represent the distance along two walls of a room and the z-axis the distance from the floor.

clip_image002

However, a Cartesian coordinate system is not the only possible representation. Spherical coordinates could also be used. In this case, the point is defined by specifying the distance of a radial vector from the origin, r, and two angles, Θ and Φ. In both Cartesian and spherical coordinates, all three axes are orthogonal or mutually independent. Additionally, a linear transformation can be applied to convert one representation to the other.

Is one representation better than another? The simple answer is it depends. If the physical problem being addressed is solving an equation of motion subjected to a linear force, Cartesian coordinates would be more appropriate. Likewise, if the problem being addressed is the description of a Kepler orbit, spherical coordinates are more appropriate.

So what does this have to do with business systems? In an oversimplified description, there are two principle data models used in business applications. A normalized model is applied when there should be no redundancies in storage and is most appropriate for capturing business transactions. It’s generally a poor model when applied to addressing analysis of a business due to the complexity of the model and the amount of physical resources needed to address joining resources from disparate data sources. Therefore, rather than use the same data model for analyzing a business, a dimensional model as described by Ralph Kimball is more appropriate. And, as a linear transformation can be applied to convert from a Cartesian to a spherical coordinate system, a linear transformation (SQL Server Integration Services if using the Microsoft BI tool stack) can be applied to transform between the normal model and a dimensional model.

I know this is long and boring. However, to answer the question posed in the title, drawing an analogy with a simple physical problem, a data warehouse is absolutely essential.

Categories: Data Warehousing