Modeling Several Boolean Attributes Using a Junk Dimension
One of the discussions that I recently had involved modeling a business process associated with several Boolean attributes. The current model addressed each of these attributes as a separate dimension. While there is nothing necessarily wrong with this approach, when reviewing a model, I always try to think the way Ralph Kimball and colleagues at the Kimball Group would think. If you’re not modeling the way that they would model that process, you probably should re-evaluate and redesign.
So how would the Kimball Group model this scenario? The answer is most likely with a Junk dimension since a Junk dimension consolidates several low cardinality attributes into a single useful dimension. Since all of these attributes have only two possible values, either Yes or No, seems like a very likely candidate for a Junk dimension!
There’s several advantages to modeling this scenario with a Junk dimension. Below are three of them (and you’re always free to disagree)
- Fact table contains a single field rather than a whole bunch of columns to refer to these flags. More on that later.
- Greater flexibility on defining the labels of each attribute, especially when utilized with a view layer abstraction between the source and the user presentation layer, and
- Easier to enhance the cube should additional Boolean attributes be identified later in the project lifecycle.
Defining the Junk Dimension
While in most cases, the key for a dimension is a meaningless number and assigned by an Identity or Sequence, in the case of a Junk dimension consolidating several Boolean attributes a better approach is to calculate this key. For instance, if I’ve consolidated four Boolean attributes, I can calculate this key by multiplying the value of the Boolean by a corresponding power of 2, either 20, 21, 22, or 23 depending on the identity of the attribute. A T-SQL statement that defines a table consolidating four Boolean attributes is given below.
WITH cteFlags AS ( SELECT 0 AS Value UNION ALL SELECT 1 ) SELECT ( Power(2, 3.0) * Flag3.Value + Power(2, 2.0) * Flag2.Value + Power(2, 1.0) * Flag1.Value + Flag0.Value ) AS ID, Flag0.Value, Flag1.Value, Flag2.Value, Flag3.Value FROM cteFlags Flag0 CROSS JOIN cteFlags Flag1 CROSS JOIN cteFlags Flag2 CROSS JOIN cteFlags Flag3 ORDER BY ID
The ETL to load fact tables is now slightly more complex. Rather than each dimension being represented as separate columns, a single column is used. But since the surrogate key for the Junk dimension is not incrementally assigned but coded based on business domain values, this value can be calculated rather than being looked up based on the individual column values.
One of the problems with inclusion of each Boolean attribute as a separate dimension is that when modeled as a role-playing dimension in a multidimensional database, you’ll end up with a whole bunch of Yes and No or True and False entities. When querying this model, it’s not too difficult to get completely lost in the meaning of each of those values. Using a Junk dimension, one can easily manipulate the labels or names applied to each of those entities. So the True flag applied to the Base Fee Schedule Overridden? attribute can be changed from Yes to Base Fee Schedule Not Used or some other description more in line with the business domain.
Easier to Enhance
Another advantage is the ease in which additional Boolean attributes can be added to an existing data mart. Most of the values for these attributes are likely to be False for the vast majority of cases as these attributes are generally used to identify exceptions. Therefore, by simply adding another column to your Junk dimension with False as its value you’ve already accurately included the additional attribute for the vast majority of fact records without even touching the fact table. How Agile is that?
To complete inclusion of the exceptions, there are two steps that need to be executed. First, new rows need to be added to the Junk dimension. These can be simply added by appending the true values for the additional attribute to the existing Junk dimension records. To illustrate using the previous case, the query defining new rows to the Junk dimension created above would simply be the following:
WITH cteFlags AS ( SELECT 0 AS Value UNION ALL SELECT 1 ) SELECT ( Power(2, 4.0) * New_Attribute.Value + Power(2, 3.0) * Flag3.Value + Power(2, 2.0) * Flag2.Value + Power(2, 1.0) * Flag1.Value + Flag0.Value ) AS ID, Flag0.Value, Flag1.Value, Flag2.Value, Flag3.Value, Flag4.Value FROM cteFlags Flag0 CROSS JOIN cteFlags Flag1 CROSS JOIN cteFlags Flag2 CROSS JOIN cteFlags Flag3 CROSS JOIN cteFlags New_Attribute WHERE New_Attribute.Value = 1 ORDER BY ID
With the new dimension records added, the second step is to update the fact records identified as “exceptions”. Again, as opposed to a complete reloading of the fact table to include the additional attribute, the Junk dimension key value for these exceptions can be updated accordingly. One way to complete this task is to load a temporary staging table that has the identifying fact fields for all records with a true New_Attribute value. By joining to this table and applying the following update conditions to the corresponding records, JunkDimensionKey = JunkDimensionKey + 24, you’ve now successfully included a new Boolean attribute into the model. If there are many records where the New_Attribute value is true and your fact table is very large, your best approach would likely be to use a technique that I described in a previous blog post nearly two years ago.