Part 4: Data Vault for beginners

This is the final article of the four articles written to easily understand the Data Vault architecture. Part four describes on how the information mart is derived and then loaded from the Data Vault.

Information mart ER diagram designed through Dimensional Modelling techniques

Dimension load

Descriptive data used to measure the facts are basically loaded into the dimensions. In this mart, the dimensions are designed based on the Slowly Changing Dimension type 2 model, which means the dimensions stores the current as well as the historical changes of the descriptive data.

Dimensions of the mart are loaded using the Hub and satellites under the same hub in the data Vault model.

Structure of the dimension

Dimension holds a primary key value which will be linked with the fact table.

With SCD type 1, Customer hash key in the hub can be used as a primary key in the dimension. Any new updates for the dimension can easily be identified and updated using the load date fields from the satellite.

But for SCD type 2, as customer hash key is referenced multiple types in the current and historical records, a unique key combination should be made as a primary key. In the data vault satellite, Customer Hash key and Load date forms the composite primary key, which are unique. While loading the dimension a new hash key is derived from the Customer Hash key and Load date to load the primary key value in the dimension.

Example SCD type 2 dimension structure is below.

Script that has been used to create the dimension

 CREATE TABLE [dbo].[DimCustomers](
               [CustomerKey] [varbinary](32) NOT NULL,
               [CustomerHashKey] [varbinary](32) NOT NULL,
               [LoadDate] [date] NOT NULL,
               [LoadEndDate] [date] NULL,
               [CustomerID] [nchar](5) NOT NULL,
               [CompanyName] [nvarchar](40) NOT NULL,
               [ContactName] [nvarchar](30) NOT NULL,
               [ContactTitle] [nvarchar](30) NOT NULL,
               [Address] [nvarchar](60) NOT NULL,
               [City] [nvarchar](15) NOT NULL,
               [Region] [nvarchar](15) NOT NULL,
               [PostalCode] [nvarchar](10) NOT NULL,
               [Country] [nvarchar](15) NOT NULL,
               [Phone] [nvarchar](24) NOT NULL,
               [Fax] [nvarchar](24) NOT NULL,
               [RecordSource] [varchar](100) NOT NULL,
               [CurrentRecord] [bit] NOT NULL,
  CONSTRAINT [PK_DimCustomers] PRIMARY KEY CLUSTERED 
 (
               [CustomerKey] ASC
 )ON [PRIMARY],
  CONSTRAINT [UK_DimCustomers] UNIQUE NONCLUSTERED 
 (
               [CustomerHashKey] ASC,
               [LoadDate] ASC
 )ON [PRIMARY]
 ) ON [PRIMARY]
 GO 

Script that has been used to load the dimension (SCD type 2)

A two-step load process is required to load the SCD type 2 dimension from Data Vault.

Step 1 – Inserts the latest Satellite records that are not in the dimension (both current and old).

 INSERT INTO [OrdersInfoMart].[dbo].[DimCustomers]
 (     [CustomerKey]
       ,[CustomerHashKey]
       ,[LoadDate]
       ,[LoadEndDate]
       ,[CustomerID]
       ,[CompanyName]
       ,[ContactName]
       ,[ContactTitle]
       ,[Address]
       ,[City]
       ,[Region]
       ,[PostalCode]
       ,[Country]
       ,[Phone]
       ,[Fax]
       ,[RecordSource]
       ,[CurrentRecord]
 )
 SELECT 
        HASHBYTES('MD5', Hub.[CustomerID] +  '^' + CONVERT(varchar, Sat.[LoadDate], 121)) AS [CustomerKey]
       ,Hub.[CustomerHashKey]
       ,Sat.[LoadDate]
       ,ISNULL(Sat.[LoadEndDate],'39991231') AS [LoadEndDate]
       ,Hub.[CustomerID]
       ,Sat.[CompanyName]
       ,ISNULL(Sat.[ContactName],'')  AS [ContactName]
       ,ISNULL(Sat.[ContactTitle],'') AS [ContactTitle]
       ,ISNULL(Sat.[Address],'')      AS [Address]
       ,ISNULL(Sat.[City],'')         AS [City]
       ,ISNULL(Sat.[Region],'')       AS [Region]        
       ,ISNULL(Sat.[PostalCode],'')   AS [PostalCode]
       ,ISNULL(Sat.[Country],'')      AS [Country]      
       ,ISNULL(Sat.[Phone],'')        AS [Phone]
       ,ISNULL(Sat.[Fax],'')          AS [Fax] 
       ,Sat.[RecordSource]
       ,CASE WHEN ISNULL(Sat.[LoadEndDate],'39991231') = '39991231' 
        THEN 1 ELSE 0 END AS [CurrentRecord]
 FROM [DataVault].[raw].[HubCustomers] Hub 
INNER JOIN  [DataVault].[raw].[SatCustomers] Sat
   ON Sat.[CustomerHashKey] = Hub.[CustomerHashKey]
 WHERE 
NOT EXISTS (   
             SELECT 1
             FROM [OrdersInfoMart].[dbo].[DimCustomers] Dim  
             WHERE Dim.[CustomerHashKey] = Sat.[CustomerHashKey]
                AND Dim.[LoadDate]       = Sat.[LoadDate]
           )
 ; 

Step 2 – Dimension’s ‘Current’ records are end-dated and updated as ‘Non-Current’ when the Data Vault records were modified/end-dated in Satellite.

 UPDATE Dim
 SET Dim.[LoadEndDate] = Sat.[LoadEndDate],
               Dim.[CurrentRecord] = 0 
 FROM [OrdersInfoMart].[dbo].[DimCustomers] Dim        
 INNER JOIN  [DataVault].[raw].[SatCustomers] Sat 
    ON Sat.[CustomerHashKey]   = Dim.[CustomerHashKey]
    AND Sat.[LoadDate]         = Dim.[LoadDate]                   
    AND ISNULL(Sat.[LoadEndDate],'39991231') <> '39991231'
 WHERE Dim.[CurrentRecord] = 1
 ; 

Fact load

Fact tables primarily store the measurable fields than can be aggregated and the foreign keys for the dimension tables.

Facts of the mart are mainly loaded using the Links and Hubs that are connected in the Data Vault for SCD type 1 dimensions. However, for SCD type 2 dimensions, either the dimensions that were already loaded or the Satellites used to load the dimensions can also be used to load the foreign keys in the Fact table.

Structure of the fact

Example fact structure is below.

Script that has been used to create the fact

 CREATE TABLE [dbo].[FactOrders](
               [OrderKey] [varbinary](32) NOT NULL,
               [CustomerKey] [varbinary](32) NOT NULL,
               [EmployeeKey] [varbinary](32) NOT NULL,
               [ShipperKey] [varbinary](32) NOT NULL,
               [ProductKey] [varbinary](32) NOT NULL,
               [OrderDate] [datetime] NOT NULL,
               [RequiredDate] [datetime] NOT NULL,
               [ShippedDate] [datetime] NOT NULL,
               [Freight] [money] NOT NULL,
               [UnitPrice] [money] NOT NULL,
               [Quantity] [smallint] NOT NULL,
               [Discount] [real] NOT NULL,
               [LoadDate] [date] NOT NULL,
               [RecordSource] [varchar](100) NOT NULL,
  CONSTRAINT [PK_FactOrders] PRIMARY KEY CLUSTERED 
 (
               [OrderKey] ASC,
               [CustomerKey] ASC,
               [EmployeeKey] ASC,
               [ShipperKey] ASC,
               [ProductKey] ASC
 ) ON [PRIMARY]
 ) ON [PRIMARY]
 GO
  
 ALTER TABLE [dbo].[FactOrders]  WITH CHECK ADD  CONSTRAINT [FK_FactOrders_DimCustomers] FOREIGN KEY([CustomerKey])
 REFERENCES [dbo].[DimCustomers] ([CustomerKey])
 GO
  
 ALTER TABLE [dbo].[FactOrders]  WITH CHECK ADD  CONSTRAINT [FK_FactOrders_DimEmployees] FOREIGN KEY([EmployeeKey])
 REFERENCES [dbo].[DimEmployees] ([EmployeeKey])
 GO
  
 ALTER TABLE [dbo].[FactOrders]  WITH CHECK ADD  CONSTRAINT [FK_FactOrders_DimProducts] FOREIGN KEY([ProductKey])
 REFERENCES [dbo].[DimProducts] ([ProductKey])
 GO
 
 ALTER TABLE [dbo].[FactOrders]  WITH CHECK ADD  CONSTRAINT [FK_FactOrders_DimShippers] FOREIGN KEY([ShipperKey])
 REFERENCES [dbo].[DimShippers] ([ShipperKey])
 GO 

Script that has been used to load the fact

Before loading the fact, it is a mandatory step that all the dimensions are loaded. Therefore, while loading the Fact table, the referenced dimension keys exists and does not cause a load error.

As said above, when SCD type 2 dimensions are involved, Fact tables can be loaded in two different ways. Best one applicable for your application should be used based on the performance and after considering other factors.

First approach is using the existing dimensions that were already loaded are joined with Hubs and links to get the foreign key information to be loaded into the Fact table. Example below.

 INSERT INTO [OrdersInfoMart].[dbo].[FactOrders]
 (
        [OrderKey]
       ,[CustomerKey]
       ,[EmployeeKey]
       ,[ShipperKey]
       ,[ProductKey]
       ,[OrderDate]
       ,[RequiredDate]
       ,[ShippedDate]
       ,[Freight]
       ,[UnitPrice]
       ,[Quantity]
       ,[Discount]
       ,[LoadDate]
       ,[RecordSource]
 )
 SELECT LO.[OrderHashKey] AS [OrderKey]
       ,DC.[CustomerKey]
       ,DE.[EmployeeKey]
       ,DS.[ShipperKey]
       ,DP.[ProductKey]
       ,SO.[OrderDate]
       ,SO.[RequiredDate]
       ,ISNULL(SO.[ShippedDate],'19000101') AS [ShippedDate]
       ,SO.[Freight]
       ,SOD.[UnitPrice]
       ,SOD.[Quantity]
       ,SOD.[Discount]
       ,SOD.[LoadDate]
       ,SOD.[RecordSource]
 FROM [DataVault].[raw].[LinkOrders] LO 
 INNER JOIN [DataVault].[raw].[SatOrders] SO
   ON SO.[OrderHashKey] = LO.[OrderHashKey]
 INNER JOIN [DataVault].[raw].[LinkOrderDetails] LOD
   ON LOD.[OrderHashKey] = LO.[OrderHashKey]
 INNER JOIN [DataVault].[raw].[SatOrderDetails] SOD
    ON LOD.[OrderDetailHashKey] = SOD.[OrderDetailHashKey]
 INNER JOIN [OrdersInfoMart].[dbo].[DimCustomers] DC
    ON DC.[CustomerHashKey] = LO.[CustomerHashKey]
    AND LO.[LoadDate] BETWEEN DC.[LoadDate] AND DC.[LoadEndDate]
 INNER JOIN [OrdersInfoMart].[dbo].[DimEmployees] DE
    ON DE.[EmployeeHashKey] = LO.[EmployeeHashKey]
    AND LO.[LoadDate] BETWEEN DE.[LoadDate] AND DE.[LoadEndDate]
 INNER JOIN [OrdersInfoMart].[dbo].[DimShippers] DS
    ON DS.[ShipperHashKey] = LO.[ShipperHashKey]
    AND LO.[LoadDate] BETWEEN DS.[LoadDate] AND DS.[LoadEndDate]
 INNER JOIN [OrdersInfoMart].[dbo].[DimProducts] DP
    ON DP.[ProductHashKey] = LOD.[ProductHashKey]
    AND LOD.[LoadDate] BETWEEN DP.[LoadDate] AND DP.[LoadEndDate]
 WHERE NOT EXISTS        
 (             
     SELECT 1
     FROM [OrdersInfoMart].[dbo].[FactOrders] F      
     WHERE F.[OrderKey]    = LO.[OrderHashKey]
       AND F.[CustomerKey] = DC.[CustomerKey]
       AND F.[EmployeeKey] = DE.[EmployeeKey]
       AND F.[ShipperKey]  = DS.[ShipperKey]
       AND F.[ProductKey]  = DP.[ProductKey]
 )
 ;

Second approach is by linking the Hubs, Links and Satellites to get the foreign key information to be loaded into the Fact table. Example below.

 ;WITH Stg_Orders AS
 (
 SELECT 
        LO.[OrderHashKey] AS [OrderKey]
        ,HASHBYTES('MD5', HC.[CustomerID] +  '^' + CONVERT(varchar, SC.[LoadDate], 121)) AS [CustomerKey]
       ,HASHBYTES('MD5', HE.[EmployeeID] +  '^' + CONVERT(varchar, SE.[LoadDate], 121)) AS [EmployeeKey]
       ,HASHBYTES('MD5', CAST(HS.[ShipperID] AS VARCHAR) +  '^' + CONVERT(varchar, SS.[LoadDate], 121)) AS [ShipperKey]
       ,HASHBYTES('MD5', CAST(HP.[ProductID] AS VARCHAR) +  '^' + CONVERT(varchar, SP.[LoadDate], 121)) AS [ProductKey]
       ,SO.[OrderDate]
       ,SO.[RequiredDate]
       ,ISNULL(SO.[ShippedDate],'19000101') AS [ShippedDate]
       ,SO.[Freight]
       ,SOD.[UnitPrice]
       ,SOD.[Quantity]
       ,SOD.[Discount]
       ,SOD.[LoadDate]
       ,SOD.[RecordSource]
 FROM [DataVault].[raw].[LinkOrders] LO 
 INNER JOIN [DataVault].[raw].[SatOrders] SO
    ON SO.[OrderHashKey] = LO.[OrderHashKey]
 INNER JOIN [DataVault].[raw].[LinkOrderDetails] LOD
    ON LOD.[OrderHashKey] = LO.[OrderHashKey]
 INNER JOIN [DataVault].[raw].[SatOrderDetails] SOD
    ON LOD.[OrderDetailHashKey] = SOD.[OrderDetailHashKey]
 INNER JOIN [DataVault].[raw].[HubCustomers] HC
    ON HC.[CustomerHashKey] = LO.[CustomerHashKey]
 INNER JOIN [DataVault].[raw].[SatCustomers] SC
    ON SC.[CustomerHashKey] = HC.[CustomerHashKey]
    AND LO.[LoadDate] BETWEEN SC.[LoadDate] 
                      AND ISNULL(SC.[LoadEndDate],'39991231')
 INNER JOIN [DataVault].[raw].[HubEmployees] HE
    ON HE.[EmployeeHashKey] = LO.[EmployeeHashKey]
 INNER JOIN [DataVault].[raw].[SatEmployees] SE
    ON SE.[EmployeeHashKey] = HE.[EmployeeHashKey]
    AND LO.[LoadDate] BETWEEN SE.[LoadDate] 
                      AND ISNULL(SE.[LoadEndDate],'39991231')
 INNER JOIN [DataVault].[raw].[HubShippers] HS
    ON HS.[ShipperHashKey] = LO.[ShipperHashKey]
 INNER JOIN [DataVault].[raw].[SatShippers] SS
    ON SS.[ShipperHashKey] = HS.[ShipperHashKey]
    AND LO.[LoadDate] BETWEEN SS.[LoadDate] 
                      AND ISNULL(SS.[LoadEndDate],'39991231')
 INNER JOIN [DataVault].[raw].[HubProducts] HP
    ON HP.[ProductHashKey] = LOD.[ProductHashKey]
 INNER JOIN [DataVault].[raw].[SatProducts] SP
    ON SP.[ProductHashKey] = HP.[ProductHashKey]
    AND LOD.[LoadDate] BETWEEN SP.[LoadDate] 
                       AND ISNULL(SP.[LoadEndDate],'39991231')
 )
 INSERT INTO [OrdersInfoMart].[dbo].[FactOrders]
 (              
  [OrderKey]
 ,[CustomerKey]
 ,[EmployeeKey]
 ,[ShipperKey]
 ,[ProductKey]
 ,[OrderDate]
 ,[RequiredDate]
 ,[ShippedDate]
 ,[Freight]
 ,[UnitPrice]
 ,[Quantity]
 ,[Discount]
 ,[LoadDate]
 ,[RecordSource]
 )
 SELECT 
  [OrderKey]
 ,[CustomerKey]
 ,[EmployeeKey]
 ,[ShipperKey]
 ,[ProductKey]
 ,[OrderDate]
 ,[RequiredDate]
 ,[ShippedDate]
 ,[Freight]
 ,[UnitPrice]
 ,[Quantity]
 ,[Discount]
 ,[LoadDate]
 ,[RecordSource]
 FROM Stg_Orders Ord
 WHERE 
NOT EXISTS (             
             SELECT 1
             FROM [OrdersInfoMart].[dbo].[FactOrders] F            
             WHERE F.[OrderKey]      = Ord.[OrderKey]
               AND F.[CustomerKey]   = Ord.[CustomerKey]
               AND F.[EmployeeKey]   = Ord.[EmployeeKey]
               AND F.[ShipperKey]    = Ord.[ShipperKey]
               AND F.[ProductKey]    = Ord.[ProductKey]
           )
 ;

This article concludes the four part series on the Data Vault architecture. Once again, please note that the OLTP, Staging, Vault and InfoMart databases developed for this series are all on the Microsoft SQL server, and the integration between different databases is done through SQL queries.

Leave a comment