SQL Server 2012 Columnstore Index In Depth Part 1: Make a bigger AdventureworksDW database for Testing

April 12, 2012 at 4:00 AMSteven Wang

The new xVelocity memory optimized columnstore index feature in SQL server 2012 are designed to deliver blazing-fast performance for large data warehouse quries. But I believe that for many people who are interested in this fantastic new feature might have a problem to find a big sample data warehouse database to try the new columnstore features, like batch-mode processing, segment elimination, the size of memory grant etc.

The new sample data warehouse database: AdventureworksDW2012, which you can download from msftdbprodsamples.codeplex.com is not big enough for us to experience many features of columnstore index. We need to make it bigger to explore the columnstore index features.

Once the AdventureworksDW2012 Data file is downloaded, use the below script to attach the database:

USE [master]
GO
Declare @Filepath nvarchar(128)
Declare @SQL nvarchar(max)

Set	@Filepath = 'H:\Applications\database\SQL 2012\SampleDBs\adventureworks\' 
				--Change the path to your file path
				--Where your downloaded AdventureWorksDW2012_Data.mdf is located

Set @Filepath = IIF(Right(@Filepath,1) = '\', @Filepath, @Filepath + '\') + 'AdventureWorksDW2012_Data.mdf'

Set	@SQL = 'CREATE DATABASE [AdventureWorksDW2012] ON ( FILENAME = ''' + @Filepath + ''') FOR ATTACH'

Exec (@SQL)
GO

 

THe below diagram shows the table size of the AdventureworksDW2012 sample database: 

The biggest table in AdventureworksDW2012 is FactProductInventory with 776286 rows. The table is too narrow to be used for columnstore index testing. and the row size below 1 million is also too small.

A good candidate for creating the nonclustered columnstore index on will be FactResellerSales. But this table has only got 60,855 rows. In order to get a good performance comparison between the B-Tree index and columnstore index and other columnstore features , a table with about 100 millions of rows will be ideal. However, in most case, we might don't have such capacity in our laptop or home computer. I decide to pump up this table to over 10 millions of rows as this size will be suitable for most of users and it is also good enough for us to experience the most of columnstore indexes.

First of all, we don't want to simply duplicate the existing rows as the repeating data in the table will skew the view of the query performance we get from the columnstore index. By analyzing the FActResellserSales table, the easiest way is to bigger the DimReseller table as a particualar salesordernumber in the table is bound to a reseller. I will repeat the each salesorder line (salesorder number + salesorderlinenumber is the primary key of FactSalestable) 180 times with random picked reseller and randomized order quantity to pump up the table to 10,953,900 rows.

As originally the DimReseller has only 701 rows, I will first make a bigger DimReseller table. The below script will make a DimResellerBig table which is 50 times bigger than DimReseller table:

USE [AdventureworksDW2012]
GO
--Create a bigger table for DimReseller
SELECT 
		Cast(Row_Number() Over(Order by (Select 0)) as int) As [ResellerKey]
		, A.[GeographyKey]
		, Cast('AW' + Format(Row_Number() Over(Order by (Select 0)), '0000000000') as Nvarchar(15)) As [ResellerAlternateKey]
		, A.[Phone]
		, A.[BusinessType]
		, Cast(A.[ResellerName] +  Format(Row_Number() Over(Order by (Select 0)), ' 0000000000') As nvarchar(50)) As ResellerName
		, A.[NumberEmployees]
		, A.[OrderFrequency]
		, A.[OrderMonth]
		, A.[FirstOrderYear]
		, A.[LastOrderYear]
		, A.[ProductLine]
		, A.[AddressLine1]
		, A.[AddressLine2]
		, A.[AnnualSales]
		, A.[BankName]
		, A.[MinPaymentType]
		, A.[MinPaymentAmount]
		, A.[AnnualRevenue]
		, A.[YearOpened]
Into	DimResellerBig
FROM	[dbo].[DimReseller] A
			Cross Join
		Master..spt_values B
Where	B.Type = 'P'
			And
		B.Number Between 1 and 50
GO


--Add indexes and constraints for DimResellerBig
Alter Table dbo.DimResellerBig Alter Column [ResellerKey] Int Not Null;

ALTER TABLE [dbo].[DimResellerBig] ADD  CONSTRAINT [PK_DimResellerBIG_ResellerKey] PRIMARY KEY CLUSTERED 
(
	[ResellerKey] ASC
);

ALTER TABLE [dbo].[DimResellerBig] ADD  CONSTRAINT [AK_DimResellerBig_ResellerAlternateKey] UNIQUE NONCLUSTERED 
(
	[ResellerAlternateKey] ASC
);

CREATE NONCLUSTERED INDEX [IX_DimResellerBig_GeographyKey] ON [dbo].[DimResellerBig]
(
	[GeographyKey] ASC
);

GO

 

Once I have the DimResellerBig table, I can use the below script to make my big FactResellerSalesBig table:

USE [AdventureWorksDW2012]
GO
--The script is used to make a bigger FactResellerSales table, called FactResellerSalesBig.
--When a RowMultiplier 180 is used the FactResellerSalesBig table will be 10,953,900 rows
--The query takes 25 minutes to run in my laptop with a 4-core processor, a 7200 rpm hard disk and 8 GB RAM


Declare		@OrderCountTable Table (OrderDateKey INT, OrderCount Int)
Declare		@OrderDateKey Int = 2012
			, @OrderCount int
			, @RowMultiplier int = 180	--180 times more rows than the original FactResellerSales Table
										--you can make it bigger or smaller by change this number

Insert Into	@OrderCountTable
SELECT		Distinct [OrderDateKey]
			, Count(Distinct SalesOrderNumber) As OrderCount
FROM		[dbo].[FactResellerSales]
Group by	[OrderDateKey];


If Object_ID('[dbo].[FactResellerSalesBig]') Is Not Null
		Drop Table [dbo].[FactResellerSalesBig];

Select	Top(0) *
Into	FactResellerSalesBig
From	[dbo].[FactResellerSales];

While Exists (Select * From @OrderCountTable)
	Begin
		Select	Top(1) 
				@OrderDateKey = OrderDateKey
				,  @OrderCount =  OrderCount
		From	@OrderCountTable;

		Insert into FactResellerSalesBig with(Tablock)
		Select
				R.[ProductKey]
				, R.[OrderDateKey]
				, R.[DueDateKey]
				, R.[ShipDateKey]
				, Y.[ResellerKey]
				, R.[EmployeeKey]
				, R.[PromotionKey]
				, R.[CurrencyKey]
				, Y.[SalesTerritoryKey]
				, Cast(R.[SalesOrderNumber] + Format(Y.RowNum, '000') AS nvarchar(20)) As SalesOrderNumber
				, R.[SalesOrderLineNumber]
				, R.[RevisionNumber]
				, Ceiling(R.[OrderQuantity] * Y.QuantityMultiplier) As OrderQuantity
				, R.[UnitPrice]
				, Ceiling(R.[OrderQuantity] * Y.QuantityMultiplier) * R.[UnitPrice] As ExtendedAmount
				, R.[UnitPriceDiscountPct]
				, Ceiling(R.[OrderQuantity] * Y.QuantityMultiplier) * R.[UnitPrice] * R.[UnitPriceDiscountPct] As DiscountAmount
				, R.[ProductStandardCost]
				, Ceiling(R.[OrderQuantity] * Y.QuantityMultiplier) *  R.[ProductStandardCost] As TotalProductCost
				, Ceiling(R.[OrderQuantity] * Y.QuantityMultiplier) * R.[UnitPrice] * (1 - R.[UnitPriceDiscountPct]) As SalesAmount
				, Ceiling(R.[OrderQuantity] * Y.QuantityMultiplier) * R.[UnitPrice] * (1 - R.[UnitPriceDiscountPct]) * 0.08 As TaxAmt
				, R.[Freight]
				, Cast(R.[CarrierTrackingNumber] +  Format(Y.RowNum, '-000') As nvarchar(25)) As CarrierTrackingNumber
				, Cast(R.[CustomerPONumber] + Format(Y.RowNum, '000') AS nvarchar(25)) As CustomerPONumber
				, R.[OrderDate]
				, R.[DueDate]
				, R.[ShipDate]
		From
			(
				SELECT 
						[ProductKey]
						,[OrderDateKey]
						,[DueDateKey]
						,[ShipDateKey]
						,[ResellerKey]
						,[EmployeeKey]
						,[PromotionKey]
						,[CurrencyKey]
						,[SalesTerritoryKey]
						,[SalesOrderNumber]
						,[SalesOrderLineNumber]
						,[RevisionNumber]
						,[OrderQuantity]
						,[UnitPrice]
						,[ExtendedAmount]
						,[UnitPriceDiscountPct]
						,[DiscountAmount]
						,[ProductStandardCost]
						,[TotalProductCost]
						,[SalesAmount]
						,[TaxAmt]
						,[Freight]
						,[CarrierTrackingNumber]
						,[CustomerPONumber]
						,[OrderDate]
						,[DueDate]
						,[ShipDate]
						, Dense_Rank() Over(Partition by [OrderDateKey] Order by SalesOrderNumber) As OrderNumber
				FROM	[dbo].[FactResellerSales] 
				Where	OrderDateKey =  @OrderDateKey
			) R
					Cross Apply
			(

				SELECT	TOP (@RowMultiplier) 
						A.[ResellerKey]
						, B.SalesTerritoryKey
						, Row_Number() Over(Order by Checksum(newid())) As RowNum
						, RAND(CHECKSUM(NEWID())) * 2 As QuantityMultiplier
				FROM	[DimResellerBig] A
							Inner join
						[dbo].[DimGeography] B
							on A.[GeographyKey] = B.GeographyKey
							Cross Join
						Master..spt_values C
				Where	C.Type = 'P'
							And
						C.Number Between 1 and @OrderCount
						and R.OrderNumber = C.number		
			) Y

		Print 'The records for the order date: ' + Cast(@OrderDateKey as nvarchar(8)) + ' has multiplied ' + Cast(@RowMultiplier as nvarchar(6)) + ' times';

		Delete Top(1) From	@OrderCountTable;

	End

Go

Create Clustered Index IX_FactResellerSalesBig_1 On [dbo].[FactResellerSalesBig] ([OrderDateKey] ASC, [ResellerKey] Asc);

CREATE NONCLUSTERED INDEX [IX_FactResellerSalesBig_CurrencyKey] ON [dbo].[FactResellerSalesBig]
(
	[CurrencyKey] ASC
);

CREATE NONCLUSTERED INDEX [IX_FactResellerSalesBig_DueDateKey] ON [dbo].[FactResellerSalesBig]
(
	[DueDateKey] ASC
);

CREATE NONCLUSTERED INDEX [IX_FactResellerSalesBig_EmployeeKey] ON [dbo].[FactResellerSalesBig]
(
	[EmployeeKey] ASC
);

CREATE NONCLUSTERED INDEX [IX_FactResellerSalesBig_ProductKey] ON [dbo].[FactResellerSalesBig]
(
	[ProductKey] ASC
);

CREATE NONCLUSTERED INDEX [IX_FactResellerSalesBig_PromotionKey] ON [dbo].[FactResellerSalesBig]
(
	[PromotionKey] ASC
);

CREATE NONCLUSTERED INDEX [IX_FactResellerSalesBig_ShipDateKey] ON [dbo].[FactResellerSalesBig]
(
	[ShipDateKey] ASC
);

CREATE NONCLUSTERED INDEX [IX_FactResellerSalesBig_CarrierTrackingNumber] ON [dbo].[FactResellerSalesBig]
(
	[CarrierTrackingNumber] ASC
);


Create NonClustered Columnstore Index IX_FactResellerSalesBig_NCI On [dbo].[FactResellerSalesBig]
(
[ProductKey]
,[OrderDateKey]
,[DueDateKey]
,[ShipDateKey]
,[ResellerKey]
,[EmployeeKey]
,[PromotionKey]
,[CurrencyKey]
,[SalesTerritoryKey]
,[SalesOrderNumber]
,[SalesOrderLineNumber]
,[RevisionNumber]
,[OrderQuantity]
,[UnitPrice]
,[ExtendedAmount]
,[UnitPriceDiscountPct]
,[DiscountAmount]
,[ProductStandardCost]
,[TotalProductCost]
,[SalesAmount]
,[TaxAmt]
,[Freight]
,[CarrierTrackingNumber]
,[CustomerPONumber]
,[OrderDate]
,[DueDate]
,[ShipDate]
)

GO

 (Note: If you have problem to copy and paste the script into the SSMS, copy it and paste to Office Word first and then paste to SSMS.)

Now we have a bigger yet still meaningful by randomizing the reseller and quantity. The table size after 2 bigger tables are added in as below:

 

As you can see, now we have a big FactResellerSalesbig table with 10,953,900 rows and 2.6 GB data size. From next few of blogs, I'm going to dig into the columnstore index to see what make it different.

Link:

SQL Server 2012 Columnstore Index In Depth Part 2: Columnstore Index Size Calculation