SQL Server 2012 Columnstore Index In Depth Part 3: the Myth of an Extra Column in sys.column_store_segments

April 22, 2012 at 4:47 AMSteven Wang

There 2 new system catalog views added into SQL server 2012 for columnstore indexes:

1) sys.column_store_segments

2) sys.column_store_dictionaries

(Note: there was another view: sys.column_store_index_stats before the RTM version, but it is no long available in the final release. Not sure why this is dropped.)

These 2 system views are very useful to find out more about an columnstore index in your database, like segments, datasize etc.

For both system catalog views, the BOL states that 'Contains a row for each column in a columnstore index', which means each column in a columnstore index will have a row in these system viewss. While this statement is true, the reverse is not always. There are some times that you might notice that there is an extra column in  sys.column_store_segments view. What is this extra column? let's go to find it out.

As the original FactResellerSales table is a small table, we can use this table to perform the test for this purpose. I will create a nonclustered columnstore index directly on this table for all the columns (27 columns)

Use [AdventureWorksDW2012]
Go

If Exists (	Select *
			From sys.indexes
			Where	type = 6
					And Object_id = Object_id('dbo.FactResellerSales')
			)
	Begin
		Declare @IndexName sysname
				, @SQL nvarchar(max)

		Select	@IndexName = Name
		From	sys.indexes
		Where	type = 6
				And Object_id = Object_id('dbo.FactResellerSales')
		Set @SQL = 'DROP INDEX ' + quotename(@IndexName) + ' ON [dbo].[FactResellerSales];'
		
		Exec (@SQL);

		--DROP INDEX [idx_FactResellerSales_CI] ON [dbo].[FactResellerSales];
	End
--columns

CREATE NONCLUSTERED COLUMNSTORE INDEX [idx_FactResellerSales_CI] 
ON [dbo].[FactResellerSales]
(
	[ProductKey],
	[OrderDateKey],
	[DueDateKey],
	[ShipDateKey],
	[ResellerKey],
	[EmployeeKey],
	[PromotionKey],
	[CurrencyKey],
	[SalesTerritoryKey],
	[SalesOrderNumber],
	[SalesOrderLineNumber],
	[RevisionNumber],
	[OrderQuantity],
	[UnitPrice],
	[ExtendedAmount],
	[UnitPriceDiscountPct],
	[DiscountAmount],
	[ProductStandardCost],
	[TotalProductCost],
	[SalesAmount],
	[TaxAmt],
	[Freight],
	[CarrierTrackingNumber],
	[CustomerPONumber],
	[OrderDate],
	[DueDate],
	[ShipDate]
);
GO

Select	S.*
From	sys.column_store_segments S
			Inner Join
		sys.Partitions P
			On	S.hobt_id = P.hobt_id
			Inner Join
		sys.indexes I
			On	i.object_id = P.object_id
				And I.index_id = P.index_id
Where	I.Object_id = Object_Id('FactResellerSales')
		And I.Type = 6

From the results we can see that there is exactly 27 columns in the sys.column_store_segments table. So far so good. now let's create an exctaly same structure table from FactResellerSales without any B-Tree indexes, and create a noclustered columnstore index on it:

Use [AdventureWorksDW2012]
Go
Select	*
Into	FactResellerSales_Heap
From	[dbo].[FactResellerSales];

CREATE NONCLUSTERED COLUMNSTORE INDEX [idx_FactResellerSales_Heap_CI] 
ON [dbo].[FactResellerSales_Heap]
(
	[ProductKey],
	[OrderDateKey],
	[DueDateKey],
	[ShipDateKey],
	[ResellerKey],
	[EmployeeKey],
	[PromotionKey],
	[CurrencyKey],
	[SalesTerritoryKey],
	[SalesOrderNumber],
	[SalesOrderLineNumber],
	[RevisionNumber],
	[OrderQuantity],
	[UnitPrice],
	[ExtendedAmount],
	[UnitPriceDiscountPct],
	[DiscountAmount],
	[ProductStandardCost],
	[TotalProductCost],
	[SalesAmount],
	[TaxAmt],
	[Freight],
	[CarrierTrackingNumber],
	[CustomerPONumber],
	[OrderDate],
	[DueDate],
	[ShipDate]
);
GO

Select	S.*
From	sys.column_store_segments S
			Inner Join
		sys.Partitions P
			On	S.hobt_id = P.hobt_id
			Inner Join
		sys.indexes I
			On	i.object_id = P.object_id
				And I.index_id = P.index_id
Where	I.Object_id = Object_Id('FactResellerSales_Heap')
		And I.Type = 6

This query returns back 28 rows and the maximun column_id value of sys.column_store_segments is 28. This 28th column in the sys.column_store_segments table is an extra column than the base table. Before we try to conclude what this column is, I'm going to do another test.  I will create an exctaly same structure table from FactResellerSales and create an non-unique clustered index on the new created table, and then create a noclustered columnstore index on it:

Use [AdventureWorksDW2012]
Go
Select	*
Into	FactResellerSales_Clustered_nonunique
From	[dbo].[FactResellerSales];

Create clustered index idx_FactResellerSales_Clustered_nonunique_1 on  [dbo].[FactResellerSales_Clustered_nonunique] ([OrderDateKey]);

CREATE NONCLUSTERED COLUMNSTORE INDEX [idx_FactResellerSales_Clustered_nonunique_CI] 
ON [dbo].[FactResellerSales_Clustered_nonunique]
(
	[ProductKey],
	[OrderDateKey],
	[DueDateKey],
	[ShipDateKey],
	[ResellerKey],
	[EmployeeKey],
	[PromotionKey],
	[CurrencyKey],
	[SalesTerritoryKey],
	[SalesOrderNumber],
	[SalesOrderLineNumber],
	[RevisionNumber],
	[OrderQuantity],
	[UnitPrice],
	[ExtendedAmount],
	[UnitPriceDiscountPct],
	[DiscountAmount],
	[ProductStandardCost],
	[TotalProductCost],
	[SalesAmount],
	[TaxAmt],
	[Freight],
	[CarrierTrackingNumber],
	[CustomerPONumber],
	[OrderDate],
	[DueDate],
	[ShipDate]
);
GO

Select	S.*
From	sys.column_store_segments S
			Inner Join
		sys.Partitions P
			On	S.hobt_id = P.hobt_id
			Inner Join
		sys.indexes I
			On	i.object_id = P.object_id
				And I.index_id = P.index_id
Where	I.Object_id = Object_Id('FactResellerSales_Clustered_nonunique')
		And I.Type = 6

Again the result returns 28 rows and the 28th column in the sys.column_store_segments table is an extra column than the base table.

Now we have a clear picture:

So why the nonclustered columnstore index on a heap table will have an extra column in sys.column_store_segments?

It seems that the nonclucstered columnstore indexes like a B-Tree nonclustered index need a RID (Row ID) to refer back to the base heap table.

For a table with a none unique colustered index, a 4 byte integer uniquifier is added into the table. We can examin this in three different ways.

If you use the old sysindexes table there is a keycnt column will tell you exactly how many fields are used to form a key for the index.

Select	'FactResellerSales_Clustered_Nonunique' As Table_Name
		, Keycnt as clustered_index_keys
From	sysindexes A, sys.indexes B
Where	A.id = Object_id ('FactResellerSales_Clustered_Nonunique')
		And A.id = b.object_id
		And A.indid = B.index_id
		And	B.Type = 6

This returns 2. as we only used ([OrderDateKey]) to create the idx_FactResellerSales_Clustered_nonunique_1 clustered index, which indicate that an unique identifier is added by system. The sysindexes is an SQL server 2000 system table, which is included as a view for backward compatibility only. Although it is still available now, it is better to avoid use this and this view merely tells the count of keys with no more further information.

There is a new undocumented system view: sys.system_internals_partition_columns, which has much detailed information about columns in a table partition. From this view, I can clearly get the aditional uniquifier column is added for this nonunique clustered index. Run the script below:

Select	Object_Name(B.object_id) As Table_Name
		, partition_column_id
		, Col_Name(B.object_id, partition_column_id) As Column_Name
		, C.Name AS Data_Type
		, Key_ordinal
		, Count(Key_Ordinal) Over() As Key_Count --Total number of keys for the clustered index
		, is_uniqueifier
From	sys.system_internals_partition_columns A
			Inner Join
		sys.partitions B
			On	A.partition_id = B.partition_id
			Inner Join
		sys.types C
			On	A.system_type_id = C.system_type_id
Where	B.object_id = object_id('FactResellerSales_Clustered_nonunique')
		And B.index_id = 1		--Clustered index
		And key_ordinal != 0	--Only interested in the Clustered index key columns

 

From the above result, we can see that a uniquifier is added and the data type is Integer and there is no column name associated with it.

We also can use the DBCC Page to delve into the physical data page to see the uniquifier is added in the page level.

Use [AdventureWorksDW2012]
Go

Select	P.database_id
		, P.allocated_page_file_id
		, P.allocated_page_page_id
		, P.allocation_unit_type_desc
From	sys.indexes I
			Cross Apply
		sys.dm_db_database_page_allocations(DB_ID(), I.object_id, I.index_id, Null, 'limited') P
Where 
		I.object_id = Object_id ('FactResellerSales_clustered_nonunique')
		And P.is_iam_page = 0
		And P.is_allocated = 1
		And P.is_mixed_page_allocation = 0
		And I.Type = 1  --Type 0 is heap
Go

DBCC traceon(3604, -1)
Go

DBCC page (
			7,				--Database ID
			1,				--Page File ID
			1033784,		--Page ID, check with your own pageid
			3				--Commnad output style
			)

The highlighted page output tells that a 4 byte integer uniquifier is added.

 

This finishes this post. Thanks for reading.

Previous Related posts:

1. SQL Server 2012 Columnstore Index In Depth Part 1: Make a bigger AdventureworksDW database for Testing

2. SQL Server 2012 Columnstore Index In Depth Part 2: Columnstore Index Physical Data and Size Calculation   

 

 

SQL Server 2012 Columnstore Index In Depth Part 2: Columnstore Index Physical Data and Size Calculation

April 20, 2012 at 3:41 AMSteven Wang

I my previous blog SQL Server 2012 Columnstore Index In Depth Part 1: Make a bigger AdventureworksDW database for Testing, I have created a bigger FactResellerSalesBig table in AdventureworksDW2012 database and a nonclustered columnstore index IX_FactResellerSalesBig_NCI  is created. Now we can check how a columnstore index is stored and calculate the size of the nonclustered columnstore index as a whole or even an individual column within the columnstore index.

1. How does a Columnstore Index Store Data

A columnstore index is broken into one or multiple segments for each partition (when a table is not partitioned it has only one partition), which consist of multiple physical data pages. A full segment contains 1,048,576 rows. Based on the number of threads used to create a columnstore index, the ending segments group (normally the number of threads used) for a column in each partition might have different row number which is smaller than 1,048,576. If a base table only has less than 1,048,576 rows then only 1 segment per column will be created.

A columnstore index segment is internally stored as a BLOB data type. This can be verified by using DBCC Page command.

Before using DBCC Page command, we need to know the page ids allocated to the columnstore index we are going to examine. As we are using SQL server 2012, we now have 2 ways to get the page ids:

a) use DBCC IND

b) use the new added Dynamic management view in SQL server 2012: sys.dm_db_database_page_allocations

Although they are both undocumented, to use sys.dm_db_database_page_allocations has many advantages over DBCC IND. I am not going to discuss details about this DM view. Kalen Delaney has an article on SQLMag: SQL server's Undocumented Changes, which has a good explanation on how to use this DM view.

For this post, based on the table FactResellerSalesBig I created (refer to my last post), I will use the below query to get the page informations:

 

Use [AdventureWorksDW2012]
Go

Select	P.database_id
		, P.allocated_page_file_id
		, P.allocated_page_page_id
		, P.allocation_unit_type_desc
From	sys.indexes I
			Cross Apply
		sys.dm_db_database_page_allocations(DB_ID(), I.object_id, I.index_id, Null, 'limited') P
Where 
		I.object_id = Object_id ('FactResellerSalesBig')
		And P.is_iam_page = 0
		And P.is_allocated = 1
		And P.is_mixed_page_allocation = 0
		And I.Type = 6  --Type 6 is 'NONCLUSTERED COLUMNSTORE'
Go

 From the demo database AdventureworksDW2012 in my computer, I got the below results:

 From the result, we already can know that the data type used to store the columnstore index is BLOB data type. But we can further delve into the physical page level to see how data is stored.

DBCC page (
			7,				--Database ID
			1,				--Page File ID
			270800,			--Page ID
			3				--Commnad output style
			)

 

The highlighted information from the DBCC Page output has testified that the BLOB data is used to store columnstore indexes. 

 

2. Calculate the Size of A Nonclustered Columnstore Index

Based on the Columnstoreindex book online, We can calculate the size of the nonclustered columnstore index IX_FactResellerSalesBig_NCI as below:

SELECT SUM(on_disk_size_MB) AS TotalSizeInMB
FROM
(
   (SELECT SUM(css.on_disk_size)/(1024.0*1024.0) on_disk_size_MB
    FROM sys.indexes AS i
    JOIN sys.partitions AS p
        ON i.object_id = p.object_id 
    JOIN sys.column_store_segments AS css
        ON css.hobt_id = p.hobt_id
    WHERE i.object_id = object_id('FactResellerSalesBig') 
    AND i.type_desc = 'NONCLUSTERED COLUMNSTORE') 
  UNION ALL
   (SELECT SUM(csd.on_disk_size)/(1024.0*1024.0) on_disk_size_MB
    FROM sys.indexes AS i
    JOIN sys.partitions AS p
        ON i.object_id = p.object_id 
    JOIN sys.column_store_dictionaries AS csd
        ON csd.hobt_id = p.hobt_id
    WHERE i.object_id = object_id('FactResellerSalesBig') 
    AND i.type_desc = 'NONCLUSTERED COLUMNSTORE') 
) AS SegmentsPlusDictionary

The result of this query is: 242.34 MB. However, from the database data storage point of view, this calculation does not truly reflect the size of the index data pages in the database. The data size from both systerm catalog views are the BLOB data size which doesn't account for the metadata and page headers and other overheads for a page to hold the blob data.

Thanks to Eric Hanson, the Principal Progran Manger at Microsoft, he cleared my doubt and confirmed that my calculation below is accurate.

SELECT	SUM(s.used_page_count) / 128.0 on_disk_size_MB
FROM	sys.indexes AS i
		JOIN sys.dm_db_partition_stats AS S
			ON	i.object_id = S.object_id 
				and I.index_id = S.index_id
WHERE	i.object_id = object_id('FactResellerSalesBig') 
		AND i.type_desc = 'NONCLUSTERED COLUMNSTORE'

The result of this query is: 250.37 MB

As the size of base table FactResellerSalesBig table is 2626 MB, we can get that the columnstore index compression ratio is around 10 : 1.

3. Calculate the Individual Column Size of A Nonclustered Columnstore Index

As columnstore indexes store column data separately, we can calculate each compressed column data size for a columnstore index. The column size can be caculated by using the below script. This script was orginally written by Eric Hanson, the Principal Program Manager of Microsoft. I have made a modification to take care of the situation when not all the columns are included in a columnstore index.

-- size per column 
with segment_size_by_column as ( 
	SELECT 
			p.object_id as table_id, 
			css.column_id, 
			SUM (css.on_disk_size)/1024/1024.0 AS segment_size_mb 
	FROM	sys.partitions AS p 
				JOIN 
			sys.column_store_segments AS css 
				ON p.hobt_id = css.hobt_id 
	GROUP BY p.object_id, css.column_id 
), 
dictionary_size_by_column as ( 
	SELECT 
			p.object_id as table_id, 
			csd.column_id, 
			SUM (csd.on_disk_size)/1024/1024.0 AS dictionary_size_mb 
	FROM	sys.partitions AS p 
				JOIN 
			sys.column_store_dictionaries AS csd 
				ON p.hobt_id = csd.hobt_id 
	GROUP BY p.object_id, csd.column_id 
) 

--As in the real world, not all the columns in a table will be or can be included in a nonclustered columnstore index, we need to join
--to the sys.index_columns to get the correct column id.
Select 
		Object_Name(s.table_id) as table_name, 
		C.column_id,	
		col_name(S.table_id, C.column_id) as column_name, 
		s.segment_size_mb, 
		d.dictionary_size_mb, 
		s.segment_size_mb + isnull(d.dictionary_size_mb, 0) total_size_mb 
from	segment_size_by_column s 
			join 
		sys.indexes I -- Join to sys.Indexes system table
			ON I.object_id = s.table_id 
			join 
		sys.index_columns c --Join to sys.Index_columns
			ON	c.object_id = s.table_id 
				And I.index_id = C.index_id
				and c.index_column_Id = s.column_id --Need to join to the index_column_id with the column_id from segment_size_by_column
			left outer join 
		dictionary_size_by_column d 
			on	s.table_id = d.table_id 
				and s.column_id = d.column_id 
Where	I.type_desc =  'NONCLUSTERED COLUMNSTORE'
order by total_size_mb desc

  (Note: If you have problem to copy and paste the script into the SSMS, copy it and paste to Office Word first and then paste to SSMS.)

 

Thanks for reading.

 

 

xVelocity Memory Optimized Columnstore indexes: A Gem of SQL Server 2012

April 18, 2012 at 5:06 AMSteven Wang

I was presenting at Auckland SQL User Group for the titled session.

Abstract:

The columnstar indexes, formerly known as project "Apollo", are new future of SQL server 2012. Columnstore indexes deliver the blazing-fast query performance, particularly for the star-joining queries. It is not surprised to notice a query performance increase up to 100 times faster. In this session, Steven will take you through the nitty-gritty and internals of columnstore indexes; he will also talk though the columnstore index do's and don'ts, limitations, tips and tricks, and the possible impact on the future BI solution considerations.

Thanks for all attendants!

I have attached the slides deck and SQL code in this blog.

 

ColumnstoreIndex_SQLUserGroup.pptx (1.46 mb)

SQL_Scripts.zip (15.09 kb)