SQL Server 列存储索引性能总结(8)——列存储中的Dictionary

接上文:SQL Server 列存储索引性能总结(7)——导入数据到列存储索引的Delta Store,前面提到了几次Dictionary,本文快速介绍一下它到底是什么,以便更好地理解列存储。
不过这部分不会讲太深入,因为这个功能只能用于SQL Server而不适用于SQL DB和SQL DW(现在称为Azure Synapse Analytics

环境

  本文继续使用ContosoRetailDW作为演示。但是首先我们用下面的脚本查一下哪些表适合使用聚集列存储索引(超过100万行):

-- 返回建议用于数据仓库环境的聚集列存储的表
SELECT object_schema_name(t.object_id) AS 'Schema'
	,object_name(t.object_id) AS 'Table'
	,sum(p.rows) AS 'Row Count'
	,(
		SELECT count(*)
		FROM sys.columns AS col
		WHERE t.object_id = col.object_id
		) AS 'Cols Count'
	,(
		SELECT sum(col.max_length)
		FROM sys.columns AS col
		JOIN sys.types AS tp ON col.system_type_id = tp.system_type_id
		WHERE t.object_id = col.object_id
		) AS 'Cols Max Length'
	,(
		SELECT count(*)
		FROM sys.columns AS col
		JOIN sys.types AS tp ON col.system_type_id = tp.system_type_id
		WHERE t.object_id = col.object_id
			AND (
				UPPER(tp.name) IN (
					'TEXT'
					,'NTEXT'
					,'TIMESTAMP'
					,'HIERARCHYID'
					,'SQL_VARIANT'
					,'XML'
					,'GEOGRAPHY'
					,'GEOMETRY'
					)
				OR (
					UPPER(tp.name) IN (
						'VARCHAR'
						,'NVARCHAR'
						)
					AND (
						col.max_length = 8000
						OR col.max_length = - 1
						)
					)
				)
		) AS 'Unsupported Columns'
	,(
		SELECT count(*)
		FROM sys.objects
		WHERE type = 'PK'
			AND parent_object_id = t.object_id
		) AS 'Primary Key'
	,(
		SELECT count(*)
		FROM sys.objects
		WHERE type = 'F'
			AND parent_object_id = t.object_id
		) AS 'Foreign Keys'
	,(
		SELECT count(*)
		FROM sys.objects
		WHERE type IN (
				'UQ'
				,'D'
				,'C'
				)
			AND parent_object_id = t.object_id
		) AS 'Constraints'
	,(
		SELECT count(*)
		FROM sys.objects
		WHERE type IN (
				'TA'
				,'TR'
				)
			AND parent_object_id = t.object_id
		) AS 'Triggers'
	,t.is_tracked_by_cdc AS 'CDC'
	,t.is_memory_optimized AS 'Hekaton'
	,t.is_replicated AS 'Replication'
	,coalesce(t.filestream_data_space_id, 0, 1) AS 'FileStream'
	,t.is_filetable AS 'FileTable'
FROM sys.tables t
INNER JOIN sys.partitions AS p ON t.object_id = p.object_id
WHERE p.data_compression IN (
		0
		,1
		,2
		) -- None, Row, Page
	AND p.index_id IN (
		0
		,1
		)
	AND (
		SELECT count(*)
		FROM sys.indexes ind
		WHERE t.object_id = ind.object_id
			AND ind.type IN (
				5
				,6
				)
		) = 0
GROUP BY t.object_id
	,t.is_tracked_by_cdc
	,t.is_memory_optimized
	,t.is_filetable
	,t.is_replicated
	,t.filestream_data_space_id
HAVING sum(p.rows) > 1000000
ORDER BY sum(p.rows) DESC

  不过需要提醒一下,这个脚本并不是万能的,也不是绝对的标准,所以仅供参考。这里主要关注在大表环境。结果如下,有5个表满足条件。FactOnlineSales, FactInventory, FactSalesQuota, FactSales, FactStrategyPlan。

在这里插入图片描述
  因为这些表存在不少主键和外键,为了减少影响,这里先清理一下,如果你担心影响后续操作的话,把数据库备份一次再操作:

-- 删除主键:
ALTER TABLE dbo.[FactOnlineSales] DROP CONSTRAINT [PK_FactOnlineSales_SalesKey]
ALTER TABLE dbo.[FactStrategyPlan] DROP CONSTRAINT [PK_FactStrategyPlan_StrategyPlanKey]
ALTER TABLE dbo.[FactSales] DROP CONSTRAINT [PK_FactSales_SalesKey]
ALTER TABLE dbo.[FactInventory] DROP CONSTRAINT [PK_FactInventory_InventoryKey]
ALTER TABLE dbo.[FactSalesQuota] DROP CONSTRAINT [PK_FactSalesQuota_SalesQuotaKey]

-- 删除外键:
ALTER TABLE dbo.[FactOnlineSales] DROP CONSTRAINT [FK_FactOnlineSales_DimCurrency]
ALTER TABLE dbo.[FactOnlineSales] DROP CONSTRAINT [FK_FactOnlineSales_DimCustomer]
ALTER TABLE dbo.[FactOnlineSales] DROP CONSTRAINT [FK_FactOnlineSales_DimDate]
ALTER TABLE dbo.[FactOnlineSales] DROP CONSTRAINT [FK_FactOnlineSales_DimProduct]
ALTER TABLE dbo.[FactOnlineSales] DROP CONSTRAINT [FK_FactOnlineSales_DimPromotion]
ALTER TABLE dbo.[FactOnlineSales] DROP CONSTRAINT [FK_FactOnlineSales_DimStore]
ALTER TABLE dbo.[FactStrategyPlan] DROP CONSTRAINT [FK_FactStrategyPlan_DimAccount]
ALTER TABLE dbo.[FactStrategyPlan] DROP CONSTRAINT [FK_FactStrategyPlan_DimCurrency]
ALTER TABLE dbo.[FactStrategyPlan] DROP CONSTRAINT [FK_FactStrategyPlan_DimDate]
ALTER TABLE dbo.[FactStrategyPlan] DROP CONSTRAINT [FK_FactStrategyPlan_DimEntity]
ALTER TABLE dbo.[FactStrategyPlan] DROP CONSTRAINT [FK_FactStrategyPlan_DimProductCategory]
ALTER TABLE dbo.[FactStrategyPlan] DROP CONSTRAINT [FK_FactStrategyPlan_DimScenario]
ALTER TABLE dbo.[FactSales] DROP CONSTRAINT [FK_FactSales_DimChannel]
ALTER TABLE dbo.[FactSales] DROP CONSTRAINT [FK_FactSales_DimCurrency]
ALTER TABLE dbo.[FactSales] DROP CONSTRAINT [FK_FactSales_DimDate]
ALTER TABLE dbo.[FactSales] DROP CONSTRAINT [FK_FactSales_DimProduct]
ALTER TABLE dbo.[FactSales] DROP CONSTRAINT [FK_FactSales_DimPromotion]
ALTER TABLE dbo.[FactSales] DROP CONSTRAINT [FK_FactSales_DimStore]
ALTER TABLE dbo.[FactInventory] DROP CONSTRAINT [FK_FactInventory_DimCurrency]
ALTER TABLE dbo.[FactInventory] DROP CONSTRAINT [FK_FactInventory_DimDate]
ALTER TABLE dbo.[FactInventory] DROP CONSTRAINT [FK_FactInventory_DimProduct]
ALTER TABLE dbo.[FactInventory] DROP CONSTRAINT [FK_FactInventory_DimStore]
ALTER TABLE dbo.[FactSalesQuota] DROP CONSTRAINT [FK_FactSalesQuota_DimChannel]
ALTER TABLE dbo.[FactSalesQuota] DROP CONSTRAINT [FK_FactSalesQuota_DimCurrency]
ALTER TABLE dbo.[FactSalesQuota] DROP CONSTRAINT [FK_FactSalesQuota_DimDate]
ALTER TABLE dbo.[FactSalesQuota] DROP CONSTRAINT [FK_FactSalesQuota_DimProduct]
ALTER TABLE dbo.[FactSalesQuota] DROP CONSTRAINT [FK_FactSalesQuota_DimScenario]
ALTER TABLE dbo.[FactSalesQuota] DROP CONSTRAINT [FK_FactSalesQuota_DimStore]

  然后我们创建聚集列存储索引(后称CCI,clustered columnstore index)到每个表上:

Create Clustered Columnstore Index CCI 	on dbo.FactOnlineSales 	WITH (DATA_COMPRESSION = COLUMNSTORE);
Create Clustered Columnstore Index CCI	on dbo.FactStrategyPlan WITH (DATA_COMPRESSION = COLUMNSTORE);
Create Clustered Columnstore Index CCI	on dbo.FactSales WITH (DATA_COMPRESSION = COLUMNSTORE);
Create Clustered Columnstore Index CCI	on dbo.FactInventory WITH (DATA_COMPRESSION = COLUMNSTORE);
Create Clustered Columnstore Index CCI	on dbo.FactSalesQuota 	WITH (DATA_COMPRESSION = COLUMNSTORE);

  我们会用下面的脚本来查询dictionary的信息:

select object_name(object_id), dictionary_id
	, count(*) as 'Number of Dictionaries'
	, sum(entry_count) as 'Entry Count'
	, min(on_disk_size) as 'Min Size'
	, max(on_disk_size) as 'Max Size'
	, avg(on_disk_size) as 'Avg Size'
	from sys.column_store_dictionaries dict
		join sys.partitions part
			on dict.hobt_id = part.hobt_id
	group by object_id, dictionary_id 
	order by object_name(object_id), dictionary_id 

在这里插入图片描述
  从上面的结果大概可以看出,随着dictionary_id的增加,可用的dictionaries数量减少。

介绍和总结

  在列存储中,有些列要求额外使用字典(dictionary),比如字符类型,用于编码转换成可用类型。字典包含全局(global)和本地(local),与片段关联,其中全局字典可以横跨全部列。

字典用于对某些而不是全部数据类型进行编码,所以不是所有的列存储中的列都有字典。

  另外条目数也就是(entry count)那里,它并非直接上升或者下降,而实先升后降。另外可以看到不同的空间,字典数的差异非常大。
  其实字典主要用于字符类型,创建一个字典存储列中的唯一值列表,列表中有数据值和字典值对应,然后在列存储中存储字典值而不是本来的数据值,比如颜色列,红黄蓝绿分别以1/2/3/4存储在字典和列存储中,这样存储空间可以进一步减少(注意这个例子只是形象化,并不是真的按照这个规则存储颜色)。
  大部分情况下会使用全局字典存储所有相关列,但是也有部分使用本地字典来存储。
  到了这里,我们很容易想到虽然存储空间少了,但是你引入了一个“关联或者配置表”,在操作的时候就多了一步。但是不可否则这种做法确实大大地降低了存储空间,至于能否得到比关联开销更大的查询提升,那需要具体问题具体分析。一般情况下我们能做的通常只是重建或重组索引。

  下一篇:SQL Server 列存储索引性能总结(9)——重建和重组聚集列存储索引所需的内存

发布了192 篇原创文章 · 获赞 1268 · 访问量 250万+

猜你喜欢

转载自blog.csdn.net/DBA_Huangzj/article/details/104975804