Advanced Data Cleaning Techniques in SQL Server -- NZ BI User Group

November 9, 2012 at 10:07 PMSteven Wang

I will present at NZ BI User Group for the topic: Advanced Data Cleaning Techniques in SQL Server for NZ BI User Group at 20th November at Microsoft Auckland.

Abstract:

In real word, data sources are too often full of dirty data due to data entry errors and differences in data representation among data sources. The poor data quality has been very negatively impacting business intelligence and decision support systems. The data cleaning process is the paramount important phase of a business intelligence project.

In this session, some practical data cleaning techniques will be discussed, which include but not limited to the list below:

  • Data Profiling
  • SQL Server Window Functions
  • Script Transformation with Regular Expression
  • Fuzzy Grouping
  • Fuzzy Lookup
  • DQS Cleansing (Brief Introduction)
This is a 300 to 400 level session and some fuzzy logic internals will also be touched.
 
Attached the presentation files as below:
 
 

NZBI_User_Group_Data_Cleaning.zip (3.87 mb)

 

 

Posted in: Data Cleaning | DQS | SSIS

Tags:

Intelligently Backup SQL Server Analysis Service Databases

May 3, 2012 at 9:50 PMSteven Wang

 There are normally 2 main approaches to backup SQL SSAS databases:

1. Backup SSAS Databases at a scheduled time

2. Backup SSAS databases whenever after a process is performed.

Although there is nothing wrong with these 2 approaches, they are both not the best. They are both not very easy to be managed, especially when there are lots of databases under same SSAS instance. A better way is to intelligently backup SSAS databases based on the data change. A job can be set up to check the data change for the SSAS databases at a scheduled interval and if an SSAS database has been detected a change either data or schema then it will be backed up; if it has nothing been changed then a new backup is not needed. In this post, I will use an SSIS package to intelligently manage the SSAS database backup. My solution is based on the logic below:

1. Only modified SSAS databases (Data or schema) will be backed up

2. A new added SSAS database will be always backed up

3. SSAS backups can be traced. The backup history will be kept.

4. backup file retention can be configured (how many backup file need to be kept?)

To start with, I need 2 SSAS maintenance tables to manage the backup solution. I use the below script for my tables: (there are some columns in this script is not specific for the backup purpose, but as they are related to the SSIS package, they will be kept.)

CREATE TABLE [dbo].[Database_Backup_Admin]
(
	[Database_Name] [nvarchar](128) NOT NULL,
	[Data_Modified] [datetime] NULL,
	[Backup_Date] [datetime] NULL,
	[DNS_Instance_Name] [nvarchar](128) NOT NULL,
	[Instance_Type] [nvarchar](50) NOT NULL,
	[Backup_file_to_Keep] [tinyint] NULL,
	[Is_Current] [bit] NULL,
	[Is_Backuped_Latest] [bit] NULL,
	[CacheWarm_Date] [datetime] NULL,
	[Is_CacheWarm_Current] [bit] NULL,
	[Is_CacheWarm_On] [bit] NULL,
	CONSTRAINT [pk_Database_BackUP_Admin] PRIMARY KEY CLUSTERED 
	(
		[Database_Name] ASC,
		[DNS_Instance_Name] ASC,
		[Instance_Type] ASC
	)
);

CREATE TABLE [dbo].[Database_Backup_History]
(
	[Database_Name] [nvarchar](128) NOT NULL,
	[Backup_Date] [datetime] NOT NULL,
	[DNS_Instance_Name] [nvarchar](128) NOT NULL,
	[Instance_Type] [nvarchar](50) NOT NULL,
	[Backup_file_Name] [nvarchar](128) NULL,
	[Backup_file_Size] [int] NULL,
	[Backup_start_Time] [datetime] NULL,
	[Backup_End_Time] [datetime] NULL,
	[Backup_Duration_Second] [int] NULL,
	CONSTRAINT [pk_Database_Backup_History] PRIMARY KEY CLUSTERED 
	(
		[Database_Name] ASC,
		[Backup_Date] ASC,
		[DNS_Instance_Name] ASC,
		[Instance_Type] ASC
	)
);

 

Once the tables are created, I created an SSIS package with the layout as below:

From the package task flow, the first 2 steps are to check if all the databases in our Database_Backup_Admin table are up to date. If a database was in the server but now is removed, then the Is_Current column of the table for the database will be set to 0.  If there is a new database added into the server, it will be loaded into the table. There is a slowly changing dimension data transformation used for for this check. The DBSchema_Catalogs DMV is used to check the SSAS databases in the server instance. The data flow task for the database check as below:

SSAS database data change can be verified by MDschema_cubes DMV. The last_data_update column of MDschema_cubes has a time stamp for the cube data change. The package will update the Data_Modified column for Database_Backup_Admin table based on the last_data_update column of MDschema_Cubes.

After the Data_Modified column is updated, a comparison between Data_Modified and Backup_date will be performed. If the backup_date is older than Data_Modified, then Is_Backuped_Latest will be marked as False, which suggests a new backup is needed.

The last for each loop container is used to backup SSAS database by using XMLA backup command. A logic is also used to remove the older backup files based on the configuration how many backup files to keep. The Scrip used to remove the old backup file:

Imports System
Imports System.Data
Imports System.Math
Imports Microsoft.SqlServer.Dts.Runtime
Imports System.IO
Imports System.Windows.Forms

<System.AddIn.AddIn("ScriptMain", Version:="1.0", Publisher:="", Description:="")> _
<System.CLSCompliantAttribute(False)> _
Partial Public Class ScriptMain
	Inherits Microsoft.SqlServer.Dts.Tasks.ScriptTask.VSTARTScriptObjectModelBase

	Enum ScriptResults
		Success = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Success
		Failure = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Failure
	End Enum
	
    Public Class CompareFileLastWriteTime
        Implements IComparer
        Public Function Compare(ByVal x As Object, ByVal y As Object) As Integer Implements IComparer.Compare
            'this "Compare" code is bollowed from: http://geekswithblogs.net/ntsmith/archive/2006/08/17/88250.aspx
            Dim File1 As FileInfo
            Dim File2 As FileInfo

            File1 = DirectCast(x, FileInfo)
            File2 = DirectCast(y, FileInfo)

            Compare = DateTime.Compare(File2.LastWriteTime, File1.LastWriteTime)
        End Function
    End Class

	Public Sub Main()
		'
		' Add your code here

        Dim BackupDir As String = Dts.Variables("BackupDir").Value.ToString
        Dim DatabaseName As String = Dts.Variables("DatabaseName").Value.ToString
        Dim FilesToKeep As Integer = CInt(Dts.Variables("FilesToKeep").Value)
        Dim NewBackupFile = New FileInfo(BackupDir & Dts.Variables("BackupFileName").Value.ToString)
        Dim directory = New DirectoryInfo(BackupDir)
        Dim AllBackups = directory.GetFiles(DatabaseName & "_*")
        Dim filename As String = ""
        Dim oldesttime As Date = Now
        Dim FilesCount As Integer = 0

        Array.Sort(AllBackups, New CompareFileLastWriteTime)
   
        Dts.Variables("BackupEndTime").Value = Now

        For Each entry In AllBackups
            If oldesttime > entry.LastWriteTime Then
                oldesttime = entry.LastWriteTime
                filename = entry.FullName
                FilesCount += 1
                If FilesCount > FilesToKeep AndAlso filename <> "" Then
                    File.Delete(filename)
                    FilesCount -= 1
                End If
            End If
        Next

        Dts.Variables("BackupFileSize").Value = CInt(NewBackupFile.Length / 1024)

		Dts.TaskResult = ScriptResults.Success
	End Sub

End Class

 

The package has the below variables:

 For these variables, these variables need to be configured:

1. InstanceName

2. DatabaseName

3. BackupDir (Olap backup directory, please note that you need to replace the single \ with \\ if you set value through sql agent job)

4. TimeZone (As the SSAS DMVs record the times as the UTC time, so you need to configure this TimeZone to your local time, for instance, New Zealand is 12.)

5. BackupFilesToKeep (How many backup files need to be kept)

The BackupXMLA variable is an expression, which will be changed dynamically for each loop:

"<Backup xmlns=\"http://schemas.microsoft.com/analysisservices/2003/engine\">
  <Object>
    <DatabaseID>" + @[User::DatabaseName]  + "</DatabaseID>
  </Object>
  <File>" + @[User::BackupFileName]  + "</File>
</Backup>"

 This is the command to do the SSAS database backup.

 I have attached the package here.

Intelligent_OLAPDB_Backup_BICOE.dtsx (289.11 kb)

 

An SQL agent job can set up to perform the backup and remember to set the variables below to your own values:

 

Automatically Transfer and Synchronize SQL Server logins, SIDs, Passwords and Permissions

April 5, 2012 at 3:54 AMSteven Wang

Subject: to transfer and synchronize SQL server logins, SIDs, SQL Login Passwords and permissions from one server instance to another by using SSIS package

This technical applies to: SQL server migration, Primary server and second server login synchronization including SIDs, Passwords and permissions transfer. SQL server 2008 and later is applicable.

  

1. Real world Issues

It is too often when a principal server failover to a mirror server, the SQL server logins are found is not totally in sync with the principal server. This is same to the log-shipping servers, replicated sites and other DR scenarios like Backup/restore databases from one server to another.

As I mentioned in my last post: A simple Way to Migrate SQL Logins from One Instance to Another, it is not convenient to transfer server logins from one server to another. 

2. Login Synchronization Consideration

It will be nice if we can have an automated process or script to transfer logins from one server to another including SIDs, passwords and permissions.

Typically, to transfer sql server logins from one server to another automatically, the following needs to be considered:

A) Login SIDs (either windows logins or SQL logins) in Destination server but not in source, these logins in destination server need to be dropped

B) SQL logins have different passwords between Source server and destination server, the SQL Logins in destination server needs to be altered

C) Login names in Destination server not in source server need to be altered with same name as in source

D) Logins in Source server but not in destination server, these logins in source server need to be synchronized from source to destination server.

E) Logins with different Is_Disabled value between source and destination. The destination logins need to be altered.

F) Logins’ server role assignments need to be synchronized from source to destination.

G) Login permissions need to be synchronized from source to destination.

H) There are others need to be considered like default language etc. but the above mentioned are most important ones.

I have transformed these requirements into the SSIS tasks as below:

A) Drop logins in destination where SIDs are different

B) Drop logins in destination where passwords are different (Drop is simpler than alter.)

C) Drop logins in destination where names are different (Drop is simpler than alter)

D) Check and Sync Logins

E) Alter logins in destination where Is_Disabled are different

F) Check and Sync role members

G) Check and sync permissions

The SSIS package lay out as below:

 3. SSIS Package design

The package contains 7 containers and each container has contained 3 tasks: a Data flow task, A ForEach Loop task, and a Execute SQL script task.

The general idea is to use the data flow task to generate the SQL script and store the query text in a recordset destination which assigns to an object variable: Query_Dataset; the ForEach task use an ADO enumerator to loop through the Query script records in the Query_Dataset variable and assign it to a string variable:SQL_Query; the Execute SQl script task runs the string query variable to apply the change.

The variables have been setup as below:

 A) Drop logins in destination where SIDs are different Task Design

The data flow task: Get login with SIDs diff task has the below design:

This is use merge join to find out any SIDs in Destination but not in Source. The conditional split is to split out any NULL SID value from the Login source in the merge Join component.

The derived column component is used to generate the DDL query like: “Drop Login MyLoginName;”

The query is stored in the Query_dataset where Query_Dataset variable is configured.

 

B) Drop logins in destination where passwords are different task design

The data flow task: Get Login with Password diff task design is depicted as below:

C) Drop logins in destination where names are different Task Design

A Login with same SID in Source and destination server could have different name if Login name in source server is altered. Therefore, these logins need to be synced to destination. To simplify the process, I simply drop those logins in destination and reimport from the source again later.

The data flow task: Get login with Name diff is depicted as below:

 

D) Check and Sync Logins task design

In this task I will use the SQL script I have created in my last post to generate the Create Login DDL queries.

The query will script out the source login SIDs and also passwords if they are SQL logins.

The data flow task: get server login diff task is depicted as below:

 

The script will script out both windows logins and SQL logins but will leave out the R = Server role, C = Login mapped to a certificate, and K = Login mapped to an asymmetric key. As in SQL server 2008, server role is fixed. The C and K are more complicated and I just want to keep things simple here.

The script in Login source is as below:

Select Upper(A.name) As Name,
Case When A.type IN ('G', 'U')
Then
'Create Login ' + QUOTENAME(A.name) 
+ ' From Windows With DEFAULT_DATABASE=[Master]' --+ quotename(A.default_database_Name)
+ ', DEFAULT_LANGUAGE=' + quotename(A.default_language_Name)
When A.type IN ('S')
Then
'Create Login ' + QUOTENAME(A.name) 
+ ' With Password=' + CONVERT(varchar(max), B.password_hash, 1) + ' hashed'
+ ', DEFAULT_DATABASE=[Master]' --+ quotename(B.default_database_Name)
+ ', DEFAULT_LANGUAGE=' + quotename(B.default_language_Name)
+ ', CHECK_POLICY=' + Case B.is_policy_checked when 0 then 'OFF' When 1 Then 'On' End 
+ ', CHECK_EXPIRATION=' + Case B.is_expiration_checked when 0 then 'OFF' When 1 Then 'On' End
+ ', SID=' + CONVERT(varchar(max), A.SID, 1)
End As SRVLogin
From sys.server_principals A
Left Outer Join
sys.sql_logins B
On A.name = B.Name
Where A.type  Not In ('R', 'C', 'K')
And A.name Not like '##%##'
And A.sid != 0x01

 E) Alter logins in destination where Is_Disabled are different

This task is used to keep the is_disabled value synced between source and destination. The data flow task: get login with IsDisable diff is depicted as below:

F) Check and Sync role members task design

This task is used to sync the server role members between source and destination. I get both role name and its members from source and destination servers. A full join component is used to find out which role members for a server role is in source but not in destination and vice versa. For those logins in a server role in source but not in destination, then an sp_addsrvrolemember script will be created. In contrast, for those logins in a server role in destination but not in source, an sp_dropsrvrolemember script will be created.

Both scripts will be combined and save into the Query_Dataset variable. The data flow task: Check role Member task is depicted as below:

 

G) Check and sync permissions task design

This task is used to sync the login permissions from source to destination. This query is used to script out the login permissions:

Select distinct
A.name
, B.Permission_Name
, B.state
, B.state_desc
From sys.server_principals A
Inner Join
sys.server_permissions B
On 
A.principal_id = B.grantee_principal_id
Where A.type  Not In ('R', 'C', 'K')
And A.name Not like '##%##'
And A.sid != 0x01

A full join is used to find out which permissions are applied in source for a login but not in destination and vice versa.

For those permissions are applied in source for a login but not in destination then those logins will go to the Permission_Applied route. And the below expression is used to create the permission grant script:

(DT_WSTR,2000)([state_desc source] + " " + [Permission_Name source] + " TO [" + [name source] + ([state source] == "W" ? "] WITH GRANT OPTION;" : "];"))

For those permissions are applied in destination for a login but not in source then those logins will go to the Permission_Removed route. And the below expression is used to create the permission grant script:

(DT_WSTR,2000)("REVOKE " + [Permission_Name dest] + " TO [" + [name dest] + "];")

The data flow task: Check Permissions task is depicted as below:

 

4. SSIS Package Automation

An SQL agent Integration service job can be used to automate the Login synchronization process. The procedure is to create a job to use the package. 

For example, a job name can be: Apply Login Changes from Source to Destination.

And Remember to set your source and destination server variable value to your server names like the below:

 

Once the job is set up you can schedule the job to run automatically in your preferred time intervals.

You can also create a server level DDL trigger to automatically run the job whenever an DDL command is performed on logins. An example of the trigger like:

Create Trigger trigger_ApplyLoginsToDestination
On All Server
After Create_Login, Alter_Login, Drop_Login
As
Exec msdb..sp_start_job 'Apply Login Changes from Source to Destination' --change to your own job name
Print 'You have synced your login changes from source to destination server.'

You can download this package from the attachment in this post.

Thanks for reading.

 

SQLServer_Login_Sync_Blog.zip (47.20 kb)

The Use of Denali 'With Result Sets' Feature in SSIS Data Flow Task

October 15, 2011 at 4:58 AMSteven Wang

The new T-SQL feature 'With Result Sets' in SQL Denali can be very useful for using the stored procedure as the data flow task data source in SSIS.

Before SQL Denali, there are some limitations for SSIS to use stored procedure as data flow task data source, like:

1. if you want the data source to use the different column names, you have to either change the stored procedure or use the derived column component, both are not convenient;
2. The same for the data type change, if you want to change the column data type you again have to either change the stored procedure or use the derived column component;
3. If your stored procedure is ended with dynamic SQL execution, like using Exec (@SQL) or sp_executeSQL  @SQL, the SSIS was not able the return the column. A common work-around is to define a table variable inside the stored procedure, and use the SQL execution to insert the data to the table variable and then select the data from it. This is very awkward and the performance is bad when the data set is big.

 
The new T-SQL feature 'With Result Sets' solves this problem.

The syntax of the With Result Sets is simple and just like below: (More information for the Execute T-SQL, see BOL. link is here)
 
Exec Stored_Proc_Name @Variables
With Result Sets
(
(
Col1 Data_type,
Col2 Data_type,
.....
List All Columns here
)
);
 
A simple example to demo this functionality as SSIS data flow task data source to solve the problem 3 stated in the above.

1. Create a simple SP
USE [AdventureWorksDWDenali]  --Denali sample DW database, free download
GO
CREATE Proc [dbo].[usp_UseWithResultSet_Dynamic_SQL_Denali]
As
Begin
Declare @SQL varchar(Max)

Set @SQL = '
SELECT [GeographyKey]
,[City]
,[StateProvinceName]
,[EnglishCountryRegionName]
FROM[dbo].[DimGeography]
'
Exec (@SQL);

End
GO

2. Create an SSIS data flow task and use exec  [dbo].[usp_UseWithResultSet_Dynamic_SQL_Denali] as data source (SQL Command) while you click preview, it gets error like below:

 

3. By using the With Result sets, you can rename the column name and change the data type (I changed column from nvarchar to varchar), and best of all, there is no error to use this SP as the data source. see the below:

 

4. Terminate with a simple Union All Component and run. It is successful:

Posted in: SQL Server 2012 | SSIS | T-SQL

Tags: , ,