One thing that every data warehouse needs is an ability to handle unknown data.  This is done by creating a record in the dimension table to handle the unknowns.  Then when we load a fact table we map any values we cannot find a match for to this unknown record.

One of the things I do is populate the Unknown record in the ETL so I know it is always there.  One of the things I dislike is creating the SQL INSERT statement for every dimension.  Writing this statement once isn’t a huge burden but typically dimension tables evolve and updating these statements multiple times becomes tedious.

Recently I have been working on a project with a lot of dynamic SQL and it got me thinking that it was time to construct the INSERT statement for the Unknown record on the fly.

I created an SSIS Script Task in C# and used SMO to access the table that I was loading.  I then gathered up the columns in the table along with their corresponding data type.  As I iterated over the columns I populated a prefix string to hold the column definition and a suffix string to hold the values to be inserted.  I used a method to lookup the columns data type and return the desired value.

The result is the script below:

public void Main()
{
    Boolean bProcessLive = Convert.ToBoolean(Dts.Variables["etlProcessLive"].Value);
    int ErrorNumber = 9000;
    try
    {
        string sSQLServer = this.Dts.Connections["OLEDB_Connection"].Properties["ServerName"].GetValue(Dts.Connections["OLEDB_Connection"]).ToString();
        if (bProcessLive == false)
            MessageBox.Show("Server Name: " + sSQLServer, "SQL Objects");
        string sSQLDatabase = this.Dts.Connections["OLEDB_Connection"].Properties["InitialCatalog"].GetValue(Dts.Connections["OLEDB_Connection"]).ToString();
        if (bProcessLive == false)
            MessageBox.Show("Database Name: " + sSQLDatabase, "SQL Objects");

        // Connect to the instance of SQL Server. 
        Server oSQLServer = new Server(sSQLServer);

        // Reference the database.  
        Database db = oSQLServer.Databases[sSQLDatabase];

        // Reference the table
        String sDestTable = Dts.Variables["etlDestinationTable"].Value.ToString();
        // Bug in SMO doesn't return schema data. ConnectID:696411
        // The following line won't work
        // Table table = db.Tables["dimDate","dbo"];

        // Iterate through the tables in database and script each one. Display the script.
        foreach (Table tb in db.Tables)
        {
            Boolean bResult = tb.ToString() == sDestTable.ToString();
            if (bProcessLive == false)
                MessageBox.Show("Enumerated Table: " + tb.ToString() + "Destination Table: " + sDestTable + " Result: " + bResult.ToString(), "SQL Objects");
            // Check if the table is our destination table and that it is not a system table
            if (bResult == true && tb.IsSystemObject == false)
            {
                string sql;
                if (bProcessLive == false)
                    MessageBox.Show("Scripting columns for table: " + tb.Name, "SQL Objects");
                // Count the columns on the table
                int colCount = tb.Columns.Count;
                if (bProcessLive == false)
                    MessageBox.Show("Column Count: " + colCount.ToString(), "Column Count");

                string InsertColumns = "";
                string ValuesColumns = "";

                // Script out each column
                while (colCount > 0)
                {
                    colCount--;
                    MessageBox.Show("Column: " + tb.Columns[colCount].Name + "; Data Type: " + tb.Columns[colCount].DataType, "Column Information");

                    // Create Columns portion of Insert statement
                    if (InsertColumns == "")
                        InsertColumns = "[" + tb.Columns[colCount].Name + "]";
                    else
                        InsertColumns = InsertColumns + ",[" + tb.Columns[colCount].Name + "]";

                    // Create Values portion of Insert statement
                    if (ValuesColumns == "")
                        ValuesColumns = "'" + GetValue(tb.Columns[colCount].DataType.ToString(), 0, 0) + "'";
                    else
                        ValuesColumns = ValuesColumns + ",'" + GetValue(tb.Columns[colCount].DataType.ToString(), 0, 0) + "'";

                }
                sql = "INSERT INTO " + sDestTable + " (" + InsertColumns + " ) VALUES (" + ValuesColumns + ")";
                MessageBox.Show(sql, "SQL Statement");
                db.ExecuteNonQuery(sql);
            }
        }
    }
    catch (Exception ex)
    {
        if (bProcessLive == false)
            MessageBox.Show(ex.Message, "Script Task Failure");
        Dts.Events.FireError(ErrorNumber, "Script Task Failure", ex.Message, ex.HelpLink, 0);
        Dts.TaskResult = (int)ScriptResults.Failure;
    }

    Dts.TaskResult = (int)ScriptResults.Success;
}

private string GetValue(string datatype, int length, int precision)
{
    string colvalue;
    switch (datatype)
    {
        case "bit":
            colvalue = "0";
            break;
        case "tinyint":
            colvalue = "0";
            break;
        case "smallint":
            colvalue = "-1";
            break;
        case "int":
            colvalue = "-1";
            break;
        case "bigint":
            colvalue = "-1";
            break;
        case "varchar":
            colvalue = "Unknown";
            break;
        case "nvarchar":
            colvalue = "Unknown";
            break;
        case "date":
            colvalue = "1/1/1900";
            break;
        case "time":
            colvalue = "00:00:00.0000000";
            break;
        case "datetime":
            colvalue = "1900-01-01 00:00:00.000";
            break;
        case "datetime2":
            colvalue = "1900-01-01 00:00:00.0000000";
            break;
        default:
            colvalue = "-1";
            break;
    }

    return colvalue;
}

Banding an Aggregate in SSAS

Posted: April 25, 2012 in SSAS

I recently came across what I thought couldn’t be done in SSAS which is the act of linking the result of an aggregate to a dimension.  I thought that this was something you had to use SQL for or a BISM model.  After some digging around the internet I cobbled together the following solution:

Create a flat dimension that describes the banding that you need.

Add that dimension to the cube but do not link it to your fact table.

Create a SCOPE statement to bind the aggregated results of a particular SET to the dimension attribute. 

SCOPE([Banding].[Banding].[Low]);

    THIS=AGGREGATE(

        FILTER(

            EXISTING

                [Dimension A].[Hierarchy A].[Attribute A].MEMBERS,

                [Measures].[Measure A] <= 10

        )

    ,[Measures].CURRENTMEMBER);

END SCOPE;         

SCOPE([Banding].[Banding].[Medium]);

    THIS=AGGREGATE(

        FILTER(

            EXISTING

                [Dimension A].[Hierarchy A].[Attribute A].MEMBERS,

                ([Measures].[Measure A]) > 10

                AND

                ([Measures].[Measure A]) < 50

        )

    ,[Measures].CURRENTMEMBER);

END SCOPE;         

SCOPE([Banding].[Banding].[High]);

    THIS=AGGREGATE(

        FILTER(

            EXISTING

                [Dimension A].[Hierarchy A].[Attribute A].MEMBERS,

                [Measures].[Measure A] >= 50

        )

    ,[Measures].CURRENTMEMBER);

END SCOPE;

There are an infinite number of ways to load a table with data however I believe most people would agree with me in saying that a table without indexes will load faster than one with indexes.  Microsoft talks about performance in loading tables in the guide (The Data Loading Performance Guide).  In that guide they say the following: “Before you load data into an empty table, it is always a good idea to drop or disable all indexes on the table. After the load is done, re-create or re-enable the indexes.

In SSIS there are a couple of ways you can do this, the easiest way is to use an Execute SQL Task to drop the index(s) before your load and then use a second task after your load to re-create the index(s).  While this is certainly a good option it causes a problem in production.  A code change is required for each change to an existing index or new index is created.  In order to get around this you need to drop all indexes on a table and re-create them after the load.  The question becomes how do you know what indexes exist to drop and after you drop them how do you re-create them.  Finally you might ask why don’t we just disable the indexes. 

We drop and re-create the indexes vs. disabling them because if a clustered index is disabled it will prevent an Insert (Guidelines for Disabling Indexes and Constraints).

To drop and recreate the indexes we can use SMO (SQL Management Objects) to obtain a list of indexes for a particular table along with a SQL script to create them.  Using an object variable in SSIS we can store the SQL scripts for re-creation after the data load.  To do this your package would look something like this:

image

In this example we have a completion constraint between the Data Flow Task and the second script that re-creates the indexes because regardless if the Data Flow ends in success or failure we want to put the indexes back. 

When you build your script task you will need the SMO Libraries to access the SQL Server via SMO and the generic collections library to build the list of SQL objects.

using Microsoft.SqlServer.Management.Smo;

using Microsoft.SqlServer.Management.Sdk.Sfc;

using System.Collections.Generic;

 

The code below will script out the indexes and drop them.

public void Main()

{

    Boolean bProcessLive = Convert.ToBoolean(Dts.Variables["etlProcessLive"].Value);

    int ErrorNumber = 9000;

    try

    {

        string sSQLServer = this.Dts.Connections["OLEDB_Connection"].Properties["ServerName"].GetValue(Dts.Connections["OLEDB_Connection"]).ToString();

        if (bProcessLive == false)

            MessageBox.Show("Server Name: " + sSQLServer, "SQL Objects");

        string sSQLDatabase = this.Dts.Connections["OLEDB_Connection"].Properties["InitialCatalog"].GetValue(Dts.Connections["OLEDB_Connection"]).ToString();

        if (bProcessLive == false)

            MessageBox.Show("Database Name: " + sSQLDatabase, "SQL Objects");

        

        // Connect to the instance of SQL Server. 

        Server oSQLServer = new Server(sSQLServer);

        

        // Reference the database.  

        Database db = oSQLServer.Databases[sSQLDatabase];

 

        // Reference the table

        String sDestTable = Dts.Variables["etlDestinationTable"].Value.ToString();

        // Bug in SMO doesn't return schema data. ConnectID:696411

        // The following line won't work

        // Table table = db.Tables["dimCustomer","dbo"];

 

        // Create a list to store SQL code in.

        List<string> objSQL = new List<string>();

 

        // Define a Scripter object and set the required scripting options. 

        Scripter scrp = new Scripter(oSQLServer);

        scrp.Options.ScriptDrops = false;

        scrp.Options.WithDependencies = false;

        scrp.Options.Indexes = true;   // Include indexes

        scrp.Options.DriAllConstraints = true;   // Include referential constraints

 

        // Iterate through the tables in database and script each one. Display the script.

        foreach (Table tb in db.Tables)

        {

            Boolean bResult = tb.ToString() == sDestTable.ToString();

            if (bProcessLive == false)

                MessageBox.Show("Enumerated Table: " + tb.ToString() + "Destination Table: " + sDestTable + " Result: " + bResult.ToString(), "SQL Objects");

            // Check if the table is our destination table and that it is not a system table

            if (bResult == true && tb.IsSystemObject == false)

            {

                if (bProcessLive == false)

                    MessageBox.Show("Scripting indexes for table: " + tb.Name, "SQL Objects");

                // Count the indexes on the table

                int ixCount = tb.Indexes.Count;

                if (bProcessLive == false)

                    MessageBox.Show("Index Count: " + ixCount.ToString(), "Index Count");

 

                // Script out each index

                while (ixCount > 0)

                {

                    ixCount--;

                    System.Collections.Specialized.StringCollection ic = scrp.Script(new Urn[] { tb.Indexes[ixCount].Urn });

                    foreach (string st in ic)

                    {

                        if (bProcessLive == false)

                            MessageBox.Show(st, "Index SQL");

                        objSQL.Add(st);

                    }

                    // Drop the index

                    tb.Indexes[ixCount].Drop();

                }

            }

        }

        Dts.Variables["User::etlSQLObjects"].Value = objSQL;

    }

    catch (Exception ex)

    {

        if (bProcessLive == false)

            MessageBox.Show(ex.Message, "Script Task Failure");

        Dts.Events.FireError(ErrorNumber, "Script Task Failure", ex.Message, ex.HelpLink, 0);

        Dts.TaskResult = (int)ScriptResults.Failure;

    }

 

    Dts.TaskResult = (int)ScriptResults.Success;

}

The script below will re-create the Indexes:

public void Main()

{

    Boolean bProcessLive = Convert.ToBoolean(Dts.Variables["etlProcessLive"].Value);

    int ErrorNumber = 9000;

 

    // Create new ADO connection

    SqlConnection sqlConn = new SqlConnection();

    sqlConn = (SqlConnection)(Dts.Connections["ADO_Connection"].AcquireConnection(Dts.Transaction) as SqlConnection);

    if (bProcessLive == false)

        MessageBox.Show(sqlConn.ConnectionString, "ADO.net Connection");

    if (bProcessLive == false)

        MessageBox.Show(sqlConn.ConnectionTimeout.ToString(), "Connection Timeout");

 

    try

    {

        // Populate the list

        object objSQL = Dts.Variables["User::etlSQLObjects"].Value;

        List<string> list;

        list = (List<string>)objSQL;

 

        // Loop through each object to create each index

        foreach (string SQL in list)

        {

            if (bProcessLive == false)

                MessageBox.Show(SQL, "SQL Statement");

            // Create the index

            SqlCommand sqlCmd = new SqlCommand(SQL,sqlConn);

            sqlCmd.CommandTimeout = sqlConn.ConnectionTimeout;

            sqlCmd.ExecuteNonQuery();

        }

    }

    catch (Exception ex)

    {

        if (bProcessLive == false)

            MessageBox.Show(ex.Message, "Script Task Failure");

        Dts.Events.FireError(ErrorNumber, "Script Task Failure", ex.Message, ex.HelpLink, 0);

        Dts.TaskResult = (int)ScriptResults.Failure;

    }

    finally

    {

        if (sqlConn.State == ConnectionState.Open)

        {

            sqlConn.Close();

            sqlConn.Dispose();

        }

    }

    Dts.TaskResult = (int)ScriptResults.Success;

}

I was recently working on a Visio drawing that had shapes connected to SQL Server data.  When I published my Web Drawing to Visio Services in SharePoint I received the following error message:

image

Visio Services was unable to refresh external data connections in this Web Drawing because of changes in the structure of an underlying data source. One or more columns in this data source have changed data type after this Web Drawing was last saved. To resolve this issue, restore the data types of the affected columns to their previous values or update the data connection information in this Web drawing using Visio.

After further investigation I traced this problem back to a column in my SQL table that had a datatype of DATETIME2.  If I removed the column from the Data Connection configuration the problem went away.  It appears that certain data types are not supported in Visio Services.

Recently I needed to combine the values of several columns together in a Script Component to do some work on.  I was planning on reusing this script again and again so I didn’t want to hard code each column from the input row.  After some digging I found some helpful information on John Welch’s blog entry “XML Destination Script Component”.  The blog and most of the comments were pre 2008 and as a result written in VB so I had to adapt it to C# code.

In the sample below I iterate over the collection of properties in the row and examine the name property of each column in the input/output buffer.  Each column in the buffer ends up with two representations, one of the column name and one suffixed with “_IsNull”.  Since we don’t care about the “_IsNull” component we ignore those.  I also created a column in the output of the Script Component called “ConcatenatedString” which I ignore as well.  Then I convert all the input columns to a string as I concatenate them together.

public override void Input0_ProcessInputRow(Input0Buffer Row)
{
    string s = "";
    foreach (PropertyInfo inputColumn in Row.GetType().GetProperties())
    {
        if (!inputColumn.Name.EndsWith("IsNull") && !inputColumn.Name.Equals("ConcatenatedString"))
        {
            if (inputColumn.GetValue(Row, null) != null)
                s = s + Convert.ToString(inputColumn.GetValue(Row, null).ToString());

        }
    }

    Row.ConcatenatedString = s;
}

Date Time Conversion in SSIS

Posted: October 5, 2011 in SSIS

I have seen many comments on how to handle time conversion in SQL and lots of them seem to involve the maintenance of tables to handle the time shift that needs to occur.  Lots of these examples only shift time by time zones which is not always an accurate way to shift time.  Let’s look at the following example:

  • 3/13/2011 3:30 AM EST – Lets assume that in your organization the center of the universe is in Central Standard Time.  If we were to just shift the date by one hour to CST we would end up with a date timestamp of 3/13/2011 2:30 AM EST.  One problem that date timestamp doesn’t exist since it falls into the Daylight Savings Time Black Hole.

So how do we do this without things getting overly complicated.  Well fortunately there is such a feature built into Windows all we need to do is tap into it.

All this takes is three lines of code in a Script Component Task.

TimeZoneInfo ESTtz = TimeZoneInfo.FindSystemTimeZoneById("Eastern Standard Time");
TimeZoneInfo CSTtz = TimeZoneInfo.FindSystemTimeZoneById("Central Standard Time");
Row.PostTimeCST = TimeZoneInfo.ConvertTime(Row.PostTime, ESTtz, CSTtz);

We need to obtain the Source and Destination time zones. Then we use the ConvertTime method to convert the row’s date from one time zone to another.

To find a complete list of Time Zones on your machine look in the following registry location:

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Time Zones

References:

A better way to load a date dimension

Posted: September 13, 2011 in SSIS

Thanks to everyone who attended my session at Carolina Code Camp.

In our session we discussed two ways a date dimension can be loaded. The traditional way is typically through the use of a SQL script or Stored Procedure. It generally will look something like this:

 1: DECLARE @StartDate DATETIME;
 2: DECLARE @EndDate DATETIME;
 3: DECLARE @Date DATETIME;
 4:
 5: SET    @StartDate = '1/1/2008'
 6: SET @EndDate = '12/31/2012'
 7: SET @Date = @StartDate;
 8:
 9: WHILE @Date <= @EndDate
 10:     BEGIN
 11:         INSERT INTO
 12:             [dbo].[dimDate]
 13:             (
 14:                 [DateKey]
 15:                 ,[Date]
 16:                 ,[DateValue]
 17:                 ,[Month]
 18:                 ,[MonthValue]
 19:                 ,[Quarter]
 20:                 ,[QuarterValue]
 21:                 ,[Year]
 22:                 ,[YearValue]
 23:             )
 24:             VALUES
 25:             (
 26:                 DATEPART(yyyy, @Date) * 10000 + DATEPART(mm, @Date) * 100 + DATEPART(dd, @Date)
 27:                 ,CONVERT(VARCHAR(4), DATEPART(mm, @Date)) + '/' + DATENAME(dd, @Date) + '/' + DATENAME(yyyy, @Date)
 28:                 ,@Date
 29:                 ,DATENAME(mm, @Date) + ', ' + DATENAME(yyyy, @Date)
 30:                 ,LEFT(DATENAME(mm, @Date),3) + '-' + RIGHT(DATEPART(yyyy, @Date),2)
 31:                 ,'Quarter ' + DATENAME(qq, @Date) + ', ' + DATENAME(yyyy, @Date)
 32:                 ,DATEPART(qq, @Date)
 33:                 ,DATENAME(yyyy, @Date)
 34:                 ,DATEPART(yyyy, @Date)
 35:             )
 36:         SET @Date = DATEADD(DAY, 1, @Date)
 37:     END

This is generally an ok choice for populating a date dimension until the date dimension starts needing additional elements such as:

  • Multilingual translations
  • Holiday information
  • Financial calendar specifications

Quickly that SQL code with the intentions of a single insert start to become an inferior design.

The SQL CAT team tells us in its article Top 10 SQL Server Integration Services Best Practices:

  • SSIS is an in-memory pipeline so ensure that all transformations occur in memory
  • Minimize Logging Operations

In order to conform with good ETL practices we should load our date dimension once and never issue an update. So how do we get multilingual support, holidays, and financial calendar information all into the date dimension without issuing an update? With SSIS using the Script Component and Merge Join transformations.

  1. Create a new SSIS package.
  2. Add a Data Flow.
  3. In the Data Flow add a Script Component and set the script usage to Source.
  4. Open the Script Component properties and edit the Inputs and Outputs. Add the following output columns:
  • DateKey | DataType: four-byte signed integer [DT_I4]
  • DateValue | DataType: date [DT_DATE]
  • DateName_zh-Hans | DataType: Unicode string [DT_WSTR] | Length: 16

Edit the Script Component using the C# script language.

In the script component we can derive date information in as many languages as Windows supports. Suppose we want to generate a date in Simplified Chinese. To do that we need to add the System.Globalization namespace in order to use the localization classes.

using System.Globalization;

Then add the following code to the CreateNewOutputRows method:

DateTime dteStartDate = Convert.ToDateTime("12-31-2008");

 

Output0Buffer.AddRow();
Output0Buffer.DateNamezhHans = dteStartDate.ToString("d", CultureInfo.CreateSpecificCulture("zh-Hans"));

We can construct the rest of the columns necessary to support a record in the date dimension with similar code.

Holiday’s and Financial Calendar’s generally come from some pre-defined data source which we can add to the pipeline with a Merge Join. Just remember to sort your data sets.

By generating a base date record out of the Script Component and using the Merge Join to add Holiday and Financial information we successfully take full advantage of SSIS/ETL best practices.

The solution file offers a more complete example and can be accessed here:

References:

PASS is having open voting for all the session submissions until May 20th.
Please consider voting for the two session abstracts I submitted using the PASS Session Preference tool.

1. A BI Developer’s Guide to Dimensional Dating: Date Dimension Best Practices

The date dimension is one of the most important dimensions in your data warehouse and often the most overlooked. In this session we will review the best practices and common mistakes for modeling a multilingual date dimension with multiple calendar hierarchies . From there we will learn how to load the date dimension in a single SSIS Data Flow operation and then move to the cube where we will leverage named sets to determine common business time horizons such as the current day, current month, last month, year to date, year over year, rolling months and so on.

By the end of this session you will have an end to end understanding of how leverage dates to satisfy end user usability requirements while maintaining operational efficiency. A sample solution will be available that includes an SSIS package and cube project.

2. De-dupe customer lists without Master Data Management

Customer data is frequently duplicated across many enterprises within the same system or across multiple systems. In this session we will use good design practices to de-dupe customers lists without the business overhead of maintaining a Master Data Management system. We will be using SSIS to validate and Geocode the customers billing and shipping addresses against the Bing Maps Geocode web service. We will consolidate customer records into a single master record for use in the Customer dimension and a cross reference table for use in ETL to assign the proper Surrogate key. From there we will break apart the addresses into their individual elements for use in a Fuzzy lookup and measure the distance between two Geocoded locations to make logical assumptions of which addresses should be grouped together to form a common record.

By the end of this session we will have an understanding of a methodology to remove duplicates and align like records from lists of data, be able to understand how to use the Fuzzy Lookup, and how to validate address information against Bing Maps. A sample solution will be available that includes an SSIS package.