Beware Job Recruiting SCAM

I’ve had multiple reports of a scam where persons have used my identity to entice IT professionals into applying for positions they have no business recruiting for. The scammers later ask for a PayPal payment for their services. The authorities have been notified of this fraudulent activity.

I urge everyone to be vigilant with regards to social engineering scams like this. Beware of unsolicited emails from persons you don’t know personally. It’s is easy for unscrupulous individuals to glean much information from publicly available sources (job descriptions, professional profiles, etc.) and tailor a message customized for a particular job seeker.

SQL Server File Naming Standards

Attention to detail in naming SQL Server physical files and logical file names will make a DBAs life easier. This is especially important when using RESTORE or attach. I suggest one establish and follow a naming standard for physical and logical database file names.

SQL Server does not enforce any particular naming standard for files.  SQL Server is perfectly happy with a data file named “readme.txt” or a log file named “word.doc”.  Of course, such an inappropriate name and extension will lead to confusion so the best practice is to name files appropriately.  An appropriate SQL Server database file name is essentially one that is self-documenting; one should be able to determine the exact purpose of a file simply by examining the name.  I recommend a naming standard that includes the associated database name, filegroup name, and file type.

Physical File Names
I propose adopting a physical file naming convention of “<database-name>_< filegroup-name>_<uniqueifier>.<file-type>” where:

  • <database-name> is the name of the associated database
  • <filegroup-name> is the name of the filegroup containing the file, or the literal “Log” for log files
  • <uniqueifier> is a integer to ensure the file name is unique
  • <file-type> is the standard SQL Server extension for the file type (“mdf”, “ndf”, or “ldf”) as documented in the SQL Server Books Online Database Files and Filegroups topic

Personally, I use an underscore to separate the name components but a dash will also suffice.  Ideally, the separator character should never be used in database or filegroup names to avoid ambiguity.  I recommend one avoid using special characters in database and filegroup names (e.g. use proper case) and ensure database and filegroup names conform to the rules for regular identifiers as outlined in the Books Online Database Identifiers topic.

There is some wiggle room as to how strictly one adheres to this naming convention.  One could omit the filegroup name and uniqueifier components for the primary data file (mdf) because this file is implicitly in the PRIMARY filegroup and there can be only primary data file in the filegroup/database.  Similarly, the uniqueifier could be omitted for the first or only file within a filegroup.   That being said, a more strict adherence this naming convention provides better consistency and makes naming file more of a no-brainer.

Note that the uniqueifier in the name need not imply a sequential value with no gaps.  For example, consider a database named MyDatabase with filegroup DataFG containing 3 files named ‘MyDatabase_DataFG_1.ndf’, ‘MyDatabase_DataFG_2.ndf’, and ‘MyDatabase_DataFG_3.ndf’.  If the second file is removed, there is no requirement to rename file ‘MyDatabase_DataFG_3.ndf’ to ‘MyDatabase_DataFG_2.ndf’.  However, if one is anal about such things, there is no harm in doing so other than the unavailability of the database during the maintenance.

The physical file naming convention described above will guarantee physical file names are unique within a SQL Server instance and facilitate relocation to different drives/folders without naming conflicts.  On a server with multiple instances, I suggest placing files in separate folders for each instance.  This practice will better organize database files while avoiding file name conflicts when a database with the same name exists on different instances.

Logical File Names
Although I commonly see database names embedded within logical file names, I suggest one avoid that practice.  The scope of a logical file name is the context database so including the database name is redundant.  More importantly, the original logical file name is retained when a database is subsequently restored or attached with a different name so the name becomes out-of-sync with the actual database name unless one explicitly changes the names afterward to match the new database name.  This extra administrative work is often overlooked and can be avoided entirely by not including the database name in the logical file name when the initial database is created or altered.

I suggest one use the same naming convention for the logical file name as the physical file name but without the database name.  The logical name will therefore match the last part of physical name (< filegroup-name>_<uniqueifier>).

Unfortunately, SQL Server likes to include the database name in logical names of the primary data file and log file during initial creation.  Executing a minimal create database T-SQL statement like “CREATE DATABASE Foo;” will result in the files below created in the default data and log folder for the instance.  This also applies to the logical names suggested by the SSMS GUI, although one can specify different names as desired.

Logical Name Physical Name
Foo D:\SqlDataFiles\Foo.mdf
Foo_log L:\SqlLogFiles\Foo_log.ldf

I recommend using the expanded form of CREATE DATABASE so that you have complete control over names, locations, and sizes.  Consider creating a custom SSMS template or code snippet to facilitate creating databases with the proper names.

CREATE DATABASE ExampleDatabase
ON PRIMARY
    (NAME = PRIMARY_1,
    FILENAME = N'D:\SqlDataFiles\ExampleDatabase_PRIMARY_1.mdf',
    SIZE = 10MB,
    MAXSIZE = 50MB,
    FILEGROWTH = 10%)
LOG ON
    ( NAME = Log_1,
    FILENAME = N'L:\SqlLogFiles\ExampleDatabase_Log_1.ldf',
    SIZE = 10MB,
    MAXSIZE = 50MB,
    FILEGROWTH = 10MB);

Examples

Below are examples of the naming conventions detailed in this article.

Simple database with only a primary data and log file:

Logical Name Physical Name
PRIMARY_1 D:\SqlDataFiles\ExampleDatabase_PRIMARY_1.mdf
Log_1 L:\SqlLogFiles\ExampleDatabase_Log_1.ldf

Database with 2 files in PRIMARY filegroup, 2 secondary filegroups containing 3 files each, and 2 log files:

Logical Name Physical Name
PRIMARY_1 D:\SqlDataFiles\ExampleDatabase_PRIMARY_1.mdf
PRIMARY_2 E:\SqlDataFiles\ExampleDatabase_PRIMARY_2.ndf
DataFG_1 F:\SqlDataFiles\ExampleDatabase_DataFG_1.ndf
DataFG_2 G:\SqlDataFiles\ExampleDatabase_DataFG_2.ndf
DataFG_3 H:\SqlDataFiles\ExampleDatabase_DataFG_3.ndf
IndexFG_1 I:\SqlDataFiles\ExampleDatabase_IndexFG_1.ndf
IndexFG_2 J:\SqlDataFiles\ExampleDatabase_IndexFG_2.ndf
IndexFG_3 K:\SqlDataFiles\ExampleDatabase_IndexFG_3.ndf
Log_1 L:\SqlLogFiles\ExampleDatabase_Log_1.ldf
Log_2 M:\SqlLogFiles\ExampleDatabase_Log_2.ldf

Table Partitioning Best Practices

SQL Server table partitioning has a number of gotchas without proper planning.  This article demonstrates those that commonly cause grief and recommends best practices to avoid them.

Implications of the Partition Function Range Specification

One needs a good understanding of how the RANGE LEFT/RIGHT specification affects partition setup and management.  The RANGE specification determines:

  • The partition created by SPLIT
  • The partition removed with MERGE
  • The permanent partition that can never be removed from partition schemes

Below are nuances of the RANGE specification that commonly surprise people.

The Partition Created by SPLIT

New partitions are created by splitting a partition function.  A partition function SPLIT splits an existing partition into 2 separate ones, changing all of the underlying partition schemes, tables, and indexes.  Below are the actions performed when a LEFT or RIGHT partition is SPLIT, with important the differences in bold:

The actions performed by a SPLIT of a RANGE LEFT partition function:

  • Identify existing partition to be split, which is the one that contains the new boundary (or the last partition if no existing boundaries are higher than the one being added)
  • Add the new boundary to the partition function, maintaining boundary order and incrementing subsequent partition numbers
  • Create a new partition to the left of the existing one on the NEXT USED filegroup of each partition scheme that uses the function
  • For each table/index using the affected partition scheme(s), move rows from the existing split partition that are less than or equal to the new boundary into the newly created partition on the left

The actions performed by a SPLIT of a RANGE RIGHT partition function:

  • Identify existing partition to be split, which is the one that contains the new boundary (or the first partition if no existing boundaries are less than the one being added)
  • Add the new boundary to the partition function, maintaining boundary order and incrementing subsequent partition numbers
  • Create a new partition to the right of the existing one on the NEXT USED filegroup for each partition scheme that uses the function
  • For each table/index using the affected partition scheme(s), move rows from the existing split partition that are greater than or equal to the new boundary into the newly create partition on the right

I generally suggest one use RANGE RIGHT instead because it is more natural, and helps avoid common pitfalls when adding incremental partition boundaries.  RANGE LEFT behavior is not intuitive and trips up many DBAs but I will discuss for completeness.  Consider this example of a RANGE LEFT partition function on a datetime column, where the initial setup is for 2 years of data (2013 and 2014):

CREATE DATABASE PartitioningDemo ON PRIMARY
        ( NAME = N'Primary', FILENAME = N'C:\SqlDataFiles\PartitioningDemo.mdf' , SIZE = 100MB , MAXSIZE = UNLIMITED, FILEGROWTH = 10MB ),
    FILEGROUP FG_2013
        ( NAME = N'FG_2013_1', FILENAME = N'C:\SqlDataFiles\PartitioningDemo_FG_2013_1.ndf' , SIZE = 10MB , MAXSIZE = UNLIMITED, FILEGROWTH = 10MB ),
    FILEGROUP FG_2014
        ( NAME = N'FG_2014_1', FILENAME = N'C:\SqlDataFiles\PartitioningDemo_FG_2014_1.ndf' , SIZE = 10MB , MAXSIZE = UNLIMITED, FILEGROWTH = 10MB )
    LOG ON
        ( NAME = N'Log_1', FILENAME = N'C:\SqlLogFiles\PartitioningDemo_Log_1.ldf' , SIZE = 100MB , MAXSIZE = 10GB , FILEGROWTH = 10MB);
GO

USE PartitioningDemo;
GO

CREATE PARTITION FUNCTION PF_DateTimeLeft(datetime) AS
    RANGE LEFT FOR VALUES(
        N'2013-12-31T23:59:59.997'
    );

CREATE PARTITION SCHEME PS_DateTimeLeft AS
    PARTITION PF_DateTimeLeft TO(
        FG_2013
        , FG_2014
    );

CREATE TABLE dbo.PartitionedTableLeft(
    DateTimeColumn datetime
    ) ON PS_DateTimeLeft(DateTimeColumn);

INSERT INTO dbo.PartitionedTableLeft VALUES
      (N'2013-01-01T01:02:03.340')
    , (N'2013-02-03T04:05:06.780')
    , (N'2014-01-01T01:02:03.340')
    , (N'2014-02-03T04:05:06.780')
    , (N'2014-03-04T05:06:07.890');
GO

ALTER DATABASE PartitioningDemo
    ADD FILEGROUP FG_2015;

ALTER DATABASE PartitioningDemo
    ADD FILE
        ( NAME = N'FG_2015_1', FILENAME = N'C:\SqlDataFiles\PartitioningDemo_FG_2015_1.ndf' , SIZE = 10MB , MAXSIZE = UNLIMITED, FILEGROWTH = 10MB )
    TO FILEGROUP FG_2015;

ALTER PARTITION SCHEME PS_DateTimeLeft
    NEXT USED FG_2015;

ALTER PARTITION FUNCTION PF_DateTimeLeft()
    SPLIT RANGE(N'2014-12-31T23:59:59.997');

INSERT INTO dbo.PartitionedTableLeft VALUES
    (N'2015-01-01T01:02:03.340');
GO

This initial setup results in 2 partitions and data properly mapped to the 2 yearly filegroups of the scheme.  Now, we need to prepare for year 2015 so we add a new filegroup, set the NEXT USED filegroup, SPLIT the function for year 2015, and insert data for 2015:

ALTER DATABASE PartitioningDemo
    ADD FILEGROUP FG_2015;

ALTER DATABASE PartitioningDemo
    ADD FILE
        ( NAME = N'FG_2015_1', FILENAME = N'C:\SqlDataFiles\PartitioningDemo_FG_2015_1.ndf' , SIZE = 10MB , MAXSIZE = UNLIMITED, FILEGROWTH = 10MB )
    TO FILEGROUP FG_2015;

ALTER PARTITION SCHEME PS_DateTimeLeft
    NEXT USED FG_2015;

ALTER PARTITION FUNCTION PF_DateTimeLeft()
    SPLIT RANGE(N'2014-12-31T23:59:59.997');

INSERT INTO dbo.PartitionedTableLeft VALUES
    (N'2015-01-01T01:02:03.346');
GO

This SPLIT results in adding the new boundary to the function as expected:

CREATE PARTITION FUNCTION PF_DateTimeLeft(datetime) AS
    RANGE LEFT FOR VALUES(
          N'2013-12-31T23:59:59.997'
        , N'2014-12-31T23:59:59.997'
);

However, the partition scheme is not as desired.  The resultant partition scheme is:

CREATE PARTITION SCHEME PS_DateTimeLeft AS
    PARTITION PF_DateTimeLeft TO(
          FG_2013
        , FG_2015
        , FG_2014
    );

As you can see, the filegroup for 2015 was inserted into the scheme before the 2014 one, resulting in 2014 data in the FG_2015 filegroup and 2015 data in the FG_2014 filegroup.  Furthermore, not only is the filegroup mapping now wrong, data movement was required to move the entire year of 2014 data into the new partition.  No big deal here since only 3 rows were moved by this demo script but in a production table, this movement could be a show stopper.  Logging during SWITCH/MERGE data movement during is about 4 times that of normal DML, which is especially costly when working with large tables containing millions or billions of rows as is commony used in table partitioning.

Remember that the new partition by a SPLIT is the one that includes the specified boundary, which is to the left of the existing split partition with a RANGE LEFT function (the 2014 partition here).  Data are moved from the existing SPLIT partition into the newly created partition according to the boundaries of the new partition (greater than ‘2013-12-31T23:59:59.997′ and less than or equal to ‘2014-12-31T23:59:59.997′).

Again, I recommend using a RANGE RIGHT function to avoid this non-intuitive behavior.  Below is the equivalent RANGE RIGHT script for yearly partitions, which results in the desired filegroup mappings as well as no data movement.  Also, note that the datetime boundaries are exact date specifications with RANGE RIGHT, which is also more intuitive when working with temporal datetime, datetime2, and datetimeoffset data types that include a time component.

CREATE PARTITION FUNCTION PF_DateTimeRight(datetime) AS
    RANGE RIGHT FOR VALUES(
        N'2014-01-01T00:00:00.000'
    );

CREATE PARTITION SCHEME PS_DateTimeRight AS
    PARTITION PF_DateTimeRight TO(
          FG_2013
        , FG_2014
    );

CREATE TABLE dbo.PartitionedTableRight(
    DateTimeColumn datetime
    ) ON PS_DateTimeRight(DateTimeColumn);

INSERT INTO dbo.PartitionedTableRight VALUES
      (N'2013-01-01T01:02:03.340')
    , (N'2013-02-03T04:05:06.780')
    , (N'2014-01-01T01:02:03.340')
    , (N'2014-02-03T04:05:06.780')
    , (N'2014-03-04T05:06:07.890');
GO

ALTER PARTITION SCHEME PS_DateTimeRight
    NEXT USED FG_2015;

ALTER PARTITION FUNCTION PF_DateTimeRight()
    SPLIT RANGE(N'2015-01-01T00:00:00.000');

INSERT INTO dbo.PartitionedTableRight VALUES
    (N'2015-01-01T01:02:03.346');

The resultant partition function and scheme definitions are as desired with this RANGE RIGHT partition function:

CREATE PARTITION FUNCTION PF_DateTimeRight (datetime) AS
    RANGE LEFT FOR VALUES(
          N'2014-01-01T00:00:00.000'
        , N'2015-01-01T00:00:00.000'
    );

CREATE PARTITION SCHEME PS_DateTimeRight AS
    PARTITION PF_DateTimeRight TO(
          FG_2013
        , FG_2014
        , FG_2015
    );

The Partition Removed by MERGE

When a partition is removed with MERGE, the dropped partition is the one that includes the specified boundary.  If the dropped partition is not empty, all data will be moved into the adjacent remaining partition.  Like SPLIT, costly data movement during partition maintenance should be avoided so it is best to plan such than only empty partitions are removed.  A MERGE should typically done after a purge/archive of data using SWITCH.

Below is a summary of a RANGE LEFT partition function MERGE:

  • Identify existing partition to be removed, which is the one that includes (to the left of) the specified existing boundary
  • For each table/index using the affected partition scheme(s), move rows from this partition into the adjacent right partition
  • Remove boundary from the partition function, maintaining boundary order and decrementing subsequent partition numbers
  • Remove the filegroup from each partition scheme that uses the function, unless the filegroup is mapped to another partition or is the NEXT USED filegroup

Below is a summary of a RANGE RIGHT partition function MERGE:

  • Identify existing partition to be removed, which is the one that includes (to the right of) the specified existing boundary
  • For each table/index using the affected partition scheme(s), move rows from this partition into the adjacent left partition
  • Remove boundary from the partition function, maintaining boundary order and decrementing subsequent partition numbers
  • Remove the filegroup from each partition scheme that uses the function, unless the filegroup is mapped to another partition or is the NEXT USED filegroup

I suggest explicit partition boundaries for expected data ranges to facilitate using both MERGE and SPLIT, and an additional one for the permanent partition (which I’ll detail shortly).  This practice helps ensure data are both logically and physically aligned, providing more natural partition management.

The Permanent Partition

You might not be aware that each partition scheme has a permanent partition that can never be removed.  This is the first partition of a RANGE RIGHT function and the last partition of a RANGE LEFT one.  Be mindful of this permanent partition when creating a new partition scheme when multiple filegroups are involved because the filegroup on which this permanent partition is created is determined when the partition scheme is created and cannot be removed from the scheme.

My recommendation is that one create explicit partition boundaries for all expected data ranges plus a lower and upper boundary for data outside the expected range, and map these partitions to appropriately named filegroups.  This practice will make the purpose of each partition/filegroup clear and help avoid accidentally placing data on the wrong filegroup.

Consider mapping partitions containing data outside the expected range to a dummy filegroup with no underlying files.  This will guarantee data integrity much like a check constraint because data outside the allowable range cannot be inserted.  If you must accommodate errant data rather than rejecting it outright, instead map these partitions to a generalized filegroup like DEFAULT or one designated specifically for that purpose.

I suggest specifying a NULL value for the first boundary of a RANGE RIGHT partition function.  This NULL boundary serves as the upper boundary of the permanent first partition as well as the lower boundary for the second partition containing data outside the expected range.  No rows are less than NULL so the first partition will always be empty.  It is therefore safe to map the first partition to the previously mentioned dummy filegroup even if you need to house data outside the expected range.  That being said, there is no harm in mapping the first partition to another filegroup other than lack of clarity.

For the last boundary of a RANGE RIGHT function, I suggest specifying the lowest value outside the expected range and also mapping the partition to either the dummy filegroup, or one designated to contain unexpected data.  The boundaries between the first boundary (NULL) and this one designate partitions for expected data.

Summary

In summary, I recommend a RANGE RIGHT function with the following setup:

  • First boundary value NULL
  • Subsequent boundary values for expected data partitions
  • A final boundary value of greater than the expected range
  • Map first, second, and last partitions to either a dummy filegroup or one designated for unexpected data
  • Map remaining expected data partitions to appropriately named filegroups

A similar RANGE LEFT function can be setup as follows.  I’m including this for only for completeness as the RANGE RIGHT setup above is a best practice, in my humble opinion.

  • First boundary for data less than the expected range
  • Subsequent boundaries for expected data partitions
  • A final boundary value of the maximum allowable value for the partitioning data type (which is another kludge that bolsters the case for RANGE RIGHT)
  • Map first, second from last, and last partitions to either a dummy filegroup or one designated for unexpected data
  • Map remaining expected data partitions to appropriately named filegroups

Below is an example script of applying these techniques with a RANGE RIGHT function, including adding an incremental partition for a new year.  Partitions for data outside the expected data range are mapped to the FG_NoData filegroup (which contains no files) so any attempt to insert data outside the expected range will fail.

--*********************************
--*** initial setup for 2 years ***
--*********************************

--add a dummy filegroup with no files
ALTER DATABASE PartitioningDemo
    ADD FILEGROUP FG_NoData;
GO

CREATE PARTITION FUNCTION PF_DateTimeRightWithExplictBoundaries(datetime) AS
    RANGE RIGHT FOR VALUES(
          NULL --prior to year 2013 (outside expected range)
        , N'2013-01-01T00:00:00.000' --year 2013 (within expected range)
        , N'2014-01-01T00:00:00.000' --year 2014 (within expected range)
        , N'2015-01-01T00:00:00.000' --after year 2014 (outside expected range)
    );

CREATE PARTITION SCHEME PS_DateTimeRightWithExplictBoundaries AS
    PARTITION PF_DateTimeRightWithExplictBoundaries TO(
          FG_NoData --first partition permanent partition and cannot be removed
        , FG_NoData --second partition contains data prior to year 2013 and is outside expected range
        , FG_2013 --year 2013 (within expected range)
        , FG_2014 --year 2014 (within expected range)
        , FG_NoData --after year 2014 (outside expected range)
    );

CREATE TABLE dbo.PartitionedTableRightWithExplictBoundaries(
        DateTimeColumn datetime
    ) ON PS_DateTimeRightWithExplictBoundaries(DateTimeColumn);

INSERT INTO dbo.PartitionedTableRightWithExplictBoundaries VALUES
      (N'2013-01-01T01:02:03.346')
    , (N'2013-02-03T04:05:06.780')
    , (N'2014-01-01T01:02:03.346')
    , (N'2014-02-03T04:05:06.780')
    , (N'2014-03-04T05:06:07.890');
GO

--**********************
--*** add a new year ***
--**********************

--specify FG_NoData filegroup to scheme as next used
ALTER PARTITION SCHEME PS_DateTimeRightWithExplictBoundaries
    NEXT USED FG_NoData;

--add 2016 boundary as outside expected range, mapped to FG_NoData
ALTER PARTITION FUNCTION PF_DateTimeRightWithExplictBoundaries()
    SPLIT RANGE(N'2016-01-01T00:00:00.000');

--remove 2015 boundary, currently mapped to FG_NoData
ALTER PARTITION FUNCTION PF_DateTimeRightWithExplictBoundaries()
    MERGE RANGE(N'2015-01-01T00:00:00.000');

--specify FG_2015 filegroup to scheme as next used
ALTER PARTITION SCHEME PS_DateTimeRightWithExplictBoundaries
    NEXT USED FG_2015;

--re-add 2015 boundary as within expected range, now mapped to FG_2015
ALTER PARTITION FUNCTION PF_DateTimeRightWithExplictBoundaries()
    SPLIT RANGE(N'2015-01-01T00:00:00.000');

INSERT INTO dbo.PartitionedTableRightWithExplictBoundaries VALUES
    (N'2015-01-01T01:02:03.346');

 

Always Use Semicolon Statement Terminators

ANSI-standard semicolon statement terminators are often omitted in T-SQL queries and many developers are unaware that this is syntax is deprecated.  Omitting statement terminators is a dangerous practice because, even if the batch compiles, you may get unexpected results.  Consider the insidious examples below pointed out by SQL Server MVP Erland Sommarskog:

BEGIN TRY
	BEGIN TRAN
	SELECT 1/0 AS CauseAnException
	COMMIT
END TRY
BEGIN CATCH
	SELECT ERROR_MESSAGE()
	THROW
END CATCH

Few of us will catch (no pun intended) the bug in the above script.  What results do you expect after running the above script under SQL Server 2012 or later versions?  Rather than leave this as an exercise for the reader, I’ll spoil the fun and mention that no run-time error is raised at all.  Instead, the THROW statement is interpreted as a column alias for the ERROR_MESSAGE() column.  This sort of coding error is especially nasty because catch blocks are rarely unit tested and this catch block coding mistake hides the run-time error entirely without raising an exception.

Similarly, the absence of statement terminators in the script below causes another problem.  Can you spot it?

BEGIN TRY
	BEGIN TRAN
	SELECT 1/0 AS CauseAnException
	COMMIT
END TRY
BEGIN CATCH
    IF @@TRANCOUNT > 0 ROLLBACK TRAN
    THROW
END CATCH

At least an error is raised in this case, albeit not the one you might expect.  The resultant error is “Cannot roll back THROW. No transaction or savepoint of that name was found”.  This coding bug obfuscates the preceding divide by zero error and prevents the THROW statement from being executed.

Semicolons Will Become Mandatory
Microsoft announced with the SQL Server 2008 release that semicolon statement terminators will become mandatory in a future version so statement terminators other than semicolons (whitespace) are currently deprecated.  This deprecation announcement means that you should always use semicolon terminators in new development.  I honestly don’t expect SQL Server to strictly enforce mandatory semicolons in the near future but it is still a best practice to use semicolon statement to avoid issues like those mentioned earlier as well as facilitate code maintainability.  I suggest specifying statement terminators in all new development and perhaps adding terminators to existing code as you perform maintenance.

Transact-SQL does not currently enforce the ANSI semicolon statement terminator requirement.  Instead, semicolon statement terminators are optional and any whitespace (spaces, tabs, newline) may be used instead.  The exception to this rule is that many of the statements introduced in SQL Server 2005 and later require the preceding statement to be properly terminated in order for the batch to compile.

Below are some guidelines I suggest on when to, and when not to, use semicolon statement terminators.

Suggested Guidelines
The Transact-SQL parser is quite lax, allowing any whitespace (e.g. space, tab, newline) to be used.  This laxness results in ambiguity like the examples at the beginning of this article demonstrate.  Similarly, statement terminators may not only be omitted, they may also be used in inappropriately.  I strongly suggest you adhere to the T-SQL syntax documented in the Books Online even if the parser allows otherwise.  This practice will help future-proof your code since relying on undocumented behavior is inherently risky.

Don’t precede a statement with a semicolon
Remember that the purpose of semicolons is to terminate SQL statements, not begin them.  A common mistake I see is throwing a semicolon in front of statements in order to get a batch of statements to compile, especially with newer statements like WITH (CTE expression) that require previous statement termination.  Although the T-SQL parser currently ignores extraneous and misplaced semi-colons, I suggest they be specified in the appropriate place according statement syntax documented in the SQL Server Books Online.

Specify semicolons at the end of each stand-alone SQL statement
Not only will this conform to the ANSI standard, your intent will be clearer and the code easier to read.

Terminate control-of-flow statement blocks at the end of the control-of-flow scope
Control-of-flow statements are not covered by the ANSI SQL standard because these are proprietary SQL extensions.  The SQL Server Books Online is sketchy on the subject and many of the examples (as of this writing) are inconsistent and do not always include statement terminators.  Furthermore, control-of-flow statement blocks are confusing due to the many variations, nesting, and optional BEGIN/END specifications.

Below are examples illustrating what I believe to be proper use of statement terminators control-of-flow block terminators using IF statements in SQL 2008 and later versions.  The same concepts apply to other control-of-flow constructs like WHILE and TRY/CATCH blocks.  I should add that this batch example will not compile under SQL 2005 because an explicit BEGIN/END block is required to execute a common table expression conditionally in that version.  T-SQL parser enhancements eliminated that requirement in SQL 2008 and later.

IF 1 = 1
  WITH a AS (SELECT 1 a) SELECT a FROM a; --terminate statement and IF

IF 1 = 1
  WITH a AS (SELECT 1 a) SELECT a FROM a; --terminate statement
ELSE
  WITH a AS (SELECT 1 a) SELECT a FROM a; --terminate statement and IF

IF 1 = 1
  BEGIN
    WITH a AS (SELECT 1 a) SELECT a FROM a; --terminate statement
  END; --terminate IF

IF 1 = 1
  BEGIN
    WITH a AS (SELECT 1 a) SELECT a FROM a; --terminate statement
  END
ELSE
  BEGIN
    WITH a AS (SELECT 1 a) SELECT a FROM a; --terminate statement
  END; --terminate IF

IF 1 = 1
  WITH a AS (SELECT 1 a) SELECT a FROM a; --terminate statement
ELSE
  BEGIN
    WITH a AS (SELECT 1 a) SELECT a FROM a; --terminate statement
  END; --terminate IF

IF 1 = 1
  BEGIN
    WITH a AS (SELECT 1 a) SELECT a FROM a; --terminate statement
  END
ELSE
  WITH a AS (SELECT 1 a) SELECT a FROM a; --terminate statement and IF

IF 1 = 1
  BEGIN
    WITH a AS (SELECT 1 a) SELECT a FROM a; --terminate statement
	IF 1 = 1
		WITH a AS (SELECT 1 a) SELECT a FROM a; --terminate statement
	ELSE
	BEGIN
		WITH a AS (SELECT 1 a) SELECT a FROM a; --terminate statement
	END; --terminate inner nested IF
	WITH a AS (SELECT 1 a) SELECT a FROM a; --terminate statement
  END
ELSE
  WITH a AS (SELECT 1 a) SELECT a FROM a; --terminate statement and outer IF

Summary
Consistent user of semicolons helps avoid bugs in code that might otherwise go undetected.  Code with statement terminators can also be more easily modified without introducing compile errors and make code easier to maintain because the end of each statement is readily apparent to subsequent developers.  Importantly, you’ll be better positioned for future SQL Server versions by consistently using semicolon statement terminators.

Help Stop SQL Injection Madness

SQL injection continues to be the most exploited security vulnerability on the internet. A SQL injection vulnerability can put companies out of business and expose innocent people to sensitive data disclosure and damage, all due to developer ignorance or carelessness. Sadly, although there is much information available on SQL injection exploits and mitigation, SQL injection exploits continue because developers use flawed database access techniques. This article details the best, and simplest, method to prevent SQL injection in SQL Server: parameterized SQL statements.

Parameterized SQL queries and stored procedure calls (with CommandType.StoredProcedure and no dynamic SQL within proc code) prevent unintended SQL from being executed. I’ll discuss exactly why shortly. Parameterized SQL also provides many benefits besides security. Notably:

  • allows quotes in data without programmatic escaping
  •  avoids the need to format date strings according to DATEFORMAT session settings
  • handles decimal separators automatically
  •  improves performance due to plan reuse
  •  provides cleaner code

Most articles on SQL injection focus on validating user input rather than parameterized queries.  Input validation certainly has benefits related to data integrity and user-friendly validation messages but it is at best a secondary defense against injection compared to parameterized queries.  Input validation using a blacklist of prohibited keywords and characters is especially weak.

SQL Injection Overview
I am puzzled by the number of examples in articles and forum posts that show building SQL statements with literals using a string concatenation technique.  I suspect the main reason for this practice is that developers write code that generates the exact same ad-hoc SQL statement they would write using a query tool like SQL Server Management Studio or Visual Studio.  This leads to the nasty habit of not using parameters.  For example, a developer might develop and test a query like this in SSMS:

SELECT FirstName, LastName, EmailAddress
FROM dbo.Customer
WHERE CustomerID = 12345;

After testing the query, the C# developer includes the SQL statement in the application code but, instead of a hard-coded value for CustomerID, the desired value is concatenated with the remainder of the SQL statement as a literal.

var command = new SqlCommand(
             "SELECT FirstName, LastName, EmailAddress" +
            " FROM dbo.Customer " +
            " WHERE CustomerID = '" + txtBoxCustomerID.Text + "';"
            ,connection);

The query runs perfectly well and returns the expected result in testing.  It is not until after production deployment that a script kiddie or hacker finds the SQL injection vulnerability.  A malicious person can easily manipulate this SQL statement in ways unintended by the developer.  For example, a hacker could enter CustomerID value “1 OR 1 = 1” in the text box to harvest the names and email addresses of all customers in the database.  This vulnerability can also be exploited to execute additional SQL statements in the same batch, which is often used to update data with malicious html script that is subsequently returned and executed by unsuspecting client browsers.  There are many ways to exploit a SQL injection vulnerability that result in sensitive data disclosure and/or database manipulation.  I won’t detail those here but rather recommend a very simple defense; use a parameterized statement instead of literals.

Mitigating SQL Injection
Below is a functionally identical parameterized query.

SELECT FirstName, LastName, EmailAddress
FROM dbo.Customer
WHERE CustomerID = @CustomerID;

This query can be incorporated into the C# application code with the parameter and corresponding value added to the parameterized SQL command as a strongly-typed parameter:

var command = new SqlCommand(
             "SELECT LastName" +
            " FROM dbo.Customer " +
            " WHERE CustomerID = @CustomerID;"
            ,connection);
command.Parameters.Add("@CustomerID", SqlDbType.Int).Value = int.Parse(textBoxCustomerID.Text);

As you can see, the effort needed to do the same job with a parameterized command is about the same as the string concatenation method, yet the security benefits are immeasurable.

Why Parameters Prevent SQL Injection
Parameterized SQL is more secure for a couple of reasons.  In the case of an ad-hoc query, the SQL statement with parameter markers is hard-coded in the application (or generated via an ORM framework) and passed to SQL Server as-is.  Parameter values are passed separately rather than inside the SQL statement itself.  This ensures the SQL statement cannot be modified by the values provided.

Similarly, a stored procedure call (using CommandType.StoredProcedure) sends only the stored procedure name to SQL Server with separately passed parameter values.  Using stored procedures also allow one to more strictly adhere to the security principal of least privilege.  By granting only stored procedure execute permissions, permissions on indirectly referenced objects are not needed as long as the ownership chain is unbroken.  This practices limits ad-hoc access to underlying data should an attacker find a back door into the database under the application security context.  Personally, I’m a staunch advocate of stored procedures for this and other reasons too.  Stored procedures provide a well-defined database interface, allowing queries and database schema to be refactored without breaking changes.

I should add that it is perfectly acceptable to build a SQL statement dynamically in either application code or stored procedures, but only if parameterized and actually needed for the task at hand.  A common use case is a dynamic search query that is constructed based on user criteria. Instead of coding a separate query for each possible search permutation, one can add parameterized WHERE clauses as needed.  Under no circumstances should a SQL statement string be built with concatenation of untrusted values.

Additional Precautions
Use Windows Authentication for SQL Server database access.  This practice eliminates the need to store security credentials in application configuration files.  If SQL authentication must be used, protect the credentials with encryption.  This is easily done in .NET by encrypting the connectionStrings section of the app.config file.  See http://msdn.microsoft.com/en-us/library/ms254494.aspx.

Adhere to Principal of least privilege.  Never use a privileged account, such as a sysadmin role member, for routine application access.  Grant permissions only on database objects which are used directly by applications.  Use stored procedures so that one can leverage ownership chaining, eliminating the need to grant permissions on tables directly.

Specify strongly-typed parameters.  The parameter data type can be inferred (e.g. using SqlCommand.Parameters.AddWithValue method) but a best practice is to explicitly specify a parameter data type that matches the underlying column data type and maximum length.  These practices will minimize the size of the SQL Server procedure cache and avoid implicit conversions that can negatively affect performance.

Maximizing Performance with Table-Valued Parameters

I’ve seen TVPs improve performance by orders of magnitude when used appropriately.  This feature isn’t used as often as it should be so I’ll show how to implement a TVP using a C# application example and discuss TVP considerations.

Why Table-Valued Parameters Improve Performance
TVPs allow one to pass multiple rows or values at once to a query or stored procedure.  By doing so, the query can leverage set-based operations to greatly improve performance compared to executing the query/proc many times.  TVPs decrease network latency by reducing network round trips.  Prior to TVPs, one had to employ workarounds like passing a string containing a delimited list of records (or XML) and parsing on the SQL side.  Those techniques are at best a kludge.  Instead TVPs are the right tool for the job in SQL Server 2008 and later.

TVPs (and table variables) are temp tables.  TVPs and table variables are persisted in tempdb much the same way as regular temp tables (# prefix).  The big differences are that TVPs and table variables are limited in scope to the current batch and do not have statistics.  Also, table parameters/variables allow only indexes declared via primary key nor unique constraints; non-unique indexes are not supported.

When a TVP Is passed to SQL Server from a client application, the data is bulk-inserted into tempdb where it is made available to the SQL statement or stored procedure using the declared TVP.  This bulk insert is handled by the client API transparently to the application and allows a large number of rows to be passed to SQL Server efficiently.  In T-SQL, one can declare and load a table variable, where it can be used directly in a SQL statement or passed as a TVP to a stored procedure or query (invoked with sp_executesql).

The exact threshold where TVPs outperform single-row operations is much lower than you might expect.  Your mileage may vary but, at least in my experience, it’s only a few rows.  The performance tests I ran here show that the threshold where a TVP outperforms individual requests with a simple list of customers is only 4 rows.  That said, I would not recommend TVPs for single-row operations due to the start-up overhead but it’s a good choice when the most common use case is multiple rows.

How to Use TVPs
In order to pass a TVP to a SQL statement or stored procedure, one must first create a user-defined table type in SQL Server.  The table type defines the schema of the table SQL Server expects for the TVP.  Listing 1 shows the CREATE DDL for the table type I use in the AdventureWorks2012 database to pass a list of customers to the stored procedure in Listing 2 that returns all orders for these customers.  Note that the stored procedure TVP must be declared as READONLY.

Listing 1: Table type DDL

CREATE TYPE dbo.CustomerList AS TABLE(
CustomerID int NOT NULL
PRIMARY KEY
);

Listing 2: Stored procedure to return orders for multiple customers

CREATE PROC dbo.usp_select_orders_for_multiple_customers
@CustomerList dbo.CustomerList READONLY
AS
SELECT soh.CustomerID, soh.SalesOrderID
FROM Sales.SalesOrderHeader AS soh
JOIN @CustomerList AS cl ON
cl.CustomerID = soh.CustomerID;
RETURN @@ERROR;

TVPs are passed from a .NET application using parameter data type DbType.Structured.  The actual parameter value can be an object of type DataTable, DbDataReader or IEnumberable<SqlDataRecord>.  A DataTable is appropriate when the required TVP data is already in a DataTable object needed for other purposes.  A DbDataReader is a good choice when the source data is the result of a query (note that any DbDataReader may be used, not just a SqlDataReader).   For other cases, I recommend an IEnumerable<SqlDataRecord>.  It is very easy to implement IEnumerable<SqlDataRecord>, especially if you already have an enumerable object like an array or collection containing the values you need to pass.  See the code at the end of this article for an example class that implements IEnumerable<SqlDataRecord> for a TVP.

A benefit with DbDataReader and IEnumerable<SqlDataRecord> is that the source data can be streamed to SQL server without first loading all the TVP data into memory in either the client application or in SQL Server.  This is a big consideration when a lot of data needs to be passed because you are not constrained by memory on ether the client or SQL Server side.

Listing 3 shows the stored procedure I use to compare performance with the TVP method.  This proc takes only a single @CustomerID parameter so it must be called once for each customer.

Listing 3: Stored procedure to return orders for a single customer

CREATE PROC dbo.usp_select_orders_for_single_customer
@CustomerID int
AS
SELECT soh.CustomerID, soh.SalesOrderID
FROM Sales.SalesOrderHeader AS soh
WHERE
soh.CustomerID = @CustomerID;
RETURN @@ERROR;

The complete C# console application I used for the performance tests is in Listing 4.  The code demonstrates the 3 ways to pass a TVP from C# and logs timings for each invocation.  As you can see in Figure 1, the TVP outperformed individual calls once more than 3 customers were passed.  Note that I had to use a logarithmic y-axis scale because of the huge performance difference.  For example, 10,000 individual calls averaged about 1.5 seconds whereas passing 10,000 customers via a TVP took less to 100 milliseconds.

Figure 1:  TVP versus individual call performance
TVP performance graph

Listing 4: Passing TVPs in C#

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Data;
using System.Data.SqlClient;
using System.Diagnostics;
using Microsoft.SqlServer.Server;
using System.IO;

namespace TVPTest
{
    class Program
    {

        private static string connectionString = @"Data Source=MyServer;Initial Catalog=AdventureWorks2012;Integrated Security=SSPI";
        private static StreamWriter logFile = new StreamWriter(@"TvpTest.log", false);

        static void Main(string[] args)
        {

            // write column headers to log file
            logFile.WriteLine("{0},{1},{2}"
                , "Test Name"
                , "Customers"
                , "Durations");

            // run each performance test 100 times
            for (int i = 0; i < 100; ++i)
            {
                runPerformanceTest(1);
                runPerformanceTest(2);
                runPerformanceTest(3);
                runPerformanceTest(4);
                runPerformanceTest(5);
                runPerformanceTest(10);
                runPerformanceTest(25);
                runPerformanceTest(50);
                runPerformanceTest(100);
                runPerformanceTest(500);
                runPerformanceTest(1000);
                runPerformanceTest(2500);
                runPerformanceTest(5000);
                runPerformanceTest(10000);
                logFile.Flush();
            }

            logFile.Close();

            Console.WriteLine("Press any key to close.");
            Console.ReadKey();
        }

        private static void runPerformanceTest(int customerCount)
        {

            Console.WriteLine("{0} customers:", customerCount);

            DataTable orderDataTable;
            Stopwatch sw;
            DateTime duration;

            //individual call baseline
            sw = Stopwatch.StartNew();
            orderDataTable = getOrdersForCustomersIndividually(customerCount);
            sw.Stop();
            duration = new DateTime(sw.Elapsed.Ticks);
            Console.WriteLine("\tIndividual call: {0} orders returned, duration is {1}"
                , orderDataTable.Rows.Count, duration.ToString("HH:mm:ss.ffffff"));
            logFile.WriteLine("{0},{1},{2}"
                , "Individual call"
                , customerCount
                , duration.ToString("HH:mm:ss.ffffff"));

            //TVP from DataTable
            sw = Stopwatch.StartNew();
            orderDataTable = getOrdersForCustomersUsingTvp(getCustomerListDataTable(customerCount));
            sw.Stop();
            duration = new DateTime(sw.Elapsed.Ticks);
            Console.WriteLine("\tTVP DataTable: {0} orders returned, duration is {1}"
                , orderDataTable.Rows.Count, duration.ToString("HH:mm:ss.ffffff"));
            logFile.WriteLine("{0},{1},{2}"
                , "TVP DataTable"
                , customerCount
                , duration.ToString("HH:mm:ss.ffffff"));

            //TVP from DataReader
            sw = Stopwatch.StartNew();
            orderDataTable = getOrdersForCustomersUsingTvp(getCustomerListDataReader(customerCount));
            sw.Stop();
            duration = new DateTime(sw.Elapsed.Ticks);
            Console.WriteLine("\tTVP DataReader: {0} orders returned, duration is {1}"
                , orderDataTable.Rows.Count, duration.ToString("HH:mm:ss.ffffff"));
            logFile.WriteLine("{0},{1},{2}"
                , "TVP DataReader"
                , customerCount
                , duration.ToString("HH:mm:ss.ffffff"));

            //TVP from IEnumerable
            sw = Stopwatch.StartNew();
            orderDataTable = getOrdersForCustomersUsingTvp(getCustomerListSqlDataRecords(customerCount));
            sw.Stop();
            duration = new DateTime(sw.Elapsed.Ticks);
            Console.WriteLine("\tTVP IEnumerable: {0} orders returned, duration is {1}"
                , orderDataTable.Rows.Count, duration.ToString("HH:mm:ss.ffffff"));
            logFile.WriteLine("{0},{1},{2}"
                , "TVP IEnumerable"
                , customerCount
                , duration.ToString("HH:mm:ss.ffffff"));

            Console.WriteLine();

        }

        /// 
        /// Call proc individually for each customer
        /// 
        /// Number of customers to return
        /// DataTable of customer orders
        private static DataTable getOrdersForCustomersIndividually(int customerCount)
        {
            using (var connection = new SqlConnection(connectionString))
            using (var command = new SqlCommand(@"dbo.usp_select_orders_for_single_customer", connection))
            {
                command.CommandType = CommandType.StoredProcedure;
                var customerIdParameter = command.Parameters.Add("@CustomerID", System.Data.SqlDbType.Int);
                var orderListDataAdapter = new SqlDataAdapter(command);
                var orderDataTable = new DataTable();
                for (int customerId = 11000; customerId < 11000 + customerCount; ++customerId)
                {
                    customerIdParameter.Value = customerId;
                    orderListDataAdapter.Fill(orderDataTable);
                }
                return orderDataTable;
            }
        }

        /// 
        /// Call proc with TVP of CustomersIDs
        /// 
        /// List of customers
        /// DataTable of customer orders
        private static DataTable getOrdersForCustomersUsingTvp(Object customerList)
        {
            using (var connection = new SqlConnection(connectionString))
            using (var command = new SqlCommand(@"dbo.usp_select_orders_for_multiple_customers", connection))
            {
                command.CommandType = CommandType.StoredProcedure;
                command.Parameters.Add("@CustomerList", System.Data.SqlDbType.Structured).Value = customerList;
                var orderListDataAdapter = new SqlDataAdapter(command);
                var orderDataTable = new DataTable();
                orderListDataAdapter.Fill(orderDataTable);
                return orderDataTable;
            }
        }

        /// 
        /// Create a DataTable containing list of customers for TVP
        /// 
        /// Number of customers to return
        /// DataTable with single CustomerID column
        private static DataTable getCustomerListDataTable(int customerCount)
        {
            var customerListDataTable = new DataTable();
            customerListDataTable.Columns.Add("CustomerID", typeof(int));

            for (int customerId = 11000; customerId < 11000 + customerCount; ++customerId)
            {
                var row = customerListDataTable.NewRow();
                customerListDataTable.Rows.Add(row);
                row[0] = customerId;
            }
            return customerListDataTable;
        }

        /// 
        /// Create a DataReader containing list of customers for TVP
        /// 
        /// Number of customers to return
        /// DataReader with single CustomerID column        
        private static IDataReader getCustomerListDataReader(int customerCount)
        {
            var connection = new SqlConnection(connectionString);
            var command = new SqlCommand(@"SELECT CustomerID FROM Sales.Customer WHERE CustomerID BETWEEN 11000 AND 11000 + @CustomerCount - 1;", connection);
            command.Parameters.Add("@CustomerCount", System.Data.SqlDbType.Int).Value = customerCount;
            connection.Open();
            command.CommandType = CommandType.Text;
            return command.ExecuteReader();

        }

        /// 
        /// Create an IEnumerable<SqlDataRecord> object containing list of customers for TVP
        /// 
        /// Number of customers to return
        /// IEnumerable<SqlDataRecord> with single CustomerID column
        private static IEnumerable getCustomerListSqlDataRecords(int customerCount)
        {
            var customerList = new List();
            for (int customerId = 11000; customerId < 11000 + customerCount; ++customerId)
            {
                customerList.Add(customerId);
            }
            return new CustomerListSqlDataRecords(customerList);
        }

        /// 
        /// Implement IEnumerable<SqlDataRecord> to return a SqlDataRecord for each CustomerID
        /// in the provided IEnumerable<int>
        /// 
        private class CustomerListSqlDataRecords : IEnumerable
        {
            private SqlMetaData metaData = new SqlMetaData("CustomerID", SqlDbType.Int);

            private IEnumerable customerList;

            public CustomerListSqlDataRecords(IEnumerable customerList)
            {
                this.customerList = customerList;
            }

            public IEnumerator GetEnumerator()
            {
                foreach (var customerId in customerList)
                {
                    var record = new SqlDataRecord(metaData);
                    record.SetInt32(0, customerId);
                    yield return record;
                }
            }

            System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
            {
                throw new NotImplementedException();
            }
        }
    }
}

Improving Uniqueidentifier Performance

A common anti-pattern I run into is the random primary key, commonly a GUID. This design is insidious because the performance implications of random access aren’t immediately obvious and exacerbated when the primary key index is clustered. It is often only after the table grows to a larger size that the performance problems become apparent. Symptoms include slowly degrading performance over time, with increased blocking and deadlocking as a side effect.

Figure 1 shows the performance profile of a random inserts with a random GUID (SQL Server uniqueidentifier data type) clustered primary key. The red line indicates the rate of batch requests per second (inserts) while the blue line shows the total number of rows in the table, scaled such that the top of the graph represents 3M rows. Only about 700, 000 rows could be inserted during this 15 minute single-threaded random key insert test, even though the insert rate was fast initially.

Figure 1: Random key insert performance
Random insert performance graph

Incremental Primary Keys

As you might guess, the cure for the random primary key anti-pattern is an incremental key pattern. With a uniqueidentifier data type, a sequential value can be assigned by SQL Server using the NEWSEQUENTIALID function (in a default constraint expression) or in application code using the UuidCreateSequential Win32 API call along with some byte swapping (code example below). Alternatively, one can use an integral data type (int, bigint, etc.) along with a value generated by an IDENTITY property or a SEQUENCE object. The advantage of an integral type is the reduced space requirements compared to a 16-byte uniqueidentifier. The advantage of a uniqueidentifier is that it can easily be generated in application code before database persistence without a database round trip, which is desirable for distributed applications and when keys of related tables are assigned in application code before writing to the database.

Figure 2 shows the same test using a sequential key value. Over 2.2M rows were inserted in 15 minutes. As you can see, significant performance improvement is achieved with this trivial application change.

Figure 2: Incremental key insert performance
Random insert performance graph

Listing 1 shows the T-SQL code I used for these performance tests and listing 2 contains the C# code (with the random GUID commented out). I generated the uniqueidentifier value via application code in the tests but performance with NEWID() is comparable to the first test and NEWSEQUENTIALID() is similar to the second test.

Listing 1: T-SQL scripts for test table and stored procedure

CREATE TABLE dbo.TestTable(
	TestKey uniqueidentifier NOT NULL
		CONSTRAINT PK_TestTable PRIMARY KEY CLUSTERED
	,TestData char(8000) NOT NULL
);
GO

CREATE PROC dbo.InsertTestTable
	@TestKey uniqueidentifier
	,@TestData char(8000)
AS
SET NOCOUNT ON;
DECLARE @TotalRows int;

--insert row
INSERT INTO dbo.TestTable (TestKey, TestData) 
	VALUES(@TestKey, @TestData);

--update pmon counter for rowcount
SELECT @TotalRows = rows 
FROM sys.partitions 
WHERE object_id = OBJECT_ID(N'TestTable') AND index_id = 1;
EXEC sys.sp_user_counter1 @TotalRows; --for pmon row count

RETURN @@ERROR;
GO

Listing 2: C# insert test console application

using System;
using System.Data;
using System.Data.SqlClient;
using System.Runtime.InteropServices;

namespace UniqueIdentifierPerformanceTest
{
    class Program
    {
        [DllImport("rpcrt4.dll", CharSet = CharSet.Auto)]
        public static extern int UuidCreateSequential(ref Guid guid);

        static string connectionString = @"Data Source=MyServer;Initial Catalog=MyDatabase;Integrated Security=SSPI";
        static int rowsToInsert = 10000000;
        static SqlConnection connection;
        static SqlCommand command;

        static void Main(string[] args)
        {

            int rowsInserted = 0;

            using (connection = new SqlConnection(connectionString))
            {
                using (command = new SqlCommand("dbo.InsertTestTable", connection))
                {
                    command.Connection = connection;
                    command.CommandType = CommandType.StoredProcedure;
                    command.Parameters.Add("@TestKey", SqlDbType.UniqueIdentifier);
                    command.Parameters.Add("@TestData", SqlDbType.Char, 8000);

                    connection.Open();

                    while (rowsInserted < rowsToInsert)
                    {
                        //random guid
                        //command.Parameters["@TestKey"].Value = Guid.NewGuid();

                        //sequential guid
                        command.Parameters["@TestKey"].Value = NewSequentialGuid();
                        command.Parameters["@TestData"].Value = "Test";
                        command.ExecuteNonQuery();
                        ++rowsInserted;

                        //display progress every 1000 rows
                        if (rowsInserted % 1000 == 0)
                        {
                            Console.WriteLine(string.Format(
                                "{0} of {1} rows inserted"
                                , rowsInserted.ToString("#,##0")
                                , rowsToInsert.ToString("#,##0")));
                        }
                    }
                }

                connection.Close();
            }

        }

        /// 
        /// call UuidCreateSequential and swap bytes for SQL Server format
        /// 
        /// sequential guid for SQL Server
        private static Guid NewSequentialGuid()
        {
            const int S_OK = 0;
            const int RPC_S_UUID_LOCAL_ONLY = 1824;

            Guid oldGuid = Guid.Empty;

            int result = UuidCreateSequential(ref oldGuid);
            if (result != S_OK && result != RPC_S_UUID_LOCAL_ONLY)
            {
                throw new ExternalException("UuidCreateSequential call failed", result);
            }

            byte[] oldGuidBytes = oldGuid.ToByteArray();
            byte[] newGuidBytes = new byte[16];
            oldGuidBytes.CopyTo(newGuidBytes, 0);

            // swap low timestamp bytes (0-3)
            newGuidBytes[0] = oldGuidBytes[3];
            newGuidBytes[1] = oldGuidBytes[2];
            newGuidBytes[2] = oldGuidBytes[1];
            newGuidBytes[3] = oldGuidBytes[0];

            // swap middle timestamp bytes (4-5)
            newGuidBytes[4] = oldGuidBytes[5];
            newGuidBytes[5] = oldGuidBytes[4];

            // swap high timestamp bytes (6-7)
            newGuidBytes[6] = oldGuidBytes[7];
            newGuidBytes[7] = oldGuidBytes[6];

            //remaining 8 bytes are unchanged (8-15) 

            return new Guid(newGuidBytes);

        }

    }
}

Why Random Keys Are Bad

I think it’s important for one to understand why random keys have such a negative impact on performance against large tables. DBAs often cite fragmentation and page splits as the primary causes of poor performance with random keys. Although it is true random inserts do cause fragmentation and splits, the primary cause of bad performance with random keys is poor temporal reference locality (http://en.wikipedia.org/wiki/Locality_of_reference), which I’ll detail shortly. Note that there were no real page splits in these insert performance tests because the nearly 8K row size allowed only one row per page. Although significant extent fragmentation occurred, this didn’t impact these single-row requests; extent fragmentation is mostly an issue with sequential scans against spinning media. So neither splits nor fragmentation explain the poor performance of the random inserts.

Temporal reference locality basically means that once data is used (e.g. inserted or touched in any way), it is likely to be used again in the near future. This is why SQL Server uses a LRU-2 algorithm to manage the buffer cache; data most recently touched will remain in memory while older, less often referenced data are aged out. The impact of random key values on temporal locality (i.e. buffer efficiency) is huge. Consider that inserts are basically rewrites of existing pages. When a new row is inserted into a table, SQL Server first reads the page where the row belongs (by key value if the table has a clustered index) and then either adds the row to the existing data page or allocates a new one if there’s not enough space available in the existing page for the new row. With a random key value, the new key value is unlikely to be adjacent to the last one inserted (which is probably still in memory) so the needed page often must be read from storage.

All things being equal, single-row performance will be roughly the same with both sequential and random keys as long as data are memory resident. This is why the random and sequential key insert tests show the same good performance initially. But once the table size exceeded the size of the buffer pool, the random key test showed a precipitous drop in throughput and steady degradation thereafter. In short, random keys diminish temporal reference locality because there is no correlation between time (most recently accessed data) and the key value.

Why Incremental Keys Good

An incremental key value naturally improves temporal reference locality; the next key value is adjacent to the last one inserted and is likely still in memory. An incremental key provides excellent insert performance regardless of table size as the insert performance test shows. Also, applications typically use recently inserted data more often than older data. This allows the same amount of work to done with much less physical I/O than a random key value.

Random Notes about GUIDs

According to the Globally unique identifier Wiki (http://en.wikipedia.org/wiki/Globally_unique_identifier), the random 122 bits of a GUID can generate 2122 unique values. That’s an incomprehensibly large 5.3 x 1036 (or 5,300,000,000,000,000,000,000,000,000,000,000,000) number unique values.

The value returned by NEWSEQUENTIALID and UuidCreateSequential is guaranteed to be unique on a given computer. Furthermore, it is globally unique if the computer has a network card because the MAC address is used as part of the GUID generation algorithm.

Remediating a Table Partitioning Mess

Partitioning often involves exceptionally large tables, which are quite unforgiving when it comes to remediating a mess due to improper partition design or maintenance.  In this post, I’ll discuss the common design mistake of using a RANGE LEFT partition function containing exact date boundaries with a date type that may include a time component (datetime, datetime2 or datetimeoffset) and show how to change it to RANGE RIGHT with minimal data movement.

Background
The most common use case for table partitioning is an incremental value.  Partitioning on a date, datetime, datetime2 or datetimeoffset value that occurs naturally in the data allows incremental load using SPLIT and SWITCH operations and purge/archive using SWITCH and MERGE.  SWITCH is nearly instantaneous regardless of the amount of underlying data involved because only meta-data changes are needed to move partitions from one table to another.  SPLIT and MERGE are also very fast meta-data only operations as long as no data movement is required.  For this reason, it is critical to setup boundaries correctly and plan as to perform SPLIT and MERGE against empty partitions whenever possible.

Note that RANGE LEFT or RANGE RIGHT specification determines:

  • The partition that contains exact boundary value matches
  • The partition is created by SPLIT
  • The partition is removed by MERGE

A common design mistake is using exact date boundaries in a RANGE LEFT function with datetime, datetime2 or datetimeoffset data type.  This error can result in rows in the wrong partition because these data types include a time component and the implications are not obvious, especially to partitioning newbies.  Rows that exactly match the date boundary (i.e. midnight on boundary date) are inserted into the lower partition instead of the higher one as desired.  For example, consider a RANGE LEFT function is a monthly sliding window scenario and boundaries of ‘2013-01-01T00:00:00’ and ‘2013-02-01T00:00:00’:

CREATE PARTITION FUNCTION PF_Monthly(datetime2(0))
    AS RANGE LEFT FOR VALUES('2013-01-01T00:00:00', '2013-02-01T00:00:00');

CREATE PARTITION SCHEME PS_Monthly AS
    PARTITION PF_Monthly TO (December2012, January2013, February2013);

This setup looks reasonable at first glance but consider the actual partitions when a table is created on the partition scheme:

  • Partition 1 (December2012 filegroup):  <= ‘2013-01-01T00:00:00’
  • Partition 2 (January2013 filegroup):  > ‘2013-01-01T00:00:00’ AND <= ‘2013-01-02T00:00:00’
  • Partition 3 (February2013 filegroup):  > ‘2013-02-01T00:00:00’

A row with the January datetime of ‘2013-01-01T00:00:00’ will end up in the first partition (December2012 filegroup).  Sliding window maintenance, which purges the oldest month (December2012), will include the ‘2013-01-01T00:00:00’ row(s) inadvertently.  Similarly, a row with value ‘2013-02-01T00:00:00’ will end up in the second partition (January2013 fielgroup) and require both the second and third partitions to be accessed when data from all of February are queried.

Remediation
There are two ways to address this problem.  One is to change the boundary values of the existing RANGE LEFT function to the maximum inclusive value instead of an exact date. For example, instead of ‘2013-01-01T00:00:00’ for January, 2013, change the boundary values to the maximum value according to the partition function data type as follows:

  • *datetime: ‘2013-01-31T23:59:59.997’
  • datetime2(0): ‘2013-01-31T23:59:59’
  • datetime2(1): ‘2013-01-31T23:59:59.9’
  • datetime2(2): ‘2013-01-31T23:59:59.99’
  • datetime2(3): ‘2013-01-31T23:59:59.999’
  • datetime2(4): ‘2013-01-31T23:59:59.9999’
  • datetime2(5): ‘2013-01-31T23:59:59.99999’
  • datetime2(6): ‘2013-01-31T23:59:59.999999’
  • datetime2(7): ‘2013-01-31T23:59:59.9999999’
  • **datetimeoffset (0): ‘2013-01-31T23:59:59-05:00’’
  • **datetimeoffset (1): ‘2013-01-31T23:59:59.9-05:00’’
  • **datetimeoffset (2): ‘2013-01-31T23:59:59.99-05:00’’
  • **datetimeoffset (3): ‘2013-01-31T23:59:59.999-05:00’’
  • **datetimeoffset (4): ‘2013-01-31T23:59:59.9999-05:00’’
  • **datetimeoffset (5): ‘2013-01-31T23:59:59.99999-05:00’’
  • **datetimeoffset (6): ‘2013-01-31T23:59:59.999999-05:00’’
  • **datetimeoffset(7): ‘2013-01-31T23:59:59.9999999-05:00’’

*Importantly, one needs to specify 997 as the milliseconds value for datetime because a higher millisecond value will be rounded to the next exact date (e.g. ‘2013-02-01T00:00:00.000’)

**The actual offset for datetimeoffset partitioning depends on your environment but is typically the same for each boundary.

Rather than changing the boundary specification, I suggest instead changing the function from LEFT to RIGHT and continue using an exact date boundary.  The inclusive date boundary is more intuitive (in my opinion) and allows the same date boundaries to be used regardless of data type.  The same amount of work is required for a mass change of existing boundaries as is required to change from RANGE LEFT to RANGE RIGHT.

The script below converts a table partitioned using a RANGE LEFT function with exact date boundaries to one partitioned RANGE RIGHT as follows:

  1. create a staging table exactly like the source table and partitioned using the same scheme
  2. create a non-partitioned holding table with the same schema source but without constraints and indexes
  3. create a new RANGE RIGHT function with the same exact date boundaries as the existing RANGE LEFT scheme
  4. create a new partition scheme with the same filegroups as the source table partition scheme but based on the new RANGE RIGHT function
  5. create a new table exactly like the source but partitioned using the new RANGE RIGHT scheme
  6. for each existing partition
    1. switch source partition to staging table
    2. move rows with exact boundary matches to holding table
    3. create check constraint on staging table matching the target right boundaries
    4. switch staging partition to new table
    5. drop check constraint
    6. after all partitions are switched to the new table, insert exact boundary matching rows to new table
    7. drop the old table and staging tables
    8. rename the new table, constraints and indexes to the old names

Since this script uses SWITCH to accomplish the bulk of the work, the table must be aligned; data and indexes must be partitioned using the same (or functionally identical) partition function.  The SWITCH operations are nearly instantaneous and the only data movement required is moving rows that exactly match the boundary values to/from the holding table, which shouldn’t take long unless you have a lot of matching rows.  The long pole in the tent from a performance perspective is likely creating the needed check constraint on the staging table because all rows must be scanned.  However, this method is still considerably faster and requires less space than repartitioning the table by recreating indexes using the new partition scheme.

Here’s the setup script for the problem table and sample data:

--existing RANGE LEFT function with exact date boundaries
CREATE PARTITION FUNCTION PF_Left(datetime)
       AS RANGE LEFT FOR VALUES('20130101','20130201');
--existing scheme based on RANGE LEFT
CREATE PARTITION SCHEME PS_Left
       AS PARTITION PF_Left ALL TO ([DEFAULT]);
--existing table, constraint and index partitioned with RANGE LEFT scheme
CREATE TABLE dbo.OldTable(
       PartitioningColumn datetime NOT NULL
       ,DataColumn int NOT NULL
       ) ON PS_Left(PartitioningColumn);
ALTER TABLE dbo.OldTable
	ADD CONSTRAINT PK_OldTable PRIMARY KEY CLUSTERED (PartitioningColumn, DataColumn) 
	ON PS_Left(PartitioningColumn);
CREATE INDEX idx_OldTable_DataColumn ON dbo.OldTable(DataColumn) 
	ON PS_Left(PartitioningColumn);
--sample data
INSERT INTO dbo.OldTable VALUES
       ('20121231',1)
       ,('20130101',1)
       ,('20130102',2)
       ,('20130103',5)
       ,('20130131',2)
       ,('20130201',4)
       ,('20130202',3);
GO

And the remediation script to change to RANGE RIGHT:

--new RANGE RIGHT function with same boundaries as existing RANGE LEFT function
CREATE PARTITION FUNCTION PF_Right(datetime)
       AS RANGE RIGHT FOR VALUES('20130101','20130201');

--new scheme based on RANGE RIGHT with same filegroups as RANGE LEFT scheme
CREATE PARTITION SCHEME PS_Right
       AS PARTITION PF_Right ALL TO ([DEFAULT]);

--new staging table exactly line existing source table
CREATE TABLE dbo.OldTableStaging(
       PartitioningColumn datetime NOT NULL
	   ,DataColumn int NOT NULL
       ) ON PS_Left(PartitioningColumn);
ALTER TABLE dbo.OldTableStaging
	ADD CONSTRAINT PK_OldTableStaging PRIMARY KEY CLUSTERED (PartitioningColumn, DataColumn) 
	ON PS_Left(PartitioningColumn);
CREATE INDEX idx_OldTableStaging_DataColumn ON dbo.OldTableStaging(DataColumn) 
	ON PS_Left(PartitioningColumn);

--new holding table for exact boundary matches (no constraints or indexes needed)
CREATE TABLE dbo.OldTableStagingExactBoundaryMatches(
       PartitioningColumn datetime NOT NULL
	   ,DataColumn int NOT NULL
       ) ON [DEFAULT];

--new table exactly like existing table but partitioned with RANGE RIGHT scheme
CREATE TABLE dbo.NewTable(
       PartitioningColumn datetime NOT NULL
	   ,DataColumn int NOT NULL
       ) ON PS_Right(PartitioningColumn);
ALTER TABLE dbo.NewTable
	ADD CONSTRAINT PK_NewTable PRIMARY KEY CLUSTERED (PartitioningColumn, DataColumn) 
	ON PS_Right(PartitioningColumn);
CREATE INDEX idx_NewTable_DataColumn ON dbo.NewTable(DataColumn) 
	ON PS_Right(PartitioningColumn);
GO

--script to convert RANGE LEFT to RANGE RIGHT
SET NOCOUNT ON;
DECLARE 
       @LowerBoundary datetime
       ,@UpperBoundary datetime
       ,@PartitionNumber int
       ,@Sql nvarchar(MAX)
       ,@PartitionFunctionName sysname = N'PF_Right';

DECLARE PartitionBoundaries CURSOR LOCAL FAST_FORWARD FOR
WITH 
       PartitionNumbers AS (
              SELECT 1 AS PartitionNumber
              UNION ALL
              SELECT boundary.boundary_id + 1
              FROM sys.partition_functions AS pf
              JOIN sys.partition_range_values AS boundary ON pf.function_id = boundary.function_id
              WHERE pf.name = @PartitionFunctionName
              )
       ,PartitionBoundaries AS (
              SELECT
                     (SELECT CAST(boundary.value AS datetime)
                     FROM sys.partition_functions AS pf
                     JOIN sys.partition_range_values AS boundary ON pf.function_id = boundary.function_id
                     WHERE
                           pf.name = @PartitionFunctionName
                           AND boundary.boundary_id = PartitionNumbers.PartitionNumber - 1
                     ) AS LowerBoundary
                     ,(SELECT CAST(boundary.value AS datetime)
                     FROM sys.partition_functions AS pf
                     JOIN sys.partition_range_values AS boundary ON pf.function_id = boundary.function_id
                     WHERE
                           pf.name = @PartitionFunctionName
                           AND boundary.boundary_id = PartitionNumbers.PartitionNumber
                     ) AS UpperBoundary
                     ,PartitionNumbers.PartitionNumber
              FROM PartitionNumbers
              )
SELECT LowerBoundary, UpperBoundary, PartitionNumber
FROM PartitionBoundaries
ORDER BY PartitionNumber;

OPEN PartitionBoundaries;
WHILE 1 = 1
BEGIN
       --get next partition boundary
       FETCH NEXT FROM PartitionBoundaries INTO @LowerBoundary, @UpperBoundary, @PartitionNumber;
       IF @@FETCH_STATUS = -1 BREAK;

       --move partition from partitioned left table to staging
       RAISERROR('Switching out partition %d', 0, 1, @PartitionNumber) WITH NOWAIT;
       ALTER TABLE dbo.OldTable
              SWITCH PARTITION @PartitionNumber TO dbo.OldTableStaging PARTITION @PartitionNumber;

       --move exact boundary matches to holding table
       DELETE FROM dbo.OldTableStaging WITH(TABLOCKX)
       OUTPUT deleted.* INTO dbo.OldTableStagingExactBoundaryMatches
       WHERE PartitioningColumn = @UpperBoundary;
       RAISERROR('Moved %d exact boundary matches to holding table', 0, 1, @@ROWCOUNT) WITH NOWAIT;

       --create check constraint to match RANGE RIGHT boundary
       SET @Sql = N'ALTER TABLE dbo.OldTableStaging
              ADD CONSTRAINT CK_TestStaging CHECK (';
       IF @LowerBoundary IS NOT NULL
       BEGIN
              SET @Sql = @Sql + N'PartitioningColumn >= ''' + CONVERT(char(8), @LowerBoundary,112) + N'''';
              IF @UpperBoundary IS NOT NULL
              BEGIN
                     SET @Sql = @Sql + N' AND ';
              END;
       END;
       IF @UpperBoundary IS NOT NULL
       BEGIN
              SET @Sql = @Sql + N'PartitioningColumn < ''' + CONVERT(char(8), @UpperBoundary,112) + N'''';
       END;
       SET @Sql = @Sql + ');';
       RAISERROR('Creating check constraint: %s', 0, 1, @Sql) WITH NOWAIT;
       EXEC sp_executesql @Sql;

       --move staging table partition to partitioned right table
       RAISERROR('Switching in partition %d', 0, 1, @PartitionNumber) WITH NOWAIT;
       ALTER TABLE dbo.OldTableStaging
              SWITCH PARTITION @PartitionNumber TO dbo.NewTable PARTITION @PartitionNumber;

       --remove staging table constraint
       RAISERROR('Dropping check constraint', 0, 1, @PartitionNumber) WITH NOWAIT;
       ALTER TABLE dbo.OldTableStaging
              DROP CONSTRAINT CK_TestStaging;

END;
CLOSE PartitionBoundaries;
DEALLOCATE PartitionBoundaries;

--insert exact boundary matches
INSERT INTO dbo.NewTable WITH (TABLOCKX) (PartitioningColumn, DataColumn) 
       SELECT PartitioningColumn, DataColumn FROM dbo.OldTableStagingExactBoundaryMatches WITH (TABLOCKX);
RAISERROR('Inserted %d exact boundary matches from holding table', 0, 1, @@ROWCOUNT) WITH NOWAIT;

IF NOT EXISTS(SELECT * FROM OldTable)
BEGIN
	--drop old table
	DROP TABLE dbo.OldTable;
	--rename table, constraints and indexes to same names as original table
	EXEC sp_rename 'dbo.NewTable.idx_NewTable_DataColumn', 'idx_OldTable_DataColumn', 'INDEX';
	EXEC sp_rename 'dbo.PK_NewTable', 'PK_OldTable';
	EXEC sp_rename 'dbo.NewTable', 'OldTable';
END;
--drop staging tables
DROP TABLE dbo.OldTableStaging;
DROP TABLE dbo.OldTableStagingExactBoundaryMatches;
GO

Service Broker External Activator Example

The Service Broker External Activator (SBEA) allows one to launch any command-line program asynchronously using T-SQL.  I haven’t found many complete end-to-end examples on how to create an SBEA application so this article will demonstrate with a common use-case; launch a SSIS package on demand with the DTEXEC command-line utility.  The package can be run on any machine with SBEA and SSIS installed, which allows you to run the package remotely on a different server if you so desire.

Service Broker External Activator Overview

The latest version of the tool, officially named Microsoft® SQL Server® Service Broker External Activator for Microsoft® SQL Server® 2012, is a free download (http://www.microsoft.com/en-us/download/details.aspx?id=29065).  Service Broker External Activator is also included in the SQL Server 2008 and SQL Server 2008 R2 Feature Packs.

SBEA runs as a Windows service and leverages the SQL Server Service Broker event notification infrastructure to run a command-line application upon receipt of an event notification message.  When an event notification message is received, it launches the command-line program associated with the queue identified in the notification event message.  There is some learning curve if you are not already familiar with Service Broker and event notification concepts but it is a worthwhile investment, not just for SBEA, but for any application that needs asynchronous or message-based processing.

SBEA is an alternative to over other methods that can launch an external program via T-SQL, such as xp_cmdshell, a SQL Agent Job (sp_start_job) or a custom SQLCLR proc.  The xp_cmdshell extended stored procedure allows any ad-hoc command to be run on the database server, limited only by SQL Server service or proxy account permissions.  You need to be mindful of the security implications and take precautions to ensure it can be used only in the way intended.  Also, xp_cmdshell runs the launched program synchronously so the calling application must wait for the application to finish.  You could take special measures to emulate asynchronous behavior with xp_cmdshell but that’s a kludge at best.

A SQL Agent job allows you to run a command asynchronously but the limitation is only one instance of a job to run at a time.  In contrast, SBEA allows one to run multiple application instances concurrently and does so automatically.  SBEA allows a configurable maximum number of application instances so that you can scale automatically to meet peak demands.

SQL Server 2012 introduces the SSISDB catalog, which provides stored procedures to run SSIS packages asynchronously via T-SQL.  However, packages must be stored in the SSISDB database in order to use this feature.  SBEA is an alternative if you are pre SQL Server 2012, need to run packages not stored in the SSIS catalog, or need to run command-line programs besides DTEXEC.EXE.

SBEA Database Objects

As with all Service Broker applications, the minimal database objects needed are an initiator service and queue as well as a target service and queue.  These are used to pass messages to and from the target application (SSIS package in this case).  Additionally, the SBEA needs a notification service, notification queue and event notification so that it is notified when messages are waiting in the target queue for processing.

Stored procedures are a great way to encapsulate Service Broker functionality and implementation details so I use procs here as the application database interface.  Stored procs also simply security since the only database permissions necessary are execute permissions on the procs.  Within the procs, I also log to a table to facilitate monitoring progress, troubleshooting and auditing.  The complete T-SQL script is included at the end of this article.

Below is a summary of the database objects I created for this example:

Service Broker Objects

  • BatchProcessInitiatorService:  SB service that requests package execution (requesting application)
  • BatchProcessInitiatorQueue:  SB queue for success or failure result messages (returned by package after execution)
  • BatchProcessTargetService:  SB service that executes the request (SSIS package)
  • BatchProcessTargetQueue:  SB queue for execution requests, with message containing parameters needed by SSIS package (or an empty message if no parameters are needed)
  • BatchProcessNotificationService:  SB service that is the target of the event notifications (SBEA service)
  • BatchProcessNotificationQueue:  SB queue for event notification messages
  • BatchProcessEventNotification:  SB event notification that monitors the BatchProcessTargetQueue

Stored Procedures

  • dbo.usp_LaunchBatchProcess (called by invoking application): begins conversation to target service and sends message with parameters needed by package
  • dbo.usp_GetBatchProcessParemeters (called by SSIS package at start): receives next message from target queue and logs to table
  • dbo.usp_CompleteBatchProcess (called by SSIS package at completion): returns result message to initiator using END CONVERSATION (or END CONVERSATION WITH ERROR)
  • dbo.usp_LogBatchProcessResult (initiator queue activated proc): receives result messages from initiator queue, logs results to table and ends the conversation

Table

  • dbo.BatchProcessLog: Each received message is logged to this table

SBEA Configuration File

Service Broker External Activator is configured using the xml configuration file (EAService.config) in the application Config folder (e.g. C:\Program Files\Service Broker\External Activator\Config\).  This configuration file specifies:

  • The name of the notification service
  • Connection string used to connect to the database
  • A list of one or more Service Broker services along with the command-line application and arguments to launch for each

Below is the example I use for this sample with the relevant values in bold.  As a side note, SBEA currently supports only one notification service even though the parent element name (NotificationServiceList) implies one can have many.  However, I haven’t personally run into a case where this is an issue since, in a distributed environment, one can designate a single database for the SBEA messaging infrastructure.

<?xml version="1.0" encoding="utf-8"?>
<Activator xmlns="http://schemas.microsoft.com/sqlserver/2008/10/servicebroker/externalactivator"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://schemas.microsoft.com/sqlserver/2008/10/servicebroker/externalactivator EAServiceConfig.xsd">
<NotificationServiceList>
<NotificationService name="BatchProcessNotificationService" id="100" enabled="true">
<Description>My test notification service</Description>
<ConnectionString>
<!-- All connection string parameters except User Id and Password should be specificed here -->
<Unencrypted>server=myserver;database=SBEA_Example;Application Name=External Activator;Integrated Security=true;</Unencrypted>
</ConnectionString>
</NotificationService>
</NotificationServiceList>
<ApplicationServiceList>
<ApplicationService name="BatchProcessTargetService" enabled="true">
<OnNotification>
<ServerName>MyServer</ServerName>
<DatabaseName>SBEA_Example</DatabaseName>
<SchemaName>dbo</SchemaName>
<QueueName>BatchProcessTargetQueue</QueueName>
</OnNotification>
<LaunchInfo>
<ImagePath>C:\Program Files\Microsoft SQL Server\110\DTS\Binn\DTEXEC.EXE</ImagePath>
<CmdLineArgs>/F C:\Packages\Package.dtsx</CmdLineArgs>
<WorkDir>c:\temp</WorkDir>
</LaunchInfo>
<Concurrency min="1" max="1" />
</ApplicationService>
</ApplicationServiceList>
<LogSettings>
<LogFilter>
<TraceFlag>All Levels</TraceFlag>
 <TraceFlag>All Modules</TraceFlag>
 <TraceFlag>All Entities</TraceFlag>
 <TraceFlag>Verbose</TraceFlag>
<!--
http://schemas.microsoft.com/sqlserver/2008/10/servicebroker/externalactivator/EAServiceConfig.xsd
<xs:enumeration value="Error"/>
<xs:enumeration value="Warning"/>
<xs:enumeration value="Information"/>
<xs:enumeration value="Verbose"/>
<xs:enumeration value="All Levels"/>
<xs:enumeration value="Application Monitor"/>
<xs:enumeration value="Configuration Manager"/>
<xs:enumeration value="Global"/>
<xs:enumeration value="Recovery Log"/>
<xs:enumeration value="Imported Methods"/>
<xs:enumeration value="Notification Service"/>
<xs:enumeration value="Service Broker"/>
<xs:enumeration value="All Modules"/>
<xs:enumeration value="Asynchronous Updates"/>
<xs:enumeration value="Threads"/>
<xs:enumeration value="All Entities"/>
-->
</LogFilter>
</LogSettings>
</Activator>

Service Broker External Activator Processing

When the SBEA service starts, it validates the application service list against Service Broker meta-data to ensure the specified services and queues exist and are properly associated.  Then SBEA starts monitoring the notification queue associated with the specified notification service.  When a queue notification event message is received, it matches the server, database, schema and queue contained in the message payload against the application services listed.  When matched, it launches the command-line application associated with that service.  A warning message () is logged if a queue notification message is received that doesn’t have a corresponding application.

SBEA logs trace messages to a file named EATrace.log in the Log folder (C:\Program Files\Service Broker\External Activator\Log).  Logging levels are configurable.  I suggest you start with verbose logging initially (as in the same config) and adjust to less logging once your application is stable.  I included the config schema reference and various logging options as comments in the sampe config for your convenience.  Be aware that trace messages are appended to the file so you’ll want to periodically delete (or archive) the file.

SBEA Application Design and Message Flow

There are a many different patterns one can generally employ with Service Broker bi-directional message processing.  Here, I use a single-message-per-conversation pattern to pass parameters needed by the package as well-formed XML.  The target (SSIS package) does its work and ends its side of the conversation upon completion, indicating success or failure on the END CONVERSATION statement, which generates an EndDialog or Error message that is returned via the initiator queue.

Service Broker automatically executes the initiator queue activated proc when a message is waiting on the queue.  Acting on behalf of the initiating application, which might not be running due to asynchronous execution of the package, the activated proc ends the other side of the conversation upon receipt of the EndDialog or Error message.  Both sides of the conversation are ended, a best practice with this single-message-per-conversation pattern.

The following sequence describes the interaction and flow of messages between components:

  1. Initiating application executes dbo.usp_LaunchBatchProcess to launch console application.  Proc executes BEGIN DIALOG and SEND to insert the parameter message into the target queue.
  2. SBEA executes RECEIVE of the event notification message, indicating a message is waiting in the target queue
  3. SBEA launches the console application associated with the event notification
  4. The console application executes dbo.usp_GetBatchProcessParameters to get parameter values.  Proc executes RECEIVE of message from target queue and logs the message.
  5. The console application executes dbo.usp_CompleteBatchProcess at completion.  Proc executes END CONVERSATION to signal successful completion (or END CONVERSATION WITH ERROR to indicate an error occurred).
  6. Service Broker launches activated stored procedure dbo.usp_LogBatchProcessResult in response to the EndDialog message.  Proc execute RECEIVE from initiator queue and logs the result to a table.

The diagram below shows this message flow.SBEAApplicationMessageFlow

SBEAApplicationMessageFlow

Example Script

USE master;
GO

---------------------------------------------------
--- create database with Service Broker enabled ---
---------------------------------------------------
ALTER DATABASE SBEA_Example
SET SINGLE_USER WITH ROLLBACK IMMEDIATE;
GO

DROP DATABASE SBEA_Example;
GO

CREATE DATABASE SBEA_Example;
ALTER AUTHORIZATION ON DATABASE::SBEA_Example TO sa;
GO

ALTER DATABASE SBEA_Example
SET ENABLE_BROKER;
GO

-------------------------------
--- create database objects ---
-------------------------------
USE SBEA_Example;
GO

--log table
CREATE TABLE dbo.BatchProcessLog(
ConversationHandle uniqueidentifier NOT NULL
,MessageTypeName sysname NOT NULL
,MessageBody varbinary(MAX) NULL
,LogTime datetime2(3) NOT NULL
CONSTRAINT DF_ServiceBrokerLog_LogTime
DEFAULT (SYSDATETIME())
);
CREATE CLUSTERED INDEX cdx_BatchProcessLog ON dbo.BatchProcessLog(LogTime);
GO

CREATE PROC dbo.usp_LogBatchProcessResult
---------------------------------------------
--initiator queue activated proc to process messages
---------------------------------------------
AS
DECLARE
@conversation_handle uniqueidentifier
,@message_type_name sysname
,@message_body varbinary(MAX);
WHILE 1 = 1
BEGIN
WAITFOR (
RECEIVE TOP (1)
@conversation_handle = conversation_handle
,@message_type_name = message_type_name
,@message_body = message_body
FROM dbo.BatchProcessInitiatorQueue
), TIMEOUT 1000;
IF @@ROWCOUNT = 0
BEGIN
--exit when no more messages
RETURN;
END;

--log message
INSERT INTO dbo.BatchProcessLog(
ConversationHandle
,MessageTypeName
,MessageBody
)
VALUES(
@conversation_handle
,@message_type_name
,@message_body
);
END CONVERSATION @conversation_handle;
END;
GO

CREATE PROC dbo.usp_LaunchBatchProcess
@Parameter1 int
---------------------------------------------
--called by application to trigger batch process
--Sample Usage:
--
-- EXEC dbo.usp_LaunchBatchProcess @@Parameter1 = 1;
---------------------------------------------
AS
DECLARE
@conversation_handle uniqueidentifier
,@message_body varbinary(MAX);

BEGIN TRY

BEGIN TRAN;

BEGIN DIALOG CONVERSATION @conversation_handle
FROM SERVICE BatchProcessInitiatorService
TO SERVICE 'BatchProcessTargetService'
ON CONTRACT [DEFAULT]
WITH
ENCRYPTION = OFF,
LIFETIME = 6000;

SET @message_body = CAST(
N''
+ CAST(@Parameter1 AS nvarchar(10))
+ N'' AS varbinary(MAX));

SEND ON CONVERSATION @conversation_handle (@message_body);

COMMIT;
END TRY
BEGIN CATCH
THROW;
END CATCH;

RETURN @@ERROR;
GO

CREATE PROC dbo.usp_GetBatchProcessParameters
--------------------------------------
--called by batch package at start ---
--------------------------------------
AS
DECLARE
@conversation_handle uniqueidentifier
,@message_body xml
,@message_type_name sysname
,@parameter1 int;

BEGIN TRY

BEGIN TRAN;

RECEIVE TOP(1)
@conversation_handle = conversation_handle
,@message_type_name = message_type_name
,@message_body = message_body
FROM dbo.BatchProcessTargetQueue;

IF @@ROWCOUNT = 0
BEGIN
RAISERROR ('No messages received from dbo.BatchProcessTargetQueue', 16, 1);
RETURN 1;
END;

INSERT INTO dbo.BatchProcessLog(
ConversationHandle
,MessageTypeName
,MessageBody
)
VALUES(
@conversation_handle
,@message_type_name
,CAST(@message_body AS varbinary(MAX))
);

SET @parameter1 = @message_body.query('/Parameters/Parameter1').value('.', 'int');

COMMIT;

SELECT
@conversation_handle AS ConversationHandle
,@parameter1 AS Parameter1;

END TRY
BEGIN CATCH
THROW;
END CATCH;

RETURN @@ERROR;
GO

CREATE PROC dbo.usp_CompleteBatchProcess
@ConversationHandle uniqueidentifier
,@ErrorMessage nvarchar(3000) = NULL
------------------------------------------
-- called by SSIS package at completion
-- Sample Usage:

-- normal completion:
-- EXEC dbo.usp_CompleteBatchProcess
-- @ConversationHandle = '00000000-0000-0000-0000-000000000000';

-- completed with error:
-- EXEC dbo.usp_CompleteBatchProcess
-- @ConversationHandle = '00000000-0000-0000-0000-000000000000'
-- @ErrorMessage = 'an error occurred;
------------------------------------------
AS

IF @ErrorMessage IS NULL
BEGIN
END CONVERSATION @ConversationHandle;
END
ELSE
BEGIN
END CONVERSATION @ConversationHandle
WITH ERROR = 1
DESCRIPTION = @ErrorMessage;
END;

RETURN @@ERROR;
GO

--initiator queue with activated proc to process batch completed messages
CREATE QUEUE dbo.BatchProcessInitiatorQueue
WITH STATUS = ON,
ACTIVATION (
PROCEDURE_NAME = dbo.usp_LogBatchProcessResult,
MAX_QUEUE_READERS = 1,
EXECUTE AS SELF );
GO

--initiator service that triggers batch process
CREATE SERVICE BatchProcessInitiatorService
ON QUEUE dbo.BatchProcessInitiatorQueue
([DEFAULT]);
GO

--queue for event notifications
CREATE QUEUE dbo.BatchProcessNotificationQueue;
GO

--service for event notifications
CREATE SERVICE BatchProcessNotificationService
ON QUEUE dbo.BatchProcessNotificationQueue
(
[http://schemas.microsoft.com/SQL/Notifications/PostEventNotification]
);
GO

--target queue for batch process parameters
CREATE QUEUE dbo.BatchProcessTargetQueue;
GO

--target service for batch process parameters
CREATE SERVICE BatchProcessTargetService
ON QUEUE dbo.BatchProcessTargetQueue
([DEFAULT]);
GO

--event notification for target queue
CREATE EVENT NOTIFICATION BatchProcessTargetNotification
ON QUEUE dbo.BatchProcessTargetQueue
FOR QUEUE_ACTIVATION
TO SERVICE 'BatchProcessNotificationService' , 'current database';
GO

Summary

The Service Broker External Activator is a powerful way to launch command-line programs asynchronously.  There is a learning curve, which can be steep if you know nothing about the SQL Server Service Broker feature.  But I hope this doesn’t discourage you from exploring and using SBEA.  It may be the right tool for the job in cases where you need asynchronous batch processing, avoiding custom programming or kludge design.  It’s always a good thing to have another tool in your toolbox.

 

 

 

Secrets of Foreign Key Index Binding

You might be surprised to learn that foreign keys bind to physical indexes when they are created. Furthermore, a foreign key does not necessarily bind to the primary key index of the referenced table; SQL Server allows a foreign key to refer to any column(s) that are guaranteed to be unique as enforced by a primary key constraint, unique constraint or unique index.

In this post, I’ll discuss the undocumented rules SQL Server uses to bind foreign key constraints to referenced table indexes so that you can achieve performance goals and protect yourself against unexpected errors in DDL modification scripts.

Background

Typically, one references the primary key in foreign key relationships. I’ve seen a foreign key (deliberately) reference columns other than the primary key only a couple of times in my career. The foreign key referenced an alternate key with a unique constraint in those cases. Why one would create such a relationship is an exercise for the reader. I’ll focus on the primary key here, although the same considerations apply to foreign keys referencing alternate keys.

As I mentioned earlier, SQL Server binds a foreign key to a physical unique index. This binding performance implications because it determines the index SQL Server uses to enforce referential integrity as child table rows are inserted or updated. Also, SQL Server will not allow the index bound to a foreign key to be dropped since that could allow duplicate rows in the parent table and thus break the unique side of the relationship. This must be considered when developing scripts that drop unique indexes (including primary key and unique constraints) that may be bound to foreign keys.

A foreign key referencing the primary key will always be bound to the primary key index when that is the only unique index on the foreign key column(s). However, you might have additional unique indexes on the primary key column(s) for performance reasons. For example, consider the case of a clustered primary key. Performance of a frequently executed query may be improved with a covering non-clustered index:

–create parent table

CREATE TABLE dbo.ParentTable(

       ParentTableID int NOT NULL IDENTITY

              CONSTRAINT PK_ParentTable PRIMARY KEY CLUSTERED

       ,Column1 int NOT NULL

       ,Column2 varchar(100) NOT NULL

       );

GO

 

–create a non-clustered covering index

CREATE UNIQUE NONCLUSTERED INDEX idx_ParentTable_ParentTableID

       ON dbo.ParentTable(ParentTableID) INCLUDE(Column1);

GO

 

INSERT INTO dbo.ParentTable VALUES(1, ‘some data’);

INSERT INTO dbo.ParentTable VALUES(2, ‘some data’);

INSERT INTO dbo.ParentTable VALUES(3, ‘some data’);

GO

 

–create child table

CREATE TABLE dbo.ChildTable(

       ChildTableID int NOT NULL IDENTITY

            CONSTRAINT PK_ChildTable PRIMARY KEY CLUSTERED

       ,ParentTableID int NOT NULL

              CONSTRAINT FK_ChildTable_ParentTable

                     FOREIGN KEY REFERENCES dbo.ParentTable(ParentTableID)

       );

GO

 

INSERT INTO dbo.ChildTable VALUES(1);

INSERT INTO dbo.ChildTable VALUES(1);

INSERT INTO dbo.ChildTable VALUES(1);

INSERT INTO dbo.ChildTable VALUES(1);

INSERT INTO dbo.ChildTable VALUES(2);

INSERT INTO dbo.ChildTable VALUES(2);

INSERT INTO dbo.ChildTable VALUES(2);

INSERT INTO dbo.ChildTable VALUES(2);

INSERT INTO dbo.ChildTable VALUES(3);

INSERT INTO dbo.ChildTable VALUES(3);

INSERT INTO dbo.ChildTable VALUES(3);

INSERT INTO dbo.ChildTable VALUES(3);

GO

 

UPDATE STATISTICS dbo.ParentTable;

UPDATE STATISTICS dbo.ChildTable;

GO

 

–show the foreign key index binding

SELECT

    fki.name

FROM sys.foreign_keys AS f

JOIN sys.indexes AS fki ON

      fki.object_id = f.referenced_object_id     

      AND fki.index_id = f.key_index_id

WHERE

      f.object_id = OBJECT_ID(N’dbo.FK_ChildTable_ParentTable’);

GO

 

–this query uses the covering index instead of clustered PK index

SELECT p.ParentTableID, p.Column1

FROM dbo.ParentTable AS p

WHERE p.ParentTableID IN(1,2,3);

GO

 

 

The SELECT query in the above script uses the covering idx_ParentTable_ParentTableID index. While this is good for performance, it introduces ambiguity regarding index binding to the foreign key. Again, any primary key constraint, unique constraint or index on the referenced column(s) may be referenced by a foreign key. With two candidate unique indexes (PK_ParentTable and idx_ParentTable_ParentTableID), you have little control which index is bound to the foreign key. 

SQL Server chooses the index binding based on rules that vary by version so you will get different binding depending on your version of SQLServer. SQL Server 2005 chooses the clustered index when possible and, if no suitable clustered index exists, the first (lowest index_id) unique non-clustered index on the referenced column(s) is used. The sample script above binds the foreign key to the PK_WideTable index under SQL Server 2005 because it is the clustered index, not because it is the primary key.

In later versions (SQL 2008, SQL 2008R2 and SQL 2012), the foreign key is bound to the unique non-clustered index on the referenced column(s) with the lowest index_id when possible. Only when no suitable unique non-clustered index exists is the unique clustered index chosen. So the foreign key in the above script is bound to idx_ParentTable_ParentTableID in SQL 2008 and later versions instead of the primary key index as one might expect.

Why Foreign Key Index Binding is Important

There are two reasons why it is important to control the index bound to a foreign key. One is performance. As I mentioned earlier, the index bound to the foreign key constraint is used at execution time to enforce the constraint as child table rows are inserted or the foreign key column(s) updated. If the parent table is large and not queried often but rows are inserted into the child table heavily, a unique non-clustered index that “covers” the referential integrity check may be more desirable than the clustered index. This can improve buffer efficiency and page life expectancy compared to using a clustered index (e.g. primary key). My assumption is that this is why SQL Server 2008 and later versions prefer the unique non-clustered index over the clustered index for constraint enforcement.

Another reason one should control the index bound to the foreign key is to facilitate index changes. If you try to drop an index bound to a foreign key, you’ll get an error like “An explicit DROP INDEX is not allowed on index ‘dbo.ParentTable.idx_ParentTable_ParentTableID ‘. It is being used for FOREIGN KEY constraint enforcement.” You’ll need to drop the foreign key first and recreate after dropping the index.

Since one can’t specify the bound foreign key index declaratively, the only guaranteed way to control the binding is to create the foreign key when only the desired unique index exists and create additional indexes afterward. This isn’t to say you can’t rely on the rules described earlier but you need to be aware that such rules vary depending on the SQL Server version and could change in the future.