Antonio Abalos Castillo How to load your data faster and safer using Change Tracking in SQL Server
Thank you to our sponsors!
Agenda Why faster data loads? What is Change Tracking? Design overview Demo/implementation Extra hints
Why faster data loads? Corporations load and replicate data in a variety of ways They become unreliable or miss data over time They use unsupported ways to identify increment of data They are difficult to maintain Not optimal when identifying the updated data Need extra programming effort Do not follow standards
Why faster data loads? Benefits of using this approach No programming overhead at the source Avoid using timestamps, row GUIDs or any other programming artifact Change Tracking is transparent to applications Maintenance cost is 0 Very low performance impact in the source database Multiple target systems can get data from the same source DB using this approach We get just the latest version, according to our last status. All different row status in the middle are skipped Running the delta more often will decrease the execution time MERGE is the fastest data loading method (SCD remains as a bad example) Minimally logged operations will help performance (maybe more than you think)
What is Change Tracking? Change tracking is a lightweight solution that provides an efficient change tracking mechanism for applications Available since SQL Server 2008 Requires Standard edition of SQL Server or higher Lightweight: The incremental performance overhead that is associated with using change tracking on a table is similar to the overhead incurred when an index is created for a table and needs to be maintained https://technet.microsoft.com/en-us/library/hh710064(v=sql.110).aspx https://msdn.microsoft.com/en-us/library/bb933875(v=sql.110).aspx
What is Change Tracking? Each insert/update/delete in each table will be tracked by: The ID columns used in the table [optional] the columns that were updated Changes are accumulated and reported by SQL Server according to the last version we got https://msdn.microsoft.com/en-us/library/hh710064(v=sql.110).aspx
What is Change Tracking? Enable Change Tracking Database level ALTER DATABASE AdventureWorks2012 SET CHANGE_TRACKING = ON (CHANGE_RETENTION = 2 DAYS, AUTO_CLEANUP = ON) For each audited table ALTER TABLE dbo.SalesOrderDetail ENABLE CHANGE_TRACKING WITH (TRACK_COLUMNS_UPDATED = ON)
What is Change Tracking? Get changes from Change Tracking Get current version SET @ver = CHANGE_TRACKING_CURRENT_VERSION(); Get minimum valid version SET @mvv = CHANGE_TRACKING_MIN_VALID_VERSION(OBJECT_ID('dbo.Sales'));
What is Change Tracking? Get changes from Change Tracking Get changes for one table DECLARE @last_ver BIGINT = 82; SELECT CT.SalesID, CT.SYS_CHANGE_OPERATION, CT.SYS_CHANGE_COLUMNS FROM CHANGETABLE(CHANGES dbo.Sales, @last_ver) AS CT https://technet.microsoft.com/en-us/library/cc280358(v=sql.105).aspx
Design overview Target Staging area MERGE delta over target data ETL Minimally logged operations Automatic delta/full load detection Source Change Tracking enabled Isolation aware
Design overview Requirements: SQL Server source database Change Tracking enabled Integration Services MERGE statements (SQL 2008+) Other data sources: Change Data Capture (Oracle)
Demo Demo scenario Server A Server B SQL Server Source database Windows Azure VNET Server A SQL Server Source database Change Tracking Server B SQL Server Target database Logging SSIS Data flow
Extra hints – Best practices Transaction isolation strategy Enable SNAPSHOT isolation in the source database Or create a source snapshot database Index maintenance jobs can break big transactions at the source Watch out for complex data flows that may need to break down into simpler ones The best is to have a one-to-one copy of the source table, but this is not always possible How do we deal with deleted rows? (joining tables) Do we need to track changes in columns?
Extra hints - Trick list Use trace flag 610 (carefully) Use tab-lock in destination Use ORDER hint in destination Boost up DFT memory Boost up DFT number of rows Run parallel tasks The Data Loading Performance Guide https://msdn.microsoft.com/en-us/library/dd425070.aspx
Extra hints - Other tricks Databases in “simple” recovery model Change page torn detection to NONE Create a DATA file group and set it as DEFAULT Create as many files as CPU in each file group (depends on storage) Separate the log file from the data files in different disks Consider using heaps for fast-load processes Consider using partitioned tables for regular tables Increase parallelism
Extra hints - Security considerations Catalog views sys.change_tracking_databases sys.change_tracking_tables Permissions SELECT permission on at least the primary key columns on the change-tracked table to the table that is being queried VIEW CHANGE TRACKING permission on the table for which changes are being obtained https://msdn.microsoft.com/en-us/library/hh710064(v=sql.110).aspx
Extra hints - Change Tracking Vs. Change Data Capture Change data capture (CDC) Change tracking (CT) Tracked changes DML changes Yes Tracked information Historical data No Whether column was changed DML type Collects historical values, and therefore much more data than CT You have no idea on how many updates were made to a row, nor the values that were updated https://technet.microsoft.com/en-us/library/cc280519(v=sql.105).aspx https://msdn.microsoft.com/en-us/library/bb933994.aspx
Other references Brent Ozar’s guide to Change Tracking https://www.brentozar.com/archive/2014/06/performance-tuning-sql-server-change-tracking/ Good guide for a data load using Change Tracking implementation https://www.timmitchell.net/post/2016/01/20/using-sql-server-change-tracking-for-incremental-loads