Presentation is loading. Please wait.

Presentation is loading. Please wait.

2010 Microsoft BI Conference

Similar presentations


Presentation on theme: "2010 Microsoft BI Conference"— Presentation transcript:

1 2010 Microsoft BI Conference

2 Performance Design Patterns
2010 Microsoft BI Conference SESSION CODE: BIE13-INT Performance Design Patterns Matt Masson Developer, SQL Server Integration Services Microsoft Corporation BIE13-INT Need a fast data integration solution, but don't have the time or budget for heavy performance tuning? Come learn how to maximize your ROI by applying trusted design patterns to your Integration Services packages. We talk about how to set performance expectations, and how to put together a simple framework to record benchmarks for your ETL process. We go over the basics of smart package design, and then look at a number of design patterns for common data warehousing problems, such as Slowly Changing Dimension processing, Range Lookups, and Change Detection.

3 Topics Parallel Processing Using the Database Engine
Change Data Capture Slowly Changing Dimensions Late Arriving Facts Lookup Cache Null Value Substitution General Performance Tips

4 2010 Microsoft BI Conference
Data Flow Review Buffers Synchronous & Asynchronous components

5 General Performance Tips
2010 Microsoft BI Conference General Performance Tips Good package design is key Use the appropriate tools Use the Database Engine when you can Take a scientific approach to performance tuning Measure Hypothesize Modify Re-Measure Most projects don’t need extensive tuning Use the right tools Previous talk about general performance optimizations

6 Creating Benchmarks

7 Sample Report Package Benchmark Step Time KB / Sec Rows / Sec Baseline
DimCustomer 00:02:37:987 350 1,202.0 DimNation 00:00:00:263 121 95.0 DimPart 00:00:19:777 2557 10,113.0 DimPartSupplier 00:01:51:643 2121 7,165.0 DimRegion 00:00:00:417 76 12.0 DimSupplier 00:00:02:207 1218 4,533.0 Lineitem 00:13:56:847 1656 7,171.0 Orders 00:01:32:843 3697 16,156.0 00:18:15:190 1,475 5,805.9

8 Topics Parallel Processing Using the Database Engine
Change Data Capture Slowly Changing Dimensions Late Arriving Facts Lookup Cache Null Value Substitution General Performance Tips

9 2010 Microsoft BI Conference
Parallel Processing Most important pattern today Scale-out and Scale-up SSIS was built with parallelism in mind PATTERN

10 2010 Microsoft BI Conference
Just Add Hardware Divide your ETL task into distinct Operations Perform Operations in parallel DimRegion DimNation DimCustomer DimPart DimSupplier DimPartSupplier Divide your ETL work Determine dependencies Doing a full historic load where the surrogate keys are created outside of the ETL

11 Parallel Control Flow Tasks
2010 Microsoft BI Conference Parallel Control Flow Tasks Unconnected tasks are run in parallel Different constraint types AND vs OR

12 Here we have a fairly complicated data flow with a single source that branches into a couple of different directions. There’s only one real destination at the bottom here, but there are two other end points as well – an export column transform, and an OLEDB Command. All of the green circles indicate synchronous transforms, and the red octagons are for places where buffers are created, or where they end. This shows how the data will flow from the source, and split on the multicasts and conditional splits. One think I want to point out is that multicast and conditional split are actually synchronous transforms – they don’t make copies of the buffers when they split the streams. SSIS uses some internal memory magic to create virtual buffers that map to the same memory. So even if you multicast a flow 20 times, you won’t see your memory usage increasing at all. When the data flow is separated into execution trees, it looks like this. The source is in its own execution tree, and the rest start and end on the red octagons. In 2005, we would have had at most 5 threads active for this entire data flow. In some cases, this would be fine… for example, we never have more than one thread on a source component, and the execution tree at the bottom here with no green dots in it would be fine with one thread as well. What benefits most from the new threading are execution trees that look like the one on the right, with the long series of synchronous transforms. In 2005 we’d move one buffer at a time through this entire series – the next buffer wouldn’t start being processed until the first one finished. In 2008, multiple threads are available on each execution tree. So now one thread might move a buffer through the first three transforms on the right, and then a second thread could pick it up and move it through the rest, while the first thread grabs the next buffer. In 2005, people would actually put in a Union All with a single input, which is asynchronous, just to split up the execution trees. Thankfully this type of tweaking is no longer needed in 2008.

13 2010 Microsoft BI Conference
Parallel Packages Break ETL logic up into multiple packages Downside to this approach? Need to design your parallelism upfront MaxExecutables Customer lab – 128 threads EPT isn’t async

14 2010 Microsoft BI Conference
Work Pile Pattern Create a (priority) queue for your packages Run multiple package instances in parallel Work Pile Work Horses Shared Resources Scheduler DTExec (1) DTExec (2) DTExec (n) P5 Pn P3 P4 P1 P2 Work Pile Pattern Scheduler logic Key – distinct operations Use staging to split up the data flow

15 2010 Microsoft BI Conference
Large Telco Customer Server Farm Extract Transform Load Loading 6-7 TB / day 64 machines in the farm, upgrading to 128 SAN bottleneck – 250gb/hour 80 packages, 3 developers Call data comes into towers Processed by unix switches Flat files transferred to server farm Each file goes through phases Devs could do live debug

16 Review Best way to improve performance
Divide your solution into distinct operations Use built-in parallelism where you can MaxConcurrentExecutables EngineThreads Consider using the Work Pile Pattern for large scale-out solutions

17 Topics Parallel Processing Using the Database Engine
Change Data Capture Slowly Changing Dimensions Late Arriving Facts Lookup Cache Null Value Substitution General Performance Tips

18 Using the Database Engine
2010 Microsoft BI Conference Using the Database Engine Choose the right tool for the job PATTERN

19 SSIS is Awesome, but the SQL Engine is Awesomer
2010 Microsoft BI Conference SSIS is Awesome, but the SQL Engine is Awesomer SSIS is an in memory ETL tool The database engine is optimized for sorting and joining Stage data and avoid OLE DB Command when possible Source Staging Destination In memory ETL No pre-built indexes Problems with OLE DB Command General staging pattern Batch Destination component

20 2010 Microsoft BI Conference
Use the SQL Engine Use ORDER BY clause Set IsSorted and SortKeyPosition Use GROUP BY clause COUNT() and other SQL Functions Perform JOINs directly in the source query Avoid SORT – Fully blocking Parallelism example – used local staging just to do sort and join Aggregate also fully blocking MERGE JOIN can be unavoidable Example – designing with a tool that generates SQL

21 2010 Microsoft BI Conference
MERGE Traditional Upsert Upsert using MERGE Old upsert pattern in SQL 2005 using lookup MERGE destination component

22 Review Take advantage of the database engine when you can
Staging is helpful Batch Destination Transform makes things easier Avoid Sort, Aggregate, and Joins (from the same source) Use MERGE MERGE Destination available on Codeplex

23 Topics Parallel Processing Using the Database Engine
Change Data Capture Slowly Changing Dimensions Late Arriving Facts Lookup Cache Null Value Substitution General Performance Tips

24 2010 Microsoft BI Conference
Change Data Capture PATTERN

25 2010 Microsoft BI Conference
Mo’ Data, Mo’ Problems Reduce the amount of data you’re processing Filter out unchanged rows as close to the source as possible Factors that affect the solution Is my source a relational database? Can I change my source schema? Many patterns just move the work around Reducing data size will give immediate performance gain

26 Traditional Approaches
2010 Microsoft BI Conference Traditional Approaches Audit Columns Modified flag Checksum Log Scraping Extract operations from Transaction Log Timed Extracts LastModifiedDate column Database Diff Comparing database snapshots

27 SQL Server Change Data Capture
2010 Microsoft BI Conference SQL Server Change Data Capture API is set of stored procedures Enable it on a table without changing the schema Captures change history in separate tables Consume full delta, or individual changes Consume windows, or all changes CDC can adapt to new or removed columns

28 2010 Microsoft BI Conference
Using CDC from SSIS Access CDC table or use CDC function in your choice of format Conditional split on operation type

29 2010 Microsoft BI Conference
Review Less work means less processing time Alternatives Checksum Change Tracking Trigger based, synchronous processing Attunity CDC Solutions

30 Topics Parallel Processing Using the Database Engine
Change Data Capture Slowly Changing Dimensions Late Arriving Facts Lookup Cache Null Value Substitution General Performance Tips

31 Slowly Changing Dimensions
2010 Microsoft BI Conference Slowly Changing Dimensions Story – Insurance company PATTERN

32 Slowly Changing What Now?
SCD handling is key for most data warehouses SSIS has built in support for SCD Easy to get up and running Performance and scale problems in certain scenarios Third party components Build it yourself using existing components

33 Slowly Changing Dimension Wizard
2010 Microsoft BI Conference Slowly Changing Dimension Wizard Performance optimizations Retain same connection to True Enable Fast Load Batch SCD 1 changes

34 Kimball Method SCD Component

35 Handling SCD with MERGE
Stage Store data in a temporary table Optimize Provides additional performance benefits Execute Run the MERGE SQL statement in two steps

36 Handling SCD with Merge – Type 1
2010 Microsoft BI Conference Handling SCD with Merge – Type 1 MERGE INTO [DimProduct] AS FACT USING [Staging] AS SRC ON ( FACT.ProductAlternateKey = SRC.ProductAlternateKey ) WHEN MATCHED AND FACT.EndDate is NULL -- update the current record THEN UPDATE SET FACT.[ArabicDescription] = SRC.ArabicDescription ,FACT.[ChineseDescription] = SRC.ChineseDescription ,FACT.[EnglishDescription] = SRC.EnglishDescription ,FACT.[FrenchDescription] = SRC.FrenchDescription ,FACT.[GermanDescription] = SRC.GermanDescription ,FACT.[HebrewDescription] = SRC.HebrewDescription ,FACT.[JapaneseDescription] = SRC.JapaneseDescription ,FACT.[ThaiDescription] = SRC.ThaiDescription ,FACT.[TurkishDescription] = SRC.TurkishDescription ,FACT.[ReorderPoint] = SRC.ReorderPoint ,FACT.[SafetyStockLevel] = SRC.SafetyStockLevel ;

37 Handling SCD with Merge – Type 2
2010 Microsoft BI Conference Handling SCD with Merge – Type 2 INSERT INTO [DimProduct] ([ProductAlternateKey],[ListPrice],[EnglishDescription],[StartDate]) SELECT [ProductAlternateKey],[ListPrice],[EnglishDescription],[StartDate] FROM ( MERGE INTO [DimProduct] AS FACT USING [Staging] AS SRC ON ( FACT.ProductAlternateKey = SRC.ProductAlternateKey ) WHEN NOT MATCHED THEN INSERT VALUES ( SRC.ProductAlternateKey ,SRC.ListPrice ,SRC.EnglishDescription ,GETDATE() -- StartDate ,NULL -- EndDate ) WHEN MATCHED AND FACT.EndDate is NULL THEN UPDATE SET FACT.EndDate = GETDATE() OUTPUT $Action Action_Out ,SRC.ProductAlternateKey ,GETDATE() StartDate ) AS MERGE_OUT WHERE MERGE_OUT.Action_Out = 'UPDATE'

38 2010 Microsoft BI Conference

39 Review SCD Wizard Try out the third party components Use MERGE
Use performance optimizations Replace OLE DB Command Enable Fast Load or stage inserts Reuse connection Not the only way to do SCD processing in SSIS Try out the third party components Kimball Method SCD (Codeplex) Table Difference (Cozyroc) Use MERGE Bad rows will fail the entire batch

40 Slowly Changing Dimension Processing
Which Should I Choose? Slowly Changing Dimension Processing SCD Wizard Custom Component MERGE Ease of Creation Maintenance Large Dimensions Small Change Set Overall Performance Error Recovery Good Okay Bad

41 Topics Parallel Processing Using the Database Engine
Change Data Capture Slowly Changing Dimensions Late Arriving Facts Lookup Cache Null Value Substitution General Performance Tips

42 Late Arriving Facts Range lookups PATTERN

43 Ranges Need to lookup a key for a range, or point in time
Frequent requirement for Historical Loads DimProduct FactOrders Range lookups are common in data warehouse situations where you’re building a historical fact table and using type 2 dimensions. As a quick refresher, Type 2 dimensions are when you add a new row to represent updated information, instead of updating the old row directly. You typically keep track of the current row using Start Date and End Date columns. This lets you maintain all of the historical information. In the example here, we have the product dimension from Adventure Works. We see three rows for the same product, identified by the ProductAlternateKey. The StandardCost and ListPrice has changed between the entries, and we see we have Start and End Date columns to indicate when these values were valid. The NULL in the End Date column means that it’s the current value. Now, if we’re building a fact table using values from the Orders table in our OLTP system, we’ll want to make sure we reference the ProductKey value that was valid at the time of the order. So if our order took place on , we’d need find the range it falls into, and the find the matching product key. Pretty straight forward, but it’s not really what the Lookup Transform was designed for. Lookup was designed more for one to one mappings. Of course, it can be done in SSIS, otherwise I probably wouldn’t be talking about it here today. I’ll show you three different approaches here, with varying degrees of complexity.

44 Lookup Transform – Customize the Query
2010 Microsoft BI Conference Lookup Transform – Customize the Query select [ProductKey], [ProductAlternateKey], [StartDate], [EndDate] from [DimProduct] where [ProductAlternateKey] = ? and [StartDate] <= ? and ( [EndDate] is null or [EndDate] > ? )

45 Merge Join and Conditional Split
OrderDate ProductNumber LJ-0192-L OrderDate Product Key StartDate EndDate LJ-0192-L 232 233 234 NULL One of the testers on the SSIS team suggested this next approach, and it look me a little bit to wrap my head around. This method doesn’t actually use the Lookup Transform at all – instead it uses a Merge Join and Conditional Split. We setup the Merge Join to do an Inner Join with the dimension table. This will give us more rows going out than we had coming in… since we’re joining on the natural key, we’ll end up with a row for every version of that value in the dimension. So remember that in the previous example I showed, the product had three entries in the Product dimension. This would give us three separate rows every time we try to lookup that product key. We use the Conditional Split to do the actual range lookup. In the merge join, we pull in three new columns, the ProductKey we want, the StartDate, and the EndDate. We set the expression in the conditional split to only take the rows where the Order Date falls between the Start and End Dates. The rest of the rows can be discarded. A little more complicated than the previous example, but it works. OrderDate Product Key LJ-0192-L 232

46 Custom Range Lookup Component
Build your own lookup cache Replicate Full and Partial cache modes Use weak references to reduce memory consumption

47 2010 Microsoft BI Conference
Machine Dual core 1.8ghz laptop 3GB RAM Data AdventureWorks and AdventureWorksDW ~120,000 order rows ~600 reference table

48 2010 Microsoft BI Conference
Review Lookup – max queries, one per change row Custom – max queries, one per dimension row Merge – entire dimension table will be read Range Lookup Processing Effort Maintenance Performance Comments Lookup Ok for small number of rows Merge Join & Conditional Split Best for equal data set size Good middle ground Custom Script Best for overall performance

49 Topics Parallel Processing Using the Database Engine
Change Data Capture Slowly Changing Dimensions Late Arriving Facts Lookup Cache Null Value Substitution General Performance Tips

50 2010 Microsoft BI Conference
Lookup Cache PATTERN

51 Save the Cache, Save the World
Cache Mode is the most important Lookup Transform property SQL 2008 introduced the Cache Connection Manager Allows shared and persistent caches SQL 2008 also introduced a Miss Cache Off by default

52 2010 Microsoft BI Conference
Lookup Cache Modes Cached created pre-data flow Must fit into memory Full Cache created on demand Cache size configurable Misses can be cached as well Partial No rows are cached Use with volatile data None

53 Cascading Lookup Pattern
Full cache grabs most common values Partial cache grabs the rest

54 Using Lookup to Create a Surrogate Key

55 Cache Connection Manager

56 Persistent Cache Usage Reducing database and memory usage
Customers

57 Persistent Cache Usage Cache most common values
Customers SELECT TOP CustomerId … Get the rest

58 Review Use the right Lookup Cache mode
Cache Connection Manager can give you performance benefits Not a silver bullet Limited by disk speed

59 Topics Parallel Processing Using the Database Engine
Change Data Capture Slowly Changing Dimensions Late Arriving Facts Lookup Cache Null Value Substitution General Performance Tips

60 Null Value Substitution
2010 Microsoft BI Conference Null Value Substitution PATTERN

61 Lookup, No Match, Union, Repeat
2010 Microsoft BI Conference Lookup, No Match, Union, Repeat Replacing a null value or failed lookup with a default value Natural pattern is to use No Match output Merge back to main stream using Union All

62 Replace with a constant NULL value
Traditional Approach Region not found Replace with a constant NULL value Repeat

63 2010 Microsoft BI Conference
Saving it Until the End Ignore missing matches Handle all subs in a single transform Only one buffer is created Use miss cache

64 Topics Parallel Processing Using the Database Engine
Change Data Capture Slowly Changing Dimensions Late Arriving Facts Lookup Cache Null Value Substitution General Performance Tips

65 2010 Microsoft BI Conference
Required Slide Resources Learning Sessions On-Demand & Community Microsoft Certification & Training Resources Resources for IT Professionals Resources for Developers

66 2010 Microsoft BI Conference
Required Slide Complete an evaluation on CommNet and enter to win!

67 7/8/2019 7:47 PM © 2010 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION. © 2010 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.


Download ppt "2010 Microsoft BI Conference"

Similar presentations


Ads by Google