SSIS Exploring Scalability, Performance and Deployment Vinod Kumar & Srinivas Sampath MVP – SQL Server
Presentation Scope A high level view Design considerations How to measure performance Performance implications of architecture Manageability aspects of SSIS Deployment tips Out of scope Prescriptive guidance for specific situations
Agenda Buffers and Memory OVAL Concept Detailed Component Specific Notes Manageability Features Deployment Considerations
Introduction
SSIS Life Cycle tools Design the SSIS Package Business Intelligence Studio (visual Studio) Migration wizard for pre SQL 2005 packages Version Control Integration (VSS) Deployment/Execution Deployment Utility to copy packages Command Line execution (dtexec.exe and dtexecui.exe) Flexible Configuration Options Supportability Rich per package Logging SQL Management Studio for monitoring running packages and organizing stored packages Checkpoint - Restartability
SSIS Tools SSIS packages packages BI Studio SSIS Service Mgt Studio Import Export WizardDeploymentInstaller File set Dtexec.exeDtexecui.exe Dtutil.exe execution View running and import\export deploy
Deep dive into Performance
Buffers and Memory Buffers based on design time metadata The width of a row determines the size of the buffer Smaller rows = more rows in memory = greater efficiency Memory copies are expensive! A buffer might have placeholder columns filled by downstream components Pointer magic where possible
Component Types Logically works at a row level Buffer Reused Data Convert, Derived Column Row based (synchronousoutputs) Partially Blocking (asynchronousoutputs) Blocking(asynchronousoutputs) May logically work at a row level Data copied to new buffers Merge, Merge Join, Union All Needs all input buffers before producing any output rows Data copied to new buffers Aggregate, Sort
CPU Utilization Execution Tree Starts from a source or an async output Ends at a destination or an input that has no sync outputs Each Execution Tree can get a worker thread MaxEngineThreads to control parallelism
Performance Strategy Use OVAL to identify the factors affecting data integration performance… O perations Which app is best suited to these operations on this volume of data? For example, use SQL Server or SSIS for sorting data? V olume A pplication L ocation How much data must be processed? What logic should be applied to the data? Where should the app run? For example, on a shared server, or on a standalone machine?
An OVAL Example— Loading a Text File Simple scenario… Interesting performance considerations! Text file on Server 1 SQL Server on Server 2
Understand all operations performed Operations Beware of hidden operations Data conversion in either step 3 or 4 1. Open a transaction on SQL Server 2. Read data from the text file 3. Load data into the SSIS data flow 4. Load the data into SQL Server 5. Commit the transaction
File Source Unnecessary data type conversions ‘FastParse’ in Flat File Source Unnecessary operations: E.g., converting from text to datetime, then from datetime to date Reduce database operations Database logging Commit size Fast Load Table lock Operations - Sharpen
Volume Reduce where possible Don’t push unneeded columns Conditional split for filtering rows Do not parse or convert columns unnecessarily In a fixed-width format you can combine adjacent unneeded columns into one Leave unneeded columns as strings
Volume - Sharpen Use appropriate data types An integer in the range takes 2 bytes as an integer, 3 bytes as a string, but 4 bytes as a real Suggest Types in the flat file connection manager UI Use parallelism If loading multiple files, can they be loaded in parallel?
Application Is SSIS right for this? Overhead of starting up an SSIS package may offset any performance gain over BCP for small data sets. Is BCP good enough? Is the greater manageability and control of SSIS needed? Bulk Import Task vs. Data Flow
Location Consider the following configuration … Text file on Server 1 SQL Server on Server 2 Where should SSIS run? (Licensing issues aside)
Location Considerations SSIS on Server 1 Competes with apps for resources Will data conversion on Server 1 reduce or increase the volume of data transferred across the network? Can not use the fast SSIS SQL Server Destination SSIS on Server 2 Competes with SQL Server for resources Will pulling text over conversion be expensive? Also consider transferring the file unparsed to Server 2 and read it locally from there Can use the fast SSIS SQL Server Destination
Measuring Performance OVAL does not provide prescriptive guidance Too many variables Improve performance by applying OVAL and measuring SSIS Logging Performance counters SQL Server Profiler For extract queries, lookups and loading
Parallelism Focus on critical path Utilize available resources Memory Constrained Reader and CPU Constrained Let it rip! Optimize the slowest
Moving Ahead
Manageability Features Logging and Log Providers Checkpoint Restartability Precedence Constraints Configurations SSIS Service
Logging and Log Providers Log entries are a blend of status and result messages Can select what ‘details’ per control flow object within each package (e.g. OnError, OnWarning, OnPreExecute) Can select what fields (e.g.computer, operator, ExecutionID…) Can define multiple log providers (SQL, text file, Windows Event..) per package
Checkpointing Checkpoint File Created Write Checkpoint Checkpoint File deleted Package LoadsPackage Completes Data Flow Task Send Mail Task
Configurations ‘Feed’ changes into a package and alter execution without editing the package directly (e.g. file name to load) The ‘feed’ can be sourced from a SQL table, XML file, Registry key, OS environment var, a Parent package. You can apply 1-many configuration sets per package and from a mix of sources
Configuration Scenario Dev DB Multiple Configurations Dev Test Production Test DBProd DB Machines where packages are being designed /tested /executed Configuration updates package on load with DB locations (and mail server, file share locations….) Package Handoff
Precedence constraints Directs Flow from object to object… Basically, ‘when do I move on’ Success, Failure, Completion or one of those plus an expression (condition) Dataflow Task SendMail Task Success Completion Failure Success & expression
Manageability Demo
Deployment Flow Tools to organize and ‘copy’ packages and supporting files Design Package Add Configurations Add Miscellaneous files Set Project Deployment properties Build Choose Destination (SQL File System) Modify protection level Choose location of supporting files Change configurations Execute Installation Wizard Bi Studio Copy/Move Deployment folder\files User Create desired agent jobs SQL Agent Copy/Move Deployment folder\filesUser
SQL Management Studio Utilizes the SSIS service Allows Monitoring of currently Executing packages Maintain stored package structure Ad hoc Package execution
Deployment Demo
Some more Tips LookupAggregateSortSwapping
Performance of Lookups The reference set Restrict to only those columns you actually use Restrict rows with WHERE if possible The lookup cache Caching can improve performance Full cache When the reference set will fit comfortably in memory Partial Build a cache as the input records are matched Useful for duplicate keys in the input, such as SKUs None Reference set doesn’t fit in memory and partial cache has no advantage
Performance of Aggregate Majority of work happens in ProcessInput call. This is on the thread in the previous execution tree! Memory requirements depend on how ‘deep’ the aggregations are Can reuse buckets if one agg can be derived from another Use when memory is limited, single threaded operation
Performance of Sort ProcessInput hangs on to the incoming data PrimeOutput does the sort and is the expensive part Sort needs all data to be in memory Sort can have unpredictable CPU requirements Merging is single threaded Stock Sort component will be good enough for most users Third party (“fastest sort in the world”) available if you really need it
Swapping buffers When physical memory is not available Each buffer gets written out to one file Multiple paths can be specified for swapping buffers BufferTempStoragePath property on the Pipeline Do everything in your power to avoid swapping Else, performance is really unpredictable Options: 64 bits, out of process execution, serializing operations
SSIS: Summary Fast ! Data flows process large volumes of data efficiently - even through complex operations Exceptional price / performance on multi-core Feature Rich Many pre-built adapters and transformations reduce hand coding Extensible object model enables specialized custom or scripted components Highly productive visual environment speeds development and debugging Integral part of a complete BI stack (IS-AS-RS) Beyond ETL Enables integration of XML, RSS and Web Services data Data cleansing features enable “difficult” data to be handled during loading Data and Text mining allow “smart” handling of data for imputation of incomplete data, conditional processing of potential problems, or smart escalation of issues such as fraud detection
Your Feedback is Important! Please Fill Out the feedback form
Questions !!!
Links & Resources Vinod Kumar, MVP-SQL Server, Intel Technology India Pvt. Ltd. SQL Server Integration Services public site ehouse/SSIS/default.aspx ehouse/SSIS/default.aspx SQL Server Business Intelligence public site on/bi/default.asp on/bi/default.asp SSIS MVPs community site Newsgroupsmicrosoft.private.sqlserver2005.dts Srinivas Sampath, MVP-SQL Server www32.brinkster.com/srisamp SCT Software Solutions
© 2005 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.