Advanced PL/SQL and Oracle ETL Doug Cosman Senior Oracle DBA SageLogix, Inc. Open World 2003
Agenda Overview of Oracle 9i ETL Provides Fast Transformations Using Only the 9i DB Advanced PL/SQL Features Necessary for Understanding Oracle 9i ETL PL/SQL Performance Techniques for Data Warehouse Environments
What Is ETL? Extract Pull the Data From the Source Transform Convert the Input Format to the Target Format Encode any Values Load Insert the Transformed Data to the Target Tables
Oracle 9i ETL Extract Oracle 9i External Tables Transform PL/SQL Pipelined Table Functions Oracle Warehouse Builder Can also be Used to Build Pipelined Table Functions Maps Source Data Layout and Target Schema and Builds PL/SQL and SQL Code Load Direct Path Inserts
Performance Factors SQL Execution Time Efficiency of Execution Plan Hardware Resource Waits Code Logic Execution Time Speed of Host Language Variable Binding Time to Bind Values Back to Host Language
PL/SQL Binding Types of Binds IN-Binds Bind Values From Host Language to SQL Engine OUT-Binds Values are Returned from SQL Objects to Host Variables Bind Options Single Row Binds Bulk Binds
Single Row Binds Cursor FOR-LOOP DECLARE CURSOR cust_cur (p_customer_id NUMBER) IS SELECT * FROM f_sales_detail WHERE customer_id = p_customer_id; v_customer_idNUMBER := 1234; BEGIN FOR rec IN cust_cur (v_customer_id) LOOP INSERT INTO sales_hist (customer_id, detail_id, process_date) VALUES (v_customer_id, rec.sales_id, sysdate); END LOOP; END;
Context Switching PL/SQL Engine DB SQL Engine OUT-BIND IN-BIND
Single Row Binds The Most Expensive Operation by Far is the Binding Single Row Binding is SLOW for Large Result Sets
Bulk Binding PL/SQL Bulk Bind Support added in 8i IN-Binds An Array of Values is Passed to the SQL Engine OUT-Binds SQL Engine Populates a PL/SQL Bind-Array Context Switch Once per Batch Instead of Once per Row Performance Increase of Up to 15 Times
Bulk Operators BULK COLLECT Specifies that Bulk Fetches Should be Used Be Careful to Handle Last Batch LIMIT Defines the Batch Size for Bulk Collections FORALL Bulk DML Operator Not a Looping Construct like a Cursor-For-Loop PL/SQL Table is Referenced in the Statement
DECLARE TYPE sales_t IS TABLE OF f_sales_detail.sales_id%TYPE INDEX BY BINARY_INTEGER; sales_idssales_t; v_customer_idNUMBER := 1234; max_rowsCONSTANT NUMBER := 10000; CURSOR sales(p_customer_id NUMBER) IS SELECT sales_id FROM f_sales_detail WHERE customer_id = p_customer_id; BEGIN OPEN sales(v_customer_id); LOOP EXIT WHEN sales%NOTFOUND; FETCH sales BULK COLLECT INTO sales_ids LIMIT max_rows; FORALL i IN 1..sales_ids.COUNT INSERT INTO sales_hist (customer_id, detail_id, process_date) VALUES (v_customer_id, sales_ids(i), sysdate); END LOOP; CLOSE sales; END;
Native Compilation Allows PL/SQL to be Executed as a Compiled C Program Requires Native C Compiler on Host Enabling Set init.ora PLSQL_* Parameters Compile as Native Code PL/SQL is First Compiled Down to P-Code C Source Code is Generated from the P-Code Native Compiler is Invoked Creating a ‘C’ Shared Object Library Subsequent Calls to PL/SQL Object are Run by the ‘C’ Library
Native Compilation Performance Language Execution Speed is About Five Times Faster when not Interacting with the Database In Typical Code Interacting with Larger Data Volumes Execution Speed is Very Similar to Interpreted Code Remember that Variable Binding can be a Bigger Factor than Code Execution Speed Mixing Native and Interpreted PL/SQL Oracle Recommends an All or None Approach for Production Including User-Defined and Supplied Packages
Collection Types Associative Arrays (PL/SQL Tables) PL/SQL Type Only Nested Tables Shared Type Varrays Shared Type
Associative Arrays PL/SQL Type Only Not a SQL Type Easy to Use Automatic Element Allocation No Need to Initialize Two Kinds in 9i Release 2 INDEX BY BINARY_INTEGER INDEX BY VARCHAR2 Similar to: Java Hashtables Perl and Awk Associative Arrays
Associative Arrays DECLARE TYPE hash_table_t IS TABLE OF NUMBER INDEX BY VARCHAR2(30); _map hash_table_t; CURSOR users IS SELECT username, user_id FROM dba_users; BEGIN FOR user IN users LOOP _map(user.username) := user.user_id; END LOOP; END;
Multi-Dimensional Arrays New in 9i Release 1 Implemented as Collection of Collections DECLARE TYPE element IS TABLE OF NUMBER INDEX BY BINARY_INTEGER; TYPE twoDimensional IS TABLE OF element INDEX BY BINARY_INTEGER; twoD twoDimensional; BEGIN twoD(1)(1) := 123; twoD(1)(2) := 456; END;
Nested Tables No Maximum Size Harder to Use than Associative Arrays Need to be Initialized Code Must Explicitly Allocate New Elements Shared Type with SQL Two Options for Type Definition Local PL/SQL Definition Global SQL Type Declared in the Database Allows Variables to be Shared Between Both Environments
Nested Tables PL/SQL Scoped Type DECLARE TYPE nest_tab_t IS TABLE OF NUMBER; nt nest_tab_t := nest_tab_t(); BEGIN FOR i IN LOOP nt.EXTEND; nt(i) := i; END LOOP; END;
Nested Tables Globally Defined in SQL CREATE OR REPLACE TYPE _demo_obj_t AS OBJECT ( _idNUMBER, demo_codeNUMBER, valueVARCHAR2(30) ); / CREATE OR REPLACE TYPE _demo_nt_t AS TABLE OF _demo_obj_t; /
Nested Tables SQL-Defined Nested Tables PL/SQL Variables can be Manipulated by the SQL Engine Local PL/SQL Variables Can Be: Sorted Aggregated Used for Dynamic In-Lists Joined With SQL Tables Joined with Other PL/SQL Nested Tables
Table Functions Nested Tables Enable Table Functions SELECT * FROM TABLE( CAST(eml_dmo_nt AS _demo_nt_t) ) TABLE Operator Tells Oracle to Treat the Variable as a SQL Table CAST Operator Explicitly Tells Oracle the Data Type to be Used to Handle the Operation
Table Function Example DECLARE eml_dmo_nt _demo_nt_t := _demo_nt_t(); BEGIN -- Some logic that populates the nested table … eml_dmo_nt.EXTEND(3); eml_dmo_nt(1) := _demo_obj_t(45, 3, '23'); eml_dmo_nt(2) := _demo_obj_t(22, 3, '41'); eml_dmo_nt(3) := _demo_obj_t(18, 7, 'over_100k'); -- Process the data in assending order of id. FOR r IN (SELECT * FROM TABLE(CAST(eml_dmo_nt AS _demo_nt_t)) ORDER BY 1) LOOP dbms_output.put_line(r. _id || ' ' || r.demo_id); END LOOP; END;
Returning Result Sets Returning Collections Directly Return the Data Structure Itself Returning Reference Cursors Returns and Open Cursor to an Application Doesn’t Return Data from PL/SQL Directly Calling a Table Function from the SQL Context Convert Function Return Value into a Cursor
Returning Collections Return a Collection Type Explicitly Best Suited for PL/SQL Calling Programs FUNCTION get_ _demo(p_ _id NUMBER) RETURN _demo_nt_t IS eml_dmo _demo_nt_t; BEGIN SELECT _demo_obj_t( _id, demo_id, value) BULK COLLECT INTO eml_dmo FROM _demographic WHERE _id = p_ _id; -- Apply some business logic on the nested table here. RETURN eml_dmo; END;
Table Functions Can be Used in a SQL Context Too A Table Function Takes a Collection Type as an Argument A Function that Returns a Collection Works Too Allows us to Pass Out PL/SQL Collections as a Cursor to any Host Language SELECT * FROM TABLE( CAST( get_ _demo(45) AS _demo_nt_t));
Table Functions Data is Buffered in the Local Variable During Function Execution Cursor Returns Rows after Function Completes Private Memory Issues if the Result Set is Large Need a Way to Stream Results 9i Pipelined Table Functions Provides a Streaming Interface Rows are Returned as they are Produced Rows are Actually Buffered in Small Batches Remember Bulk Binding Issue? Can be Run in Parallel PIPELINED Keyword PIPE ROW Operator
Pipelined Table Function FUNCTION get_ _demo RETURN _demo_nt_t PIPELINED IS CURSOR _demo_cur IS SELECT _demo_obj_t( _id, demo_id, value) FROM _demographic; eml_dmo_nt _demo_nt_t; BEGIN OPEN _demo_cur; LOOP EXIT WHEN _demo_cur%NOTFOUND; FETCH _demo_cur BULK COLLECT INTO eml_dmo_nt LIMIT 1000; FOR i IN 1..eml_dmo_nt.COUNT LOOP /* Apply some business logic on the object here, and return a row. */ PIPE ROW (eml_dmo_nt(i)); END LOOP; END LOOP; RETURN; END;
External Tables One Last Piece of Background Information Oracle 9i External Tables Provides a Way for Oracle to Read Directly from Flat Files on the Database Server File can be Queried as if it is a Real Database Table Can Sort, Aggregate, Filter Rows, etc. External File Can be Queried in Parallel Only Table Definition is Stored in the Database Data is ‘External’ Table Definition is Similar to SQL*Loader Control File
External Tables CREATE TABLE ext_tab ( VARCHAR2(50), age NUMBER, income VARCHAR2(20)) ORGANIZATION EXTERNAL ( TYPE oracle_loader DEFAULT DIRECTORY data_dir ACCESS PARAMETERS (RECORDS DELIMITED BY NEWLINE LOGFILE data_dir: 'ext_tab.log' BADFILE data_dir: 'ext_tab.bad' FIELDS TERMINATED BY ',' MISSING FIELD VALUES ARE NULL ( CHAR(50), age INTEGER EXTERNAL(2), income CHAR(20) ) ) LOCATION ('ext_tab.dat') ) REJECT LIMIT UNLIMITED;
ETL Example AGEINCOME _IDDEMO_CODEVALUE over_100k Normalize, Encode and Pivot Input Record
PACKAGE BODY etl IS TYPE hash_table_t IS TABLE OF NUMBER INDEX BY VARCHAR2(30); _map hash_table_t; FUNCTION transform (new_data SYS_REFCURSOR) RETURN _demo_nt_t PIPELINED PARALLEL_ENABLE ( PARTITION new_data BY ANY ) IS TYPE ext_tab_array IS TABLE OF ext_tab%ROWTYPE INDEX BY BINARY_INTEGER; indata ext_tab_array; _demo_obj _demo_obj_t := _demo_obj_t(null,null,null); demo_map hash_table_t; BEGIN LOOP EXIT WHEN new_data%NOTFOUND; FETCH new_data BULK COLLECT INTO indata LIMIT 1000; FOR i IN 1..indata.COUNT LOOP _demo_obj. _id := _map(indata(i). ); _demo_obj.demo_code := 3; _demo_obj.value := indata(i).age; PIPE ROW ( _demo_obj); _demo_obj.demo_code := 7; _demo_obj.value := indata(i).income; PIPE ROW ( _demo_obj); END LOOP; RETURN; END; BEGIN FOR IN (SELECT _id, FROM ) LOOP _map( . ) := . _id; END LOOP; END;
Oracle 9i ETL Transformation is Just a Simple INSERT as SELECT Elegant Solution to Parallel, Transactional Co-processing INSERT /*+ append nologging */ INTO _demographic (SELECT /*+ parallel( a, 4 ) */ * FROM TABLE( CAST( etl.transform( CURSOR(SELECT * FROM ext_tab )) AS _demo_nt_t)) a);
Parallel Co-processing DB PL/SQL PQ Slave Input File INSERT Extract Transform Load
Performance Issues Speed is Respectable but There is a Performance Bottleneck with the Table Function Mechanism Possibly an Issue Binding Data Back from the SQL Engine Throughput is about Three Times Slower than Coding with BULK COLLECT and FORALL Operators However These Don’t Support Parallel Operations Oracle Expects to have it Fixed in Next Release
ETL Alternatives The Multi-Table INSERT Statement New in 9i Each Sub-Query Input Row Can be INSERT’ed to a Different Table … or the Same Table Multiple Times Faster than Using PL/SQL It’s Always Faster to do Something in Pure SQL than Using Any Host Language Binding is Avoided
Multi-Table Insert INSERT /*+ append nologging */ ALL INTO _demographic ( _id, demo_id, value) VALUES ( _id, 3, age) INTO _demographic ( _id, demo_id, value) VALUES ( _id, 7, income) (SELECT /*+ ordered index( b ) */ b. _id, a.income, a.age FROM ext_tab a, b WHERE a. = b. );
SQL-Only Processing DB PQ Slave Input File INSERT Extract Transform Load
Performance Solutions Minimize SQL Execution Time Exploiting Caching to Eliminate Some SQL Look-ups and Joins Direct Path Inserts Code Logic Execution Time Replacing Interpreted PL/SQL with Native Compilation Eliminating Host Language Using Multi-Table INSERTS Variable Binding Replace Single Row Binds with Bulk Binds
Conclusion Oracle 9i ETL is a High Performance ETL Solution Especially Once the Table Function Issue is Resolved Already Included in the Cost of the RDBMS