4 Copyright © 2005, Oracle. All rights reserved. Extraction, Transformation, and Loading (ETL) Extraction and Transportation.

Slides:



Advertisements
Similar presentations
Using the SQL Access Advisor
Advertisements

Numbers Treasure Hunt Following each question, click on the answer. If correct, the next page will load with a graphic first – these can be used to check.
1 A B C
1 Senn, Information Technology, 3 rd Edition © 2004 Pearson Prentice Hall James A. Senns Information Technology, 3 rd Edition Chapter 7 Enterprise Databases.
AP STUDY SESSION 2.
1
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 4 Computing Platforms.
Processes and Operating Systems
Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.
David Burdett May 11, 2004 Package Binding for WS CDL.
Refreshing Materialized Views
1 RA I Sub-Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Casablanca, Morocco, 20 – 22 December 2005 Status of observing programmes in RA I.
Local Customization Chapter 2. Local Customization 2-2 Objectives Customization Considerations Types of Data Elements Location for Locally Defined Data.
Process a Customer Chapter 2. Process a Customer 2-2 Objectives Understand what defines a Customer Learn how to check for an existing Customer Learn how.
CALENDAR.
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt BlendsDigraphsShort.
Chapter 6 File Systems 6.1 Files 6.2 Directories
1 Chapter 12 File Management Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,
1 Click here to End Presentation Software: Installation and Updates Internet Download CD release NACIS Updates.
The 5S numbers game..
1.
Welcome. © 2008 ADP, Inc. 2 Overview A Look at the Web Site Question and Answer Session Agenda.
Break Time Remaining 10:00.
Turing Machines.
Table 12.1: Cash Flows to a Cash and Carry Trading Strategy.
9 Copyright © 2006, Oracle. All rights reserved. Automatic Performance Management.
13 Copyright © 2005, Oracle. All rights reserved. Monitoring and Improving Performance.
Database Performance Tuning and Query Optimization
PP Test Review Sections 6-1 to 6-6
11 Copyright © Oracle Corporation, All rights reserved. Managing Tables.
9 Copyright © 2004, Oracle. All rights reserved. Using DDL Statements to Create and Manage Tables.
Microsoft Confidential. We look at the world... with our own eyes...
EIS Bridge Tool and Staging Tables September 1, 2009 Instructor: Way Poteat Slide: 1.
1 tRelational/DPS Overview. 2 ADABAS Data Transfer: business needs and issues tRelational & DPS Overview Summary Questions? Demo Agenda.
Bellwork Do the following problem on a ½ sheet of paper and turn in.
Exarte Bezoek aan de Mediacampus Bachelor in de grafische en digitale media April 2014.
Sample Service Screenshots Enterprise Cloud Service 11.3.
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
GIS Lecture 8 Spatial Data Processing.
 Copyright I/O International, 2013 Visit us at: A Feature Within from Item Class User Friendly Maintenance  Copyright.
Note: A bolded number or letter refers to an entire lesson or appendix. A Adding Data Through a View ADD_MONTHS Function 03-22, 03-23, 03-46,
1 RA III - Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Buenos Aires, Argentina, 25 – 27 October 2006 Status of observing programmes in RA.
4 Oracle Data Integrator First Project – Simple Transformations: One source, one target 3-1.
Mobility Tool Fremtidens afrapportering 2013 – Erasmus Mobilitet / IP 2014 – Erasmus+ aktioner.
CONTROL VISION Set-up. Step 1 Step 2 Step 3 Step 5 Step 4.
Adding Up In Chunks.
BY LECTURER/ AISHA DAWOOD DW Lab # 3 Overview of Extraction, Transformation, and Loading.
MaK_Full ahead loaded 1 Alarm Page Directory (F11)
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt Synthetic.
Materialized Views.
: 3 00.
5 minutes.
1 hi at no doifpi me be go we of at be do go hi if me no of pi we Inorder Traversal Inorder traversal. n Visit the left subtree. n Visit the node. n Visit.
Speak Up for Safety Dr. Susan Strauss Harassment & Bullying Consultant November 9, 2012.
Essential Cell Biology
Converting a Fraction to %
Clock will move after 1 minute
PSSA Preparation.
Immunobiology: The Immune System in Health & Disease Sixth Edition
Physics for Scientists & Engineers, 3rd Edition
Energy Generation in Mitochondria and Chlorplasts
Select a time to count down from the clock above
Import Tracking and Landed Cost Processing An Enhancement For AS/400 DMAS from  Copyright I/O International, 2001, 2005, 2008, 2012 Skip Intro Version.
Copyright Tim Morris/St Stephen's School
1.step PMIT start + initial project data input Concept Concept.
Page 1 Orchard Harvest ™ LIS Find a Patient Training.
South Dakota Library Network MetaLib User Interface South Dakota Library Network 1200 University, Unit 9672 Spearfish, SD © South Dakota.
Presentation transcript:

4 Copyright © 2005, Oracle. All rights reserved. Extraction, Transformation, and Loading (ETL) Extraction and Transportation

4-2 Copyright © 2005, Oracle. All rights reserved. Objectives After completing this lesson, you should be able to do the following: Describe the core ETL framework inside the database and its integration advantage Explain data warehousing extraction methods Identify transportation methods: –Flat file –Distributes operations –Transportable tablespaces Describe transformation flow

4-3 Copyright © 2005, Oracle. All rights reserved. Overview Lesson 4: Extraction/Transportation Lesson 5: Loading Lesson 6: Transformation

4-4 Copyright © 2005, Oracle. All rights reserved. What Is ETL? ETL is an acronym for Extraction, Transformation, and Loading. The following happen during the ETL process: –The desired data is identified and extracted from many different sources. –Some transformations may take place during this extraction process. –After extraction, the data must be transported to a target system or an intermediate system for further processing. –Depending on the method of transportation, some transformations can be done simultaneously. ETL refers to a broad process.

4-5 Copyright © 2005, Oracle. All rights reserved.

4-6 Copyright © 2005, Oracle. All rights reserved. Extraction Methods Extraction can be thought of in two parts: –Extraction –Transportation There are two extraction methods: –Logical –Physical Your logical choice influences the way the data is physically extracted. Some criteria for choosing a combination: –Business needs –Location of the source and target systems –Availability of the source system –Time required to extract data

4-7 Copyright © 2005, Oracle. All rights reserved. Logical Extraction Methods There are two kinds of logical extraction: Full extraction –All data is pulled –Less information to track –More time required to pull the data Incremental extraction –A subset of data is pulled –Must track what data needs to be pulled –Less time required to pull the data

4-8 Copyright © 2005, Oracle. All rights reserved.

4-9 Copyright © 2005, Oracle. All rights reserved. Physical Extraction Methods There are two types of physical extraction. Online extraction: –Pulls data from the source system Offline extraction: –Pulls data from a staging area –Staging areas include flat files, dump files, and transportable tablespaces.

4-10 Copyright © 2005, Oracle. All rights reserved. Offline Extraction Staging areas: Flat files –Requires data in a predefined, generic format Dump files –Must be in an Oracle-specific format Redo and archive logs –Data located in special dump files Transportable tablespaces –Powerful, fast method for moving large volumes of data

4-11 Copyright © 2005, Oracle. All rights reserved. Implementing Methods of Extraction Extracting to a file: –Spooling from SQL*Plus –Using OCI or Pro*C to dump to a file –Using Data Pump to export to an Oracle dump file –Using external tables Extracting through distributed operations

4-12 Copyright © 2005, Oracle. All rights reserved.

4-13 Copyright © 2005, Oracle. All rights reserved. Incremental Extraction Using CDC CDC can capture and publish committed change data in either of the following modes: Synchronous –Triggers on the source database allow change data to be captured immediately. –Change data is captured as part of the transaction modifying the source table. Asynchronous –Change data is captured after a SQL statement performing DML is committed using the redo logs. –Asynchronous Change Data Capture is built on Oracle Streams.

4-14 Copyright © 2005, Oracle. All rights reserved. Publish and Subscribe Model The publisher performs the following tasks: Identifies source tables from which the data warehouse is interested in capturing change data Uses the DBMS_CDC_PUBLISH package to: –Set up the capture of data from the source tables –Determine and advance the change sets –Publish the change data Allows controlled access to subscribers using the SQL GRANT and REVOKE statements

4-15 Copyright © 2005, Oracle. All rights reserved.

4-16 Copyright © 2005, Oracle. All rights reserved. Publish and Subscribe Model The subscriber uses the DBMS_CDC_SUBSCRIBE package to: Subscribe to source tables Extend the window and create change view Prepare the subscriber views View data stored in change tables Purge the subscriber view Remove the subscriber views

4-17 Copyright © 2005, Oracle. All rights reserved.

4-18 Copyright © 2005, Oracle. All rights reserved. Synchronous CDC Source tables Source database transactions SYNC_SOURCE Change source Change set Change tables Subscriber views Trigger execution Source database

4-19 Copyright © 2005, Oracle. All rights reserved. Asynchronous CDC Asynchronous CDC: Captures change data from redo log files after changes have been committed to the source database Modes are dependent on the level of supplemental logging used on the source database Uses Oracle Streams to capture change data from redo log files Has three source modes: –Asynchronous AutoLog mode –Asynchronous HotLog mode –Asynchronous Distributed HotLog mode

4-20 Copyright © 2005, Oracle. All rights reserved. Asynchronous AutoLog Mode Source tables Source database transactions LGWR Online redo logs Distributed AutoLog change set Change set Change tables Subscriber views Source database Staging database Distributed AutoLog change source RFS Standby redo logs Streams capture LOG_ARCHIVE_DEST_2

4-21 Copyright © 2005, Oracle. All rights reserved.

4-22 Copyright © 2005, Oracle. All rights reserved. Asynchronous HotLog Configuration Source tables Source database transactions HOTLOG_SOURCE Change Source Change set Change tables Subscriber views Streams local capture Source database LGWR Online redo logs

4-23 Copyright © 2005, Oracle. All rights reserved. Asynchronous Distributed HotLog Mode Source tables Source database transactions LGWR Online redo logs Distributed HotLog change set Change set Change tables Subscriber views Source database Staging database DBlink Distributed HotLog change source DBlink Streams propagation

4-24 Copyright © 2005, Oracle. All rights reserved. Preparing to Publish Change Data 1.Gather requirements from the subscribers. 2.Determine which source database contains the relevant source tables. 3.Choose the capture mode: Synchronous Asynchronous HotLog Asynchronous Distributed HotLog Asynchronous AutoLog 4.Ensure that the source and staging databases have appropriate database initialization parameters set. 5.Set up database links between the source database and the staging database.

4-25 Copyright © 2005, Oracle. All rights reserved. Creating a Publisher User The staging database publisher must be granted the following privileges and roles: – EXECUTE_CATALOG_ROLE privilege – SELECT_CATALOG_ROLE privilege – CREATE TABLE and CREATE SESSION privileges – EXECUTE on the DBMS_CDC_PUBLISH package Create a default tablespace for the publisher.

4-26 Copyright © 2005, Oracle. All rights reserved.

4-27 Copyright © 2005, Oracle. All rights reserved. Synchronous Publishing 1.Create a change set. 2.Create a change table. 3.Grant access to subscribers. BEGIN DBMS_CDC_PUBLISH.CREATE_CHANGE_SET( change_set_name => 'CHICAGO_DAILY', description => 'Change set for sales history info', change_source_name => 'SYNC_SOURCE'); END; GRANT SELECT ON cdcpub.products_ct TO subscriber1;

4-28 Copyright © 2005, Oracle. All rights reserved.

4-29 Copyright © 2005, Oracle. All rights reserved. Asynchronous Distributed HotLog Publishing Prepare the source and staging databases: 1.Configure Oracle Net so that the source database can communicate with the staging database. 2.Set initialization parameters on the source database. 3.Set initialization parameters on the staging database. compatible = global_names = true job_queue_processes = + 2 open_links = 4 parallel_max_servers = + 3 processes = + 4 sessions = + 1 streams_pool_size = + 20 MB undo_retention = 3600

4-30 Copyright © 2005, Oracle. All rights reserved. Asynchronous Distributed HotLog Publishing Prepare the staging database: Set the database initialization parameters on the staging database. compatible = global_names = true java_pool_size = open_links = 4 job_queue_processes = 2 parallel_max_servers = + 2 processes = + 3 sessions = + 1 streams_pool_size = + 11 MB undo_retention = 3600

4-31 Copyright © 2005, Oracle. All rights reserved. Asynchronous Distributed HotLog Publishing Alter the source database: 1.Place the database into FORCE LOGGING logging mode to protect against unlogged direct writes. 2.Enable supplemental logging. 3.Create an unconditional log group on all columns to be captured in the source table. ALTER DATABASE FORCE LOGGING; ALTER DATABASE ADD SUPPLEMENTAL LOG DATA; ALTER TABLE SH.PRODUCTS ADD SUPPLEMENTAL LOG GROUP log_group_products (PROD_ID, PROD_NAME, PROD_LIST_PRICE) ALWAYS;

4-32 Copyright © 2005, Oracle. All rights reserved. Asynchronous Distributed HotLog Publishing Publisher privileges on source and staging databases: 1.Create and grant privileges to the source database publisher. 2.Create and grant privileges to the staging database publisher. CREATE USER source_cdcpub IDENTIFIED BY source_cdcpub QUOTA UNLIMITED ON SYSTEM QUOTA UNLIMITED ON SYSAUX; GRANT CREATE SESSION TO source_cdcpub; GRANT DBA TO source_cdcpub; GRANT CREATE DATABASE LINK TO source_cdcpub; GRANT EXECUTE on DBMS_CDC_PUBLISH TO source_cdcpub; GRANT EXECUTE_CATALOG_ROLE TO source_cdcpub; GRANT SELECT_CATALOG_ROLE TO source_cdcpub; EXECUTE DBMS_STREAMS_AUTH.GRANT_ADMIN_PRIVILEGE( GRANTEE=> 'source_cdcpub');

4-33 Copyright © 2005, Oracle. All rights reserved.

4-34 Copyright © 2005, Oracle. All rights reserved. Asynchronous Distributed HotLog Publishing Create source and staging database links: 1.Create the source database link. 2.Create the staging database link. CREATE DATABASE LINK staging_db CONNECT TO staging_cdcpub IDENTIFIED BY staging_cdcpub USING 'staging_db'; CREATE DATABASE LINK source_db CONNECT TO source_cdcpub IDENTIFIED BY source_cdcpub USING 'source_db';

4-35 Copyright © 2005, Oracle. All rights reserved. Asynchronous Distributed HotLog Publishing Create change sources and change sets: 1.Create the change sources. 2.Create the change sets. BEGIN DBMS_CDC_PUBLISH.CREATE_HOTLOG_CHANGE_SOURCE( change_source_name => 'CHICAGO', description => 'test source', source_database => 'source_db'); END; DBMS_CDC_PUBLISH.CREATE_CHANGE_SET( change_set_name => 'CHICAGO_DAILY', description => 'change set for product info', change_source_name => 'CHICAGO', stop_on_ddl => 'y'); END;

4-36 Copyright © 2005, Oracle. All rights reserved. Asynchronous Distributed HotLog Publishing Create the change tables on the staging database: BEGIN DBMS_CDC_PUBLISH.CREATE_CHANGE_TABLE( owner => 'staging_cdcpub', change_table_name => 'products_ct', change_set_name => 'CHICAGO_DAILY', source_schema => 'SH', source_table => 'PRODUCTS', column_type_list => 'PROD_ID NUMBER(6), PROD_NAME VARCHAR2(50), PROD_LIST_PRICE NUMBER(8,2), JOB_ID VARCHAR2(10), DEPARTMENT_ID NUMBER(4)', capture_values => 'both', rs_id => 'y', row_id => 'n',... options_string => 'TABLESPACE TS_CHICAGO_DAILY'); END;

4-37 Copyright © 2005, Oracle. All rights reserved.

4-38 Copyright © 2005, Oracle. All rights reserved. Asynchronous Distributed HotLog Publishing Enable the change source and change set: 1.Enable the change source. 2.Enable the change set. 3.Grant access to subscribers. BEGIN DBMS_CDC_PUBLISH.ALTER_HOTLOG_CHANGE_SOURCE( change_source_name => 'CHICAGO', enable_source => 'Y'); END; BEGIN DBMS_CDC_PUBLISH.ALTER_CHANGE_SET( change_set_name => 'CHICAGO_DAILY', enable_capture => 'y'); END;

4-39 Copyright © 2005, Oracle. All rights reserved. Subscribing to Change Data 1.Find the source tables for which the subscriber has access privileges. 2.Find the change set names and columns for which the subscriber has access privileges. SQL> SELECT * FROM ALL_SOURCE_TABLES; SOURCE_SCHEMA_NAME SOURCE_TABLE_NAME SH PRODUCTS SQL> SELECT UNIQUE CHANGE_SET_NAME, COLUMN_NAME, PUB_ID FROM 2 ALL_PUBLISHED_COLUMNS WHERE SOURCE_SCHEMA_NAME ='SH' AND 3 SOURCE_TABLE_NAME = 'PRODUCTS'; CHANGE_SET_NAME COLUMN_NAME PUB_ID CHICAGO_DAILY PROD_ID CHICAGO_DAILY PROD_LIST_PRICE CHICAGO_DAILY PROD_NAME 41494

4-40 Copyright © 2005, Oracle. All rights reserved. Subscribing to Change Data 3.Create a subscription. 4.Subscribe to a source table and columns. BEGIN DBMS_CDC_SUBSCRIBE.CREATE_SUBSCRIPTION( change_set_name => 'CHICAGO_DAILY', description => 'Change data for PRODUCTS', subscription_name => 'SALES_SUB'); END; BEGIN DBMS_CDC_SUBSCRIBE.SUBSCRIBE( subscription_name => 'SALES_SUB', source_schema => 'SH', source_table => 'PRODUCTS', column_list => 'PROD_ID, PROD_NAME, PROD_LIST_PRICE', subscriber_view => 'SALES_VIEW'); END;

4-41 Copyright © 2005, Oracle. All rights reserved. Subscribing to Change Data 5.Activate the subscription. 6.Get the next set of change data. BEGIN DBMS_CDC_SUBSCRIBE.ACTIVATE_SUBSCRIPTION( subscription_name => 'SALES_SUB'); END; BEGIN DBMS_CDC_SUBSCRIBE.EXTEND_WINDOW( subscription_name => 'SALES_SUB'); END;

4-42 Copyright © 2005, Oracle. All rights reserved. Subscribing to Change Data 7.Query the subscriber views. 8.Indicate that the change data is no longer needed. 9.End the subscription. SELECT PROD_ID, PROD_NAME, PROD_LIST_PRICE FROM SALES_VIEW; PROD_ID PROD_NAME PROD_LIST_PRICE And 2 Crosscourt Tee Kids And 2 Crosscourt Tee Kids Gurfield& Murks Pleated Trousers Gurfield& Murks Pleated Trousers BEGIN DBMS_CDC_SUBSCRIBE.PURGE_WINDOW( subscription_name => 'SALES_SUB'); END;

4-43 Copyright © 2005, Oracle. All rights reserved. Asynchronous Distributed HotLog Source Database Initialization Parameters For all Oracle Database 10g releases: 3600 UNDO_RETENTION The current value + (the number of change sources planned) SESSIONS The current value + (4 times the number of change sources planned) PROCESSES The current value + (3 times the number of change sources planned) PARALLEL_MAX_SERVERS Should be equal to the number of Distributed HotLog change sources planned OPEN_LINKS Maximum number of DBMS_JOB jobs that can run simultaneously plus 2 JOB_QUEUE_PROCESSES TRUE GLOBAL_NAMES or COMPATIBLE Value Parameter

4-44 Copyright © 2005, Oracle. All rights reserved. Asynchronous Distributed HotLog Source Database Initialization Parameters For Oracle 9.2 databases: The current value + (the number of change sources planned) PROCESSES The current value + (3 times the number of change sources planned) PARALLEL_MAX_SERVERS The number of Distributed HotLog change sources planned OPEN_LINKS The number of change sources planned LOGMNR_MAX_PERSISTENT_ SESSIONS 1 LOG_PARALLELISM Maximum number of DBMS_JOB jobs that can run simultaneously plus 2 JOB_QUEUE_PROCESSES TRUE GLOBAL_NAMES COMPATIBLE Value Parameter

4-45 Copyright © 2005, Oracle. All rights reserved. Asynchronous Distributed HotLog Staging Database Initialization Parameters For Oracle Database 10g Release 2: Set to the current value + ((the number of change sources planned) * (11MB)) STREAMS_POOL_SIZE The current value + (the number of change sources planned) SESSIONS The current value + (3 times the number of change sources planned) PROCESSES The current value + (2 times the number of change sources planned) PARALLEL_MAX_SERVERS Equal to the number of Distributed HotLog change sources planned, but no less than 4 OPEN_LINKS JAVA_POOL_SIZE TRUE GLOBAL_NAMES COMPATIBLE Value Parameter

4-46 Copyright © 2005, Oracle. All rights reserved. Data Dictionary Views Supporting CDC CHANGE_SOURCES lists existing change sources. CHANGE_SETS lists existing change sets. CHANGE_PROPAGATIONS describes the streams propagation associated with a given distributed HotLog change source on the source database. CHANGE_TABLES lists existing change tables. DBA_SOURCE_TABLES lists published source tables. DBA_PUBLISHED_COLUMNS lists published source table columns. DBA_SUBSCRIPTIONS lists all registered subscriptions. DBA_SUBSCRIBED_TABLES lists published tables to which subscribers have subscribed. DBA_SUBSCRIBED_COLUMNS lists the columns of tables to which subscribers have subscribed.

4-47 Copyright © 2005, Oracle. All rights reserved. Transportation in a Data Warehouse Three basic choices in transportation: Transportation using flat files Transportation through distributed operations Transportation using transportable tablespaces

4-48 Copyright © 2005, Oracle. All rights reserved. Transportable Tablespaces This is the fastest method for moving large volumes of data. Source and target databases can have different block sizes. The method is especially useful for transporting data from OLTP to data warehouse. Before Oracle Database 10g, source and target databases needed to use the same operating system.

4-49 Copyright © 2005, Oracle. All rights reserved.

4-50 Copyright © 2005, Oracle. All rights reserved. Transportable Tablespaces: Example 1.Place the data into its own tablespace. 2.Export the metadata. 3.Copy the data and export file to the target system. 4.Import the metadata. 5.Insert the new data into the fact table or employ the partition exchange feature. CREATE TABLE temp_jan_sales NOLOGGING TABLESPACE ts_temp_sales AS SELECT * FROM sales WHERE time_id BETWEEN '31-DEC-1999' AND '01-FEB-2000'; EXPDP DIRECTORY=DW_DUMP_DIR DUMPFILE=jan.dmp TRANSPORT_TABLESPACES=ts_temp_sales IMPDP DIRECTORY=DM_DUMP_DIR DUMPFILE=jan.dmp TRANSPORT_DATAFILES='/db/tempjan.f'

4-51 Copyright © 2005, Oracle. All rights reserved.

4-52 Copyright © 2005, Oracle. All rights reserved. Summary In this lesson, you should have learned how to: Describe the core ETL framework inside the database and its integration advantage Explain data warehousing extraction methods Identify transportation methods: –Flat file –Distributes operations –Transportable tablespaces Describe transformation flow

4-53 Copyright © 2005, Oracle. All rights reserved. Practice 4: Overview This practice covers the following topics: Loading data from a flat file by using SQL*Loader Configuring synchronous Change Data Capture Loading data from a transportable tablespace by using Data Pump

4-54 Copyright © 2005, Oracle. All rights reserved.