Download presentation
Presentation is loading. Please wait.
Published byAllison Booth Modified over 9 years ago
1
Date Warehouse - A data warehouse is a relational/multidimensional database that is designed for query and analysis rather than transaction processing. A data warehouse usually contains historical data that is derived from transaction data. It separates analysis workload from transaction workload and enables a business to consolidate data from several sources. In addition to a relational/multidimensional database, a data warehouse environment often consists of an ETL solution, an OLAP engine, client analysis tools, and other applications that manage the process of gathering data and delivering it to business users. Other Related definitions/terms : 1. Enterprise Data Warehouse - An enterprise data warehouse provides a central database for decision support throughout the enterprise. 2. ODS (Operational Data Store) - This has a broad enterprise wide scope, but unlike the real enterprise data warehouse, data is refreshed in near real time and used for routine business activity. 3. Data Mart - Data mart is a subset of data warehouse and it supports a particular region, business unit or business function. Data warehouses and data marts are built on dimensional data modeling where fact tables are connected with dimension tables. This is most useful for users to access data since a database can be visualized as a cube of several dimensions. A data warehouse provides an opportunity for slicing and dicing that cube along each of its dimensions.
2
Data Warehouse Architecture – Fig. 1
3
Data Warehouse and Data Marts – Fig. 2
4
What is Star Schema? Star Schema is a relational database schema for representing multi-dimensional data. It is the simplest form of data warehouse schema that contains one or more dimensions and fact tables. It is called a star schema because the relationship diagram between dimensions and fact tables resembles a star where one fact table is connected to multiple dimensions. The center of the star schema consists of a large fact table and it points towards the dimension tables. The advantages of star schema are slicing down, performance increase and easy understanding of data. Important aspects of Star Schema & Snow Flake Schema In a star schema every dimension will have a primary key. In a star schema, a dimension table will not have any parent table. Whereas in a snow flake schema, a dimension table will have one or more parent tables. Hierarchies for the dimensions are stored in the dimensional table itself in star schema. Whereas hierarchies are broken into separate tables in snow flake schema. These hierarchies helps to drill down the data from topmost hierarchies to the lowermost hierarchies. Hierarchy A logical structure that uses ordered levels as a means of organizing data. A hierarchy can be used to define data aggregation; for example, in a time dimension, a hierarchy might be used to aggregate data from the Month level to the Quarter level, from the Quarter level to the Year level. A hierarchy can also be used to define a navigational drill path, regardless of whether the levels in the hierarchy represent aggregated totals or not.
5
Level A position in a hierarchy
Level A position in a hierarchy. For example, a time dimension might have a hierarchy that represents data at the Month, Quarter, and Year levels. Fact Table A table in a star schema that contains facts and connected to dimensions. The centralized table in a star schema is called as FACT table. A fact table typically has two types of columns: those that contain facts and those that are foreign keys to dimension tables. The primary key of a fact table is usually a composite key that is made up of all of its foreign keys. A fact table might contain either detail level facts or facts that have been aggregated (fact tables that contain aggregated facts are often called as summary tables). A fact table usually contains facts with the same level of aggregation. In the example given below, figure 3, sales fact table is connected to dimensions location, product, time and organization. It shows that data can be sliced across all dimensions and again it is possible for the data to be aggregated across multiple dimensions. "Sales Dollar" in sales fact table can be calculated across all dimensions independently or in a combined manner which is explained below. Sales Dollar value for a particular product Sales Dollar value for a product in a location Sales Dollar value for a product in a year within a location Sales Dollar value for a product in a year within a location sold or serviced by an employee
6
In the Star Schema example fig "Sales Dollar" is a fact (measure) and it can be added across several dimensions. Fact tables store different types of measures like additive, non additive and semi additive measures. Measure Types Additive - Measures that can be added across all dimensions. Non Additive - Measures that cannot be added across all dimensions. Semi Additive - Measures that can be added across few dimensions and not with others. A fact table might contain either detail level facts or facts that have been aggregated (fact tables that contain aggregated facts are often called as summary tables). In the real world, it is possible to have a fact table that contains no measures or facts. These tables are called as Fact less Fact tables. Steps in designing Fact Table Identify a business process for analysis (like sales). Identify measures or facts (sales dollar). Identify dimensions for facts (product dimension, location dimension, time dimension, organization dimension). List the columns that describe each dimension. (branch name, region name). Determine the lowest level of summary in a fact table (sales dollar).
7
Star Schema – Fig. 3
8
A snowflake schema is a term that describes a star schema structure normalized through the use of outrigger tables. i.e dimension table hierarchies are broken into simpler tables. In star schema example we had 4 dimensions like location, product, time, organization and a fact table (sales) In Snowflake schema, the example diagram shown below has 4 dimension tables, 4 lookup tables and 1 fact table. The reason is that hierarchies (category, branch, state, and month) are being broken out of the dimension tables (PRODUCT, ORGANIZATION, LOCATION, and TIME) respectively and shown separately. In OLAP, this Snowflake schema approach increases the number of joins and poor performance in retrieval of data.
9
Snowflake Schema
10
Fact less Fact table
11
Dimension Table Dimension table is one that describe the business entities of an enterprise, represented as hierarchical, categorical information such as time, departments, locations, and products. Dimension tables are sometimes called lookup or reference tables. Location Dimension In a relational data modeling, for normalization purposes, country lookup, state lookup, county lookup, and city lookups are not merged as a single table. In a dimensional data modeling (star schema), these tables would be merged as a single table called LOCATION DIMENSION for performance and slicing data requirements. This location dimension helps to compare the sales in one region with another region. We may see good sales profit in one region and loss in another region. If it is a loss, the reasons for that may be a new competitor in that area, or failure of our marketing strategy etc. Example of Location Dimension: Figure 4 Slowly Changing Dimensions: Dimensions that change over time are called Slowly Changing Dimensions. For instance, a product price changes over time; People change their names for some reason; Country and State names may change over time. These are a few examples of Slowly Changing Dimensions since some changes are happening to them over a period of time.
12
Slowly Changing Dimensions are often categorized into three types namely Type1, Type2 and Type3. The following section deals with how to capture and handling these changes over time. In the year 2004, the price of Product1 was $150 and over the time, Product1's price changes from $150 to $350. With this information, let us explain the three types of Slowly Changing Dimensions. Type 1: Overwriting the old values. In the year 2005, if the price of the product changes to $250, then the old values of the columns "Year" and "Product Price" have to be updated and replaced with the new values. In this Type 1, there is no way to find out the old value of the product "Product1" in year 2004 Type 2: Creating an another additional record. In this Type 2, the old values will not be replaced but a new row containing the new values will be added to the product table. So at any point of time, the difference between the old values and new values can be retrieved and easily be compared. This would be very useful for reporting purposes. The problem with the above mentioned data structure is "Product ID" cannot store duplicate values of "Product1" since "Product ID" is the primary key. Also, the current data structure doesn't clearly specify the effective date and expiry date of Product1 like when the change to its price happened. So, it would be better to change the current data structure to overcome the above primary key violation.
13
In the changed Product table's Data structure, "Product ID" and "Effective DateTime" are composite primary keys. So there would be no violation of primary key constraint. Addition of new columns, "Effective DateTime" and "Expiry DateTime" provides the information about the product's effective date and expiry date which adds more clarity and enhances the scope of this table. Type2 approach may need additional space in the data base, since for every changed record, an additional row has to be stored. Since dimensions are not that big in the real world, additional space is negligible. Type 3: Creating new fields. In this Type 3, the latest update to the changed values can be seen. Example mentioned below illustrates how to add new columns and keep track of the changes. From that, we are able to see the current price and the previous price of the product, Product1. The problem with the Type 3 approach, is over years, if the product price continuously changes, then the complete history may not be stored, only the latest change will be stored. For example, in year 2006, if the product1's price changes to $350, then we would not be able to see the complete history of 2004 prices, since the old values would have been updated with 2005 product information.
14
Location Dimension – Fig. 4
15
ETL – Extraction Transformation and Loading
“Which does what you want the way you want” - ETL refers to the methods involved in accessing and manipulating source data and loading it into target database. What are ETL Tools? ETL Tools are meant to extract, transform and load the data into Data Warehouse for decision making. Before the evolution of ETL Tools, the above mentioned ETL process was done manually by using SQL code created by programmers. This task was tedious and cumbersome in many cases since it involved many resources, complex coding and more work hours. On top of it, maintaining the code placed a great challenge among the programmers. These difficulties are eliminated by ETL Tools since they are very powerful and they offer many advantages in all stages of ETL process starting from extraction, data cleansing, data profiling, transformation, debugging and loading into data warehouse when compared to the old method. There are a number of ETL tools available in the market to do ETL process the data according to business/technical requirements. Following are some those.
16
Popular ETL Tools Informatica DataStage Ab Initio Data Junction Oracle Warehouse Builder Microsoft SQL Server
17
The first step in ETL process is mapping the data between source systems and target database(data warehouse or data mart). The second step is cleansing of source data in staging area. The third step is transforming cleansed source data and then loading into the target system. Note that ETT (extraction, transformation, transportation) and ETM (extraction, transformation, move) are sometimes used instead of ETL. Glossary of ETL (Reference: Source System A database, application, file, or other storage facility from which the data in a data warehouse is derived. Mapping The definition of the relationship and data flow between source and target objects. Metadata Data that describes data and other structures, such as objects, business rules, and processes. For example, the schema design of a data warehouse is typically stored in a repository as metadata, which is used to generate scripts used to build and populate the data warehouse. A repository contains metadata. Staging Area A place where data is processed before entering the warehouse. Cleansing The process of resolving inconsistencies and fixing the anomalies in source data, typically as part of the ETL process. Transformation The process of manipulating data. Any manipulation beyond copying is a transformation. Examples include cleansing, aggregating, and integrating data from multiple sources.
19
In Informatica, Transformations help to transform the source data according to the requirements of target system and it ensures the quality of the data being loaded into target. Transformations are of two types: Active and Passive. Active Transformation An active transformation can change the number of rows that pass through it from source to target i.e it eliminates rows that do not meet the condition in transformation. Passive Transformation A passive transformation does not change the number of rows that pass through it i.e it passes all rows through the transformation. Transformations can be Connected or UnConnected. Connected Transformation Connected transformation is connected to other transformations or directly to target table in the mapping. UnConnected Transformation An unconnected transformation is not connected to other transformations in the mapping. It is called within another transformation, and returns a value to that transformation.
20
Following are the list of Transformations available in Informatica:
Aggregator Transformation Expression Transformation Filter Transformation Joiner Transformation Lookup Transformation Normalizer Transformation Rank Transformation Router Transformation Sequence Generator Transformation Stored Procedure Transformation Sorter Transformation Update Strategy Transformation XML Source Qualifier Transformation Advanced External Procedure Transformation External Transformation
21
Aggregator Transformation Aggregator transformation is an Active and Connected transformation. This transformation is useful to perform calculations such as averages and sums (mainly to perform calculations on multiple rows or groups). For example, to calculate total of daily sales or to calculate average of monthly or yearly sales. Aggregate functions such as AVG, FIRST, COUNT, PERCENTILE, MAX, SUM etc. can be used in aggregate transformation. Expression Transformation Expression transformation is a Passive and Connected transformation. This can be used to calculate values in a single row before writing to the target. For example, to calculate discount of each product or to concatenate first and last names or to convert date to a string field. Filter Transformation Filter transformation is an Active and Connected transformation. This can be used to filter rows in a mapping that do not meet the condition. For example, to know all the employees who are working in Department 10 or to find out the products that falls between the rate category $500 and $1000.
22
Joiner Transformation Joiner Transformation is an Active and Connected transformation. This can be used to join two sources coming from two different locations or from same location. For example, to join a flat file and a relational source or to join two flat files or to join a relational source and a XML source. In order to join two sources, there must be atleast one matching port. at least one matching port. While joining two sources it is a must to specify one source as master and the other as detail. The Joiner transformation supports the following types of joins: Normal, Master Outer, Detail Outer and Full Outer Normal join discards all the rows of data from the master and detail source that do not match, based on the condition. Master outer join discards all the unmatched rows from the master source and keeps all the rows from the detail source and the matching rows from the master source. Detail outer join keeps all rows of data from the master source and the matching rows from the detail source. It discards the unmatched rows from the detail source. Full outer join keeps all rows of data from both the master and detail sources.
23
Lookup Transformation Lookup transformation is Passive and it can be both Connected and UnConnected as well. It is used to look up data in a relational table, view, or synonym. Lookup definition can be imported either from source or from target tables. For example, if we want to retrieve all the sales of a product with an ID 10 and assume that the sales data resides in another table. Here instead of using the sales table as one more source, use Lookup transformation to lookup the data for the product, with ID 10 in sales table. Difference between Connected and UnConnected Lookup Transformation: Connected lookup receives input values directly from mapping pipeline whereas UnConnected lookup receives values from: LKP expression from another transformation. Connected lookup returns multiple columns from the same row whereas UnConnected lookup has one return port and returns one column from each row. Connected lookup supports user-defined default values whereas UnConnected lookup does not support user defined values.
24
Normalizer Transformation Normalizer Transformation is an Active and Connected transformation. It is used mainly with COBOL sources where most of the time data is stored in de-normalized format. Also, Normalizer transformation can be used to create multiple rows from a single row of data. Rank Transformation Rank transformation is an Active and Connected transformation. It is used to select the top or bottom rank of data. For example, to select top 10 Regions where the sales volume was very high or to select 10 lowest priced products. Router Transformation Router is an Active and Connected transformation. It is similar to filter transformation. The only difference is, filter transformation drops the data that do not meet the condition whereas router has an option to capture the data that do not meet the condition. It is useful to test multiple conditions. It has input, output and default groups. For example, if we want to filter data like where State=Michigan, State=California, State=New York and all other States. It’s easy to route data to different tables.
25
Sequence Generator Transformation Sequence Generator transformation is a Passive and Connected transformation. It is used to create unique primary key values or cycle through a sequential range of numbers or to replace missing keys. It has two output ports to connect transformations. By default it has two fields CURRVAL and NEXTVAL(You cannot add ports to this transformation). NEXTVAL port generates a sequence of numbers by connecting it to a transformation or target. CURRVAL is the NEXTVAL value plus one or NEXTVAL plus the Increment By value. Stored Procedure Transformation Stored Procedure transformation is a Passive and Connected & UnConnected transformation. It is useful to automate time-consuming tasks and it is also used in error handling, to drop and recreate indexes and to determine the space in database, a specialized calculation etc. The stored procedure must exist in the database before creating a Stored Procedure transformation, and the stored procedure can exist in a source, target, or any database with a valid connection to the Informatica Server. Stored Procedure is an executable script with SQL statements and control statements, user-defined variables and conditional statements.
26
Sorter Transformation Sorter transformation is a Connected and an Active transformation. It allows to sort data either in ascending or descending order according to a specified field. Also used to configure for case-sensitive sorting, and specify whether the output rows should be distinct. Source Qualifier Transformation Source Qualifier transformation is an Active and Connected transformation. When adding a relational or a flat file source definition to a mapping, it is must to connect it to a Source Qualifier transformation. The Source Qualifier performs the various tasks such as overriding default SQL query, filtering records; join data from two or more tables etc. Update Strategy Transformation Update strategy transformation is an Active and Connected transformation. It is used to update data in target table, either to maintain history of data or recent changes. You can specify how to treat source rows in table, insert, update, delete or data driven. XML Source Qualifier Transformation XML Source Qualifier is a Passive and Connected transformation. XML Source Qualifier is used only with an XML source definition. It represents the data elements that the Informatica Server reads when it executes a session with XML sources.
27
Advanced External Procedure Transformation Advanced External Procedure transformation is an Active and Connected transformation. It operates in conjunction with procedures, which are created outside of the Designer interface to extend PowerCenter/PowerMart functionality. It is useful in creating external transformation applications, such as sorting and aggregation, which require all input rows to be processed before emitting any output rows. External Procedure Transformation External Procedure transformation is an Active and Connected/UnConnected transformations. Sometimes, the standard transformations such as Expression transformation may not provide the functionality that you want. In such cases External procedure is useful to develop complex functions within a dynamic link library (DLL) or UNIX shared library, instead of creating the necessary Expression transformations in a mapping. Differences between Advanced External Procedure and External Procedure Transformations: External Procedure returns single value, where as Advanced External Procedure returns multiple values. External Procedure supports COM and Informatica procedures where as AEP supports only Informatica Procedures.
28
Custom Transformation
Custom transformations operate in conjunction with procedures you create outside of the Designer interface to extend PowerCenter functionality. You can create a Custom transformation and bind it to a procedure that you develop using the functions described in Custom Transformation Functions. You can use the Custom transformation to create transformation applications, such as sorting and aggregation, which require all input rows to be processed before outputting any output rows. To support this process, the input and output functions occur separately in Custom transformations compared to External Procedure transformations. The PowerCenter Server passes the input data to the procedure using an input function. The output function is a separate function that you must enter in the procedure code to pass output data to the PowerCenter Server. In contrast, in the External Procedure transformation, an external procedure function does both input and output, and its parameters consist of all the ports of the transformation. You can also use the Custom transformation to create a transformation that requires multiple input groups, multiple output groups, or both. A group is the representation of a row of data entering or leaving a transformation. For example, you might create a Custom transformation with one input group and multiple output groups that parses XML data. Or, you can create a Custom transformation with two input groups and one output group that merges two streams of input data into one stream of output data.
29
External Procedure transformations operate in conjunction with procedures you create outside of the Designer interface to extend PowerCenter functionality. Although the standard transformations provide you with a wide range of options, there are occasions when you might want to extend the functionality provided with PowerCenter. For example, the range of standard transformations, such as Expression and Filter transformations, may not provide the exact functionality you need. If you are an experienced programmer, you may want to develop complex functions within a dynamic link library (DLL) or UNIX shared library, instead of creating the necessary Expression transformations in a mapping. To obtain this kind of extensibility, you can use the Transformation Exchange (TX) dynamic invocation interface built into PowerCenter. Using TX, you can create an Informatica External Procedure transformation and bind it to an external procedure that you have developed. You can bind External Procedure transformations to two kinds of external procedures: COM external procedures (available on Windows only) Informatica external procedures (available on Windows, AIX, HP-UX, Linux, and Solaris) To use TX, you must be an experienced C, C++, or Visual Basic programmer. You can use multi-threaded code in external procedures.
30
Transaction Control Transformation Overview
Transformation type:ActiveConnected PowerCenter allows you to control commit and rollback transactions based on a set of rows that pass through a Transaction Control transformation. A transaction is the set of rows bound by commit or rollback rows. You can define a transaction based on a varying number of input rows. You might want to define transactions based on a group of rows ordered on a common key, such as employee ID or order entry date. In PowerCenter, you define transaction control at two levels: Within a mapping. Within a mapping, you use the Transaction Control transformation to define a transaction. You define transactions using an expression in a Transaction Control transformation. Based on the return value of the expression, you can choose to commit, roll back, or continue without any transaction changes. Within a session. When you configure a session, you configure it for user-defined commit. You can choose to commit or roll back a transaction if the PowerCenter Server fails to transform or write any row to the target. When you run the session, the PowerCenter Server evaluates the expression for each row that enters the transformation. When it evaluates a commit row, it commits all rows in the transaction to the target or targets. When the PowerCenter Server evaluates a rollback row, it rolls back all rows in the transaction from the target or targets. Note: You can also use the transformation scope in other transformation properties to define transactions. For more information, see “Understanding Commit Points” in the Workflow Administration Guide.
31
Union Transformation Overview
Transformation type:ConnectedActive The Union transformation is a multiple input group transformation that you can use to merge data from multiple pipelines or pipeline branches into one pipeline branch. It merges data from multiple sources similar to the UNION ALL SQL statement to combine the results from two or more SQL statements. Similar to the UNION ALL statement, the Union transformation does not remove duplicate rows. You can connect heterogeneous sources to a Union transformation. The Union transformation merges sources with matching ports and outputs the data from one output group with the same ports as the input groups. The Union transformation is developed using the Custom transformation. Union Transformation Rules and Guidelines Consider the following rules and guidelines when you work with a Union transformation: You can create multiple input groups, but only one output group. All input groups and the output group must have matching ports. The precision, datatype, and scale must be identical across all groups. The Union transformation does not remove duplicate rows. To remove duplicate rows, you must add another transformation such as a Router or Filter transformation. You cannot use a Sequence Generator or Update Strategy transformation upstream from a Union transformation. The Union transformation does not generate transactions.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.