Data Warehousing
On-Line Analytical Processing (OLAP) Tools The use of a set of graphical tools that provides users with multidimensional views of their data and allows them to analyze the data using simple windowing techniques Relational OLAP (ROLAP) –Traditional relational representation Multidimensional OLAP (MOLAP) –Cube structure OLAP Operations –Cube slicing–come up with 2-D view of data –Drill-down–going from summary to more detailed views –Roll-up – the opposite direction of drill-down –Reaggregation – rearrange the order of dimensions
Slicing a data cube
Example of drill-down Summary report Drill-down with color added Starting with summary data, users can obtain details for particular cells
Excel’s Pivot Table Data/Pivot Table –Drilldown, rollup, reaggregation
Access Pivot Form Drill Down
Data Warehouse A subject-oriented, integrated, time-variant, non-updatable collection of data used in support of management decision-making processes –Subject-oriented: e.g. customers, employees, locations, products, time periods, etc. Dimensions for data analysis –Integrated: Consistent naming conventions, formats, encoding structures; from multiple data sources –Time-variant: Can study trends and changes –Nonupdatable: Read-only, periodically refreshed Data Mart: –A data warehouse that is limited in scope
Need for Data Warehousing Integrated, company-wide view of high-quality information (from disparate databases) Separation of operational and informational systems and data (for improved performance)
Data Warehousing Processes E T L One, company- wide warehouse Periodic extraction data is not completely current in warehouse
The ETL Process Extract –Incremental extract –Incremental extract = capturing changes that have occurred since the last static extract Transform –Scrub or data cleansing Load and Index
Data Warehouse Design - Star Schema - Also called “dimensional model” Fact table –contain detailed business data –Example: An item sold in an order Dimension tables –Dimension is a term used to describe any category or subjects of the business used in analyzing data, such as customers, employees, locations, products, time periods, etc. –Dimension tables contain descriptions about the subjects of the business. –Example: A sold tem related to many business subjects such as salesperson, customer, product and time period.
Star schema example Fact table provides statistics for sales broken down by product, period and store dimensions Dimension tables contain descriptions about the subjects of the business
Star schema with sample data
Example: Order Processing System Customer Order Product Has 1 M M M CID Cname City OIDODate PID Pname Price Rating SalesPerson Qty
Star Schema Fact table: –Sales data Analysis dimensions: –Store location –Customer rating –Time period –Product Product category
Star Schema FactTable LocationCode PeriodCode Rating PID Qty Amount Location Dimension LocationCode State City CustomerRating Dimension Rating Description Product Dimension PID Pname CategoryID Product Category CategoryID Description Period Dimension PeriodCode Year Quarter Can group by State, City (Snowflake model)
From SalesDB to MyDataWarehouse Extract data from SalesDB: –Create query to get the data –Download to MyDataWareHouse File/Import/Save as Table Data scrub/cleasing,and transform: –Transform City to Location –Transform Odate to Period Load data to FactTable
SQL GROUPING SETS GROUPING SETS –SELECT CITY,RATING,COUNT(CID) FROM CUSTOMERS –GROUP BY GROUPING SETS(CITY,RATING,(CITY,RATING),()) – ORDER BY CITY; Note: Compute the subtotals for every member in the GROUPING SETS. () indicates that an overall total is desired.
Results CITY Rating COUNT(CID) CHICAGO A 1 CHICAGO B 2 CHICAGO 3 LOS ANGELES A 1 LOS ANGELES C 1 LOS ANGELES 2 SAN FRANCISCO A 2 SAN FRANCISCO B 1 SAN FRANCISCO 3 A 4 8 CITY R COUNT(CID) B 3 C 1
SQL CUBE Perform aggregations for all possible combinations of columns indicated. –SELECT CITY,RATING,COUNT(CID) FROM CUSTOMERS –GROUP BY CUBE(CITY,RATING) –ORDER BY CITY, RATING;
Results CITY Rating COUNT(CID) CHICAGO A 1 CHICAGO B 2 CHICAGO 3 LOS ANGELES A 1 LOS ANGELES C 1 LOS ANGELES 2 SAN FRANCISCO A 2 SAN FRANCISCO B 1 SAN FRANCISCO 3 A 4 B 3 CITY R COUNT(CID) C 1 8
SQL ROLLUP The ROLLUP extension causes cumulative subtotals to be calculated for the columns indicated. If multiple columns are indicated, subtotals are performed for each of the columns except the far-right column. –SELECT CITY,RATING,COUNT(CID) FROM CUSTOMERS – GROUP BY ROLLUP(CITY,RATING) – ORDER BY CITY, RATING;
Results CITY Rating COUNT(CID) CHICAGO A 1 CHICAGO B 2 CHICAGO 3 LOS ANGELES A 1 LOS ANGELES C 1 LOS ANGELES 2 SAN FRANCISCO A 2 SAN FRANCISCO B 1 SAN FRANCISCO 3 8