CS 157B: Database Management Systems II March 20 Class Meeting Department of Computer Science San Jose State University Spring 2013 Instructor: Ron Mak
Department of Computer Science Spring 2013: March 20 CS 157B: Database Management Systems II © R. Mak 2 Unofficial Field Trip Computer History Museum in Mt. View Experience a fully restored IBM 1401 mainframe computer from the early 1960s in operation. General info: My summer seminar: Restoration: thelen.org/1401Project/1401RestorationPage.htmlhttp://ed- thelen.org/1401Project/1401RestorationPage.html Private demos at 11:45 and at 2:00. See a life-size working model of Charles Babbage’s Difference Engine in operation, a hand-cranked mechanical computer designed in the early 1800s. Public demo at 1:00. Saturday, March 23. Meet in the museum lobby at 11:15 AM.
Department of Computer Science Spring 2013: March 20 CS 157B: Database Management Systems II © R. Mak 3 Extra Credit! There will be extra credit if you participate in the unofficial field trip to the Computer History Museum. Up to 10 points added to your midterm score. To be decided: a quiz (via Desire2Learn) or an essay
Department of Computer Science Spring 2013: March 20 CS 157B: Database Management Systems II © R. Mak 4 Extract, Transform, and Load (ETL)
Department of Computer Science Spring 2013: March 20 CS 157B: Database Management Systems II © R. Mak 5 Extract, Transform, and Load (ETL) You want only high quality data in your data warehouse. What is high quality data? correct unambiguous consistent complete The transform phase of ETL produces high quality data. Cleaning the data. Conforming data from multiple sources. _
Department of Computer Science Spring 2013: March 20 CS 157B: Database Management Systems II © R. Mak 6 Extract, Transform, and Load (ETL) In the real world, data is often dirty. Therefore, the ETL process must clean the source data when the data is being copied into the data warehouse. Cleaning operations Remove or correct corrupted data. Remove or correct invalid or inconsistent data. unexpected null values missing data values out of range misspellings referential integrity violations business rule violations _
Department of Computer Science Spring 2013: March 20 CS 157B: Database Management Systems II © R. Mak 7 Extract, Transform, and Load (ETL) Data from multiple sources may need to be conformed to be usable together in the data warehouse. Type conversion Example: Convert a user ID in a data source from a string to a long integer to match with the user ID in other data sources. Format conversion Example: Dates and times, names Align field and attribute names Examples: customer_name vs. name_of_client store vs. retail_outlet _
Department of Computer Science Spring 2013: March 20 CS 157B: Database Management Systems II © R. Mak 8 ETL: Semantic Mappings Unit conversions Example: feet vs. yards, miles vs. kilometers Structural mappings Example: federal state city district vs. kingdom region parish Temporal mappings Example: One data source has a measure taken once an hour, another data source has the same measure taken daily. Spatial mappings Example: street addresses vs. GIS coordinates (latitude + longitude) vs. political boundaries (cities, districts, counties, etc.)
Department of Computer Science Spring 2013: March 20 CS 157B: Database Management Systems II © R. Mak 9 ETL: Semantic Mappings Spatio-temporal mappings Locations in space-time And even more complex mappings May require the use of ontologies. shared vocabularies knowledge structures models of reality etc. _
Department of Computer Science Spring 2013: March 20 CS 157B: Database Management Systems II © R. Mak 10 Dimensional Modeling Fact tables Contain values that are measures, usually numeric. Example: the number of sales Dimension tables Contain the context for the measures. Examples: time, location, product Dimensions are usually grouped and hierarchical Example: western locations, eastern locations Example: yearly, quarterly, monthly, weekly, daily, hourly Often denormalized for query performance. Many queries, few updates. _
Department of Computer Science Spring 2013: March 20 CS 157B: Database Management Systems II © R. Mak 11 Dimensional Modeling Design criteria What are the facts? What are we measuring? Example: number of sales What is the grain, or granularity of the facts? Determined by the dimensions. All measurements in a fact table must be at the same grain. Example: sales figures collected at the point of sale What are the dimensions? What context do we need to provide for the measures in the fact table? Examples: stores, dates, products
Department of Computer Science Spring 2013: March 20 CS 157B: Database Management Systems II © R. Mak 12 Dimensional Modeling Implementation Star schema Measures: number of units sold Dimensions: date, store, product
Department of Computer Science Spring 2013: March 20 CS 157B: Database Management Systems II © R. Mak 13 Online Analytical Processing (OLAP) A common type of business analysis. Also used to analyze scientific data. Visualize data in a multidimensional manner. Analytical processes that involve manipulating data along different dimensions. The OLAP cube. “What happened recently, and why?” _
Department of Computer Science Spring 2013: March 20 CS 157B: Database Management Systems II © R. Mak 14 Online Analytical Processing (OLAP) OLAP operations slice and dice drill up, drill down drill across, drill through pivot _
Department of Computer Science Spring 2013: March 20 CS 157B: Database Management Systems II © R. Mak 15 Online Analytical Processing (OLAP) Slice View or manipulate the data along a subset of the dimensions. Consider only data from the first quarter.
Department of Computer Science Spring 2013: March 20 CS 157B: Database Management Systems II © R. Mak 16 Online Analytical Processing (OLAP) Dice View or manipulate the data within subsets of the ranges of the dimensions. Consider only data from Q1 and Q2 from only Toronto and Vancouver for only computers and home entertainment.
Department of Computer Science Spring 2013: March 20 CS 157B: Database Management Systems II © R. Mak 17 Online Analytical Processing (OLAP) Drill down View or manipulate a dimension at a lower level of detail. Drill down on the time dimension from quarters to months.
Department of Computer Science Spring 2013: March 20 CS 157B: Database Management Systems II © R. Mak 18 Online Analytical Processing (OLAP) Drill up “Roll up” (aggregate) data to a higher level in along a dimension. Sum up the cities by country.
Department of Computer Science Spring 2013: March 20 CS 157B: Database Management Systems II © R. Mak 19 Online Analytical Processing (OLAP) Drill across Integrate data from more than one fact table. Drill through Access the database tables that underlie the OLAP cube. _
Department of Computer Science Spring 2013: March 20 CS 157B: Database Management Systems II © R. Mak 20 Online Analytical Processing (OLAP) Pivot Rotate the axes (dimensions) to present a different view.
Department of Computer Science Spring 2013: March 20 CS 157B: Database Management Systems II © R. Mak 21 OLAP Summary
Department of Computer Science Spring 2013: March 20 CS 157B: Database Management Systems II © R. Mak 22 DW Summary Plus: dashboards and scorecards
Department of Computer Science Spring 2013: March 20 CS 157B: Database Management Systems II © R. Mak 23 Cognos Business intelligence (BI) tool from IBM. Queries and reports Dashboards and scorecards OLAP Data mining predictive analysis Cognos Business Intelligence 10 is available in the IBM Academic Cloud along with a sample data warehouse. I will create student accounts. Online tutorials Cognos demo