Data Warehousing
Definition Data Warehouse: Data Mart: A subject-oriented, integrated, time-variant, non-updatable collection of data used in support of management decision-making processes Subject-oriented: e.g. customers, patients, students, products Integrated: Consistent naming conventions, formats, encoding structures; from multiple data sources Time-variant: Contain a time dimenstion so that it may be used to study trends and changes Nonupdatable: Read-only, periodically refreshed Data Mart: A data warehouse that is limited in scope
Need for Data Warehousing Integrated, company-wide view of high-quality information (from disparate databases) Separation of operational and informational (decision support) systems and data (for improved performance)
Data Warehouse Architectures Generic Two-Level Architecture Independent Data Mart All involve some form of extraction, transformation and loading (ETL)
Figure 11-2: Generic two-level data warehousing architecture One, company-wide warehouse T E Periodic extraction data is not completely current in warehouse
Figure 11-3 Independent data mart data warehousing architecture Data marts: Mini-warehouses, limited in scope E T L Separate ETL for each independent data mart Data access complexity due to multiple data marts
The ETL Process Capture/Extract Scrub or data cleansing Transform: Convert data from the format of the source to the format of the data warehouse. Load and Index ETL = Extract, transform, and load
Figure 11-10: Steps in data reconciliation Load/Index= place transformed data into the warehouse and create indexes Figure 11-10: Steps in data reconciliation (cont.) Refresh mode: bulk rewriting of target data at periodic intervals Update mode: only changes in source data are written to data warehouse
Index Bitmap index Join index
Bitmap saves on space requirements Figure 6-8 Rows - possible values of the attribute Columns - table rows Bit indicates whether the attribute of a row has the values Figure 6-8 Bitmap index index organization
Figure 6-9 Join Indexes–speeds up join operations
Star Schema for Data Warehouse Objectives Ease of use for decision support applications Fast response to predefined user queries Customized data for particular target audiences Also called “dimensional model” Dimension: A dimension is a term used to describe any category used in analyzing data, such as time, geography, and product line.
Figure 11-13 Components of a star schema Fact tables contain factual or quantitative data 1:N relationship between dimension tables and fact tables Dimension tables are denormalized to maximize performance Dimension tables contain descriptions about the subjects of the business Excellent for ad-hoc queries, but bad for online transaction processing
Figure 11-14 Star schema example Fact table provides statistics for sales broken down by product, period and store dimensions
Figure 11-15 Star schema with sample data
On-Line Analytical Processing (OLAP) Tools The use of a set of graphical tools that provides users with multidimensional views of their data and allows them to analyze the data using simple windowing techniques Relational OLAP (ROLAP) Traditional relational representation Multidimensional OLAP (MOLAP) Cube structure OLAP Operations Cube slicing–come up with 2-D view of data Drill-down–going from summary to more detailed views
Figure 11-23 Slicing a data cube
Figure 11-24 Example of drill-down Summary report Starting with summary data, users can obtain details for particular cells Drill-down with color added
Data Mining and Visualization Knowledge discovery using a blend of statistical, AI, and computer graphics techniques Goals: Explain observed events or conditions Confirm hypotheses Explore data for new or unexpected relationships Techniques Statistical regression Decision tree induction Clustering and signal processing Affinity Sequence association Case-based reasoning Rule discovery Neural nets Fractals Data visualization–representing data in graphical/multimedia formats for analysis
Pivot Table Excel: Drill Down, Roll Up Access CrossTab query
SQL GROUPING SETS GROUPING SETS SELECT CITY,RATING,COUNT(CID) FROM HCUSTOMERS GROUP BY GROUPING SETS(CITY,RATING,(CITY,RATING),()) ORDER BY CITY; Note: () indicates that an overall total is desired.
SQL CUBE Perform aggregations for all possible combinations of columns indicated. SELECT CITY,RATING,COUNT(CID) FROM HCUSTOMERS GROUP BY CUBE(CITY,RATING) ORDER BY CITY, RATING;
SQL ROLLUP The ROLLUP extension causes cumulative subtotals to be calculated for the columns indicated. If multiple columns are indicated, subtotals are performed for each of the columns except the far-right column. SELECT CITY,RATING,COUNT(CID) FROM HCUSTOMERS GROUP BY ROLLUP(CITY,RATING) ORDER BY CITY, RATING