Download presentation
Presentation is loading. Please wait.
Published byAugustus Hensley Modified over 9 years ago
1
Chapter 18: Data Analysis and Mining Kat Powell
2
Chapter 18: Data Analysis and Mining ➔ Decision Support Systems ➔ Data Analysis and OLAP ➔ Data Warehousing ➔ Data Mining
3
Decision Support Systems ➔ Make business decisions, often based on data collected by on-line transaction-processing systems ➔ What items to stock? ➔ To whom to send advertisements? ➔ Examples of data used for making decisions ➔ Retail sales transaction details ➔ Customer profiles (income, age, gender, etc.)
4
Decision-Support Systems Overview ➔ Data analysis tasks are simplified by specialized tools and SQL extensions ➔ Data mining seeks to discover knowledge automatically in the form of statistical rules and patterns from large databases. ➔ A data warehouse archives information gathered from multiple sources, and stores it under a unified schema, at a single site. ✔ Important for large businesses that generate data from multiple divisions, possibly at multiple sites
5
Data Analysis and OLAP ➔ OnLine Analytical Processing (OLAP) ➔ Online interactive tools for analysis and summary of data that present it to humans in understandable format ➔ Multidimensional data – data modeled as dimension attributes and measure attributes in table summary format – Measure attributes measure some value and can be aggregated upon – Dimension attributes define the dimensions on which measure attributes (or aggregates thereof) are viewed
6
Cross Tabulation (aka cross-tab) of sales by item-name and color ➔ item_name, color, and size are the dimension attributes of the sales relation ➔ the number values in each cell represent measure attributes ➔ Extra column and row store the cell totals for each column and row Summary of sales
7
Relational Representation of Cross-tabs ➔ Cross-tabs also can be represented as relations ➔ The value 'all' is used to represent an aggregate, that is, the set of all values for an attribute ➔ The SQL:1999 standard actually uses 'null' values in place of 'all' despite confusion with regular 'null' values
8
Data Cube (Sales relation) ➔ Multidimensional generalization of a cross-tab ➔ Can have n dimensions; we see 3 here ➔ Cross-tabs can be used as views on a data cube ➔ All cells contain values...even the invisible (inner) cells
9
Online Analytical Processing ➔ Pivoting: changing the dimensions used in a cross-tab EX: An analyst may select a cross-tab on item_name and size OR on color and size ➔ Slicing: creating a cross-tab for fixed values only ● Sometimes called dicing, particularly when values for multiple dimensions are fixed. EX: An analyst may wish to see cross-tab on item_name and color for a fixed value of size (for instance...large), instead of the sum across all sizes.
10
Online Analytical Processing – OLAP systems permit users to view data at any desired level of granularity – Rollup: moving from finer-granularity data to a coarser granularity EX: Starting from the data cube on the sales table, got the cross-tab by rolling up on the attribute 'size' – Drill down: The opposite operation - that of moving from coarser-granularity data to finer- granularity data EX: usually generated from the original data like drilling for fine, rare jewels
11
Data Warehousing ➔ Data sources often store only current data, not historical data ➔ Corporate decision making requires a unified view of all organizational data, including historical data ➔ A data warehouse is a repository (archive) of information gathered from multiple sources, stored under a unified schema, at a single site ➔ Greatly simplifies querying, permits study of historical trends ➔ Shifts decision support query load away from transaction processing systems
12
Data Warehousing
13
Warehouse Design Issues ➔W➔When and how to gather data ➔S➔Source driven architecture: data sources transmit new information to warehouse, either continuously or periodically (e.g. at night) ➔D➔Destination driven architecture: warehouse periodically requests new information from data sources ➔K➔Keeping warehouse exactly synchronized with data sources (e.g. using two-phase commit) is too expensive ➔U➔Usually OK to have slightly out-of-date data at warehouse –D–D ata/updates are periodically downloaded form online transaction processing (OLTP) systems.
14
More Warehouse Design Issues ➔ What schema to use ➔ Schema integration – Data cleansing ● E.g. correct mistakes in addresses (misspellings, zip code errors, etc.) ● Merge address lists from different sources and purge duplicates – How to propagate updates ● Warehouse schema may be a (materialized) view of schema from data sources
15
Data Mining ➔ The process of semi-automatically analyzing large databases to find useful patterns ➔ Prediction based on past history ➔ Predict if a credit card applicant poses a good credit risk, based on some attributes (income, job type, age,..) and past history ➔ Predict if a pattern of phone calling card usage is likely to be fraudulent
16
Data Mining (Cont.) – Some other examples of prediction mechanisms: ● Classification – Given a new item whose class is unknown, predict to which class it belongs ● Regression formulae – Given a set of mappings for an unknown function, predict the function result for a new parameter value
17
Data Mining (Cont.) – Descriptive Patterns ● Associations – Find books that are often bought by “similar” customers. If a new such customer buys one such book, suggest the others too. ● Associations may be used as a first step in detecting causation – E.g. association between exposure to chemical X and cancer ● Clusters – E.g. typhoid cases were clustered in an area surrounding a contaminated well – remains important in detecting epidemics
18
Other Types of Mining ➔ Text mining: application of data mining to textual documents ➔ cluster Web pages to find related pages ➔ cluster pages a user has visited to organize their visit history ➔ classify Web pages automatically into a Web directory
19
Other Types of Mining ➔ Data visualization systems help users examine large volumes of data and detect patterns visually ➔ Can visually encode large amounts of information on a single screen ➔ Humans are very good a detecting visual patterns
20
The END
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.