Data Analysis. Overview Traditional database systems are tuned to many, small, simple queries. Some applications use fewer, more time-consuming, analytic.

Slides:



Advertisements
Similar presentations
Data Warehousing and Data Mining J. G. Zheng May 20 th 2008 MIS Chapter 3.
Advertisements

OLAP Tuning. Outline OLAP 101 – Data warehouse architecture – ROLAP, MOLAP and HOLAP Data Cube – Star Schema and operations – The CUBE operator – Tuning.
Data Mining of Very Large Data
By: Mr Hashem Alaidaros MIS 211 Lecture 4 Title: Data Base Management System.
Chapter 18: Data Analysis and Mining Kat Powell. Chapter 18: Data Analysis and Mining ➔ Decision Support Systems ➔ Data Analysis and OLAP ➔ Data Warehousing.
Data Warehousing CPS216 Notes 13 Shivnath Babu. 2 Warehousing l Growing industry: $8 billion way back in 1998 l Range from desktop to huge: u Walmart:
Jennifer Widom On-Line Analytical Processing (OLAP) Introduction.
Data Mining, Frequent-Itemset Mining
2/10/05Salman Azhar: Database Systems1 On-Line Analytical Processing Salman Azhar Warehousing Data Cubes Data Mining These slides use some figures, definitions,
1 Data Warehousing. 2 Data Warehouse A data warehouse is a huge database that stores historical data Example: Store information about all sales of products.
OLAP. Overview Traditional database systems are tuned to many, small, simple queries. Some new applications use fewer, more time-consuming, analytic queries.
SLIDE 1IS 257 – Fall 2011 Data Mining and the Weka Toolkit University of California, Berkeley School of Information IS 257: Database Management.
Asssociation Rules Prof. Sin-Min Lee Department of Computer Science.
Advanced Querying OLAP Part 2. Context OLAP systems for supporting decision making. Components: –Dimensions with hierarchies, –Measures, –Aggregation.
1 On-Line Application Processing Warehousing Data Cubes Data Mining.
COMP 578 Data Warehousing And OLAP Technology Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University.
Lab3 CPIT 440 Data Mining and Warehouse.
CSE6011 Warehouse Models & Operators  Data Models  relations  stars & snowflakes  cubes  Operators  slice & dice  roll-up, drill down  pivoting.
SLIDE 1IS 257 – Fall 2010 Data Mining and the Weka Toolkit University of California, Berkeley School of Information IS 257: Database Management.
On-Line Application Processing Warehousing Data Cubes Data Mining 1.
Online Analytical Processing (OLAP) Hweichao Lu CS157B-02 Spring 2007.
Garcia-Molina, Ullman, and Widom Chapter 10 Part 2.
Exercises Product ( pname, price, category, maker) Purchase (buyer, seller, store, product) Company (cname, stock price, country) Person( per-name, phone.
Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Decision Support Chapter 23.
Business Intelligence. Topics Chart Online Analytical Process, OLAP – Excel’s Pivot table – Data visualization with dashboard Data warehousing Data Mining.
Week 6 Lecture The Data Warehouse Samuel Conn, Asst. Professor
On-Line Analytic Processing Chetan Meshram Class Id:221.
OnLine Analytical Processing (OLAP)
Cube Intro. Decision Making Effective decision making Goal: Choice that moves an organization closer to an agreed-on set of goals in a timely manner Goal:
20.5 Data Cubes Instructor : Dr. T.Y. Lin Chandrika Satyavolu 222.
1 On-Line Application Processing Warehousing Data Cubes Data Mining.
DIMENSIONAL MODELLING. Overview Clearly understand how the requirements definition determines data design Introduce dimensional modeling and contrast.
1 Data Warehouses BUAD/American University Data Warehouses.
BI Terminologies.
BUSINESS ANALYTICS AND DATA VISUALIZATION
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
Winter 2006Winter 2002 Keller, Ullman, CushingJudy Cushing 19–1 Warehousing The most common form of information integration: copy sources into a single.
Fox MIS Spring 2011 Data Warehouse Week 8 Introduction of Data Warehouse Multidimensional Analysis: OLAP.
UNIT-II Principles of dimensional modeling
1 On-Line Analytic Processing Warehousing Data Cubes.
On-Line Application Processing Warehousing Data Cubes (Data Mining) (slides borrowed from Stanford)
CMPE 226 Database Systems October 21 Class Meeting Department of Computer Engineering San Jose State University Fall 2015 Instructor: Ron Mak
ADVANCED TOPICS IN RELATIONAL DATABASES Spring 2011 Instructor: Hassan Khosravi.
Business Intelligence. Topics Chart Online Analytical Process, OLAP – Excel’s Pivot table – Data visualization with dashboard Scenario Management Data.
OLAP On Line Analytic Processing. OLTP On Line Transaction Processing –support for ‘real-time’ processing of orders, bookings, sales –typically access.
INFORMATION INTEGRATION Sandeep Singh Balouria CS-257 ID- 101.
Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support.
Aggregation SELECT Sum(price) FROM Product WHERE manufacturer=“Toyota” SQL supports several aggregation operations: SUM, MIN, MAX, AVG, COUNT Except COUNT,
1 Introduction to Database Systems, CS420 SQL Views and Indexes.
I am Xinyuan Niu I am here because I love to give presentations. Data Warehousing.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25.
1 Database Systems, 8 th Edition Star Schema Data modeling technique –Maps multidimensional decision support data into relational database Creates.
Introduction to OLAP and Data Warehouse Assoc. Professor Bela Stantic September 2014 Database Systems.
Chapter 111 Chapter 11 Information Integration Spring 2001 Prof. Sang Ho Lee School of Computing, Soongsil Univ.
Data Warehouses and OLAP 1.  Review Questions ◦ Question 1: OLAP ◦ Question 2: Data Warehouses ◦ Question 3: Various Terms and Definitions ◦ Question.
Databases 2 On-Line Application Processing: Warehousing, Data Cubes, Data Mining.
CMPE 226 Database Systems April 12 Class Meeting Department of Computer Engineering San Jose State University Spring 2016 Instructor: Ron Mak
On-Line Application Processing
Data warehouse.
Chapter 11 Information Integration
Data Warehousing CIS 4301 Lecture Notes 4/20/2006.
Data Warehouse.
On-Line Analytic Processing
Data warehouse and OLAP
On-Line Analytic Processing
On-Line Analytical Processing (OLAP)
CMPE 226 Database Systems April 11 Class Meeting
On-Line Application Processing
Presentation transcript:

Data Analysis

Overview Traditional database systems are tuned to many, small, simple queries. Some applications use fewer, more time-consuming, analytic queries.

OLTP Most database operations involve On-Line Transaction Processing (OTLP). –Short, simple, frequent queries and/or modifications, each involving a small number of tuples. Examples: –Answering queries from a Web interface, –sales at cash registers, –selling airline tickets.

OLAP On-Line Analytical Processing (OLAP, or “analytic”) queries are, typically: –Few, but complex queries --- may run for hours. –Queries don't depend on having an absolutely up-to-date database. Example –Partition car sales by years, and look at only the last two years (2007 and 2008). Aggregate price.

Common Architecture Databases at store branches handle OLTP. Local store databases copied to a central data warehouse overnight. Analysts use the data warehouse for OLAP.

Star Schemas A star schema is a common organization for data at a warehouse. It consists of: 1.Fact table : a very large accumulation of facts such as sales. 2.Dimension tables : smaller, generally static information about the entities involved in the facts.

Example: Star Schema Suppose we want to record in a warehouse information about every car sale, e.g.: –serial number, –dealer who sold the car, –date of the sale, and –price charged. The fact table is thus: Sales(serialNo, dealer, day, price)

Example -- Continued The dimension tables include information about the autos, dealers, and days “dimensions”: Autos(serialNo, model, color) Dealers(name, city, province) Days(day, week, month, year) Days dimension table probably not stored.

Visualization – Star Schema Dimension Table (Day)Dimension Table (etc.) Dimension Table (Dealer)Dimension Table (Autos) Fact Table - Sales Dimension Attrs. Dependent Attrs., e.g. price

Data Cubes Keys of dimension tables are the dimensions of a hypercube. Example: for the Sales data, the three dimensions are serialNo, date, and dealer. Dependent attributes (e.g., price) appear at the points of the cube.

Visualization -- Data Cubes price dealer car date

Slicing and Dicing Slice –focus on particular partitions along (one or more) dimension i.e., focus on a particular slice of the cube WHERE clause in SQL Dice –partition the cube into smaller subcubes and aggregated the points in each GROUP BY clause in SQL

Slicing and Dicing in SQL SELECT grouping-attributes and aggregations FROM fact table joined with (zero or more) dimension tables WHERE certain attributes are compared with constants /* slicing */ GROUP BY grouping-attributes /* dicing */

Slicing and Dicing Example Suppose a particular car model, say ‘Gobi’, isn't selling as well as anticipated. How to analyze? Maybe it’s the color… Slice for ‘Gobi.’ Dice for color. SELECT color, SUM(price) FROM Sales NATURAL JOIN Autos WHERE model = 'Gobi' GROUP BY color;

Slicing and Dicing Example Suppose the previous query doesn't tell us much; each color produces about the same revenue. Since the query does not dice for time, we only see the total over all time for each color. –We may thus issue a revised query that also partitions time by month. SELECT color, month, SUM(price) FROM (Sales NATURAL JOIN Autos) NATURAL JOIN Days WHERE model = 'Gobi' GROUP BY color, month;

Slicing and Dicing Example We might discover that red Gobis haven’t sold well recently. Does this problem exists at all dealers, or only some dealers have had low sales of red Gobis? Let’s dice on dealer dimension. SELECT dealer, month, SUM(price) FROM (Sales NATURAL JOIN Autos) NATURAL JOIN Days WHERE model = 'Gobi' AND color = 'red‘ GROUP BY month, dealer;

Slicing and Dicing Example At this point, we find that the sales per month for red Gobis are so small that we cannot observe any trends easily. We decide it was a mistake to dice by month. A better partition by years, and look at only the last two years (2007 and 2008). SELECT dealer, year, SUM(price) FROM (Sales NATURAL JOIN Autos) NATURAL JOIN Days WHERE model = 'Gobi' AND color = 'red' AND (year = 2007 OR year = 2008) GROUP BY year, dealer;

Drill-Down and Roll-Up Previous examples illustrate two common patterns in sequences of queries that slice-and-dice the data cube. Drill-down is the process of partitioning more finely and/or focusing on specific values in certain dimensions. –Each of the example steps except the last is an instance of drill-down. Roll-up is the process of partitioning more coarsely. –The last step, where we grouped by years instead of months to eliminate the effect of randomness in the data, is an example of roll-up.

Marginals --- Data Cube w/Aggregation price dealer car date SUM over all Days

Cube operator CUBE(F) is an augmented table for fact table F –tuples (or points) added in CUBE(F) have a value, denoted * to each dimension –* represents aggregation along that dimension

Example Sales(serialNo, date, dealer, price) Sales(model, color, date, dealer, val, cnt) We replace the serialNo by model and color to which the serial number connects. Also, we will have as independent attributes, val for sum of prices and cnt for number of cars sold. serialNo dimension not well suited for the cube as summing the price over all dates, or over all dealers, but keeping the serial number fixed has no effect.

Example: CUBE(Sale) ('Gobi', 'red', ' ', 'Friendly Fred', 45000, 2) –On May 21, 2001, dealer Friendly Fred sold two red Gobis for a total of $45,000. –This tuple is in both Sales and CUBE(Sales). ('Gobi', *, ' ', 'Friendly Fred', , 7) –On May 21, 2001, Friendly Fred sold seven Gobis of all colors, for a total price of $152,000. –Note that this tuple is in CUBE(Sales) but not in Sales.

Example: CUBE(Sale) ('Gobi', *, ' ', *, , 100) –On May 21, 2001, there were 100 Gobis sold by all the dealers, and the total price of those Gobis was $2,348,000. ('Gobi', *, *, *, , 58000) –Over all time, dealers, and colors, 58,000 Gobis have been sold for a total price of $1,339,800,000. (*, *, *, *, , ) –Total sales of all models in all colors, over all time at all dealers is 198,000 cars for a total price of $3,521,727,000.

CUBE Helps Answering Queries SELECT color, AVG(price) FROM Sales WHERE model = 'Gobi' GROUP BY color; is answered by looking for all tuples of CUBE(Sales) with the form ('Gobi', c, *, *, v, n) where c is any specific color. v and n will be the sum and number of sales of Gobis in that color. The average price, is v/n. The answer to query is the set of (c; v/n) pairs obtained from all ('Gobi', c, *, *, v, n) tuples.

CUBE in SQL SELECT model, color, date, dealer, SUM(val) AS v, SUM(cnt) AS n FROM Sales GROUP BY model, color, date, dealer WITH CUBE; Suppose we materialize this into SalesCube. Then the previous query is rewritten into: SELECT color, SUM(v)/SUM(n) FROM SalesCube WHERE model = 'Gobi' AND date IS NULL AND dealer IS NULL;

Data Mining Data mining is a popular term for summarization of big data sets in useful ways. Examples: 1.Clustering Web pages by topic. 2.Finding characteristics of fraudulent credit-card use. 3.Finding products that are usually bought together.

Market-Basket Data An important form of mining from relational data involves market baskets = sets of “items” that are purchased together as a customer leaves a store. Summary of basket data is frequent itemsets = sets of items that often appear together in baskets.

Example: Market Baskets If people often buy hamburger and ketchup together, the store can: 1.Put hamburger and ketchup near each other and put potato chips between. 2.Run a sale on hamburger and raise the price of ketchup.

Finding Frequent Pairs The simplest case is when we only want to find “frequent pairs” of items. Assume data is in a relation Baskets(basket, item). The support threshold s is the minimum number of baskets in which a pair appears before we are interested.

Frequent Pairs in SQL SELECT b1.item, b2.item FROM Baskets b1, Baskets b2 WHERE b1.basket = b2.basket AND b1.item < b2.item GROUP BY b1.item, b2.item HAVING COUNT(*) >= s; Look for two Basket tuples with the same basket and different items. First item must precede second, so we don’t count the same pair twice. Create a group for each pair of items that appears in at least one basket. Throw away pairs of items that do not appear at least s times.

A-Priori Trick – (1) Straightforward implementation involves a join of a huge Baskets table with itself. The A-Priori algorithm speeds the query by recognizing that a pair of items {i, j } cannot have support s unless both {i } and {j } do.

A-Priori Trick – (2) Use a materialized view to hold only information about frequent items. INSERT INTO Baskets1(basket, item) SELECT * FROM Baskets WHERE item IN ( SELECT item FROM Baskets GROUP BY item HAVING COUNT(*) >= s ); Items that appear in at least s baskets.

A-Priori Algorithm 1.Materialize the view Baskets1. 2.Run the obvious query, but on Baskets1 instead of Baskets. Notes Computing Baskets1 is cheap, since it doesn’t involve a join. Baskets1 probably has many fewer tuples than Baskets. –Running time shrinks with the square of the number of tuples involved in the join.

Course Plug Fall 2009: CSC 485E / CSC 571: Advanced Databases Spring 2010: SENG 474: Data Mining