Scalable Management of On-line E-commerce Interactions Krithi Ramamritham July 2000.

Slides:



Advertisements
Similar presentations
Memory.
Advertisements

CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.
Anindya Datta Debra VanderMeer Krithi Ramamritham Presented by –
OLAP Tuning. Outline OLAP 101 – Data warehouse architecture – ROLAP, MOLAP and HOLAP Data Cube – Star Schema and operations – The CUBE operator – Tuning.
Data Warehouse Tuning. 7 - Datawarehouse2 Datawarehouse Tuning Aggregate (strategic) targeting: –Aggregates flow up from a wide selection of data, and.
Outline What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data.
Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
Dimensional Modeling CS 543 – Data Warehousing. CS Data Warehousing (Sp ) - Asim LUMS2 From Requirements to Data Models.
ETEC 100 Information Technology
Chapter 3 Database Management
IS 4420 Database Fundamentals Chapter 6: Physical Database Design and Performance Leon Chen.
Data Warehousing - 3 ISYS 650. Snowflake Schema one or more dimension tables do not join directly to the fact table but must join through other dimension.
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Chapter 8 Physical Database Design. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Outline Overview of Physical Database.
1.A file is organized logically as a sequence of records. 2. These records are mapped onto disk blocks. 3. Files are provided as a basic construct in operating.
CSI315CSI315 Web Development Technologies Continued.
IT The Relational DBMS Section 06. Relational Database Theory Physical Database Design.
Chapter 6: Foundations of Business Intelligence - Databases and Information Management Dr. Andrew P. Ciganek, Ph.D.
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
TM 7-1 Copyright © 1999 Addison Wesley Longman, Inc. Physical Database Design.
8.4 paging Paging is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method for implementation.
Chapter 6 1 © Prentice Hall, 2002 The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited) Project Identification and Selection Project Initiation.
Data Warehouse and the Star Schema CSCI 242 ©Copyright 2015, David C. Roberts, all rights reserved.
March 16 & 21, Csci 2111: Data and File Structures Week 9, Lectures 1 & 2 Indexed Sequential File Access and Prefix B+ Trees.
Data warehousing and online analytical processing- Ref Chap 4) By Asst Prof. Muhammad Amir Alam.
1 Data Warehouses BUAD/American University Data Warehouses.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Database Management COP4540, SCS, FIU Physical Database Design (ch. 16 & ch. 3)
Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
Chapter 5 Uma Gupta Introduction to Information Systems  2000 by Prentice Hall. 5-1 Database Design and Management.
Winter 2006Winter 2002 Keller, Ullman, CushingJudy Cushing 19–1 Warehousing The most common form of information integration: copy sources into a single.
Management Information Systems, 4 th Edition 1 Chapter 8 Data and Knowledge Management.
Building Dashboards SharePoint and Business Intelligence.
Variant Indexes. Specialized Indexes? Data warehouses are large databases with data integrated from many independent sources. Queries are often complex.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
Chapter 15 A External Methods. © 2004 Pearson Addison-Wesley. All rights reserved 15 A-2 A Look At External Storage External storage –Exists beyond the.
Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.
Chapter 4 Logical & Physical Database Design
Big traffic data processing framework for intelligent monitoring and recording systems 學生 : 賴弘偉 教授 : 許毅然 作者 : Yingjie Xia a, JinlongChen a,b,n, XindaiLu.
Chapter 5 Index and Clustering
B+ Trees: An IO-Aware Index Structure Lecture 13.
Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.
Query Processing CS 405G Introduction to Database Systems.
Session id: Darrell Hilliard Senior Delivery Manager Oracle University Oracle Corporation.
CS 440 Database Management Systems Lecture 5: Query Processing 1.
CS 540 Database Management Systems
1 Copyright © 2009, Oracle. All rights reserved. Oracle Business Intelligence Enterprise Edition: Overview.
DMBS Internals I February 24 th, What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the.
© 2006 Pearson Addison-Wesley. All rights reserved15 A-1 Chapter 15 External Methods.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
I am Xinyuan Niu I am here because I love to give presentations. Data Warehousing.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25.
What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently and safely. Provide.
What is a Data Warehouse? A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a.
Module 11: File Structure
CS 440 Database Management Systems
Physical Database Design and Performance
Chapter 13 The Data Warehouse
Data Warehouse.
Database Performance Tuning and Query Optimization
Physical Database Design
An Introduction to Data Warehousing
Selected Topics: External Sorting, Join Algorithms, …
Chapter 11 Database Performance Tuning and Query Optimization
Chapter 17 Designing Databases
Data Warehousing Concepts
Database management systems
Presentation transcript:

Scalable Management of On-line E-commerce Interactions Krithi Ramamritham July 2000

2 Road Map Introduction & Motivation –The Context: Online User Interaction –The Problem –State of the Art –Our Approach The DataIndex

3 Bob, the eShopper The Context: Online User Interaction

4 Bob likes to read –He regularly visits Papyrus.com, an online bookseller Papyrus.com tracks Bob’s clicks as he navigates through the site Online User Interaction (2)

5 Let’s say Bob enters the Papyrus site at the home page, and then navigates the following link path: –Fiction  Thriller  Legal Thriller Online User Interaction (3)

6 Let us now assume that people who follow the Fiction  Thriller  Legal Thriller path are known to be interested in fusion jazz –This type of knowledge is widely mined by sites these days Online User Interaction (4)

7 Wouldn’t it be cool to be able to present Bob, right at that moment, with an e-coupon from Tower Records with a special discount for titles in the fusion jazz category? Online User Interaction (5)

8 This reduces to performing the following tasks: –Tracking users’ movements on the site –Accessing a vast knowledge base that correlates specific behavior with stored knowledge –Generating responses The Problem … in subsecond time frames

9 Academic research does not appear to have considered this problem directly Two broad classes of commercial software technologies attempt to control user experience of web visits: –Customer Relationship Management (CRM) –Personalization CRM attempts to perform campaign management through offline delayed interaction (e.g., Seibel, E.piphany) Personalization software attempts to perform real-time interaction primarily using recommendation technology (e.g., NetPerceptions, Engage) State of the Art

10 Basic Principles Both CRM and Personalization software rely on the same underlying principles to control user experience. –Accumulate data about historical behavior, usually in commercial database systems –Correlate new behavior with historical data –Generate response

11 Existing technologies either: –Look at vast amounts of history and provide offline response, or –Look at very aggregated information and provide static, online response Ideally, detailed stored historical data will be correlated with current online behavior to provide dynamic, real-time response Existing database systems cannot effectively support such interactive response rates from large persistent databases, especially under heavy loads There is clear evidence of this problem (e.g., Amazon’s decision to discontinue using NetPerceptions due to scaling problems) Limitations of Current Technology

12 We have developed a solution approach to this problem, currently commercialized by a venture- backed startup The 3 key components of our approach closely mirror the general methodology described earlier: –Observe specific instances of on-line behavior –Correlate this specific behavior with the vast amounts of site history accumulated over time –React accordingly Our Approach

13 Our Solution FastPump Knowledge Rules FastPump Knowledge Rules FastPump Knowledge Rules

14 The Yellow Circle A critical element of the overall architecture A data warehouse The underlying technology is based on a suite of structures -- DataIndexes

15 Road Map Introduction & Motivation The DataIndex –DataIndex Structures –Query Processing Algorithms –Comparative Analysis Results –Implementation Results –Demonstration –Conclusions & Future Work

16 The DataIndex A novel paradigm for data storage and retrieval Both a storage and an access structure –Indexing comes for “free” Based on, and extends the notions of vertical partitioning and transposed files Two kinds presented here: –Basic DataIndex (BDI) –Join DataIndex (JDI)

17 Related Work Variant Indexes [O’Neil & Quass, 97] –B + -tree (all systems): RID-list for each search-key value –Bitmapped (almost all systems): Bit-vectors (usually compressed) instead of RID-lists –Projection (Sybase IQ): Mirror copy of column –Bit-sliced (Sybase IQ): “bit-level” projection index Join Indexes [Valduriez et. al., 86] –Bitmapped-Join Indexes [O’Neil & Graefe ‘95] (Informix) Limitations: –These structures are maintained in addition to the base table. –Query response times are unacceptable in interactive contexts.

18 The Basic DataIndex (BDI) Projection-like Index with matching column removed from the table Can have multiple columns (e.g., no TPC-D query asks for ExtPrice or Discount alone) For this presentation, we assume single-column BDIs CustKeyQty DiscountExtPrice CK1Q1 D1E1 CK2Q2 D2E2 CK3Q3 D3E3 CK4Q4 D4E4 Base Table CustKey CK1 CK2 CK3 CK4 Qty Q1 Q2 Q3 Q4 BDI DiscountExtPrice D1E1 D2E2 D3E3 D4E4 BDI

19 The Join DataIndex (JDI) –JDI is “BDI” of RIDs to foreign table. Tax T1 T2 T3 T4 Base Fact Table RIDs RID1 RID2 RID3 Tax T1 T2 T3 T4 JDI NameAddress N1A1 N2A2 N3A3 Base Dimension Table NameAddress N1A1 N2A2 N3A3 DiscountExtPrice D1E1 D2E2 D3E3 D4E4 CustKey CK1 CK2 CK3 CustKey CK1 CK2 CK3 DiscountExtPrice D1E1 D2E2 D3E3 D4E4 BDI CustKey CK1 CK2 CK3 BDI –Joins can be processed efficiently

20 –(Block ID, Slot Number) to Position: –Position to (Block ID, Slot Number): Order of records is conserved in DataIndexes A simple arithmetic mapping is used to associate fields of a record Records in each vertical partition can easily be mapped to blocks and vice-versa RID: (Block ID, Slot Number within that Block) Maintaining Logical Records

21 Query Processing with DataIndexes Two common classes of queries in data warehousing: –Range queries –Star join queries Example range query: SELECT CustKey FROM SALES WHERE Qty>10 SALES Table CustKey CK1 CK2 CK3 CK4 Qty BDI Steps: Rowset –Load display BDI(s) into memory CustKey CK1 CK2 CK3 CK4 BDI –Display values –Apply restrictions to form rowset(s)

22 Star Join Queries A fact table is joined with a set of dimension tables: SELECT Column-list FROM FactTable, DimensionTables WHERE SelectionPredicates AND JoinPredicates JoinPredicates: “ Fact.Attr1 = Dimension.Attr2 ” General Technique Used to Evaluate: 1. Apply SelectionPredicates on individual tables. 2. Perform Join on restricted set of rows or rowsets.

23 Evaluating Star Joins Using DataIndexes Propose 2 efficient algorithms: 1. Star Join with Large memory (SJL) 2. Star Join with Small memory (SJS) »Has negligible memory requirements »Less efficient than SJL

24 The SJL Algorithm Input: –set of dimension tables participating in join –set of dimension table display columns –set of fact table display columns –set of rowsets, one for each dimension table and one for fact table (R F ) Steps: –Load all dimension display column BDIs into memory –Scan R F For each JDI If bit not set in corresponding element of dimension rowset Read next row of R F Else create output: Use JDI to access dimension display columns Use ordinal position to access fact table display columns

25 Example Star Schema Based on TPC-D (Scale Factor: 1) 4 Dimension Tables –PART –SUPPLIER –CUSTOMER –TIME 1 Fact Table –SALES PART PartKey 4 Name55 Mfgr25 Brand 10 Type 25 Size 4 Others ,000 CUSTOMER CustKey 4 Name 25 Address 40 Nation 25 Region 25 Phone 15 AcctBal 8 MktSegment 10 Comment ,000SUPPLIER SuppKey 4 Name25 Address 40 Nation 25 Region25 Phone 15 AcctBal 8 Comment ,000 TIME TimeKey 2 Alpha 10 Year 4 Month 4 Week 4 Day ,557 SALES PartKey 4 SuppKey 4 CustKey 4 Quantity 8 ExtPrice 8 Discount 8 Tax 8 RetFlag 1 Status 1 ShipDate 2 CommitDate 2 ReceiptDate 2 ShipInstruct 25 ShipMode 10 Comment ,000,000

26 SJL Algorithm Example Sample Query: SELECT Mfgr, AcctBal, Quantity, ExtPrice FROM SALES S, PART P, SUPPLIER U WHERE S.PartKey=P.PartKey AND U.SuppKey=P.SuppKey AND Size<100 AND RetFlag=1 AND Nation=‘United States’ D = {PART, SUPPLIER} C D = {Mfgr, AcctBal} C F = {Quantity, ExtPrice} R PART <100 NameMfgrSize PART PartKey R SALES JDI on PartKey JDI on SuppKey RetFlagQuantityExtPrice SALES =1 R SUPP AcctBalNation SUPPLIER =“US” SuppKey Step 0: Perform all selections on single tables ( Size<100 AND RetFlag=0 … ) Create corresponding Rowsets R = {R PART, R SUPP, R SALES }

27 SJL Algorithm Example (2) “Step” 1: (1-2) Load Mfgr & AcctBal BDIs into Memory. R PART R SALES R SUPP MfgrAcctBal R PART R SALES R SUPP MfgrAcctBal JDI on PartKey JDI on SuppKey “Step” 2a: (3-6) Scan R SALES : For each record: Check PartKey JDI against R PART Check SuppKey JDI against R SUPP

28 SJL Algorithm Example (3) “Step” 2b: (7-8) Access in-memory BDIs for each matching record R PART R SALES R SUPP MfgrAcctBal QuantityExtPriceQuantityExtPriceMfgrAcctBal Output “Step” 2c: (9-10) Access Fact Table BDIs from disk for each matching record “Step” 3: Output each record

29 About SJL Advantages: –Accesses each fact table block only once, one block at a time. –Accesses each dimension table block only once. –Accesses only relevant columns (and JDIs). –Memory requirements dependent only on size of displayed dimension BDIs independent of fact table size –Time complexity: O (|F|) Disadvantage: –May still require significant amounts of memory in some cases (extremely large dimension tables). We thus propose SJS

30 The SJS Algorithm Input: same as SJL 4 Phases: 1. R F restriction: Restricts R F to rows appearing in join result. –Scan R F For each JDI If bit not set in corresponding element of dimension rowset Clear bit in R F 2. JDI restriction: Restricts JDIs to rows appearing in join result. –Scan R F For each JDI If bit set in corresponding R F row Write JDI element to restricted JDI (JDI R ) on disk

31 The SJS Algorithm (2) 3. Output BDI Creation: Creates output BDI for dimension display columns. –For each dimension display column BDI Load a portion of BDI into memory (as much as can fit) Scan JDI R Write matching entries to output BDI (in JDI R order) Repeat until entire BDI processed 4. Final Output Merge: Merges dimension and fact table display columns. –Scan R F Use ordinal position to access dimension display columns from output BDI Use ordinal position to access fact table display columns

32 About SJS Used when dimensional BDIs do not fit in memory JDI scanned multiple times, but (large) BDI scanned only once. Time complexity: O(|D| |F|) (|D| = size of BDI) –Smaller than O(|F| 2 ) that can occur with hashing. –Most often affects only one or a few columns.

33 Comparative Analysis Analysis of star-join query cost for bitmapped-join index (BJI) and DataIndex (SJL & SJS) approaches Comparison of star-join performance for: –best case performance of BJI –worst case performance of SJL & SJS Metric: number of disk accesses Query SELECT U.Name, S.ExtPrice FROM SALES S, TIME T, CUSTOMER C, SUPPLIER U WHERE T.Year BETWEEN 1996 AND 1998 AND U.Nation=‘United States’ AND C.Nation=‘United States’ AND S.ShipDate=T.TimeKey AND S.CustKey=C.CustKey AND S.SuppKey=U.SuppKey

34 Selected Baseline Parameter Settings Selectivity on fact table is 1% Selectivity on each dimension table is 5% Number of distinct search key values in a range selection is 2% Compression level is 20% Size of warehouse varies from  86 MB to  860 GB Size of: –Data Block = 8,192 bytes –RID = 6 bytes –Pointer to data block = 4 bytes

35 Baseline Performance Scale Factor Query Evaluation Cost, N

36 Memory Requirements for SJL & BJI Scale Factor Memory Requirements (MB)

37 DataIndex Implementation We have implemented the DataIndex strategy: –Written in C++ –Platforms supported: Solaris, Linux, HP-UX, DEC, NT Performance evaluation on NT platform: –Comparison with Oracle, Red Brick, and DB2 in terms of query processing, storage, and loading costs –Minimal indexing scheme used for commercial systems –Platform: Windows NT, 300 MHz Pentium, 64 MB RAM Much larger tests run on various platforms

38 Schema Used in Analysis CUSTOMER CustKey 4 Name 10 Address 30 Age 4 Phone 4 Total 4 56 TIME TimeKey 4 Year 4 Month 4 Week 4 Day 4 20 PURCHASE PurchKey 4 ProdKey 4 CustKey 4 TimeKey 4 Quantity 4 Price 4 Amount 8 Discount 4 Tax 4 40 PRODUCT ProdKey 4 Name10 Color 1 Weight 8 23 Table # of Records PURCHASE5M14M22M CUSTOMER10K20K40K PRODUCT100K200K400K TIME2.5K5K10K

39 Query Processing Tests Find products having high sales volumes. Find elderly customers who purchased large quantities of a given range of products and the month of purchase. Find elderly customers who purchased large quantities of a given range of products. List the total quantity purchased by customer, product, and month. 4-way join, 2 restrictions 2-way join, 2 restrictions 4-way join, 3 restrictions, aggregation with 3 GROUP BY columns Query Characteristics

40 Query Performance: 2-Way Join Raw Data Size (GB) Response Time (seconds)

41 Query Performance: 4-Way Join, Aggregation Raw Data Size (GB) Response Time (seconds)

42 Storage Requirements Raw Data Size (GB) Indexed Data Size (GB)

43 Loading Times Raw Data Size (GB) Load Time (seconds)

44 Demonstration

45 Other Advantages of DataIndexes Compression –Small range of values yields high compressibility –Algorithms exist for scanning compressed data Bulk Update (Warehouse Loads) –No need to update indexes Buffer Utilization –Columns that are accessed frequently may be pinned in memory

46 Conclusions New, high- performance storage and indexing strategy FastPump Knowledge Rules Implementation results support our analytical findings Empirical analysis shows that DataIndex strategy outperforms existing strategies for range and star join queries in many practical cases

47 Related & Future Work Related Work –“Curio: A Novel Solution for Efficient Storage and Indexing in Data Warehouses”, Proceedings of VLDB 1999 –“A Case for Parallelism in Data Warehousing and OLAP”, Proceedings of DWDOT Future Work –Analytical Study of Other Query Processing Components Aggregations & Group-by’s Multi-Star Queries

48 Other Work Past Work –Indexing block-compressed data (Proceedings of WITS 1999, TKDE submission). –Data modeling for data warehouses/OLAP (Decision Support Systems, 1999). Ongoing Work –Efficient electronic catalog integration –Designing efficient micropayment schemes –Designing structures that allow fast insertion into data warehouses