Scalable Management of On-line E-commerce Interactions Krithi Ramamritham July 2000.

Scalable Management of On-line E-commerce Interactions Krithi Ramamritham July 2000

2 Road Map Introduction & Motivation –The Context: Online User Interaction –The Problem –State of the Art –Our Approach The DataIndex

3 Bob, the eShopper The Context: Online User Interaction

4 Bob likes to read –He regularly visits Papyrus.com, an online bookseller Papyrus.com tracks Bob’s clicks as he navigates through the site Online User Interaction (2)

5 Let’s say Bob enters the Papyrus site at the home page, and then navigates the following link path: –Fiction  Thriller  Legal Thriller Online User Interaction (3)

6 Let us now assume that people who follow the Fiction  Thriller  Legal Thriller path are known to be interested in fusion jazz –This type of knowledge is widely mined by sites these days Online User Interaction (4)

7 Wouldn’t it be cool to be able to present Bob, right at that moment, with an e-coupon from Tower Records with a special discount for titles in the fusion jazz category? Online User Interaction (5)

8 This reduces to performing the following tasks: –Tracking users’ movements on the site –Accessing a vast knowledge base that correlates specific behavior with stored knowledge –Generating responses The Problem … in subsecond time frames

9 Academic research does not appear to have considered this problem directly Two broad classes of commercial software technologies attempt to control user experience of web visits: –Customer Relationship Management (CRM) –Personalization CRM attempts to perform campaign management through offline delayed interaction (e.g., Seibel, E.piphany) Personalization software attempts to perform real-time interaction primarily using recommendation technology (e.g., NetPerceptions, Engage) State of the Art

10 Basic Principles Both CRM and Personalization software rely on the same underlying principles to control user experience. –Accumulate data about historical behavior, usually in commercial database systems –Correlate new behavior with historical data –Generate response

11 Existing technologies either: –Look at vast amounts of history and provide offline response, or –Look at very aggregated information and provide static, online response Ideally, detailed stored historical data will be correlated with current online behavior to provide dynamic, real-time response Existing database systems cannot effectively support such interactive response rates from large persistent databases, especially under heavy loads There is clear evidence of this problem (e.g., Amazon’s decision to discontinue using NetPerceptions due to scaling problems) Limitations of Current Technology

12 We have developed a solution approach to this problem, currently commercialized by a venture- backed startup The 3 key components of our approach closely mirror the general methodology described earlier: –Observe specific instances of on-line behavior –Correlate this specific behavior with the vast amounts of site history accumulated over time –React accordingly Our Approach

13 Our Solution FastPump Knowledge Rules FastPump Knowledge Rules FastPump Knowledge Rules

14 The Yellow Circle A critical element of the overall architecture A data warehouse The underlying technology is based on a suite of structures -- DataIndexes

15 Road Map Introduction & Motivation The DataIndex –DataIndex Structures –Query Processing Algorithms –Comparative Analysis Results –Implementation Results –Demonstration –Conclusions & Future Work

16 The DataIndex A novel paradigm for data storage and retrieval Both a storage and an access structure –Indexing comes for “free” Based on, and extends the notions of vertical partitioning and transposed files Two kinds presented here: –Basic DataIndex (BDI) –Join DataIndex (JDI)

17 Related Work Variant Indexes [O’Neil & Quass, 97] –B + -tree (all systems): RID-list for each search-key value –Bitmapped (almost all systems): Bit-vectors (usually compressed) instead of RID-lists –Projection (Sybase IQ): Mirror copy of column –Bit-sliced (Sybase IQ): “bit-level” projection index Join Indexes [Valduriez et. al., 86] –Bitmapped-Join Indexes [O’Neil & Graefe ‘95] (Informix) Limitations: –These structures are maintained in addition to the base table. –Query response times are unacceptable in interactive contexts.

18 The Basic DataIndex (BDI) Projection-like Index with matching column removed from the table Can have multiple columns (e.g., no TPC-D query asks for ExtPrice or Discount alone) For this presentation, we assume single-column BDIs CustKeyQty DiscountExtPrice CK1Q1 D1E1 CK2Q2 D2E2 CK3Q3 D3E3 CK4Q4 D4E4 Base Table CustKey CK1 CK2 CK3 CK4 Qty Q1 Q2 Q3 Q4 BDI DiscountExtPrice D1E1 D2E2 D3E3 D4E4 BDI

19 The Join DataIndex (JDI) –JDI is “BDI” of RIDs to foreign table. Tax T1 T2 T3 T4 Base Fact Table RIDs RID1 RID2 RID3 Tax T1 T2 T3 T4 JDI NameAddress N1A1 N2A2 N3A3 Base Dimension Table NameAddress N1A1 N2A2 N3A3 DiscountExtPrice D1E1 D2E2 D3E3 D4E4 CustKey CK1 CK2 CK3 CustKey CK1 CK2 CK3 DiscountExtPrice D1E1 D2E2 D3E3 D4E4 BDI CustKey CK1 CK2 CK3 BDI –Joins can be processed efficiently

20 –(Block ID, Slot Number) to Position: –Position to (Block ID, Slot Number): Order of records is conserved in DataIndexes A simple arithmetic mapping is used to associate fields of a record Records in each vertical partition can easily be mapped to blocks and vice-versa RID: (Block ID, Slot Number within that Block) Maintaining Logical Records

21 Query Processing with DataIndexes Two common classes of queries in data warehousing: –Range queries –Star join queries Example range query: SELECT CustKey FROM SALES WHERE Qty>10 SALES Table CustKey CK1 CK2 CK3 CK4 Qty 25 5 7 15 BDI Steps: 1 0 0 1 Rowset –Load display BDI(s) into memory CustKey CK1 CK2 CK3 CK4 BDI –Display values –Apply restrictions to form rowset(s)

22 Star Join Queries A fact table is joined with a set of dimension tables: SELECT Column-list FROM FactTable, DimensionTables WHERE SelectionPredicates AND JoinPredicates JoinPredicates: “ Fact.Attr1 = Dimension.Attr2 ” General Technique Used to Evaluate: 1. Apply SelectionPredicates on individual tables. 2. Perform Join on restricted set of rows or rowsets.

23 Evaluating Star Joins Using DataIndexes Propose 2 efficient algorithms: 1. Star Join with Large memory (SJL) 2. Star Join with Small memory (SJS) »Has negligible memory requirements »Less efficient than SJL

24 The SJL Algorithm Input: –set of dimension tables participating in join –set of dimension table display columns –set of fact table display columns –set of rowsets, one for each dimension table and one for fact table (R F ) Steps: –Load all dimension display column BDIs into memory –Scan R F For each JDI If bit not set in corresponding element of dimension rowset Read next row of R F Else create output: Use JDI to access dimension display columns Use ordinal position to access fact table display columns

25 Example Star Schema Based on TPC-D (Scale Factor: 1) 4 Dimension Tables –PART –SUPPLIER –CUSTOMER –TIME 1 Fact Table –SALES PART PartKey 4 Name55 Mfgr25 Brand 10 Type 25 Size 4 Others... 41 164 200,000 CUSTOMER CustKey 4 Name 25 Address 40 Nation 25 Region 25 Phone 15 AcctBal 8 MktSegment 10 Comment 117 269 150,000SUPPLIER SuppKey 4 Name25 Address 40 Nation 25 Region25 Phone 15 AcctBal 8 Comment 101 243 10,000 TIME TimeKey 2 Alpha 10 Year 4 Month 4 Week 4 Day 4 28 2,557 SALES PartKey 4 SuppKey 4 CustKey 4 Quantity 8 ExtPrice 8 Discount 8 Tax 8 RetFlag 1 Status 1 ShipDate 2 CommitDate 2 ReceiptDate 2 ShipInstruct 25 ShipMode 10 Comment 44 137 6,000,000

26 SJL Algorithm Example Sample Query: SELECT Mfgr, AcctBal, Quantity, ExtPrice FROM SALES S, PART P, SUPPLIER U WHERE S.PartKey=P.PartKey AND U.SuppKey=P.SuppKey AND Size<100 AND RetFlag=1 AND Nation=‘United States’ D = {PART, SUPPLIER} C D = {Mfgr, AcctBal} C F = {Quantity, ExtPrice} R PART <100 NameMfgrSize PART PartKey R SALES JDI on PartKey JDI on SuppKey RetFlagQuantityExtPrice SALES =1 R SUPP AcctBalNation SUPPLIER =“US” SuppKey Step 0: Perform all selections on single tables ( Size<100 AND RetFlag=0 … ) Create corresponding Rowsets R = {R PART, R SUPP, R SALES }

27 SJL Algorithm Example (2) “Step” 1: (1-2) Load Mfgr & AcctBal BDIs into Memory. R PART R SALES R SUPP MfgrAcctBal R PART R SALES R SUPP MfgrAcctBal JDI on PartKey JDI on SuppKey “Step” 2a: (3-6) Scan R SALES : For each record: Check PartKey JDI against R PART Check SuppKey JDI against R SUPP

28 SJL Algorithm Example (3) “Step” 2b: (7-8) Access in-memory BDIs for each matching record R PART R SALES R SUPP MfgrAcctBal QuantityExtPriceQuantityExtPriceMfgrAcctBal Output “Step” 2c: (9-10) Access Fact Table BDIs from disk for each matching record “Step” 3: Output each record

29 About SJL Advantages: –Accesses each fact table block only once, one block at a time. –Accesses each dimension table block only once. –Accesses only relevant columns (and JDIs). –Memory requirements dependent only on size of displayed dimension BDIs independent of fact table size –Time complexity: O (|F|) Disadvantage: –May still require significant amounts of memory in some cases (extremely large dimension tables). We thus propose SJS

30 The SJS Algorithm Input: same as SJL 4 Phases: 1. R F restriction: Restricts R F to rows appearing in join result. –Scan R F For each JDI If bit not set in corresponding element of dimension rowset Clear bit in R F 2. JDI restriction: Restricts JDIs to rows appearing in join result. –Scan R F For each JDI If bit set in corresponding R F row Write JDI element to restricted JDI (JDI R ) on disk

31 The SJS Algorithm (2) 3. Output BDI Creation: Creates output BDI for dimension display columns. –For each dimension display column BDI Load a portion of BDI into memory (as much as can fit) Scan JDI R Write matching entries to output BDI (in JDI R order) Repeat until entire BDI processed 4. Final Output Merge: Merges dimension and fact table display columns. –Scan R F Use ordinal position to access dimension display columns from output BDI Use ordinal position to access fact table display columns

32 About SJS Used when dimensional BDIs do not fit in memory JDI scanned multiple times, but (large) BDI scanned only once. Time complexity: O(|D| |F|) (|D| = size of BDI) –Smaller than O(|F| 2 ) that can occur with hashing. –Most often affects only one or a few columns.

33 Comparative Analysis Analysis of star-join query cost for bitmapped-join index (BJI) and DataIndex (SJL & SJS) approaches Comparison of star-join performance for: –best case performance of BJI –worst case performance of SJL & SJS Metric: number of disk accesses Query SELECT U.Name, S.ExtPrice FROM SALES S, TIME T, CUSTOMER C, SUPPLIER U WHERE T.Year BETWEEN 1996 AND 1998 AND U.Nation=‘United States’ AND C.Nation=‘United States’ AND S.ShipDate=T.TimeKey AND S.CustKey=C.CustKey AND S.SuppKey=U.SuppKey

34 Selected Baseline Parameter Settings Selectivity on fact table is 1% Selectivity on each dimension table is 5% Number of distinct search key values in a range selection is 2% Compression level is 20% Size of warehouse varies from  86 MB to  860 GB Size of: –Data Block = 8,192 bytes –RID = 6 bytes –Pointer to data block = 4 bytes

35 Baseline Performance Scale Factor Query Evaluation Cost, N

36 Memory Requirements for SJL & BJI Scale Factor Memory Requirements (MB)

37 DataIndex Implementation We have implemented the DataIndex strategy: –Written in C++ –Platforms supported: Solaris, Linux, HP-UX, DEC, NT Performance evaluation on NT platform: –Comparison with Oracle, Red Brick, and DB2 in terms of query processing, storage, and loading costs –Minimal indexing scheme used for commercial systems –Platform: Windows NT, 300 MHz Pentium, 64 MB RAM Much larger tests run on various platforms

38 Schema Used in Analysis CUSTOMER CustKey 4 Name 10 Address 30 Age 4 Phone 4 Total 4 56 TIME TimeKey 4 Year 4 Month 4 Week 4 Day 4 20 PURCHASE PurchKey 4 ProdKey 4 CustKey 4 TimeKey 4 Quantity 4 Price 4 Amount 8 Discount 4 Tax 4 40 PRODUCT ProdKey 4 Name10 Color 1 Weight 8 23 Table # of Records PURCHASE5M14M22M CUSTOMER10K20K40K PRODUCT100K200K400K TIME2.5K5K10K

39 Query Processing Tests Find products having high sales volumes. Find elderly customers who purchased large quantities of a given range of products and the month of purchase. Find elderly customers who purchased large quantities of a given range of products. List the total quantity purchased by customer, product, and month. 4-way join, 2 restrictions 2-way join, 2 restrictions 4-way join, 3 restrictions, aggregation with 3 GROUP BY columns Query Characteristics

40 Query Performance: 2-Way Join Raw Data Size (GB) Response Time (seconds)

41 Query Performance: 4-Way Join, Aggregation Raw Data Size (GB) Response Time (seconds)

42 Storage Requirements Raw Data Size (GB) Indexed Data Size (GB)

43 Loading Times Raw Data Size (GB) Load Time (seconds)

44 Demonstration

45 Other Advantages of DataIndexes Compression –Small range of values yields high compressibility –Algorithms exist for scanning compressed data Bulk Update (Warehouse Loads) –No need to update indexes Buffer Utilization –Columns that are accessed frequently may be pinned in memory

46 Conclusions New, high- performance storage and indexing strategy FastPump Knowledge Rules Implementation results support our analytical findings Empirical analysis shows that DataIndex strategy outperforms existing strategies for range and star join queries in many practical cases

47 Related & Future Work Related Work –“Curio: A Novel Solution for Efficient Storage and Indexing in Data Warehouses”, Proceedings of VLDB 1999 –“A Case for Parallelism in Data Warehousing and OLAP”, Proceedings of DWDOT 1998. Future Work –Analytical Study of Other Query Processing Components Aggregations & Group-by’s Multi-Star Queries

48 Other Work Past Work –Indexing block-compressed data (Proceedings of WITS 1999, TKDE submission). –Data modeling for data warehouses/OLAP (Decision Support Systems, 1999). Ongoing Work –Efficient electronic catalog integration –Designing efficient micropayment schemes –Designing structures that allow fast insertion into data warehouses

Scalable Management of On-line E-commerce Interactions Krithi Ramamritham July 2000.

Similar presentations

Presentation on theme: "Scalable Management of On-line E-commerce Interactions Krithi Ramamritham July 2000."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Scalable Management of On-line E-commerce Interactions Krithi Ramamritham July 2000.

Similar presentations

Presentation on theme: "Scalable Management of On-line E-commerce Interactions Krithi Ramamritham July 2000."— Presentation transcript:

Similar presentations

About project

Feedback