UPI: A Primary Index for Uncertain Databases (VLDB 10) Hideaki Kimura (BrownU) Samuel Madden (MIT) Stanley B. Zdonik (BrownU) Speaker: Yinuo Zhang Supervisor: Dr. Reynold Cheng
Outline Introduction Uncertain Primary Index (UPI) Secondary Index on UPI Experiments Conclusion and Future Work
Introduction Table Author A Possible World Name Institutionp Existence Alice Brown: 80%, MIT: 20% 90% Bob MIT: 95%, UCB: 5% 100% Carol Brown: 60%, U. Tokyo: 40% 80% A Possible World Name Institution Alice Brown Bob MIT SELECT * FROM Author WHERE Institution=MIT Threshold: confidence QT Query answering over Possible World Semantics The probability of such world: 90%*80% * 100%*95% * 20% = 13.7%
Ex) DBLP with Uncertain Affiliation DBLP: 1.3M Papers and 0.7M Authors Complemented Author Affiliation q=“David DeWitt” Rank URL 1 Wisc.edu/… 2 Microsoft.com/… 3 Columbia.edu/… Name Inst. David DeWitt ? Google API Zipfian Distribution Name Institutionp Countryp David DeWitt Wisconsin: 40%, MS: 20%, Columbia: 13%, … US: 100%
Introduction Achieving an efficient implementation using possible world semantics is difficult. Probabilistic Inverted Index [Singh07] – a secondary index Heap Institution Pointer Brown [Alice] 0.3 [Carol] 0.2 [Bob] 0.4 MIT [Bob] 0.3 [Alice] 0.8 Disk Seeking
Over Uncertain Attributes Introduction Goal Build A Primary Index Over Uncertain Attributes Primary Index Seq. Read
Challenges on Building PI over Uncertain Data Name Institutionp Existence Alice Brown: 80%, MIT: 20% 90% Bob MIT: 95%, UCB: 5% 100% Carol Brown: 60%, U. Tokyo: 40% 80% Cluster on most probable possible value? Replicate tuples into inverted index? Cluster Tuples Brown Alice, Carol MIT Bob Cluster Tuples Brown Alice, Carol, … MIT Alice, Bob, … Too Large for Long-tail distribution (e.g., 100 values with 0.1%) Alice? SELECT …WHERE Inst.=MIT
UPI: Heap + Cutoff Index Name Institutionp Existence Alice Brown: 80%, MIT: 20% 90% Bob MIT: 95%, UCB: 5% 100% Carol Brown: 60%, U. Tokyo: 40% 80% Heap: Sorted by (Inst., Prob) Cutoff Index: Sorted by (Inst., Prob) Institution Tuple Brown (72%) Alice Brown (48%) Carol MIT (95%) Bob MIT (18%) UCB (5%) U. Tokyo (32%) Institution TupleID Pointer UCB (5%) Bob MIT Cutoff Entries with Less than C probability (Cutoff Threshold)
Answering Queries with UPI Probabilistic Threshold Query (PTQ) SELECT * FROM Author WHERE Inst.=UCB With: Probability ≥ QT (Query Threshold) Institution Tuple MIT (95%) Bob … UCB (90%) Dan UCB (20%) Emily C=10% Seek Institution TupleID Pointer UCB (5%) Bob MIT If QT<C (e.g., QT=5%), follow Cutoff pointers If QT≥C (e.g., QT=20%), Sequentially Read
Choosing Cutoff Threshold Faster, but Larger Slower, but Smaller Cutoff Threshold C SELECT * FROM Author WHERE Institution=Ishikawa U Threshold: confidence QT (QT is given at runtime)
Determining C Based on Value/Probability Histograms Histograms (Inst.) #Keys … Br* 30,000 Bs* 31,000 Bt* 30,500 Prob. #Keys … 10%-15% 15,000 15%-25% 28,000 25%-40% 33,000 Histograms (Inst.) Tolerable average query runtime Available Disk Capability -> UPI Size = Costfullscan * Selectivity+ Costseek * # Pointers ? #Pointers C C
#Pointers and Query Cost (replace Ishikawa U with Stanford) Saturation Cost Model Logistic function
Secondary Index on UPI SELECT * FROM Author WHERE Country=US Name Institutionp Countryp Existence Alice Brown: 80%, MIT: 20% US: 100% 90% Bob MIT: 95%, UCB: 5% 100% Carol Brown: 60%, U. Tokyo: 40% US: 60%, Japan: 40% 80% SELECT * FROM Author WHERE Country=US Secondary Index on (Country) Institution Tuple Brown (72%) Alice … MIT (95%) Bob MIT (18%) Country TulpleID Pointers US (100%) Bob MIT US (90%) Alice Brown, MIT Store Multiple Pointers Tailored Access
Experiments Environments C++ & BerkeleyDB 4.7 on Fedora Core 11 Quad-Core, 4GB RAM, 10k RPM SATA HDD Dataset: DBLP w/ Uncertain Affiliation 700k authors and 1.3M publications SwetoDblp, Google API (institutions up to ten per author) Compared With PII
Query Runtime: PII vs. UPI Q1: SELECT * FROM Author WHERE Institution=x Elapsed Read PII 5 [ms] Sort Pointer 30 [µs] Read Heap 5,200 [ms] Elapsed Read UPI 47 [ms] UPI Causes Much Fewer Disk Seeks
Secondary Index Access Q2: SELECT Journal, COUNT (*) FROM Publication WHERE Country=x GROUP BY Journal Elapsed Read PII 110 [ms] Read UPI 3,200 [ms] Elapsed Read PII 110 [ms] Tailor 33 [ms] Read UPI 500 [ms]
Conclusion and Future Work UPI Heap + Cutoff Index Tailored Secondary Index Access Fractured UPI (not presented here) Applying to other types of queries Top-k Query: UPI as Tuple Access Layer
Thanks!
Fractured UPI New Fracture Main Fracture Fracture 1 Query Dump Heap File Cutoff Index Delete Set Delete Set 2ndary Index 2ndary Index Query Independently Dump Insert Buffer (On RAM) SELECT INSERT DELETE
Fractured UPI 8 sec 75 sec 650 sec 212 sec 4 sec 0.03 sec Insert 10% Delete 1% Unclustered Heap 8 sec 75 sec UPI 650 sec 212 sec Fractured UPI 4 sec 0.03 sec Fragmentation More Fractures
Cutoff Index Cost Model (1) Selective Case (Q1, #Pointers=300) Real Runtime Estimated Runtime
Cutoff Index Cost Model (2) Non-Selective Case (Q1, #Pointers=37000) Real Runtime Estimated Runtime