Data Integration Aggregate Query Answering under Uncertain Schema Mappings Avigdor Gal, Maria Vanina Martinez, Gerardo I. Simari, VS Subrahmanian Presented By Stephen Lynn
Data Integration Overview Aggregate Queries Probabilistic Schema Mapping Goals/Objectives Aggregate Processing (3 proposals) By-Table Algorithm By-Tuple Algorithm Evaluation Analysis
Data Integration Aggregate Queries COUNT, MIN, MAX, SUM, AVG IDPriceQuantity Simple PTIME algorithms to compute
Data Integration Probabilistic Schema Mappings
Data Integration By-Table vs By-Tuple Tuple – consider all possible mappings for each tuple Table – single mapping for entire table P(date→postedDate) = 0.7 P(date→reducedDate) = 0.3
Data Integration Goals/Objectives Impact Analysis of Probabilistic Schemas on Aggregate Queries Aggregate Query Algorithms Time Complexity Analysis Evaluation
Data Integration Aggregation Methods Range Distribution Expected Value
Data Integration Method Relationships Distribution Most time consuming Most information Range Computed directly from distribution Expected Value Computed directly from distribution More efficient ways to compute
Data Integration By-Table Algorithm All PTIME computable
Data Integration By-Tuple Algorithm (COUNT) O(n * m)
Data Integration Example By-Tuple (COUNT)
Data Integration Time Complexity
Data Integration Evaluation Empirical Evaluation Real-world dataset (eBay) Synthetic dataset Evaluate Time Complexity Vary tuple numbers Vary attribute mappings
Data Integration Evaluation Results
Data Integration Evaluation Results
Data Integration Evaluation Results
Data Integration Analysis Strengths Effect of probabilistic schemas on aggregates Nice PTIME algorithms Weaknesses Evaluation was obvious By-Table results biased by database optimizations Future Work Improve algorithms Extend to sub-queries Heuristics