Managing Uncertain Data Anish Das Sarma Stanford University May 19, 2015 1 Anish Das Sarma.

Slides:

Advertisements

Similar presentations

Uncertainty in Data Integration Ai Jing

Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

LIVE A lineage-supported, versioned DBMS  Anish Das Sarma  Martin Theobald  Jennifer Widom.

Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.

BY ANISH D. SARMA, XIN DONG, ALON HALEVY, PROCEEDINGS OF SIGMOD'08, VANCOUVER, BRITISH COLUMBIA, CANADA, JUNE 2008 Bootstrapping Pay-As-You-Go Data Integration.

Modeling and Querying Possible Repairs in Duplicate Detection George Beskales Mohamed A. Soliman Ihab F. Ilyas Shai Ben-David.

E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

Date : 2013/05/27 Author : Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Gong Yu Source : SIGMOD’12 Speaker.

Generalization and Specialization of Kernelization Daniel Lokshtanov.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 SQL: Queries, Programming, Triggers Chapter 5 Modified by Donghui Zhang.

Fast Algorithms For Hierarchical Range Histogram Constructions

Outline What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data.

Creating Probabilistic Databases from IE Models Olga Mykytiuk, 21 July 2011 M.Theobald.

Efficient Query Evaluation on Probabilistic Databases

ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 13: Incorporating Uncertainty into Data Integration PRINCIPLES OF DATA INTEGRATION.

Uncertainty Lineage Data Bases Very Large Data Bases

Ming Hua, Jian Pei Simon Fraser UniversityPresented By: Mahashweta Das Wenjie Zhang, Xuemin LinUniversity of Texas at Arlington The University of New South.

Data Integration with Uncertainty Xin (Luna) Dong Data Management AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma.

SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with.

Personalizing Search via Automated Analysis of Interests and Activities Jaime Teevan Susan T.Dumains Eric Horvitz MIT,CSAILMicrosoft Researcher Microsoft.

Supporting Queries with Imprecise Constraints Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati Dept. of Computer.

Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:

Trio: A System for Data, Uncertainty, and Lineage Search “stanford trio”

1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

Chapter 6: Database Evolution Title: AutoAdmin “What-if” Index Analysis Utility Authors: Surajit Chaudhuri, Vivek Narasayya ACM SIGMOD 1998.

Large-Scale Deduplication with Constraints using Dedupalog Arvind Arasu et al.

Physical Database Monitoring and Tuning the Operational System.

Trio: A System for Data, Uncertainty, and Lineage Search “stanford trio”

Trio: A System for Data, Uncertainty, and Lineage Jennifer Widom et al Stanford University.

Quality-driven Integration of Heterogeneous Information System by Felix Naumann, et al. (VLDB1999) 17 Feb 2006 Presented by Heasoo Hwang.

Answering Imprecise Queries over Autonomous Web Databases Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati.

Databases 6: Normalization

Automatic Data Ramon Lawrence University of Manitoba

Representation Formalisms for Uncertain Data Jennifer Widom with Anish Das Sarma Omar Benjelloun Alon Halevy Trio and other participants in the Trio Project.

Trio: A System for Data, Uncertainty, and Lineage Jennifer Widom Stanford University.

Probabilistic answers to relational queries (PARQ) Octavian Udrea Yu Deng Edward Hung V. S. Subrahmanian.

CBLOCK: An Automatic Blocking Mechanism for Large-Scale Deduplication Tasks Ashwin Machanavajjhala Duke University with Anish Das Sarma, Ankur Jain, Philip.

Midterm 1 Concepts Relational Algebra (DB4) SQL Querying and updating (DB5) Constraints and Triggers (DB11) Unified Modeling Language (DB9) Relational.

Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.

DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.

Lecture 2 The Relational Model. Objectives Terminology of relational model. How tables are used to represent data. Connection between mathematical relations.

Chapter 10 Functional Dependencies and Normalization for Relational Databases.

Introduction. 

Sensor Data Management: Challenges and (some) Solutions Amol Deshpande, University of Maryland.

ULDBs: Databases with Uncertainty and Lineage O. Benjelloun, A. Das Sarma, A. Halevy, J. Widom.

Learning SQL with a Computerized Tutor (Centered on SQL-Tutor) Antonija Mitrovic (University of Canterbury) Presented by Danielle H. Lee.

1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,

Lecture 7 Integrity & Veracity UFCE8K-15-M: Data Management.

« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.

A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung

Data-Centric Human Computation Jennifer Widom Stanford University.

RecBench: Benchmarks for Evaluating Performance of Recommender System Architectures Justin Levandoski Michael D. Ekstrand Michael J. Ludwig Ahmed Eldawy.

Personalized Social Recommendations – Accurate or Private? A. Machanavajjhala (Yahoo!), with A. Korolova (Stanford), A. Das Sarma (Google) 1.

Target schema and domain evolution Source metadata preparation Source data preparation Metadata matching Target data instantiation Transformation and analysis.

Logical Database Design (1 of 3) John Ortiz Lecture 6Logical Database Design (1)2 Introduction  The logical design is a process of refining DB schema.

Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.

Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University.

Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

1 Working Models for Uncertain Data Anish Das Sarma, Omar Benjelloun, Alon Halevy, Jennifer Widom Stanford InfoLab.

Statistical Schema Matching across Web Query Interfaces

Trio A System for Data, Uncertainty, and Lineage

Probabilistic Data Management

Data Integration with Dependent Sources

Record Linkage with Uniqueness Constraints and Erroneous Values

The Trio System for Data, Uncertainty, and Lineage: Overview and Demo

Probabilistic Databases

A Framework for Testing Query Transformation Rules

Presentation transcript:

Managing Uncertain Data Anish Das Sarma Stanford University May 19, Anish Das Sarma

What is Uncertain Data? May 19, Anish Das Sarma (Certain) DataUncertain Data Temperature is FSensor reported 75 ±0.5 F Bob works for YahooBob works for either Yahoo or Microsoft Mary sighted a FinchMary sighted either a Finch (80%) or a Sparrow (20%) It will rain in Stanford tomorrow There is a 60% chance of rain in Stanford tomorrow Yahoo stocks will be at 100 in a month Yahoo stock will be between 60 and 120 in a month John’s age is 23John’s age is in [20,30]

Why Does It Arise? May 19, Anish Das Sarma (Certain) DataUncertain Data Temperature is FSensor reported 75 ±0.5 F Bob works for YahooBob works for either Yahoo or Microsoft Mary sighted a FinchMary sighted either a Finch (80%) or a Sparrow (20%) It will rain in Stanford tomorrow There is a 60% chance of rain in Stanford tomorrow Yahoo stocks will be at 100 in a month Yahoo stock will be between 60 and 120 in a month John’s age is 23John’s age is in [20,30] Precision of devices Lack of information Uncertainty about the future Anonymization

May 19, 2015 Anish Das Sarma 4 Applications: Information Extraction RestaurantZip Hard Rock Cafe

May 19, 2015 Anish Das Sarma 5 Applications: Information Integration name, hPhone, oPhone, hAddr, oAddr name, phone, address Combined View

May 19, 2015 Anish Das Sarma 6 Applications: Deduplication Name John Doe J. Doe ? 80% match

May 19, 2015 Anish Das Sarma 7 Applications: Scientific & Medical Experiments Probably not cancer

How Do Database Management Systems (DBMS) Handle Uncertainty? They don’t  May 19, Anish Das Sarma

What Do (Most) Applications Do? Clean: turn into data that DBMSs can handle May 19, Anish Das Sarma (1)Loss of information (2)Errors compound insidiously ObserverBird-1 Mary Finch: 80% Sparrow: 20% Susan Dove: 70% Sparrow: 30% Jane Hummingbird: 65% Sparrow: 35% Bird-1 Finch Dove Hummingbird

Outline of The Talk Part 1: Managing Uncertainty in a DBMS theory  systems Part 2: Handling Uncertainty in Data Integration systems  theory Other Research (trailer) Future Plans May 19, Anish Das Sarma

Part 1: Managing Uncertain Data Primarily in the context of the Trio project 1)Data 2)Uncertainty 3)Lineage Today’s focus: how lineage helps May 19, Anish Das Sarma

Uncertain Data May 19, 2015Anish Das Sarma 12 Uncertain Data Sensor reported 75 ±0.5 F Bob works for either Yahoo or Microsoft Mary sighted either a Finch (80%) or a Sparrow (20%) There is a 60% chance of rain in Stanford tomorrow An uncertain database represents a set of possible instances (or, possible worlds) Our work: finite sets of possible instances

13 Representing Uncertain Data 20+ years of work (mostly theoretical) Appears to be fundamental trade-off between expressiveness & intuitiveness We spent some time exploring the space of models for uncertainty May 19, 2015Anish Das Sarma

14 Hierarchy of Models [ICDE 06] R relations A or-sets ? maybe-tuples 2 2-clauses prop Full propositional logic sets tuple-sets May 19, 2015Anish Das Sarma + Expressive - Complex + Intuitive - Inexpressive Next 1.Consider a model M 2.Isolate inexpressiveness 3.Solve problem with lineage

15 Running Example: Crime-Solver Saw (witness, color, car) // may be uncertain Drives (person, color, car) // may be uncertain Suspects (person) = π person (Saw ⋈ Drives) May 19, 2015Anish Das Sarma

16 Simple Model M 1. Alternatives: uncertainty about value 2. ‘?’ (Maybe) Annotations Saw (witness, color, car) Amy red, Honda ∥ red, Toyota ∥ orange, Mazda Three possible instances May 19, 2015Anish Das Sarma

17 Six possible instances Simple Model M 1. Alternatives 2. ‘?’ (Maybe): uncertainty about presence ? Saw (witness, color, car) Amy red, Honda ∥ red, Toyota ∥ orange, Mazda Bettyblue, Acura May 19, 2015Anish Das Sarma

May 19, 2015Anish Das Sarma 18 Review: Relational Queries DS Q Saw (witness, color, car) Amy, red, Honda Betty, blue, Acura π person(σ color=red ) W (witness) Amy

19 Queries on Uncertain Data Closure: up-arrow always exists Completeness: All sets of possible instances can be represented D I 1, I 2, …, I n J 1, J 2, …, J m D′D′ possible instances Q on each instance rep. of instances direct implementation May 19, 2015Anish Das Sarma

20 Model M is Not Closed Saw (witness, car) Cathy Honda ∥ Mazda Drives (person, car) Jimmy, Toyota ∥ Jimmy, Mazda Billy, Honda ∥ Frank, Honda Hank, Honda Suspects Jimmy Billy ∥ Frank Hank Suspects = π person (Saw ⋈ Drives) ? ? ? Does not correctly capture possible instances in the result CANNOT May 19, 2015Anish Das Sarma

21 to the Rescue Lineage Model M + Lineage = Completeness May 19, 2015Anish Das Sarma

22 Example with Lineage IDSaw (witness, car) 11Cathy Honda ∥ Mazda IDDrives (person, car) 21 Jimmy, Toyota ∥ Jimmy, Mazda 22 Billy, Honda ∥ Frank, Honda 23Hank, Honda IDSuspects 31Jimmy 32 Billy ∥ Frank 33Hank Suspects = π person (Saw ⋈ Drives) ? ? ? May 19, 2015Anish Das Sarma

23 Example with Lineage ID Saw (witness, car) 11Cathy Honda ∥ Mazda ID Drives (person, car) 21 Jimmy, Toyota ∥ Jimmy, Mazda 22 Billy, Honda ∥ Frank, Honda 23Hank, Honda ID Suspects 31Jimmy 32 Billy ∥ Frank 33Hank Suspects = π person (Saw ⋈ Drives) ? ? ? λ (31) = (11,2) Λ (21,2) λ (32,1) = (11,1) Λ (22,1); λ (32,2) = (11,1) Λ (22,2) λ (33) = (11,1) Λ 23 Correctly captures possible instances in the result

24 Trio’s Data Model 1.Alternatives 2.‘?’ (Maybe) Annotations 3.Confidence values (next) 4.Lineage Uncertainty-Lineage Databases (ULDBs) Theorem: ULDBs are closed and complete [VLDB 06] May 19, 2015Anish Das Sarma Formally studied properties like minimization, equivalence, approximation and membership. [VLDB 06, VLDB J. 08]

25 Confidence Values in Trio Confidence values supplied with base data – Default probabilistic interpretation Problem: Compute confidence values on result data [ICDE 08] 5-minute DBClip – Search “confidence computation” on YouTube. May 19, 2015Anish Das Sarma

26 Problem Description ID Saw (witness,car) 11(Amy, Honda) : (Betty, Acura) : 0.6 ID Drives (person,car) 21(Jimmy, Honda) : (Billy, Honda) : (Hank, Acura) : 1.0 ID Cars 41Honda 42Acura Cars = π car (Saw ⋈ Drives) : ? May 19, 2015Anish Das Sarma

27 Operator-by-Operator ID Saw (witness,car) 11(Amy, Honda) : (Betty, Acura) : 0.6 ID Drives (person,car) 21(Jimmy, Honda) : (Billy, Honda) : (Hank, Acura) : 1.0 ID Cars 41Honda 42Acura 31(Amy,Jimmy,Honda) 32(Amy,Billy,Honda) 33(Betty,Hank,Acura) ⋈ SawDrives π car : 0.5*0.9: 0.45 : 0.4 : (0.45*0.4): 0.67 Wrong!! May 19, 2015Anish Das Sarma

28 Operator-by-Operator ID Saw (witness,car) 11(Amy, Honda) : (Betty, Acura) : 0.6 ID Drives (person,car) 21(Jimmy, Honda) : (Billy, Honda) : (Hank, Acura) : 1.0 ID Cars 41Honda 42Acura 31(Amy,Jimmy,Honda) 32(Amy,Billy,Honda) 33(Betty,Hank,Acura) : 0.45 : 0.4 : (0.45*0.4) Not independent! May 19, 2015Anish Das Sarma

29 Database Query Processing 101 May 19, 2015Anish Das Sarma Q Query Execution Plans Pick and execute best plan Statistics, indexes

30 Operator-by-Operator Confidence Computation May 19, 2015Anish Das Sarma Q Query Plans Can be much smaller or empty

31 Decouple Data and Confidence Computation May 19, 2015Anish Das Sarma Q Query Plans 1.Compute data 2.Use lineage to compute confidences (on demand) Theorem: Arbitrary improvement. [ICDE 08]

32 Our Approach ID Saw (witness,car) 11(Amy, Honda) : (Betty, Acura) : 0.6 ID Drives (person,car) 21(Jimmy, Honda) : (Billy, Honda) : (Hank, Acura) : 1.0 ID Cars 41Honda 42Acura : ? λ (41) = 11 Λ (21 V 22) λ (42) = 12 Λ * ( *0.8) : 0.49 : 0.6 Correct!! May 19, 2015Anish Das Sarma

Algorithm May 19, 2015Anish Das Sarma 33 R t t1t2 t4 t5t6t7 λ(t) = f(t4,t5,t6,t7) Expand lineage to base data 2. Get confidence of base data 3. Evaluate the probability λ(t) Detecting independence Memoization Batch computation 0.4

Some Other Trio Work May 19, Anish Das Sarma Modifications and Versioning [TR 08] -Stored derived relations -Modifications  versions Indexes and Statistics [MUD 08] -Specialized indexes, histograms Functional Dependencies & Schema Design [TR 07] -Definitions, sound and complete axiomatization of FDs -Lossless decomposition -FD testing, finding, and inference

35 Related Work (sample) Modeling Uncertainty: Plenty, covered in textbooks Systems: Avatar, BayesStore, MayBMS, MYSTIQ, ORION, PrDB, ProbView, Trio, others? May 19, 2015Anish Das Sarma

Part 2: Data Integration Reboot! May 19, Anish Das Sarma or, wake up!

Traditional Data Integration: Setup D1D2D3D4D5 Bib(title, authors, conf, year) Author(aid, name) Paper(pid, title, year) AuthoredBy(aid,pid) Mediated Schema Publication(title, author, conf, year) 1. Mediated Schema 2. Schema Mappings Mapping SELECT P.title AS title, A.name AS author, NULL AS conf, P.year AS year, FROM Author AS A, Paper AS P, AuthoredBy AS B WHERE A.aid=B.aid AND P.pid=B.pid 3. Query Answering Significant up-front effort 37 Who authored the most SIGMOD papers in the 90’s? Mike Carey

“Pay-As-You-Go” Data Integration 1.Automated best-effort integration from the outset 2.Further improve the system over time with feedback 38 How advanced a starting point can we provide? May 19, 2015Anish Das Sarma

Automatic integration  Make guesses  Model probabilities Specifically – Probabilistic schema mappings – Probabilistic mediated-schema Anish Das Sarma39May 19, 2015 to the Rescue Uncertainty >90% accuracy in automatically integrating data sources for several domains [SIGMOD 08]

Next 1.Probabilistic mediated schemas 2.Probabilistic schema mappings 3.Experimental results Anish Das Sarma40May 19, 2015

Mediated Schema S1(name, , phone-num, address)S2(person-name,phone,mailing-addr) Med-S (name, , phone, addr) {name, person-name} {phone-num, phone} {address, mailing-addr} { } A mediated schema is a clustering of a subset of the set of all attributes appearing in source schemas. 41 Anish Das SarmaMay 19, 2015

Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr}) Example S1(name, hPhone, oPhone, hAddr, oAddr)S2(name,phone,address) ? Q: SELECT name, hPhone, oPhone FROM Med 42

S1(name, hPhone, oPhone, hAddr, oAddr)S2(name,phone,address) Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr}) Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr}) Q: SELECT name, phone, address FROM Med 43 Example

Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr}) S1(name, hPhone, oPhone, hAddr, oAddr)S2(name,phone,address) Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr}) Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr}) Q: SELECT name, phone, address FROM Med 44 Example

Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr}) Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr}) S1(name, hPhone, oPhone, hAddr, oAddr)S2(name,phone,address) Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr}) Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr}) Q: SELECT name, phone, address FROM Med 45 Example

Med5 ({name}, {phone}, {hPhone}, {oPhone}, {address}, {hAddr}, {oAddr}) Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr}) S1(name, hPhone, oPhone, hAddr, oAddr)S2(name,phone,address) Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr}) Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr}) Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr}) Q: SELECT name, phone, address FROM Med 46 Example

Med5 ({name}, {phone}, {hPhone}, {oPhone}, {address}, {hAddr}, {oAddr}) Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr}) S1(name, hPhone, oPhone, hAddr, oAddr)S2(name,phone,address) Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr}) Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr}) Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr}) Q: SELECT name, phone, address FROM Med 47 Example

Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr}) Probabilistic Mediated Schema S1(name, hPhone, oPhone, hAddr, oAddr)S2(name,phone,address) Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr}) Pr= Anish Das SarmaMay 19, 2015 Pr=0.5 Probabilistic Mediated Schema (p-med-schema) is a set M = {(M 1,Pr(M 1 )), …, (M k,Pr(M k ))} where M i is a med-schema; i≠j => M i ≠ M j Pr(M i )(0,1]; ΣPr(M i ) = 1

P-Mappings PM1 Med3 (name, hPP, oP, hAA, oA) S1(name, hP, oP, hA, oA) Pr=.64 Med3 (name, hPP, oP, hAA, oA) S1(name, hP, oP, hA, oA) Pr=.16 Med3 (name, hPP, oP, hAA, oA) S1(name, hP, oP, hA, oA) Pr=.16 Med3 (name, hPP, oP, hAA, oA) S1(name, hP, oP, hA, oA) Pr=.04 PM2 Med4 (name, oPP, hP, oAA, hA) S1(name, hP, oP, hA, oA) Pr=.64 Med4 (name, oPP, hP, oAA, hA) S1(name, hP, oP, hA, oA) Pr=.16 Med4 (name, oPP, hP, oAA, hA) S1(name, hP, oP, hA, oA) Pr=.16 Med4 (name, oPP, hP, oAA, hA) S1(name, hP, oP, hA, oA) Pr= Anish Das SarmaMay 19, 2015

Expressive Power of P-Med-Schema & P-Mapping Theorem 1. For one-to-many mappings: (p-med-schema + p-mappings) = (mediated schema + p-mapping) > (p-med-schema + mappings) Theorem 2. When restricted to one-to-one mappings: (p-med-schema + p-mappings) = (p-med-schema + mappings) > (mediated schema + p-mapping) 50 Anish Das SarmaMay 19, 2015

Next Creating p-med-schemas (briefly) Creating p-mappings (briefly) Experimental Results Anish Das Sarma51May 19, 2015

P-med-schema Creation S2 S1 nameaddress -address pnamehome-address May 19, Certain/uncertain edges

S2 S1 nameaddress -address pnamehome-address S2 S1 nameaddress -address pnamehome-address S2 S1 nameaddress -address pnamehome-address S2 S1 nameaddress -address pnamehome-address 53 P-med-schema Creation 2. Clustering

S2 S1 nameaddress -address pnamehome-address S2 S1 nameaddress -address pnamehome-address S2 S1 nameaddress -address pnamehome-address S2 S1 nameaddress -address pnamehome-address Pr=1/6 Pr=1/3 54 P-med-schema Creation 3. Assign probabilities

P-mapping Creation S=(num, pname, home-addr, office-addr) T=(name, mailing-addr) Goal: find a p-mapping that is consistent with a set of weighted correspondences Theorem: There exists a p-mapping consistent if and only if for every source/target attribute a, the sum of the weights of all correspondences that involve a is at most 1.

Experiments Data: tables extracted from HTML tables on the web Domain#SourcesSearch Keywords Movie161movie, year Car817make, model People49 job/title, organization/company/employer Course647 course/class, instructor/teacher/lecturer, subject/department/title Bib649author, title, year, journal/conference 56Anish Das Sarma May 19, 2015

Gold standard: manual Approximate standard: semi-automatic Precision, recall, F-measure for several SQL queries varying attributes, selectivities 57 Experiments

Quality of Query Answering DomainPrecisionRecallF-measure Golden Standard People Course Approximate Golden Standard Movie Car People Course111 Bib

Comparison with Other Approaches Keyword search obtained low precision and low recall. Querying the sources directly or considering only the highest probability mapping obtained low recall. We obtained highest F-measure in all domains. 59

Comparison with Other Mediated-Schema Generation Methods Using p-med- schema obtained highest F-measure in all domains. 60

System Setup Time (one domain) 61

Brief Related Work Approximate schema mappings [Magnani et. al. 2007], [Gal 2007], [Dong. et. al. 2007] Automatic generation of mediated schemas [He et. al. 2003], More (see paper) Anish Das Sarma62May 19, 2015

Finally… Other Research – Data Integration (2) – Deduplication (2) – Quality Estimation of Sensor/RFID Streams [IQIS 06] Future Plans May 19, Anish Das Sarma

Data Integration May 19, Anish Das Sarma Problem: Foundations for integration of uncertain data Solution [TR 08]: -Define open- and closed-containment for uncertain data -Algorithms, complexity of consistency checking and finding maximally-correct query answers Problem: Dependencies in web-data integration (e.g., deep-web, plagiarism) Solution [TR 08]: Algorithms, complexity of fundamental problems: Coverage estimation, cost minimization and coverage maximization, and source ordering

Deduplication May 19, Anish Das Sarma [SIGMOD 07] -Leveraging real-world constraints for deduplication -Tractable optimal solution and experiments over DBLP and ACM publication data [WWW 07] -Detecting near-duplicate web-pages for crawling -Efficient indexing scheme supporting crawling speeds over web-scale data

Future Work May 19, Anish Das Sarma Short & Medium-Term 1.View management over uncertain databases: materialized view updates, versioning, partial materialization, … 2.More applications of uncertain data 3.More on lineage: internal/external lineage, approximate lineage, uncertain lineage, …

Future Work May 19, Anish Das Sarma Long-term 1.Applying uncertainty to other data management problems: query optimization? cloud computing? 2.Improve quality of data through conflict resolution and feedback 3.Web-data management: Handling huge amounts of data that is conflicting, uncertain, redundant, dependent, …

Thanks! May 19, 2015Anish Das Sarma 68 Anish Das Sarma (or search “Anish Das Sarma”)