Managing Uncertain Data Anish Das Sarma Stanford University May 19, Anish Das Sarma
What is Uncertain Data? May 19, Anish Das Sarma (Certain) DataUncertain Data Temperature is FSensor reported 75 ±0.5 F Bob works for YahooBob works for either Yahoo or Microsoft Mary sighted a FinchMary sighted either a Finch (80%) or a Sparrow (20%) It will rain in Stanford tomorrow There is a 60% chance of rain in Stanford tomorrow Yahoo stocks will be at 100 in a month Yahoo stock will be between 60 and 120 in a month John’s age is 23John’s age is in [20,30]
Why Does It Arise? May 19, Anish Das Sarma (Certain) DataUncertain Data Temperature is FSensor reported 75 ±0.5 F Bob works for YahooBob works for either Yahoo or Microsoft Mary sighted a FinchMary sighted either a Finch (80%) or a Sparrow (20%) It will rain in Stanford tomorrow There is a 60% chance of rain in Stanford tomorrow Yahoo stocks will be at 100 in a month Yahoo stock will be between 60 and 120 in a month John’s age is 23John’s age is in [20,30] Precision of devices Lack of information Uncertainty about the future Anonymization
May 19, 2015 Anish Das Sarma 4 Applications: Information Extraction RestaurantZip Hard Rock Cafe
May 19, 2015 Anish Das Sarma 5 Applications: Information Integration name, hPhone, oPhone, hAddr, oAddr name, phone, address Combined View
May 19, 2015 Anish Das Sarma 6 Applications: Deduplication Name John Doe J. Doe ? 80% match
May 19, 2015 Anish Das Sarma 7 Applications: Scientific & Medical Experiments Probably not cancer
How Do Database Management Systems (DBMS) Handle Uncertainty? They don’t May 19, Anish Das Sarma
What Do (Most) Applications Do? Clean: turn into data that DBMSs can handle May 19, Anish Das Sarma (1)Loss of information (2)Errors compound insidiously ObserverBird-1 Mary Finch: 80% Sparrow: 20% Susan Dove: 70% Sparrow: 30% Jane Hummingbird: 65% Sparrow: 35% Bird-1 Finch Dove Hummingbird
Outline of The Talk Part 1: Managing Uncertainty in a DBMS theory systems Part 2: Handling Uncertainty in Data Integration systems theory Other Research (trailer) Future Plans May 19, Anish Das Sarma
Part 1: Managing Uncertain Data Primarily in the context of the Trio project 1)Data 2)Uncertainty 3)Lineage Today’s focus: how lineage helps May 19, Anish Das Sarma
Uncertain Data May 19, 2015Anish Das Sarma 12 Uncertain Data Sensor reported 75 ±0.5 F Bob works for either Yahoo or Microsoft Mary sighted either a Finch (80%) or a Sparrow (20%) There is a 60% chance of rain in Stanford tomorrow An uncertain database represents a set of possible instances (or, possible worlds) Our work: finite sets of possible instances
13 Representing Uncertain Data 20+ years of work (mostly theoretical) Appears to be fundamental trade-off between expressiveness & intuitiveness We spent some time exploring the space of models for uncertainty May 19, 2015Anish Das Sarma
14 Hierarchy of Models [ICDE 06] R relations A or-sets ? maybe-tuples 2 2-clauses prop Full propositional logic sets tuple-sets May 19, 2015Anish Das Sarma + Expressive - Complex + Intuitive - Inexpressive Next 1.Consider a model M 2.Isolate inexpressiveness 3.Solve problem with lineage
15 Running Example: Crime-Solver Saw (witness, color, car) // may be uncertain Drives (person, color, car) // may be uncertain Suspects (person) = π person (Saw ⋈ Drives) May 19, 2015Anish Das Sarma
16 Simple Model M 1. Alternatives: uncertainty about value 2. ‘?’ (Maybe) Annotations Saw (witness, color, car) Amy red, Honda ∥ red, Toyota ∥ orange, Mazda Three possible instances May 19, 2015Anish Das Sarma
17 Six possible instances Simple Model M 1. Alternatives 2. ‘?’ (Maybe): uncertainty about presence ? Saw (witness, color, car) Amy red, Honda ∥ red, Toyota ∥ orange, Mazda Bettyblue, Acura May 19, 2015Anish Das Sarma
May 19, 2015Anish Das Sarma 18 Review: Relational Queries DS Q Saw (witness, color, car) Amy, red, Honda Betty, blue, Acura π person(σ color=red ) W (witness) Amy
19 Queries on Uncertain Data Closure: up-arrow always exists Completeness: All sets of possible instances can be represented D I 1, I 2, …, I n J 1, J 2, …, J m D′D′ possible instances Q on each instance rep. of instances direct implementation May 19, 2015Anish Das Sarma
20 Model M is Not Closed Saw (witness, car) Cathy Honda ∥ Mazda Drives (person, car) Jimmy, Toyota ∥ Jimmy, Mazda Billy, Honda ∥ Frank, Honda Hank, Honda Suspects Jimmy Billy ∥ Frank Hank Suspects = π person (Saw ⋈ Drives) ? ? ? Does not correctly capture possible instances in the result CANNOT May 19, 2015Anish Das Sarma
21 to the Rescue Lineage Model M + Lineage = Completeness May 19, 2015Anish Das Sarma
22 Example with Lineage IDSaw (witness, car) 11Cathy Honda ∥ Mazda IDDrives (person, car) 21 Jimmy, Toyota ∥ Jimmy, Mazda 22 Billy, Honda ∥ Frank, Honda 23Hank, Honda IDSuspects 31Jimmy 32 Billy ∥ Frank 33Hank Suspects = π person (Saw ⋈ Drives) ? ? ? May 19, 2015Anish Das Sarma
23 Example with Lineage ID Saw (witness, car) 11Cathy Honda ∥ Mazda ID Drives (person, car) 21 Jimmy, Toyota ∥ Jimmy, Mazda 22 Billy, Honda ∥ Frank, Honda 23Hank, Honda ID Suspects 31Jimmy 32 Billy ∥ Frank 33Hank Suspects = π person (Saw ⋈ Drives) ? ? ? λ (31) = (11,2) Λ (21,2) λ (32,1) = (11,1) Λ (22,1); λ (32,2) = (11,1) Λ (22,2) λ (33) = (11,1) Λ 23 Correctly captures possible instances in the result
24 Trio’s Data Model 1.Alternatives 2.‘?’ (Maybe) Annotations 3.Confidence values (next) 4.Lineage Uncertainty-Lineage Databases (ULDBs) Theorem: ULDBs are closed and complete [VLDB 06] May 19, 2015Anish Das Sarma Formally studied properties like minimization, equivalence, approximation and membership. [VLDB 06, VLDB J. 08]
25 Confidence Values in Trio Confidence values supplied with base data – Default probabilistic interpretation Problem: Compute confidence values on result data [ICDE 08] 5-minute DBClip – Search “confidence computation” on YouTube. May 19, 2015Anish Das Sarma
26 Problem Description ID Saw (witness,car) 11(Amy, Honda) : (Betty, Acura) : 0.6 ID Drives (person,car) 21(Jimmy, Honda) : (Billy, Honda) : (Hank, Acura) : 1.0 ID Cars 41Honda 42Acura Cars = π car (Saw ⋈ Drives) : ? May 19, 2015Anish Das Sarma
27 Operator-by-Operator ID Saw (witness,car) 11(Amy, Honda) : (Betty, Acura) : 0.6 ID Drives (person,car) 21(Jimmy, Honda) : (Billy, Honda) : (Hank, Acura) : 1.0 ID Cars 41Honda 42Acura 31(Amy,Jimmy,Honda) 32(Amy,Billy,Honda) 33(Betty,Hank,Acura) ⋈ SawDrives π car : 0.5*0.9: 0.45 : 0.4 : (0.45*0.4): 0.67 Wrong!! May 19, 2015Anish Das Sarma
28 Operator-by-Operator ID Saw (witness,car) 11(Amy, Honda) : (Betty, Acura) : 0.6 ID Drives (person,car) 21(Jimmy, Honda) : (Billy, Honda) : (Hank, Acura) : 1.0 ID Cars 41Honda 42Acura 31(Amy,Jimmy,Honda) 32(Amy,Billy,Honda) 33(Betty,Hank,Acura) : 0.45 : 0.4 : (0.45*0.4) Not independent! May 19, 2015Anish Das Sarma
29 Database Query Processing 101 May 19, 2015Anish Das Sarma Q Query Execution Plans Pick and execute best plan Statistics, indexes
30 Operator-by-Operator Confidence Computation May 19, 2015Anish Das Sarma Q Query Plans Can be much smaller or empty
31 Decouple Data and Confidence Computation May 19, 2015Anish Das Sarma Q Query Plans 1.Compute data 2.Use lineage to compute confidences (on demand) Theorem: Arbitrary improvement. [ICDE 08]
32 Our Approach ID Saw (witness,car) 11(Amy, Honda) : (Betty, Acura) : 0.6 ID Drives (person,car) 21(Jimmy, Honda) : (Billy, Honda) : (Hank, Acura) : 1.0 ID Cars 41Honda 42Acura : ? λ (41) = 11 Λ (21 V 22) λ (42) = 12 Λ * ( *0.8) : 0.49 : 0.6 Correct!! May 19, 2015Anish Das Sarma
Algorithm May 19, 2015Anish Das Sarma 33 R t t1t2 t4 t5t6t7 λ(t) = f(t4,t5,t6,t7) Expand lineage to base data 2. Get confidence of base data 3. Evaluate the probability λ(t) Detecting independence Memoization Batch computation 0.4
Some Other Trio Work May 19, Anish Das Sarma Modifications and Versioning [TR 08] -Stored derived relations -Modifications versions Indexes and Statistics [MUD 08] -Specialized indexes, histograms Functional Dependencies & Schema Design [TR 07] -Definitions, sound and complete axiomatization of FDs -Lossless decomposition -FD testing, finding, and inference
35 Related Work (sample) Modeling Uncertainty: Plenty, covered in textbooks Systems: Avatar, BayesStore, MayBMS, MYSTIQ, ORION, PrDB, ProbView, Trio, others? May 19, 2015Anish Das Sarma
Part 2: Data Integration Reboot! May 19, Anish Das Sarma or, wake up!
Traditional Data Integration: Setup D1D2D3D4D5 Bib(title, authors, conf, year) Author(aid, name) Paper(pid, title, year) AuthoredBy(aid,pid) Mediated Schema Publication(title, author, conf, year) 1. Mediated Schema 2. Schema Mappings Mapping SELECT P.title AS title, A.name AS author, NULL AS conf, P.year AS year, FROM Author AS A, Paper AS P, AuthoredBy AS B WHERE A.aid=B.aid AND P.pid=B.pid 3. Query Answering Significant up-front effort 37 Who authored the most SIGMOD papers in the 90’s? Mike Carey
“Pay-As-You-Go” Data Integration 1.Automated best-effort integration from the outset 2.Further improve the system over time with feedback 38 How advanced a starting point can we provide? May 19, 2015Anish Das Sarma
Automatic integration Make guesses Model probabilities Specifically – Probabilistic schema mappings – Probabilistic mediated-schema Anish Das Sarma39May 19, 2015 to the Rescue Uncertainty >90% accuracy in automatically integrating data sources for several domains [SIGMOD 08]
Next 1.Probabilistic mediated schemas 2.Probabilistic schema mappings 3.Experimental results Anish Das Sarma40May 19, 2015
Mediated Schema S1(name, , phone-num, address)S2(person-name,phone,mailing-addr) Med-S (name, , phone, addr) {name, person-name} {phone-num, phone} {address, mailing-addr} { } A mediated schema is a clustering of a subset of the set of all attributes appearing in source schemas. 41 Anish Das SarmaMay 19, 2015
Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr}) Example S1(name, hPhone, oPhone, hAddr, oAddr)S2(name,phone,address) ? Q: SELECT name, hPhone, oPhone FROM Med 42
S1(name, hPhone, oPhone, hAddr, oAddr)S2(name,phone,address) Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr}) Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr}) Q: SELECT name, phone, address FROM Med 43 Example
Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr}) S1(name, hPhone, oPhone, hAddr, oAddr)S2(name,phone,address) Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr}) Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr}) Q: SELECT name, phone, address FROM Med 44 Example
Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr}) Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr}) S1(name, hPhone, oPhone, hAddr, oAddr)S2(name,phone,address) Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr}) Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr}) Q: SELECT name, phone, address FROM Med 45 Example
Med5 ({name}, {phone}, {hPhone}, {oPhone}, {address}, {hAddr}, {oAddr}) Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr}) S1(name, hPhone, oPhone, hAddr, oAddr)S2(name,phone,address) Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr}) Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr}) Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr}) Q: SELECT name, phone, address FROM Med 46 Example
Med5 ({name}, {phone}, {hPhone}, {oPhone}, {address}, {hAddr}, {oAddr}) Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr}) S1(name, hPhone, oPhone, hAddr, oAddr)S2(name,phone,address) Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr}) Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr}) Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr}) Q: SELECT name, phone, address FROM Med 47 Example
Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr}) Probabilistic Mediated Schema S1(name, hPhone, oPhone, hAddr, oAddr)S2(name,phone,address) Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr}) Pr= Anish Das SarmaMay 19, 2015 Pr=0.5 Probabilistic Mediated Schema (p-med-schema) is a set M = {(M 1,Pr(M 1 )), …, (M k,Pr(M k ))} where M i is a med-schema; i≠j => M i ≠ M j Pr(M i )(0,1]; ΣPr(M i ) = 1
P-Mappings PM1 Med3 (name, hPP, oP, hAA, oA) S1(name, hP, oP, hA, oA) Pr=.64 Med3 (name, hPP, oP, hAA, oA) S1(name, hP, oP, hA, oA) Pr=.16 Med3 (name, hPP, oP, hAA, oA) S1(name, hP, oP, hA, oA) Pr=.16 Med3 (name, hPP, oP, hAA, oA) S1(name, hP, oP, hA, oA) Pr=.04 PM2 Med4 (name, oPP, hP, oAA, hA) S1(name, hP, oP, hA, oA) Pr=.64 Med4 (name, oPP, hP, oAA, hA) S1(name, hP, oP, hA, oA) Pr=.16 Med4 (name, oPP, hP, oAA, hA) S1(name, hP, oP, hA, oA) Pr=.16 Med4 (name, oPP, hP, oAA, hA) S1(name, hP, oP, hA, oA) Pr= Anish Das SarmaMay 19, 2015
Expressive Power of P-Med-Schema & P-Mapping Theorem 1. For one-to-many mappings: (p-med-schema + p-mappings) = (mediated schema + p-mapping) > (p-med-schema + mappings) Theorem 2. When restricted to one-to-one mappings: (p-med-schema + p-mappings) = (p-med-schema + mappings) > (mediated schema + p-mapping) 50 Anish Das SarmaMay 19, 2015
Next Creating p-med-schemas (briefly) Creating p-mappings (briefly) Experimental Results Anish Das Sarma51May 19, 2015
P-med-schema Creation S2 S1 nameaddress -address pnamehome-address May 19, Certain/uncertain edges
S2 S1 nameaddress -address pnamehome-address S2 S1 nameaddress -address pnamehome-address S2 S1 nameaddress -address pnamehome-address S2 S1 nameaddress -address pnamehome-address 53 P-med-schema Creation 2. Clustering
S2 S1 nameaddress -address pnamehome-address S2 S1 nameaddress -address pnamehome-address S2 S1 nameaddress -address pnamehome-address S2 S1 nameaddress -address pnamehome-address Pr=1/6 Pr=1/3 54 P-med-schema Creation 3. Assign probabilities
P-mapping Creation S=(num, pname, home-addr, office-addr) T=(name, mailing-addr) Goal: find a p-mapping that is consistent with a set of weighted correspondences Theorem: There exists a p-mapping consistent if and only if for every source/target attribute a, the sum of the weights of all correspondences that involve a is at most 1.
Experiments Data: tables extracted from HTML tables on the web Domain#SourcesSearch Keywords Movie161movie, year Car817make, model People49 job/title, organization/company/employer Course647 course/class, instructor/teacher/lecturer, subject/department/title Bib649author, title, year, journal/conference 56Anish Das Sarma May 19, 2015
Gold standard: manual Approximate standard: semi-automatic Precision, recall, F-measure for several SQL queries varying attributes, selectivities 57 Experiments
Quality of Query Answering DomainPrecisionRecallF-measure Golden Standard People Course Approximate Golden Standard Movie Car People Course111 Bib
Comparison with Other Approaches Keyword search obtained low precision and low recall. Querying the sources directly or considering only the highest probability mapping obtained low recall. We obtained highest F-measure in all domains. 59
Comparison with Other Mediated-Schema Generation Methods Using p-med- schema obtained highest F-measure in all domains. 60
System Setup Time (one domain) 61
Brief Related Work Approximate schema mappings [Magnani et. al. 2007], [Gal 2007], [Dong. et. al. 2007] Automatic generation of mediated schemas [He et. al. 2003], More (see paper) Anish Das Sarma62May 19, 2015
Finally… Other Research – Data Integration (2) – Deduplication (2) – Quality Estimation of Sensor/RFID Streams [IQIS 06] Future Plans May 19, Anish Das Sarma
Data Integration May 19, Anish Das Sarma Problem: Foundations for integration of uncertain data Solution [TR 08]: -Define open- and closed-containment for uncertain data -Algorithms, complexity of consistency checking and finding maximally-correct query answers Problem: Dependencies in web-data integration (e.g., deep-web, plagiarism) Solution [TR 08]: Algorithms, complexity of fundamental problems: Coverage estimation, cost minimization and coverage maximization, and source ordering
Deduplication May 19, Anish Das Sarma [SIGMOD 07] -Leveraging real-world constraints for deduplication -Tractable optimal solution and experiments over DBLP and ACM publication data [WWW 07] -Detecting near-duplicate web-pages for crawling -Efficient indexing scheme supporting crawling speeds over web-scale data
Future Work May 19, Anish Das Sarma Short & Medium-Term 1.View management over uncertain databases: materialized view updates, versioning, partial materialization, … 2.More applications of uncertain data 3.More on lineage: internal/external lineage, approximate lineage, uncertain lineage, …
Future Work May 19, Anish Das Sarma Long-term 1.Applying uncertainty to other data management problems: query optimization? cloud computing? 2.Improve quality of data through conflict resolution and feedback 3.Web-data management: Handling huge amounts of data that is conflicting, uncertain, redundant, dependent, …
Thanks! May 19, 2015Anish Das Sarma 68 Anish Das Sarma (or search “Anish Das Sarma”)