Download presentation
Presentation is loading. Please wait.
Published byMyron Hampton Modified over 9 years ago
1
Managing Uncertain Data Anish Das Sarma Stanford University May 19, 2015 1 Anish Das Sarma
2
What is Uncertain Data? May 19, 2015 2 Anish Das Sarma (Certain) DataUncertain Data Temperature is 74.634589 FSensor reported 75 ±0.5 F Bob works for YahooBob works for either Yahoo or Microsoft Mary sighted a FinchMary sighted either a Finch (80%) or a Sparrow (20%) It will rain in Stanford tomorrow There is a 60% chance of rain in Stanford tomorrow Yahoo stocks will be at 100 in a month Yahoo stock will be between 60 and 120 in a month John’s age is 23John’s age is in [20,30]
3
Why Does It Arise? May 19, 2015 3 Anish Das Sarma (Certain) DataUncertain Data Temperature is 74.634589 FSensor reported 75 ±0.5 F Bob works for YahooBob works for either Yahoo or Microsoft Mary sighted a FinchMary sighted either a Finch (80%) or a Sparrow (20%) It will rain in Stanford tomorrow There is a 60% chance of rain in Stanford tomorrow Yahoo stocks will be at 100 in a month Yahoo stock will be between 60 and 120 in a month John’s age is 23John’s age is in [20,30] Precision of devices Lack of information Uncertainty about the future Anonymization
4
May 19, 2015 Anish Das Sarma 4 Applications: Information Extraction RestaurantZip Hard Rock Cafe 94111 94133 94109
5
May 19, 2015 Anish Das Sarma 5 Applications: Information Integration name, hPhone, oPhone, hAddr, oAddr name, phone, address Combined View
6
May 19, 2015 Anish Das Sarma 6 Applications: Deduplication Name John Doe J. Doe ? 80% match
7
May 19, 2015 Anish Das Sarma 7 Applications: Scientific & Medical Experiments Probably not cancer
8
How Do Database Management Systems (DBMS) Handle Uncertainty? They don’t May 19, 2015 8 Anish Das Sarma
9
What Do (Most) Applications Do? Clean: turn into data that DBMSs can handle May 19, 2015 9 Anish Das Sarma (1)Loss of information (2)Errors compound insidiously ObserverBird-1 Mary Finch: 80% Sparrow: 20% Susan Dove: 70% Sparrow: 30% Jane Hummingbird: 65% Sparrow: 35% Bird-1 Finch Dove Hummingbird
10
Outline of The Talk Part 1: Managing Uncertainty in a DBMS theory systems Part 2: Handling Uncertainty in Data Integration systems theory Other Research (trailer) Future Plans May 19, 2015 10 Anish Das Sarma
11
Part 1: Managing Uncertain Data Primarily in the context of the Trio project 1)Data 2)Uncertainty 3)Lineage Today’s focus: how lineage helps May 19, 2015 11 Anish Das Sarma
12
Uncertain Data May 19, 2015Anish Das Sarma 12 Uncertain Data Sensor reported 75 ±0.5 F Bob works for either Yahoo or Microsoft Mary sighted either a Finch (80%) or a Sparrow (20%) There is a 60% chance of rain in Stanford tomorrow An uncertain database represents a set of possible instances (or, possible worlds) Our work: finite sets of possible instances
13
13 Representing Uncertain Data 20+ years of work (mostly theoretical) Appears to be fundamental trade-off between expressiveness & intuitiveness We spent some time exploring the space of models for uncertainty May 19, 2015Anish Das Sarma
14
14 Hierarchy of Models [ICDE 06] R relations A or-sets ? maybe-tuples 2 2-clauses prop Full propositional logic sets tuple-sets May 19, 2015Anish Das Sarma + Expressive - Complex + Intuitive - Inexpressive Next 1.Consider a model M 2.Isolate inexpressiveness 3.Solve problem with lineage
15
15 Running Example: Crime-Solver Saw (witness, color, car) // may be uncertain Drives (person, color, car) // may be uncertain Suspects (person) = π person (Saw ⋈ Drives) May 19, 2015Anish Das Sarma
16
16 Simple Model M 1. Alternatives: uncertainty about value 2. ‘?’ (Maybe) Annotations Saw (witness, color, car) Amy red, Honda ∥ red, Toyota ∥ orange, Mazda Three possible instances May 19, 2015Anish Das Sarma
17
17 Six possible instances Simple Model M 1. Alternatives 2. ‘?’ (Maybe): uncertainty about presence ? Saw (witness, color, car) Amy red, Honda ∥ red, Toyota ∥ orange, Mazda Bettyblue, Acura May 19, 2015Anish Das Sarma
18
May 19, 2015Anish Das Sarma 18 Review: Relational Queries DS Q Saw (witness, color, car) Amy, red, Honda Betty, blue, Acura π person(σ color=red ) W (witness) Amy
19
19 Queries on Uncertain Data Closure: up-arrow always exists Completeness: All sets of possible instances can be represented D I 1, I 2, …, I n J 1, J 2, …, J m D′D′ possible instances Q on each instance rep. of instances direct implementation May 19, 2015Anish Das Sarma
20
20 Model M is Not Closed Saw (witness, car) Cathy Honda ∥ Mazda Drives (person, car) Jimmy, Toyota ∥ Jimmy, Mazda Billy, Honda ∥ Frank, Honda Hank, Honda Suspects Jimmy Billy ∥ Frank Hank Suspects = π person (Saw ⋈ Drives) ? ? ? Does not correctly capture possible instances in the result CANNOT May 19, 2015Anish Das Sarma
21
21 to the Rescue Lineage Model M + Lineage = Completeness May 19, 2015Anish Das Sarma
22
22 Example with Lineage IDSaw (witness, car) 11Cathy Honda ∥ Mazda IDDrives (person, car) 21 Jimmy, Toyota ∥ Jimmy, Mazda 22 Billy, Honda ∥ Frank, Honda 23Hank, Honda IDSuspects 31Jimmy 32 Billy ∥ Frank 33Hank Suspects = π person (Saw ⋈ Drives) ? ? ? May 19, 2015Anish Das Sarma
23
23 Example with Lineage ID Saw (witness, car) 11Cathy Honda ∥ Mazda ID Drives (person, car) 21 Jimmy, Toyota ∥ Jimmy, Mazda 22 Billy, Honda ∥ Frank, Honda 23Hank, Honda ID Suspects 31Jimmy 32 Billy ∥ Frank 33Hank Suspects = π person (Saw ⋈ Drives) ? ? ? λ (31) = (11,2) Λ (21,2) λ (32,1) = (11,1) Λ (22,1); λ (32,2) = (11,1) Λ (22,2) λ (33) = (11,1) Λ 23 Correctly captures possible instances in the result
24
24 Trio’s Data Model 1.Alternatives 2.‘?’ (Maybe) Annotations 3.Confidence values (next) 4.Lineage Uncertainty-Lineage Databases (ULDBs) Theorem: ULDBs are closed and complete [VLDB 06] May 19, 2015Anish Das Sarma Formally studied properties like minimization, equivalence, approximation and membership. [VLDB 06, VLDB J. 08]
25
25 Confidence Values in Trio Confidence values supplied with base data – Default probabilistic interpretation Problem: Compute confidence values on result data [ICDE 08] 5-minute DBClip – Search “confidence computation” on YouTube. May 19, 2015Anish Das Sarma
26
26 Problem Description ID Saw (witness,car) 11(Amy, Honda) : 0.5 12(Betty, Acura) : 0.6 ID Drives (person,car) 21(Jimmy, Honda) : 0.9 22(Billy, Honda) : 0.8 23(Hank, Acura) : 1.0 ID Cars 41Honda 42Acura Cars = π car (Saw ⋈ Drives) : ? May 19, 2015Anish Das Sarma
27
27 Operator-by-Operator ID Saw (witness,car) 11(Amy, Honda) : 0.5 12(Betty, Acura) : 0.6 ID Drives (person,car) 21(Jimmy, Honda) : 0.9 22(Billy, Honda) : 0.8 23(Hank, Acura) : 1.0 ID Cars 41Honda 42Acura 31(Amy,Jimmy,Honda) 32(Amy,Billy,Honda) 33(Betty,Hank,Acura) ⋈ SawDrives π car : 0.5*0.9: 0.45 : 0.4 : 0.6 0.45 + 0.4 - (0.45*0.4): 0.67 Wrong!! May 19, 2015Anish Das Sarma
28
28 Operator-by-Operator ID Saw (witness,car) 11(Amy, Honda) : 0.5 12(Betty, Acura) : 0.6 ID Drives (person,car) 21(Jimmy, Honda) : 0.9 22(Billy, Honda) : 0.8 23(Hank, Acura) : 1.0 ID Cars 41Honda 42Acura 31(Amy,Jimmy,Honda) 32(Amy,Billy,Honda) 33(Betty,Hank,Acura) : 0.45 : 0.4 : 0.6 0.45 + 0.4 - (0.45*0.4) Not independent! May 19, 2015Anish Das Sarma
29
29 Database Query Processing 101 May 19, 2015Anish Das Sarma Q Query Execution Plans Pick and execute best plan Statistics, indexes
30
30 Operator-by-Operator Confidence Computation May 19, 2015Anish Das Sarma Q Query Plans Can be much smaller or empty
31
31 Decouple Data and Confidence Computation May 19, 2015Anish Das Sarma Q Query Plans 1.Compute data 2.Use lineage to compute confidences (on demand) Theorem: Arbitrary improvement. [ICDE 08]
32
32 Our Approach ID Saw (witness,car) 11(Amy, Honda) : 0.5 12(Betty, Acura) : 0.6 ID Drives (person,car) 21(Jimmy, Honda) : 0.9 22(Billy, Honda) : 0.8 23(Hank, Acura) : 1.0 ID Cars 41Honda 42Acura : ? λ (41) = 11 Λ (21 V 22) λ (42) = 12 Λ 23 0.5 * (0.9 + 0.8 - 0.9*0.8) : 0.49 : 0.6 Correct!! May 19, 2015Anish Das Sarma
33
Algorithm May 19, 2015Anish Das Sarma 33 R t t1t2 t4 t5t6t7 λ(t) = f(t4,t5,t6,t7) 0.7 0.9 1.0 0.4 0.823 1. Expand lineage to base data 2. Get confidence of base data 3. Evaluate the probability λ(t) Detecting independence Memoization Batch computation 0.4
34
Some Other Trio Work May 19, 2015 34 Anish Das Sarma Modifications and Versioning [TR 08] -Stored derived relations -Modifications versions Indexes and Statistics [MUD 08] -Specialized indexes, histograms Functional Dependencies & Schema Design [TR 07] -Definitions, sound and complete axiomatization of FDs -Lossless decomposition -FD testing, finding, and inference
35
35 Related Work (sample) Modeling Uncertainty: Plenty, covered in textbooks Systems: Avatar, BayesStore, MayBMS, MYSTIQ, ORION, PrDB, ProbView, Trio, others? May 19, 2015Anish Das Sarma
36
Part 2: Data Integration Reboot! May 19, 2015 36 Anish Das Sarma or, wake up!
37
Traditional Data Integration: Setup D1D2D3D4D5 Bib(title, authors, conf, year) Author(aid, name) Paper(pid, title, year) AuthoredBy(aid,pid) Mediated Schema Publication(title, author, conf, year) 1. Mediated Schema 2. Schema Mappings Mapping SELECT P.title AS title, A.name AS author, NULL AS conf, P.year AS year, FROM Author AS A, Paper AS P, AuthoredBy AS B WHERE A.aid=B.aid AND P.pid=B.pid 3. Query Answering Significant up-front effort 37 Who authored the most SIGMOD papers in the 90’s? Mike Carey
38
“Pay-As-You-Go” Data Integration 1.Automated best-effort integration from the outset 2.Further improve the system over time with feedback 38 How advanced a starting point can we provide? May 19, 2015Anish Das Sarma
39
Automatic integration Make guesses Model probabilities Specifically – Probabilistic schema mappings – Probabilistic mediated-schema Anish Das Sarma39May 19, 2015 to the Rescue Uncertainty >90% accuracy in automatically integrating 50-800 data sources for several domains [SIGMOD 08]
40
Next 1.Probabilistic mediated schemas 2.Probabilistic schema mappings 3.Experimental results Anish Das Sarma40May 19, 2015
41
Mediated Schema S1(name, email, phone-num, address)S2(person-name,phone,mailing-addr) Med-S (name, email, phone, addr) {name, person-name} {phone-num, phone} {address, mailing-addr} {email} A mediated schema is a clustering of a subset of the set of all attributes appearing in source schemas. 41 Anish Das SarmaMay 19, 2015
42
Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr}) Example S1(name, hPhone, oPhone, hAddr, oAddr)S2(name,phone,address) ? Q: SELECT name, hPhone, oPhone FROM Med 42
43
S1(name, hPhone, oPhone, hAddr, oAddr)S2(name,phone,address) Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr}) Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr}) Q: SELECT name, phone, address FROM Med 43 Example
44
Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr}) S1(name, hPhone, oPhone, hAddr, oAddr)S2(name,phone,address) Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr}) Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr}) Q: SELECT name, phone, address FROM Med 44 Example
45
Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr}) Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr}) S1(name, hPhone, oPhone, hAddr, oAddr)S2(name,phone,address) Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr}) Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr}) Q: SELECT name, phone, address FROM Med 45 Example
46
Med5 ({name}, {phone}, {hPhone}, {oPhone}, {address}, {hAddr}, {oAddr}) Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr}) S1(name, hPhone, oPhone, hAddr, oAddr)S2(name,phone,address) Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr}) Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr}) Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr}) Q: SELECT name, phone, address FROM Med 46 Example
47
Med5 ({name}, {phone}, {hPhone}, {oPhone}, {address}, {hAddr}, {oAddr}) Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr}) S1(name, hPhone, oPhone, hAddr, oAddr)S2(name,phone,address) Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr}) Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr}) Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr}) Q: SELECT name, phone, address FROM Med 47 Example
48
Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr}) Probabilistic Mediated Schema S1(name, hPhone, oPhone, hAddr, oAddr)S2(name,phone,address) Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr}) Pr=0.5 48 Anish Das SarmaMay 19, 2015 Pr=0.5 Probabilistic Mediated Schema (p-med-schema) is a set M = {(M 1,Pr(M 1 )), …, (M k,Pr(M k ))} where M i is a med-schema; i≠j => M i ≠ M j Pr(M i )(0,1]; ΣPr(M i ) = 1
49
P-Mappings PM1 Med3 (name, hPP, oP, hAA, oA) S1(name, hP, oP, hA, oA) Pr=.64 Med3 (name, hPP, oP, hAA, oA) S1(name, hP, oP, hA, oA) Pr=.16 Med3 (name, hPP, oP, hAA, oA) S1(name, hP, oP, hA, oA) Pr=.16 Med3 (name, hPP, oP, hAA, oA) S1(name, hP, oP, hA, oA) Pr=.04 PM2 Med4 (name, oPP, hP, oAA, hA) S1(name, hP, oP, hA, oA) Pr=.64 Med4 (name, oPP, hP, oAA, hA) S1(name, hP, oP, hA, oA) Pr=.16 Med4 (name, oPP, hP, oAA, hA) S1(name, hP, oP, hA, oA) Pr=.16 Med4 (name, oPP, hP, oAA, hA) S1(name, hP, oP, hA, oA) Pr=.04 49 Anish Das SarmaMay 19, 2015
50
Expressive Power of P-Med-Schema & P-Mapping Theorem 1. For one-to-many mappings: (p-med-schema + p-mappings) = (mediated schema + p-mapping) > (p-med-schema + mappings) Theorem 2. When restricted to one-to-one mappings: (p-med-schema + p-mappings) = (p-med-schema + mappings) > (mediated schema + p-mapping) 50 Anish Das SarmaMay 19, 2015
51
Next Creating p-med-schemas (briefly) Creating p-mappings (briefly) Experimental Results Anish Das Sarma51May 19, 2015
52
P-med-schema Creation S2 S1 nameaddress email-address pnamehome-address 1.6.2 52 May 19, 2015 1. Certain/uncertain edges
53
S2 S1 nameaddress email-address pnamehome-address S2 S1 nameaddress email-address pnamehome-address S2 S1 nameaddress email-address pnamehome-address S2 S1 nameaddress email-address pnamehome-address 53 P-med-schema Creation 2. Clustering
54
S2 S1 nameaddress email-address pnamehome-address S2 S1 nameaddress email-address pnamehome-address S2 S1 nameaddress email-address pnamehome-address S2 S1 nameaddress email-address pnamehome-address Pr=1/6 Pr=1/3 54 P-med-schema Creation 3. Assign probabilities
55
P-mapping Creation S=(num, pname, home-addr, office-addr) T=(name, mailing-addr) 0.80.9 0.2 55 Goal: find a p-mapping that is consistent with a set of weighted correspondences Theorem: There exists a p-mapping consistent if and only if for every source/target attribute a, the sum of the weights of all correspondences that involve a is at most 1.
56
Experiments Data: tables extracted from HTML tables on the web Domain#SourcesSearch Keywords Movie161movie, year Car817make, model People49 job/title, organization/company/employer Course647 course/class, instructor/teacher/lecturer, subject/department/title Bib649author, title, year, journal/conference 56Anish Das Sarma May 19, 2015
57
Gold standard: manual Approximate standard: semi-automatic Precision, recall, F-measure for several SQL queries varying attributes, selectivities 57 Experiments
58
Quality of Query Answering DomainPrecisionRecallF-measure Golden Standard People1.849.918 Course1.852.92 Approximate Golden Standard Movie.951.924 Car1.917.957 People.958.984.971 Course111 Bib1.955.977 58
59
Comparison with Other Approaches Keyword search obtained low precision and low recall. Querying the sources directly or considering only the highest probability mapping obtained low recall. We obtained highest F-measure in all domains. 59
60
Comparison with Other Mediated-Schema Generation Methods Using p-med- schema obtained highest F-measure in all domains. 60
61
System Setup Time (one domain) 61
62
Brief Related Work Approximate schema mappings [Magnani et. al. 2007], [Gal 2007], [Dong. et. al. 2007] Automatic generation of mediated schemas [He et. al. 2003], More (see paper) Anish Das Sarma62May 19, 2015
63
Finally… Other Research – Data Integration (2) – Deduplication (2) – Quality Estimation of Sensor/RFID Streams [IQIS 06] Future Plans May 19, 2015 63 Anish Das Sarma
64
Data Integration May 19, 2015 64 Anish Das Sarma Problem: Foundations for integration of uncertain data Solution [TR 08]: -Define open- and closed-containment for uncertain data -Algorithms, complexity of consistency checking and finding maximally-correct query answers Problem: Dependencies in web-data integration (e.g., deep-web, plagiarism) Solution [TR 08]: Algorithms, complexity of fundamental problems: Coverage estimation, cost minimization and coverage maximization, and source ordering
65
Deduplication May 19, 2015 65 Anish Das Sarma [SIGMOD 07] -Leveraging real-world constraints for deduplication -Tractable optimal solution and experiments over DBLP and ACM publication data [WWW 07] -Detecting near-duplicate web-pages for crawling -Efficient indexing scheme supporting crawling speeds over web-scale data
66
Future Work May 19, 2015 66 Anish Das Sarma Short & Medium-Term 1.View management over uncertain databases: materialized view updates, versioning, partial materialization, … 2.More applications of uncertain data 3.More on lineage: internal/external lineage, approximate lineage, uncertain lineage, …
67
Future Work May 19, 2015 67 Anish Das Sarma Long-term 1.Applying uncertainty to other data management problems: query optimization? cloud computing? 2.Improve quality of data through conflict resolution and feedback 3.Web-data management: Handling huge amounts of data that is conflicting, uncertain, redundant, dependent, …
68
Thanks! May 19, 2015Anish Das Sarma 68 Anish Das Sarma anish@cs.stanford.edu http://i.stanford.edu/~anishdshttp://i.stanford.edu/~anishds (or search “Anish Das Sarma”)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.