Managing Uncertain Data Anish Das Sarma Stanford University May 19, 2015 1 Anish Das Sarma.

Managing Uncertain Data Anish Das Sarma Stanford University May 19, 2015 1 Anish Das Sarma

What is Uncertain Data? May 19, 2015 2 Anish Das Sarma (Certain) DataUncertain Data Temperature is 74.634589 FSensor reported 75 ±0.5 F Bob works for YahooBob works for either Yahoo or Microsoft Mary sighted a FinchMary sighted either a Finch (80%) or a Sparrow (20%) It will rain in Stanford tomorrow There is a 60% chance of rain in Stanford tomorrow Yahoo stocks will be at 100 in a month Yahoo stock will be between 60 and 120 in a month John’s age is 23John’s age is in [20,30]

Why Does It Arise? May 19, 2015 3 Anish Das Sarma (Certain) DataUncertain Data Temperature is 74.634589 FSensor reported 75 ±0.5 F Bob works for YahooBob works for either Yahoo or Microsoft Mary sighted a FinchMary sighted either a Finch (80%) or a Sparrow (20%) It will rain in Stanford tomorrow There is a 60% chance of rain in Stanford tomorrow Yahoo stocks will be at 100 in a month Yahoo stock will be between 60 and 120 in a month John’s age is 23John’s age is in [20,30] Precision of devices Lack of information Uncertainty about the future Anonymization

May 19, 2015 Anish Das Sarma 4 Applications: Information Extraction RestaurantZip Hard Rock Cafe 94111 94133 94109

May 19, 2015 Anish Das Sarma 5 Applications: Information Integration name, hPhone, oPhone, hAddr, oAddr name, phone, address Combined View

May 19, 2015 Anish Das Sarma 6 Applications: Deduplication Name John Doe J. Doe ? 80% match

May 19, 2015 Anish Das Sarma 7 Applications: Scientific & Medical Experiments Probably not cancer

How Do Database Management Systems (DBMS) Handle Uncertainty? They don’t  May 19, 2015 8 Anish Das Sarma

What Do (Most) Applications Do? Clean: turn into data that DBMSs can handle May 19, 2015 9 Anish Das Sarma (1)Loss of information (2)Errors compound insidiously ObserverBird-1 Mary Finch: 80% Sparrow: 20% Susan Dove: 70% Sparrow: 30% Jane Hummingbird: 65% Sparrow: 35% Bird-1 Finch Dove Hummingbird

Outline of The Talk Part 1: Managing Uncertainty in a DBMS theory  systems Part 2: Handling Uncertainty in Data Integration systems  theory Other Research (trailer) Future Plans May 19, 2015 10 Anish Das Sarma

Part 1: Managing Uncertain Data Primarily in the context of the Trio project 1)Data 2)Uncertainty 3)Lineage Today’s focus: how lineage helps May 19, 2015 11 Anish Das Sarma

Uncertain Data May 19, 2015Anish Das Sarma 12 Uncertain Data Sensor reported 75 ±0.5 F Bob works for either Yahoo or Microsoft Mary sighted either a Finch (80%) or a Sparrow (20%) There is a 60% chance of rain in Stanford tomorrow An uncertain database represents a set of possible instances (or, possible worlds) Our work: finite sets of possible instances

13 Representing Uncertain Data 20+ years of work (mostly theoretical) Appears to be fundamental trade-off between expressiveness & intuitiveness We spent some time exploring the space of models for uncertainty May 19, 2015Anish Das Sarma

14 Hierarchy of Models [ICDE 06] R relations A or-sets ? maybe-tuples 2 2-clauses prop Full propositional logic sets tuple-sets May 19, 2015Anish Das Sarma + Expressive - Complex + Intuitive - Inexpressive Next 1.Consider a model M 2.Isolate inexpressiveness 3.Solve problem with lineage

15 Running Example: Crime-Solver Saw (witness, color, car) // may be uncertain Drives (person, color, car) // may be uncertain Suspects (person) = π person (Saw ⋈ Drives) May 19, 2015Anish Das Sarma

16 Simple Model M 1. Alternatives: uncertainty about value 2. ‘?’ (Maybe) Annotations Saw (witness, color, car) Amy red, Honda ∥ red, Toyota ∥ orange, Mazda Three possible instances May 19, 2015Anish Das Sarma

17 Six possible instances Simple Model M 1. Alternatives 2. ‘?’ (Maybe): uncertainty about presence ? Saw (witness, color, car) Amy red, Honda ∥ red, Toyota ∥ orange, Mazda Bettyblue, Acura May 19, 2015Anish Das Sarma

May 19, 2015Anish Das Sarma 18 Review: Relational Queries DS Q Saw (witness, color, car) Amy, red, Honda Betty, blue, Acura π person(σ color=red ) W (witness) Amy

19 Queries on Uncertain Data Closure: up-arrow always exists Completeness: All sets of possible instances can be represented D I 1, I 2, …, I n J 1, J 2, …, J m D′D′ possible instances Q on each instance rep. of instances direct implementation May 19, 2015Anish Das Sarma

20 Model M is Not Closed Saw (witness, car) Cathy Honda ∥ Mazda Drives (person, car) Jimmy, Toyota ∥ Jimmy, Mazda Billy, Honda ∥ Frank, Honda Hank, Honda Suspects Jimmy Billy ∥ Frank Hank Suspects = π person (Saw ⋈ Drives) ? ? ? Does not correctly capture possible instances in the result CANNOT May 19, 2015Anish Das Sarma

21 to the Rescue Lineage Model M + Lineage = Completeness May 19, 2015Anish Das Sarma

22 Example with Lineage IDSaw (witness, car) 11Cathy Honda ∥ Mazda IDDrives (person, car) 21 Jimmy, Toyota ∥ Jimmy, Mazda 22 Billy, Honda ∥ Frank, Honda 23Hank, Honda IDSuspects 31Jimmy 32 Billy ∥ Frank 33Hank Suspects = π person (Saw ⋈ Drives) ? ? ? May 19, 2015Anish Das Sarma

23 Example with Lineage ID Saw (witness, car) 11Cathy Honda ∥ Mazda ID Drives (person, car) 21 Jimmy, Toyota ∥ Jimmy, Mazda 22 Billy, Honda ∥ Frank, Honda 23Hank, Honda ID Suspects 31Jimmy 32 Billy ∥ Frank 33Hank Suspects = π person (Saw ⋈ Drives) ? ? ? λ (31) = (11,2) Λ (21,2) λ (32,1) = (11,1) Λ (22,1); λ (32,2) = (11,1) Λ (22,2) λ (33) = (11,1) Λ 23 Correctly captures possible instances in the result

24 Trio’s Data Model 1.Alternatives 2.‘?’ (Maybe) Annotations 3.Confidence values (next) 4.Lineage Uncertainty-Lineage Databases (ULDBs) Theorem: ULDBs are closed and complete [VLDB 06] May 19, 2015Anish Das Sarma Formally studied properties like minimization, equivalence, approximation and membership. [VLDB 06, VLDB J. 08]

25 Confidence Values in Trio Confidence values supplied with base data – Default probabilistic interpretation Problem: Compute confidence values on result data [ICDE 08] 5-minute DBClip – Search “confidence computation” on YouTube. May 19, 2015Anish Das Sarma

26 Problem Description ID Saw (witness,car) 11(Amy, Honda) : 0.5 12(Betty, Acura) : 0.6 ID Drives (person,car) 21(Jimmy, Honda) : 0.9 22(Billy, Honda) : 0.8 23(Hank, Acura) : 1.0 ID Cars 41Honda 42Acura Cars = π car (Saw ⋈ Drives) : ? May 19, 2015Anish Das Sarma

27 Operator-by-Operator ID Saw (witness,car) 11(Amy, Honda) : 0.5 12(Betty, Acura) : 0.6 ID Drives (person,car) 21(Jimmy, Honda) : 0.9 22(Billy, Honda) : 0.8 23(Hank, Acura) : 1.0 ID Cars 41Honda 42Acura 31(Amy,Jimmy,Honda) 32(Amy,Billy,Honda) 33(Betty,Hank,Acura) ⋈ SawDrives π car : 0.5*0.9: 0.45 : 0.4 : 0.6 0.45 + 0.4 - (0.45*0.4): 0.67 Wrong!! May 19, 2015Anish Das Sarma

28 Operator-by-Operator ID Saw (witness,car) 11(Amy, Honda) : 0.5 12(Betty, Acura) : 0.6 ID Drives (person,car) 21(Jimmy, Honda) : 0.9 22(Billy, Honda) : 0.8 23(Hank, Acura) : 1.0 ID Cars 41Honda 42Acura 31(Amy,Jimmy,Honda) 32(Amy,Billy,Honda) 33(Betty,Hank,Acura) : 0.45 : 0.4 : 0.6 0.45 + 0.4 - (0.45*0.4) Not independent! May 19, 2015Anish Das Sarma

29 Database Query Processing 101 May 19, 2015Anish Das Sarma Q Query Execution Plans Pick and execute best plan Statistics, indexes

30 Operator-by-Operator Confidence Computation May 19, 2015Anish Das Sarma Q Query Plans Can be much smaller or empty

31 Decouple Data and Confidence Computation May 19, 2015Anish Das Sarma Q Query Plans 1.Compute data 2.Use lineage to compute confidences (on demand) Theorem: Arbitrary improvement. [ICDE 08]

32 Our Approach ID Saw (witness,car) 11(Amy, Honda) : 0.5 12(Betty, Acura) : 0.6 ID Drives (person,car) 21(Jimmy, Honda) : 0.9 22(Billy, Honda) : 0.8 23(Hank, Acura) : 1.0 ID Cars 41Honda 42Acura : ? λ (41) = 11 Λ (21 V 22) λ (42) = 12 Λ 23 0.5 * (0.9 + 0.8 - 0.9*0.8) : 0.49 : 0.6 Correct!! May 19, 2015Anish Das Sarma

Algorithm May 19, 2015Anish Das Sarma 33 R t t1t2 t4 t5t6t7 λ(t) = f(t4,t5,t6,t7) 0.7 0.9 1.0 0.4 0.823 1. Expand lineage to base data 2. Get confidence of base data 3. Evaluate the probability λ(t) Detecting independence Memoization Batch computation 0.4

Some Other Trio Work May 19, 2015 34 Anish Das Sarma Modifications and Versioning [TR 08] -Stored derived relations -Modifications  versions Indexes and Statistics [MUD 08] -Specialized indexes, histograms Functional Dependencies & Schema Design [TR 07] -Definitions, sound and complete axiomatization of FDs -Lossless decomposition -FD testing, finding, and inference

35 Related Work (sample) Modeling Uncertainty: Plenty, covered in textbooks Systems: Avatar, BayesStore, MayBMS, MYSTIQ, ORION, PrDB, ProbView, Trio, others? May 19, 2015Anish Das Sarma

Part 2: Data Integration Reboot! May 19, 2015 36 Anish Das Sarma or, wake up!

Traditional Data Integration: Setup D1D2D3D4D5 Bib(title, authors, conf, year) Author(aid, name) Paper(pid, title, year) AuthoredBy(aid,pid) Mediated Schema Publication(title, author, conf, year) 1. Mediated Schema 2. Schema Mappings Mapping SELECT P.title AS title, A.name AS author, NULL AS conf, P.year AS year, FROM Author AS A, Paper AS P, AuthoredBy AS B WHERE A.aid=B.aid AND P.pid=B.pid 3. Query Answering Significant up-front effort 37 Who authored the most SIGMOD papers in the 90’s? Mike Carey

“Pay-As-You-Go” Data Integration 1.Automated best-effort integration from the outset 2.Further improve the system over time with feedback 38 How advanced a starting point can we provide? May 19, 2015Anish Das Sarma

Automatic integration  Make guesses  Model probabilities Specifically – Probabilistic schema mappings – Probabilistic mediated-schema Anish Das Sarma39May 19, 2015 to the Rescue Uncertainty >90% accuracy in automatically integrating 50-800 data sources for several domains [SIGMOD 08]

Next 1.Probabilistic mediated schemas 2.Probabilistic schema mappings 3.Experimental results Anish Das Sarma40May 19, 2015

Mediated Schema S1(name, email, phone-num, address)S2(person-name,phone,mailing-addr) Med-S (name, email, phone, addr) {name, person-name} {phone-num, phone} {address, mailing-addr} {email} A mediated schema is a clustering of a subset of the set of all attributes appearing in source schemas. 41 Anish Das SarmaMay 19, 2015

Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr}) Example S1(name, hPhone, oPhone, hAddr, oAddr)S2(name,phone,address) ? Q: SELECT name, hPhone, oPhone FROM Med 42

S1(name, hPhone, oPhone, hAddr, oAddr)S2(name,phone,address) Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr}) Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr}) Q: SELECT name, phone, address FROM Med 43 Example

Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr}) S1(name, hPhone, oPhone, hAddr, oAddr)S2(name,phone,address) Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr}) Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr}) Q: SELECT name, phone, address FROM Med 44 Example

Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr}) Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr}) S1(name, hPhone, oPhone, hAddr, oAddr)S2(name,phone,address) Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr}) Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr}) Q: SELECT name, phone, address FROM Med 45 Example

Med5 ({name}, {phone}, {hPhone}, {oPhone}, {address}, {hAddr}, {oAddr}) Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr}) S1(name, hPhone, oPhone, hAddr, oAddr)S2(name,phone,address) Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr}) Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr}) Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr}) Q: SELECT name, phone, address FROM Med 46 Example

Med5 ({name}, {phone}, {hPhone}, {oPhone}, {address}, {hAddr}, {oAddr}) Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr}) S1(name, hPhone, oPhone, hAddr, oAddr)S2(name,phone,address) Med1 ({name}, {phone, hPhone, oPhone}, {address, hAddr, oAddr}) Med2 ({name}, {phone, hPhone}, {oPhone}, {address, oAddr}, {hAddr}) Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr}) Q: SELECT name, phone, address FROM Med 47 Example

Med3 ({name}, {phone, hPhone}, {oPhone}, {address, hAddr}, {oAddr}) Probabilistic Mediated Schema S1(name, hPhone, oPhone, hAddr, oAddr)S2(name,phone,address) Med4 ({name}, {phone, oPhone}, {hPhone}, {address, oAddr}, {hAddr}) Pr=0.5 48 Anish Das SarmaMay 19, 2015 Pr=0.5 Probabilistic Mediated Schema (p-med-schema) is a set M = {(M 1,Pr(M 1 )), …, (M k,Pr(M k ))} where M i is a med-schema; i≠j => M i ≠ M j Pr(M i )(0,1]; ΣPr(M i ) = 1

P-Mappings PM1 Med3 (name, hPP, oP, hAA, oA) S1(name, hP, oP, hA, oA) Pr=.64 Med3 (name, hPP, oP, hAA, oA) S1(name, hP, oP, hA, oA) Pr=.16 Med3 (name, hPP, oP, hAA, oA) S1(name, hP, oP, hA, oA) Pr=.16 Med3 (name, hPP, oP, hAA, oA) S1(name, hP, oP, hA, oA) Pr=.04 PM2 Med4 (name, oPP, hP, oAA, hA) S1(name, hP, oP, hA, oA) Pr=.64 Med4 (name, oPP, hP, oAA, hA) S1(name, hP, oP, hA, oA) Pr=.16 Med4 (name, oPP, hP, oAA, hA) S1(name, hP, oP, hA, oA) Pr=.16 Med4 (name, oPP, hP, oAA, hA) S1(name, hP, oP, hA, oA) Pr=.04 49 Anish Das SarmaMay 19, 2015

Expressive Power of P-Med-Schema & P-Mapping Theorem 1. For one-to-many mappings: (p-med-schema + p-mappings) = (mediated schema + p-mapping) > (p-med-schema + mappings) Theorem 2. When restricted to one-to-one mappings: (p-med-schema + p-mappings) = (p-med-schema + mappings) > (mediated schema + p-mapping) 50 Anish Das SarmaMay 19, 2015

Next Creating p-med-schemas (briefly) Creating p-mappings (briefly) Experimental Results Anish Das Sarma51May 19, 2015

P-med-schema Creation S2 S1 nameaddress email-address pnamehome-address 1.6.2 52 May 19, 2015 1. Certain/uncertain edges

S2 S1 nameaddress email-address pnamehome-address S2 S1 nameaddress email-address pnamehome-address S2 S1 nameaddress email-address pnamehome-address S2 S1 nameaddress email-address pnamehome-address 53 P-med-schema Creation 2. Clustering

S2 S1 nameaddress email-address pnamehome-address S2 S1 nameaddress email-address pnamehome-address S2 S1 nameaddress email-address pnamehome-address S2 S1 nameaddress email-address pnamehome-address Pr=1/6 Pr=1/3 54 P-med-schema Creation 3. Assign probabilities

P-mapping Creation S=(num, pname, home-addr, office-addr) T=(name, mailing-addr) 0.80.9 0.2 55 Goal: find a p-mapping that is consistent with a set of weighted correspondences Theorem: There exists a p-mapping consistent if and only if for every source/target attribute a, the sum of the weights of all correspondences that involve a is at most 1.

Experiments Data: tables extracted from HTML tables on the web Domain#SourcesSearch Keywords Movie161movie, year Car817make, model People49 job/title, organization/company/employer Course647 course/class, instructor/teacher/lecturer, subject/department/title Bib649author, title, year, journal/conference 56Anish Das Sarma May 19, 2015

Gold standard: manual Approximate standard: semi-automatic Precision, recall, F-measure for several SQL queries varying attributes, selectivities 57 Experiments

Quality of Query Answering DomainPrecisionRecallF-measure Golden Standard People1.849.918 Course1.852.92 Approximate Golden Standard Movie.951.924 Car1.917.957 People.958.984.971 Course111 Bib1.955.977 58

Comparison with Other Approaches Keyword search obtained low precision and low recall. Querying the sources directly or considering only the highest probability mapping obtained low recall. We obtained highest F-measure in all domains. 59

Comparison with Other Mediated-Schema Generation Methods Using p-med- schema obtained highest F-measure in all domains. 60

System Setup Time (one domain) 61

Brief Related Work Approximate schema mappings [Magnani et. al. 2007], [Gal 2007], [Dong. et. al. 2007] Automatic generation of mediated schemas [He et. al. 2003], More (see paper) Anish Das Sarma62May 19, 2015

Finally… Other Research – Data Integration (2) – Deduplication (2) – Quality Estimation of Sensor/RFID Streams [IQIS 06] Future Plans May 19, 2015 63 Anish Das Sarma

Data Integration May 19, 2015 64 Anish Das Sarma Problem: Foundations for integration of uncertain data Solution [TR 08]: -Define open- and closed-containment for uncertain data -Algorithms, complexity of consistency checking and finding maximally-correct query answers Problem: Dependencies in web-data integration (e.g., deep-web, plagiarism) Solution [TR 08]: Algorithms, complexity of fundamental problems: Coverage estimation, cost minimization and coverage maximization, and source ordering

Deduplication May 19, 2015 65 Anish Das Sarma [SIGMOD 07] -Leveraging real-world constraints for deduplication -Tractable optimal solution and experiments over DBLP and ACM publication data [WWW 07] -Detecting near-duplicate web-pages for crawling -Efficient indexing scheme supporting crawling speeds over web-scale data

Future Work May 19, 2015 66 Anish Das Sarma Short & Medium-Term 1.View management over uncertain databases: materialized view updates, versioning, partial materialization, … 2.More applications of uncertain data 3.More on lineage: internal/external lineage, approximate lineage, uncertain lineage, …

Future Work May 19, 2015 67 Anish Das Sarma Long-term 1.Applying uncertainty to other data management problems: query optimization? cloud computing? 2.Improve quality of data through conflict resolution and feedback 3.Web-data management: Handling huge amounts of data that is conflicting, uncertain, redundant, dependent, …

Thanks! May 19, 2015Anish Das Sarma 68 Anish Das Sarma anish@cs.stanford.edu http://i.stanford.edu/~anishdshttp://i.stanford.edu/~anishds (or search “Anish Das Sarma”)

Managing Uncertain Data Anish Das Sarma Stanford University May 19, 2015 1 Anish Das Sarma.

Similar presentations

Presentation on theme: "Managing Uncertain Data Anish Das Sarma Stanford University May 19, 2015 1 Anish Das Sarma."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Managing Uncertain Data Anish Das Sarma Stanford University May 19, 2015 1 Anish Das Sarma.

Similar presentations

Presentation on theme: "Managing Uncertain Data Anish Das Sarma Stanford University May 19, 2015 1 Anish Das Sarma."— Presentation transcript:

Similar presentations

About project

Feedback