Download presentation
Presentation is loading. Please wait.
Published byRandolf Kelley Modified over 9 years ago
1
Lecture 1: Introduction Faculty of Computer Science Technion – Israel Institute of Technology Spring 2015
2
Assumed Background Databases –Relational model, database querying, SQL, relational algebra, schema, integrity constraints (e.g., functional dependencies) Algorithms and complexity –Asymptotic running time, ptime, NP, completeness, reduction Basic probability theory –Probability space, event, random variable, conditional probability 2
3
Attendance Requirement 4 mandatory assignments, no exam –Theoretical (20%), programmatic (30%), theoretical (20%), programmatic (30%) To get a grade, students must submit all assignments and attend lectures – 5 misses is unacceptable –Exception: students who miss 3-5 lectures can get a grade by attending an easy exam on the course material Must pass, 10% of the grade (other grades normalized accordingly) 3
4
Lecture 1: Introduction U NCERTAINTY IN D ATABASES 4
5
Some Modern Database Content 5 Integration Business DBs Social Media Text Analytics / NLP Sensing Data OCR / Image Web Pages Financial ReportsMed Reports Knowledge Bases Gov Reports Signal / Image Processing
6
Knowledge Bases 6 Microsoft Probase Google Knowledge Graph Google Knowledge Vault Stanford DeepDive Israel country 0.4 Person 0.2 location 0.35 Instance Concept Probability MPI YAGO CMU NELL Freebase... Concept Instance Attribute Value Relationship
7
Relating to Big Data 7 Missing information Conflicting Information Probabilistic information
8
Popular Topics in DB Research VLDB 2014 Ten Year Best Paper –Nilesh Dalvi and Dan Suciu: Efficient Query Evaluation on Probabilistic Databases PODS 2014 Keynote –Leonid Libkin: Incomplete data: what went wrong, and how to fix it SIGMOD/PODS 2014 Workshop on Big Uncertain Data –Kimelfeld (DB) and Kersting (AI) ICDT 2013 Test-of-Time Award –Ronald Fagin, Phokion Kolaitis, Renee Miller, and Lucian Popa: Data Exchange: Semantics and Query Answering 8
9
What’s in the Course? Principled, application-independent paradigms to managing uncertainty in data –Incomplete / inconsistent / probabilistic databases Two key aspects for every paradigm: –Representation How do we represent what we know, what is missing, and what is our confidence? –Query evaluation What is the meaning of query answering in the presence of uncertainty? What is the involved computational complexity? 9
10
Lecture 1: Introduction I NCOMPLETE D ATABASES 10
11
Missing Information Problem: pieces of data missing, but we need to keep whatever partial knowledge we have 11 Registrations studentcourse AhuvaPL Courses courselecturer PLEran A source tells us that Alon is a student of Keren –How can we represent it in our DB? Registrations studentcourse AhuvaPL Alon ⊥ Courses courselecturer PLEran ⊥ Keren ⊥ =NULL
12
SQL’s NULL 12 NULL is SQL’s special “missing value” Same queries as complete tables, but SQL assigns a special behavior to logic over NULL –“Three-valued logic”: true, false, unknown Alas, there are some issues...
13
Try It Yourself (psql) 13 CREATE TABLE Registrations( student varchar(40), course varchar(40)); INSERT INTO Registrations VALUES ('Ahuva','PL'), ('Alon',NULL); CREATE TABLE Courses( course varchar(40), lecturer varchar(40)); INSERT INTO Courses VALUES ('PL','Eran'), (NULL,'Keren'); Registrations studentcourse AhuvaPL Alon ⊥ Courses courselecturer PLEran ⊥ Keren SELECT student, lecturer FROM Registrations R, Courses C WHERE R.course = C.course; studentlecturer AhuvaEran Of course, we've lost our initial association (join)...
14
Try More Yourself (psql) 14 Registrations studentcourse AhuvaPL Alon ⊥ Courses courselecturer PLEran ⊥ Keren SELECT student FROM Registrations; student Ahuva Alon Inconsistent logic... real problem! SELECT student FROM Registrations WHERE course='PL'; student Ahuva SELECT student FROM Registrations WHERE course!='PL'; student SELECT student FROM Registrations WHERE course='PL' OR course!='PL'; student Ahuva Alon??
15
Labeled Nulls in “Naive” Tables 15 Registrations studentcourse AhuvaPL Alon ⊥1⊥1 Ahuva ⊥2⊥2 Courses courselecturer PLEran ⊥1⊥1 Keren ⊥2⊥2 Shaul Just like nulls, but each null has a name –We do not know what the value is, but we do know that two nulls with the same name are the same = studentcourse lecture r AhuvaPLEran Alon ⊥1⊥1 Keren Ahuva ⊥2⊥2 Shaul ??? ???
16
Possible Worlds 16 Registrations studentcourse AhuvaPL Alon ⊥1⊥1 Ahuva ⊥2⊥2 Registrations studentcourse AhuvaPL AlonPL AhuvaDB Registrations studentcourse AhuvaPL AlonDB AhuvaDB... Closed-World Assumption: Registrations studentcourse AhuvaPL Alon ⊥1⊥1 Ahuva ⊥2⊥2 Registrations studentcourse AhuvaPL AlonPL AhuvaDB AnnaAI Registrations studentcourse AhuvaPL AlonDB AhuvaDB AhuvaAI AviML... Open-World Assumption:
17
Semantics of Query Answering Incomplete DB 17 Possible Worlds
18
Semantics of Query Answering Incomplete DB 18 Possible Worlds
19
Semantics of Query Answering 19 Certain answers (“weak) Incomplete DB Possible Worlds Represent as an incomplete relation (“strong”)
20
Application: Data Exchange 20 Mapping Users Associations Messages Global Schema
21
The Clio Project IBM + U. Toronto – tool for data exchange Commercialized in IBM DB2 21
22
Formalism [Fagin et al. 05] TaughtBy studentcourse AhuvaShaul AlonKeren Registrations studentcourse Courses courselecturer StudLecturer studentlecturer A schema mapping is defined by a source schema S, a target schema T, and a set Σ of logical assertions stating how S relates to T TS StudLecturer(x,y) ∃ z Registrations(x,z) ⋀ Courses(z,y) source instance Σ ?? We don’t have z! So 2 options: 1)Abort 2)Do our best to max usability
23
Formalism [Fagin et al. 05] 23 Registrations studentcourse Ahuva ⊥1⊥1 Alon ⊥2⊥2 Courses courselecturer ⊥1⊥1 Shaul ⊥2⊥2 Keren TaughtBy studentcourse AhuvaShaul AlonKeren source instancesolution Registrations studentcourse Courses courselecturer StudLecturer studentlecturer TS StudLecturer(x,y) ∃ z Registrations(x,z) ⋀ Courses(z,y) Σ A schema mapping is defined by a source schema S, a target schema T, and a set Σ of logical assertions stating how S relates to T
24
Problems Studied in Data Exchange Materialization –Many solutions exist; what makes one solution “better” than another? If there a “best” solution? How can we find it? Target query answering –Given a source instance and a query over the target, evaluate the query (semantics / complexity) Manipulating schema mappings –Composition and inversion of mappings 24
25
Lecture 1: Introduction I NCONSISTENT D ATABASES 25
26
Inconsistency An inconsistent database contains inconsistent (or impossible) information –Two students have the same ID –A student gets credit for the same course twice –A student takes a course that is not listed in the course database –A student has a grade for this course but a grade is missing for an assignment Modeling: (D,Σ) where D is a database and Σ is a set of required logical integrity constraints over DBs; alas, D violates Σ 26
27
Query Answering 27 Grades studentcoursegrade AhuvaPL90 AlonPL86 AlonPL81 Courses courselecturer PLEran DCKeren Database D Functional Dependency: student, course grade Integrity Constraints Σ SELECTstudent FROM Grades G, Courses C WHERE G.grade >= 85 AND G.course = C.course AND C.lecturer=‘Eran’ SELECTstudent FROM Grades G, Courses C WHERE G.grade >= 85 AND G.course = C.course AND C.lecturer=‘Eran’ Ahuva Alon
28
Query Answering 28 Grades studentcoursegrade AhuvaPL90 AlonPL86 AlonPL81 Courses courselecturer PLEran DCKeren Database D Functional Dependency: Student, Course Grade Integrity Constraints Σ SELECTstudent FROM Grades G, Courses C WHERE G.grade >= 87 AND G.course = C.course AND C.lecturer=‘Eran’ SELECTstudent FROM Grades G, Courses C WHERE G.grade >= 87 AND G.course = C.course AND C.lecturer=‘Eran’ Ahuva Alon
29
Query Answering 29 Grades studentcoursegrade AhuvaPL90 AlonPL86 AlonPL81 Courses courselecturer PLEran DCKeren Database D Functional Dependency: Student, Course Grade Integrity Constraints Σ SELECTstudent FROM Grades G, Courses C WHERE G.grade >= 80 AND G.course = C.course AND C.lecturer=‘Eran’ SELECTstudent FROM Grades G, Courses C WHERE G.grade >= 80 AND G.course = C.course AND C.lecturer=‘Eran’ Ahuva Alon
30
Minimal Repairs [Arenas, Bertossi, Chomicki 99]: D EFINITION : Let (D,Σ) be an inconsistent DB. A repair is a DB D', such that: 1.DB D' is consistent (with respect to Σ) 2.DB D' differs from D in a “minimal way” 30 Grades studentcoursegrade AhuvaPL90 AlonPL86 AlonPL81 Grades studentcoursegrade AhuvaPL90 AlonPL86 Grades studentcoursegrade AhuvaPL90 AlonPL81 Inconsistent database D Repair D' 1 Repair D' 2
31
Semantics of Query Answering 31 Repairs (consistent DBs) Inconsistent DB
32
Semantics of Query Answering 32 Inconsistent DB Repairs (consistent DBs)
33
Semantics of Query Answering 33 Consistent Answers Inconsistent DB Repairs (consistent DBs)
34
Algorithms / Complexity Very recent result by Koutris & Wijsen: For consistent query answering with key constraints, Select-Project-Join (SPJ) queries w/o repeated relations can be classified into three categories: 34 Inconsistent DB 1.2. Rewriting Inconsistent DB Graph algorithm 3. coNP-complete (exptime under standard complexity assumptions) ignore inconsistency
35
Incorporating Preferences 35 Courses courselecturer DBKeren DCKeren DCEran Functional dependencies: course lecturer lecturer course What if we trust tuple 2 more than tuple 1? Staworko, Chomicki, Marcinkowski: Prioritized repairing and consistent query answering in relational databases. Ann. Math. Artif. Intell. 64(2-3): 209-246 (2012)
36
Lecture 1: Introduction P ROBABILISTIC D ATABASES 36
37
How to accommodate the probabilistic nature of data at the database & query level? 37 StudentUniversity AhuvaTechnion Alon Technion HaifaU EmployeeEmployerRole AhuvaIntel Eng PM VP Alon Yahoo!Eng GoogleEng IntelPM Find the students that are employed as engineers How many students work at Intel? Is any PM a Technion student?
38
38 How to accommodate the probabilistic nature of data at the database & query level? Pr 1.0 0.7 0.3 Pr 0.7 0.2 0.1 0.4 0.1 StudentUniversity AhuvaTechnion Alon Technion HaifaU EmployeeEmployerRole AhuvaIntel Eng PM VP Alon Yahoo!Eng GoogleEng IntelPM Find the students that are employed as engineers -Ahuva (0.7), Alon (0.8) How many students work at Intel? -Expectation = 1 + 0.1 Is any PM a Technion student? -Yes w/ prob 1-((1-0.2)*(1-0.7*0.1))
39
Semantics 39 p1p1 p2p2 p3p3 p4p4 pnpn Probabilistic DB Space of ordinary DBs
40
Semantics of Query Answering 40 p1p1 p2p2 p3p3 p4p4 pnpn Probabilistic Database Space of ordinary DBs
41
Semantics of Query Answering 41 p1p1 p2p2 p3p3 p4p4 pnpn Probabilistic Database Space of ordinary DBs p1p1 p2p2 p3p3 p4p4 pnpn
42
Semantics of Query Answering 42 p1p1 p2p2 p3p3 p4p4 pnpn Probabilistic Database Space of ordinary DBs p1p1 p2p2 p3p3 p4p4 pnpn Rep of the probability space Mapping tuple marginal probability
43
Algorithms for Query Answering Dalvi & Suciu dichotomy: SPJ queries can be fully classified into: –Queries that can be solved in polynomial time By repeated decomposition into simpler queries –Queries for which answering is #P-hard Hence, cannot be computed in polynomial time under standard complexity assumptions Heuristic via BDDs [Olteanu+] Guaranteed approximation via sampling –Additive approx. p± is simple –Multiplicative approx. (1±)p requires more work 43
44
Probabilistic XML 44 [Abiteboul, Kimelfeld, Sagiv, Senellart]: Representation systems and XPath evaluation
45
Lecture 1: Introduction P LANNED S CHEDULE 45
46
46 124/03Intro 231/03DB Essentials 07/04Passover 312/04* (comp)Incompleteness 414/04Data Exchange 521/04Inconsistent DBs Assignment 1 due 628/04Consistent Q Answering 705/05Consistent Q Answering 812/05Pref. Repairs + Misc Assignment 2 due 919/05Probabilistic DB 1026/05Query Inference 02/06No Lecture Assignment 3 due 1109/06Query Inference 16/06Guest Lecture 1223/06Extras Assignment 4 due
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.