Presentation is loading. Please wait.

Presentation is loading. Please wait.

Probabilistic Models for Relational Data Seminar Data Mining (SS 2005) Prof. Dr. Thomas Hofmann Dipl. Inform. Steffen Hartmann Xin Dong 05,07,2005.

Similar presentations


Presentation on theme: "Probabilistic Models for Relational Data Seminar Data Mining (SS 2005) Prof. Dr. Thomas Hofmann Dipl. Inform. Steffen Hartmann Xin Dong 05,07,2005."— Presentation transcript:

1 Probabilistic Models for Relational Data Seminar Data Mining (SS 2005) Prof. Dr. Thomas Hofmann Dipl. Inform. Steffen Hartmann Xin Dong 05,07,2005

2 History/Introduction “flat” data relational data plate models and probabilistic relational models (PRMs) graphically quite different similar to express probabilistic relationships probabilistic entity-relationship (PER) model an extension of the ER model enhances the expressiveness make relationships first class objects easy to model relational data. directed acyclic probabilistic entity-relationship (DAPER) model more similar, more expressive the use of restricted relationships, self relationships, probabilistic relationships

3 The Basic Ideas ---ER Model Entity relationship (ER) model a commonly used abstract representation of database structure the first step in the process of building a relational database Features of anticipated data and how they interrelate are encoded used to create a relational schema for the database, which in turn is used to build the database itself is a representation of a database structure, not of a particular database that contains data

4 The Basic Ideas ---ER Model Definitions entity --- a thing or object that is or may be stored in a database relationship --- a specific interaction among entities attribute --- a variable describing some property of an entity or relationship.

5 The Basic Ideas --- ER Model Example 1 A university database maintains records on students and their IQs, courses and their difficulty, and the courses taken by students and the grades they receive. distinguish between: ER diagram and ER model ER diagram --- only graph ER model --- ER diagram + mechanism skeleton and instance for an ER model skeleton --- collection of corresponding entity and relationship sets instance --- skeleton + assignment of a value to every attribute an instance of an ER model is an actual database

6 Course Diff Takes Student Grade IQ attribute class entity class relationship class Student John mary Course cs107 stat10 Takes Student Course John cs107 mary cs107 mary stat10 cs107.Diff T(mary,stat10).G stat10.Diff T(john,cs107).G T(mary,cs107).G mary.IQjohn.IQ (a). ER model (b). An example skeleton for the entity and relationship classes (c). The attributes defined by the application of the ER model to the skeleton. entity set relationship set

7 Student John mary Course cs107 stat10 Takes Student Course John cs107 mary cs107 mary stat10 Student John mary Course cs107 stat10 Takes Student Course John cs107 mary cs107 mary stat10 Student. IQ 120 125 Course. Diff A B Takes. Grade 3.0 2.0 1.0 skeleton for a set of entity and relationship classes instance for an ER model

8 The Basic Ideas --- DAPER Model directed acyclic probabilistic entity relationship (DAPER) model ER model with directed (solid) arcs and local distribution classes arc class --- represent probabilistic dependencies among corresponding attributes local distribution classes --- define local distributions for attributes DAPER diagram --- graph DAPER model --- diagram + the local distribution classes + the mechanism, by which a DAPER model defines a directed acyclic graphical (DAG) model given a skeleton.

9 The Basic Ideas --- DAPER Model Example 2 In the university database (Example 1), a student’s grade in a course depends both on the student’s IQ and on the difficulty of the course. arc class Constraint local distribution class a specification from which local distributions for attributes corresponding to the attribute class can be constructed, when a DAPER model is expanded to a DAG model local distribution class for Takes.Grade p (Takes.Grade | Student.IQ, Course.Diff) is a specification from which the local distributions for Takes(s, c).Grade, for all students s and courses c, can be constructed.

10 Course Diff Takes Student Grade IQ Course[Diff] = Course[Grade] student[IQ] = student[Grade] (a). DAPER model Student John mary Course cs107 stat10 Takes Student Course John cs107 mary cs107 mary stat10 (b). An example skeleton for the entity and relationship classes cs107.Diff T(mary,stat10).G stat10.Diff T(john,cs107).G T(mary,cs107).G mary.IQjohn.IQ (c). Directed acyclic graphical (DAG) model defined by application of DAPER model to ER skeleton

11 The Basic Ideas --- plate Model developed as a language for compactly representing graphical models in which there are repeated measurements no formal definition of a plate model, we provide one here. This definition enhances the expressivity of such models while retaining their essence plate and DAPER models are equivalent

12 Course Takes Student Diff Grade IQ Course [Diff] = Course [Grade] Student [IQ] = Student [Grade] Plate model depicting the structure of a university database. entity class -> a large rectangle, called a plate The plate is labeled with the entity-class name Plates are allowed to intersect or overlap A relationship class is drawn at the named intersection of the plates Attribute classes of an entity class are drawn as ovals inside the rectangle corresponding to the entity, but outside any intersection. Attribute classes associated with a relationship class are drawn in the intersection corresponding to the relationship class. Arc classes and constraints are drawn just as they are in DAPER models. In additon, local distribution classes are specified just as they are in DAPER models. (not shown in the graph) The invertible mapping from a DAPER to plate model

13 The Basic Ideas --- PRMs Probabilistic Relational Models (PRMs) developed explicitly for the purpose of representing relational data extends the relational model — another commonly used representation for the structure of a database directed PRMs equivalent to DAPER models and plate models

14 Course Diff Takes Course Student Grade Student IQ Course [Diff] = Course [Grade ] Student [IQ] = Student [Grade ] PRM model depicting the structure of a university database. The invertible mapping from a DAPER model to a directed PRM the ER-model component of the DAPER model is mapped to a relational model in a standard way both entity and relationship classes are represented as tables attribute classes for entity and relationship classes are represented as attributes or columns in the corresponding tables of the relational model the probabilistic components of the DAPER model are mapped to those of the directed PRM arc classes and constraints just as they are in the DAPER model.

15 Probabilistic Entity-Relationship Models Fundamentals ground graph --- structure of the DAG model created by the expansion of a DAPER model given a skeleton drawing of arcs --- important part of this expansion mechanism --- important conditional independence relations could be expressed

16 Probabilistic Entity-Relationship Models Example 3 A database contains diseases and symptoms for a given patient. Every disease is a potential cause of every symptom. Example 4 Extending Example 3, suppose a physician has identified the possible causes of each symptom.

17 Disease Present Symptom Present d3.Presentd2.Present s1.Presents2.Presents3.present d1.Present Causes Causes (d, s) Causes Disease Symptom d1 s1 d1 s2 d1 s3 d2 s2 d3 s3 (a) A DAPER model for a complete bipartite graph between symptoms and diseases. (b) A ground graph (a DAG model structure) generated by the application of this DAPER model to any given a skeleton is a full bipartite graph. (c) A DAPER model for a incomplete bipartite graph between symptoms and diseases. (d) A possible skeleton (e) A DAG model resulting from the expansion of the DAPER model to the skeleton.

18 Probabilistic Entity-Relationship Models Example 5 Extending Example 3 in a different way, suppose the physician has identified both primary (major) and secondary (minor) causes of disease. Example 6 Extending Example 3 in a different way, suppose that both diseases and symptoms have category labels — labels drawn from the same set of categories. The possible causes of a symptom are diseases that have at least one category in common with that symptom.

19 Disease Present Symptom Present Causes Causes (d, s) 2°Causes1°Causes 1°Causes (d, s) v 2°Causes(d, s) Disease Present Symptom Present R1R1 R2R2 Category (b) A DAPER model with a disjunctive constraint. (c) A constraint containing the existence quantifier. (a) A DAPER model (in Example 4)

20 Probabilistic Entity-Relationship Models Restricted Relationships A relationship class R in an ER (or PER) model is restricted when some skeletons for the entity and relationship classes of the ER model are prohibited. graphical notation has been developed for common restrictions extremely useful tool for modeling with PER models.

21 Probabilistic Entity-Relationship Models Example 7 A binary outcome O is measured on patients in multiple hospitals. Each patient is treated in exactly one hospital. It is believed that outcomes in any given hospital h are i.i.d. given binomial parameter h.θ; and that these binomial parameters are themselves i.i.d. across hospitals given hyper parameters α.

22 Hospital Patient In In (h, p) h1.h1.hm.hm. p mn m.p m1.p 1n 1.p 11.... h[ ]=h[ ] (a) A DAPER model (b) The ground graph for a skeleton containing m hospitals and ni patients in hospital i applied to the DAPER model. (c) A DAPER model equivalent to the one in (a).

23 Probabilistic Entity-Relationship Models Self Relationships Self relationships are relationships that relate like entities (and perhaps other entities as well). A self-relationship class is one that contains self relationships.

24 Probabilistic Entity-Relationship Models Example 9 In the university-database example (Example 2), a student’s grade in a course depends on whether an advisor of the student is a friend of a teacher of the course.

25 Course Diff Takes Student Grade IQ Professor Teaches F Friend Advises Full (a) ER model(b) DAPER model c[D]=c[G] s[IQ]=s[G] Teaches(p, c) Advises(p f, s) F(p, p f ) (c) DAPER model, the Professor entity class has been copied. Professor (Advisor) Professor (Teacher) an ordinary attribute θ corresponding to this uncertain distribution. there are two instances of the Professor entity class named“Professor (Teacher)” and “Professor (Advisor).” Note that copying allows us to annotate the role that each copy of the entity class plays in the self-relationship class. Models drawn with this copy convention are sometimes more transparent.

26 F has one attribute class F.Friend,where the attribute F(p, pf).Friend is true if professor pf is a friend of professor p. Note that F has the Full constraint so that we can model whether any one professor is a friend of another. Also note that F(p1, p2).Friend may be true while F(p2, p1).Friend may be false.

27 The constraint on the arc class from F.Friend to Takes.Grade is Teaches(p, c) ∧ Advises(pf, s). Thus, in any ground graph generated from this model, there is an arc from attribute F(p, pf ).Friend to attribute Takes(s, c).Grade whenever a teacher of the course is p and an advisor of the student is pf —precisely the additional dependence described in the example.

28 Probabilistic Entity-Relationship Models Probabilistic Relationships Example 12 (Relationship existence) A database contains academic papers and citations for a subset of those papers. Using the citations we have, we model how the topics of two papers influence whether one paper cites the other. Example 13 Modifying Example 12, we now know that the database was constructed such that contains at most ten citations from the bibliography of any paper.

29 Paper (Citing) Topic Cites Paper (Cited) Exists Topic (a) An ER model (b) A DAPER model for the situation where citations are uncertain. p[T]=p cg [E] p[T]=p cd [E] Cites(p cg,p cd ) Full (c) A DAPER model for the situation where citations are limited to ten per paper. <=10 p cg [E]=p[<=10] we are uncertain about the citations of papers whose citations have not been recorded. To model this uncertainty, we use a DAPER Model in which Cites is a Full relationship class with attribute class Cites.Exists, where Cites(pcg, pcd).Exists is true when paper pcg cites paper pcd. In addition, to model how the topics of two papers influence this existence, we add the attribute class Paper.Topic and the arc classes.

30 With respect to Figure b, we have added a binary, attribute class Paper. <= 10. The double oval associated with this Attribute class indicates that this attribute expands to deterministic attributes in a ground graph. In particular, a ground graph attribute p. <= 10 will have parents Cites(pcg, pcd).Exists, for all pcd, and will be true exactly when ten or fewer of these parents are true. To encode the restriction, we set p. <= 10 to true for every p when performing inference in the ground graph.

31 Summary ER model by example definitions for the DAPER model, plate model and PRM examine DAPER models in detail restricted relationships self relationships probabilistic relationships

32


Download ppt "Probabilistic Models for Relational Data Seminar Data Mining (SS 2005) Prof. Dr. Thomas Hofmann Dipl. Inform. Steffen Hartmann Xin Dong 05,07,2005."

Similar presentations


Ads by Google