CMSC 828G: Introduction to Statistical Relational Learning (SRL) & Link Analysis (LA) January 28, 2005.

2 Today’s Outline Brief Introduction to SRL Student Introductions Course Mechanics Slightly Longer Introduction to SRL SRL focus problem Exercise: Create your own SRL focus problem Discussion of SRL focus problems Survey Resources

3 Statistical Relational Learning Traditional machine learning and data mining approaches assume: –A random sample of homogeneous objects from single relation Real world data sets: –Multi-relational, heterogeneous and semi-structured SRL –newly emerging research area at the intersection of research in graphical models, social network and link analysis, hypertext and web mining, graph mining, relational learning and inductive logic programming

4 SRL Approaches Combine logical/combinatorial structures with statistical/probabilistic models Families of Approaches –Entity-relation Models + Graphical Models (BNs/Markov Models) –First-Order Logic + Graphical Models –Functional Programming + Stochastic Execution

5 Sample Domains web data (web) bibliographic data (cite) epidimiological data (epi) communication data (comm) customer networks (cust) collaborative filtering problems (cf) trust networks (trust) biological data (bio)

6 Recent SRL Activities Dagstuhl Workshop on Probabilistic, Logical and Relational Learning - Towards a Synthesis (1/30/05-2/04/05) ICML 2004 workshop on Statistical Relational Learning and its Connections to Other Fields IJCAI 2003 workshop on Statistical Relational Learning AAAI 2000 workshop on Statistical Relational Learning Several related workshops: –KDD MRDM workshops Benjamin Taskar and I are working on an edited SRL collection, and ideally we will have access to draft chapters from this collection.

7 Other SRL Related Courses Tom Dietterich’s course at OSU David Page, Mark Craven and Jude Shavlik at UWisc Pedro Domingo’s course at UWash Eric Mjolsness course at UCI on Probabilistic Knowledge Representation v2.doc v2.doc Stuart Russell’s course at Berkeley on Knowledge Representation and Reasoning Joydeep Ghosh course at UT Austin on Advanced Topics in Data Mining Michael Littman course at Rutgers on Learned Representations in AI, David Jensen and Andrew McCallums course at UMass on Computational Social Network Analysis

8 Goals of this Course ***NEW*** area Understand Foundations –Tutorials on Graphical Models, Logic, ILP, etc. Understand existing work –Wade through and make sense of Alphabet Soup of approaches (PRMs, BLPs, SLPs, MLPs, RMNs, LBNs, etc.) Understand interesting theoretical issues –Collective classification, Open World assumptions, etc. Study interesting and practical applications of SRL Do a significant (publishable) project in this area.

9 Course Mechanics Course meets 10:00-12:45. –We will have 15 minute break, typically 11:15- 11:30 –Class will consists of: Tutorials Exercises Readings and Discussion Course URL – 828g/ 828g/ Course Wiki –… stay tuned….

10 Course Expectations SRL Focus problem (15%) –Each student will develop an SRL focus problem (10%) due Feb. 11 Describe a domain Describe useful inference and learning tasks (Ideally) Collect data –Each student will ‘solve’ SRL focus problem using at least two different SRL techniques (5%) Lead at least one class discussion (5%) –Each student will sign up to lead the discussion of one (or more depending on class size) class discussion topic. Class Participation (15%) –Each week each student must turn in a short discussion of the readings by noon Thursday before class. The discussion leader should review the others responses, and use them to structure the class discussion. Class Project (50%) –Each student is expected to do a research project for the course. Feb. 18, Project Proposals Due Mar. 18, Project Progress Report #1 due Apr. 22, Project Progress Report #2 due May 6, Project Presentations May 12, Project Write-up Due Class Exercises (10%) –Throughout the course, there will be small class exercises Reviewer (5%) –Each student is expected to do 2 one-page reviews of submitted SRL Book Chapters (Students reviewers will be acknowledged in text)

11 Introductions Name Where you are originally from Research Interest/Advisor if you have one

12 SRL Intro Part II An Example: Probabilistic Relational Models

13 Bayesian Networks: Problem Bayesian nets use propositional representation Real world has objects, related to each other Intelligence Difficulty Grade Intell_Jane Diffic_CS101 Grade_Jane_CS101 Intell_George Diffic_Geo101 Grade_George_Geo101 Intell_George Diffic_CS101 Grade_George_CS101 A C These “instances” are not independent

14 Probabilistic Relational Models Combine advantages of relational logic & BNs: –Natural domain modeling: objects, properties, relations –Generalization over a variety of situations –Compact, natural probability models Integrate uncertainty with relational model: –Properties of domain entities can depend on properties of related entities –Uncertainty over relational structure of domain

15 St. Nordaf University Teaches In-course Registered In-course Prof. SmithProf. Jones George Jane Welcome to CS101 Welcome to Geo101 Teaching-ability Difficulty Registered Grade Satisfac Intelligence

16 Relational Schema Specifies types of objects in domain, attributes of each type of object & types of relations between objects Teach Student Intelligence Registration Grade Satisfaction Course Difficulty Professor Teaching-Ability In Take Classes Relations Attributes

17 Representing the Distribution Very large probability space for a given context –All possible assignments of all attributes of all objects Infinitely many potential contexts –Each associated with a very different set of worlds Need to represent infinite set of complex distributions

18 Probabilistic Relational Models Universals: Probabilistic patterns hold for all objects in class Locality: Represent direct probabilistic dependencies –Links define potential interactions Student Intelligence Reg Grade Satisfaction Course Difficulty Professor Teaching-Ability [Koller & Pfeffer; Poole; Ngo & Haddawy] ABC

19 Prof. SmithProf. Jones Welcome to CS101 Welcome to Geo101 PRM Semantics Teaching-ability Difficulty Grade Satisfac Intelligence Instantiated PRM  BN  variables: attributes of all objects  dependencies: determined by links & PRM George Jane

20 Welcome to CS101 low / high The Web of Influence Welcome to Geo101 A C low high easy / hard

21 Reasoning with a PRM Generic approach: –Instantiate PRM to produce ground BN –Use standard BN inference In most cases, resulting BN is too densely connected to allow exact inference Use approximate inference: belief propagation Improvement: Use domain structure — objects & relations — to guide computation –Kikuchi approximation where clusters = objects

22 Data  Model  Objects Learner Database Course Student Reg Expert knowledge Probabilistic Model Data for New Situation Prob. Inferenc e  What are the objects in the new situation?  How are they related to each other? [Friedman, Getoor, Koller & Pfeffer;

23 PRM Summary PRMs inherit key advantages of probabilistic graphical models: –Coherent probabilistic semantics –Exploit structure of local interactions Relational models inherently more expressive “Web of influence”: use multiple sources of information to reach conclusions Exploit both relational information and power of probabilistic reasoning

24 SRL & Link Mining General Issues

25 Linked Data Heterogeneous, multi-relational data represented as a graph or network –Nodes are objects May have different kinds of objects Objects have attributes Objects may have labels or classes –Edges are links May have different kinds of links Links may have attributes Links may be directed, are not required to be binary

26 Link Mining Tasks Link-based Object Classification Object Type Prediction Link Type Prediction Predicting Link Existence Link Cardinality Estimation Object Consolidation Group Detection Subgraph Discovery Metadata Mining

27 Link-based Object Classification Predicting the category of an object based on its attributes and its links and attributes of linked objects web: Predict the category of a web page, based on words that occur on the page, links between pages, anchor text, html tags, etc. cite: Predict the topic of a paper, based on word occurrence, citations, co-citations epi: Predict disease type based on characteristics of the patients infected by the disease

28 Object Class Prediction Predicting the type of an object based on its attributes and its links and attributes of linked objects comm: Predict whether a communication contact is by email, phone call or mail. cite: Predict the venue type of a publication (conference, journal, workshop)

29 Link Type Classification Predicting type or purpose of link based on properties of the participating objects web: predict advertising link or navigational link; predict an advisor-advisee relationship epi: predicting whether contact is familial, co-worker or acquaintance

30 Predicting Link Existence Predicting whether a link exists between two objects web: predict whether there will be a link between two pages cite: predicting whether a paper will cite another paper epi: predicting who a patient’s contacts are

31 Link Cardinality Estimation I Predicting the number of links to an object web: predict the authoratativeness of a page based on the number of in-links; identifying hubs based on the number of out-links cite: predicting the impact of a paper based on the number of citations epi: predicting the number of people that will be infected based on the infectiousness of a disease.

32 Link Cardinality Estimation II Predicting the number of objects reached along a path from an object Important for estimating the number of objects that will be returned by a query web: predicting number of pages retrieved by crawling a site cite: predicting the number of citations of a particular author in a specific journal

33 Entity Resolution Predicting when two objects are the same, based on their attributes and their links aka: record linkage, duplicate elimination, identity uncertainty web: predict when two sites are mirrors of each other. cite: predicting when two citations are referring to the same paper. epi: predicting when two disease strains are the same bio: learning when two names refer to the same protein

34 Group Detection Predicting when a set of entities belong to the same group based on clustering both object attribute values and link structure web – identifying communities cite – identifying research communities

35 Subgraph Identification Find characteristic subgraphs Focus of graph-based data mining (Cook & Holder, Inokuchi, Washio & Motoda, Kuramochi & Karypis, Yan & Han) bio – protein structure discovery comm – legitimate vs. illegitimate groups chem – chemical substructure discovery

36 Metadata Mining Schema mapping, schema discovery, schema reformulation cite – matching between two bibliographic sources web - discovering schema from unstructured or semi-structured data bio – mapping between two medical ontologies

38 SRL General Issues Summary SRL Tasks –Link-based Object Classification –Object Type Prediction –Link Type Prediction –Predicting Link Existence SRL Challenges –Logical vs. Statistical dependencies –Feature construction –Instances vs. Classes –Collective Classification –Link Cardinality Estimation –Object Consolidation –Group Detection –Subgraph Discovery –Metadata Mining –Collective Consolidation –Effective Use of Labeled & Unlabeled Data –Link Prediction –Closed vs. Open World

39 SRL Focus Problem #1 Citation Analysis

40 Domain The first focus problem domain is bibliographic citation analysis. A large number of SRL researchers have worked with this domain. Some advantages of this domain are: –the availability of data (thanks largely to Andrew McCallum, William Cohen, Steve Lawrence and others) –the ease of understanding the domain and –our obvious inherent interest in the domain as academics,. –the potential high payoff, high visability of SRL apporaches if they can solve this problem. Within this domain, some of the objects are: –papers, authors, affiliations and venues and so on, Some of the links or relationships are: –citations, authorship and co-authorship and so on. An interesting aspect of the problem is that one must deal with indentity uncertainty: objects can be referenced in many ways, and an important task is entity resolution: figuring out the underlying object domains and mappings between references and objects.

41 SRL Tasks in FP #1 topic prediction: collective classification of the topics of papers author attribution: predicting the author of a paper. An issue is whether we assume a closed or open world for the authors. Plagiarism detection. author-topic identification: discovering the topic areas for authors. This can be used for example to assign reviewers for papers. entity resolution: collective clustering of the reference to objects to determine the set of authors, papers and venues. topic evolution: tracking change in topics over time. group detection: finding collaboration networks. – citation counting/ranking: predicting number of citations or ranking based on predicted number of citations. hidden object invention: Analogous to hidden variable introduction, the introduction of a hidden object, such as an advisor, that relates two author instances. predicate invention: from co-author information, affiliation information and perhaps information such as position and room location, invent advisor predicate.

42 Data for FP #1 Many people have constructed data sets by crawling bibliography servers such as CiteSeer, ACM, DBLP and, soon one would imagine, GoogleScholar. Steve Lawrence several years ago made available a large collection of the citeseer data, this is available by contacting him. Several versions of the Cora data set are available here: data.html The recent 2003 KDD Cup challenge has data available from high energy physics,

43 Your Turn Come up with an SRL focus problem: –Define the schema, objects, links, etc. –Describe some SRL tasks in this domain –Think about where you could get the data

44 Survey

45 Next Time Graphical Models Review Led by Indrajit Bhattacharya Readings available for pickup and in library. (Due to draft nature, they are not available on the web)

