CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Link mining ( based on slides.

Slides:



Advertisements
Similar presentations
Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei Zhang, and Wei-Ying Ma.
Advertisements

Learning Relational Probability Trees Jennifer Neville David Jensen Lisa Friedland Michael Hay Presented by Andrew Tjang.
Introduction to Information Retrieval
Multi-label Relational Neighbor Classification using Social Context Features Xi Wang and Gita Sukthankar Department of EECS University of Central Florida.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
Relational Learning with Gaussian Processes By Wei Chu, Vikas Sindhwani, Zoubin Ghahramani, S.Sathiya Keerthi (Columbia, Chicago, Cambridge, Yahoo!) Presented.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
Funding Networks Abdullah Sevincer University of Nevada, Reno Department of Computer Science & Engineering.
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Xyleme A Dynamic Warehouse for XML Data of the Web.
CSE 574 – Artificial Intelligence II Statistical Relational Learning Instructor: Pedro Domingos.
XML Document Mining Challenge Bridging the gap between Information Retrieval and Machine Learning Ludovic DENOYER – University of Paris 6.
Statistical Relational Learning for Link Prediction Alexandrin Popescul and Lyle H. Unger Presented by Ron Bjarnason 11 November 2003.
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
08/22/2004 MRDM 2004 Workshop 1 Link Mining Lise Getoor University of Maryland, College Park joint work with Indrajit Bhattacharya, Qing Lu and Prithviraj.
CMSC 828G: Introduction to Statistical Relational Learning (SRL) & Link Analysis (LA) January 28, 2005.
Social Network Analysis
Computer Science 1 Web as a graph Anna Karpovsky.
Online Stacked Graphical Learning Zhenzhen Kou +, Vitor R. Carvalho *, and William W. Cohen + Machine Learning Department + / Language Technologies Institute.
Overview of Web Data Mining and Applications Part I
Query Log Analysis Naama Kraus Slides are based on the papers: Andrei Broder, A taxonomy of web search Ricardo Baeza-Yates, Graphs from Search Engine Queries.
Introduction to Data Mining Engineering Group in ACL.
Using Friendship Ties and Family Circles for Link Prediction Elena Zheleva, Lise Getoor, Jennifer Golbeck, Ugur Kuter (SNAKDD 2008)
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Charu Aggarwal + * Department of Computer Science, University of Texas at Dallas + IBM T. J. Watson.
THOMSON SCIENTIFIC Web of Science 7.0 via the Web of Knowledge 3.0 Platform Access to the World’s Most Important Published Research.
C LUSTERING NETWORKED DATA BASED ON LINK AND SIMILARITY IN A CTIVE LEARNING Advisor : Sing Ling Lee Student : Yi Ming Chang Speaker : Yi Ming Chang 1.
Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.
Using Hyperlink structure information for web search.
Automatically Extracting Data Records from Web Pages Presenter: Dheerendranath Mundluru
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
Presented by, Lokesh Chikkakempanna Authoritative Sources in a Hyperlinked environment.
Page 1 Ming Ji Department of Computer Science University of Illinois at Urbana-Champaign.
Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.
Computing & Information Sciences Kansas State University Laboratory for Knowledge Discovery in Databases PhD Research Proficiency Exam Jing.
On Node Classification in Dynamic Content-based Networks.
Copyright © 2012, SAS Institute Inc. All rights reserved. ANALYTICS IN BIG DATA ERA ANALYTICS TECHNOLOGY AND ARCHITECTURE TO MANAGE VELOCITY AND VARIETY,
Mining real world data Web data. World Wide Web Hypertext documents –Text –Links Web –billions of documents –authored by millions of diverse people –edited.
Computing & Information Sciences Kansas State University IJCAI HINA 2015: 3 rd Workshop on Heterogeneous Information Network Analysis KSU Laboratory for.
Advisor : Prof. Sing Ling Lee Student : Chao Chih Wang Date :
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.
Ranking Link-based Ranking (2° generation) Reading 21.
Introduction to the Semantic Web and Linked Data
Graph Algorithms: Classification William Cohen. Outline Last week: – PageRank – one algorithm on graphs edges and nodes in memory nodes in memory nothing.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
Learning Statistical Models From Relational Data Lise Getoor University of Maryland, College Park Includes work done by: Nir Friedman, Hebrew U. Daphne.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
Chapter 6. Inference beyond the index 2007 년 1 월 30 일 부산대학교 인공지능연구실 김민호 Text : FINDING OUT ABOUT Page. 182 ~ 251.
Multiple-goal Search Algorithms and their Application to Web Crawling Dmitry Davidov and Shaul Markovitch Computer Science Department Technion, Haifa 32000,
Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.
Social Network Analysis and Mining June 10, CENG 514.
Applying Link-based Classification to Label Blogs Smriti Bhagat, Irina Rozenbaum Graham Cormode.
CPS 49S Google: The Computer Science Within and its Impact on Society Shivnath Babu Spring 2007.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Learning Bayesian Networks for Complex Relational Data
Data mining in web applications
Exploring Social Tagging Graph for Web Object Classification
Jiawei Han Department of Computer Science
Discriminative Probabilistic Models for Relational Data
Label and Link Prediction in Relational Data
Statistical Relational AI
Presentation transcript:

CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Link mining ( based on slides by Lise Gatoor )

Link Mining Traditional machine learning/data mining approaches assume: A random sample of homogeneous objects from a single relation Real world data sets: Multi-relational, heterogeneous and semi-structured Link Mining newly emerging research area at the intersection of research in social network and link analysis, hypertext and web mining, relational learning and inductive logic programming and graph mining. Web mining

Outline Link Mining Tasks Statistical Modeling Challenges Synthesis of issues raised at IJCAI Workshop Learning Statistical Models from Relational Data

Linked Data Heterogeneous, multi-relational data represented as a graph or network Nodes are objects May have different kinds of objects Objects have attributes Objects may have labels or classes Edges are links May have different kinds of links Links may have attributes Links may be directed, are not required to be binary

Sample Domains web data (web) bibliographic data (cite) epidimiological data (epi)

Example: Linked Bibliographic Data P2P2 P4P4 A1A1 P3P3 P1P1 I1I1 Objects: Papers Authors Institutions Papers P2P2 P4P4 P3P3 P1P1 Authors A1A1 I1I1 Institutions Links: Citation Co-Citation Author-of Author-affiliation Citation Co-Citation Author-of Author-affiliation Attributes: Categories P2P2 P4P4 P3P3 P1P1

Link Mining Tasks Link-based Object Classification Link Type Prediction Predicting Link Existence Link Cardinality Estimation Object Identification Subgraph Discovery

Link-based Object Classification Predicting the category of an object based on its attributes and its links and attributes of linked objects web: Predict the category of a web page, based on words that occur on the page, links between pages, anchor text, html tags, XML tags, etc. cite: Predict the topic of a paper, based on word occurrence, citations, co-citations epi: Predict disease type based on characteristics of the people; predict person’s age based on ages of people they have been in contact with and disease type

Link Type Predicting type or purpose of link web: predict advertising link or navigational link; predict an advisor-advisee relationship cite: predicting whether co-author is also an advisor epi: predicting whether contact is familial, co- worker or acquaintance

Predicting Link Existence Predicting whether a link exists between two objects web: predict whether there will be a link between two pages cite: predicting whether a paper will cite another paper epi: predicting who a patient’s contacts are

Link Cardinality Estimation I Predicting the number of links to an object web: predict the authoratativeness of a page based on the number of in-links; identifying hubs based on the number of out-links cite: predicting the impact of a paper based on the number of citations epi: predicting the infectiousness of a disease based on the number of people diagnosed.

Link Cardinality Estimation II Predicting the number of objects reached along a path from an object Important for estimating the number of objects that will be returned by a query web: predicting number of pages retrieved by crawling a site cite: predicting the number of citations of a particular author in a specific journal epi: predicting the number of elderly contacts for a particular patient

Object Identity Predicting when two objects are the same, based on their attributes and their links aka: record linkage, duplicate elimination web: predict when two sites are mirrors of each other. cite: predicting when two citations are referring to the same paper. epi: predicting when two disease strains are the same.

Link Mining Challenges Logical vs. Statistical dependencies Feature construction Instances vs. Classes Collective classification Effective Use of Labeled & Unlabeled Data Link Prediction Challenges common to any link-based statistical model (Bayesian Logic Programs, Conditional Random Fields, Probabilistic Relational Models, Relational Markov Networks, Relational Probability Trees, Stochastic Logic Programming to name a few)

Logical vs. Statistical Dependence Coherently handling two types of dependence structures: Link structure - the logical relationships between objects Probabilistic dependence - statistical relationships between attributes Challenge: statistical models that support rich logical relationships Model search is complicated by the fact that attributes can depend on arbitrarily linked attributes -- issue: how to search this huge space

Model Search P2P2 P A1A1 P3P3 P1P1 ? A1A1 P2P2 P3P3 P1P1 I1I1 I1I1

Feature Construction In many cases, objects are linked to a set of objects. To construct a single feature from this set of objects, we may either use: Aggregation Selection

P2P2 P1P1 P3P3 Aggregation I1I1 mode P2P2 P3P3 P1P1 P A1A1 ? P2P2 P1P1 I2I2 P6P6 P4P4 P5P5 P A2A2 ? P6P6 P6P6 P6P6 P

P2P2 P1P1 P3P3 Selection I1I1 P2P2 P3P3 P1P1 P A1A1 ? P2P2 P3P3 P

Individuals vs. Classes Does model refer explicitly to individuals classes or generic categories of individuals On one hand, we’d like to be able to model that a connection to a particular individual may be highly predictive On the other hand, we’d like our models to generalize to new situations, with different individuals

Instance-based Dependencies A1A1 P3P3 I1I1 Papers that cite P 3 are likely to be P3P3

Class-based Dependencies A1A1 P3P3 I1I1 Papers that cite are likely to be

Collective classification Using a link-based statistical model for classification Two steps: Model construction Inference using learned model

Model Selection & Estimation category set { } P5P5 P8P8 P7P7 P2P2 P4P4 Learn model from fully labeled training set P9P9 P6P6 P3P3 P1P1 P 10

Collective Classification Algorithm category set { } P5P5 P4P4 P3P3 P2P2 P1P1 P5P5 P4P4 P3P3 P2P2 P1P1 Step 1: Bootstrap using object attributes only

Collective Classification Algorithm category set { } P5P5 P3P3 P2P2 P1P1 P5P5 P4P4 P3P3 P2P2 P1P1 Step 2: Iteratively update the category of each object, based on linked object’s categories P4P4 P4P4

Labeled & Unlabeled Data In link-based domains, unlabeled data provide three sources of information: Helps us infer object attribute distribution Links between unlabeled data allow us to make use of attributes of linked objects Links between labeled data and unlabeled data (training data and test data) help us make more accurate inferences

P5P5 P8P8 P7P7 P2P2 P4P4 P9P9 P6P6 P3P3 P1P1 P 10 P 15 P 14 P 13 P 12 P 11

Link Prior Probability The prior probability of any particular link is typically extraordinarily low For medium-sized data sets, we have had success with building explicit models of link existence It may be more effective to model links at higher level--required for large data sets!

Modeling Link Existence Explicitly Paper#2 Topic Paper#3 Topic WordN Paper#1 Word1 Topic... Author#1 Area Ins t #1-#2 Author#2 Area Inst Exists #2-#3 Exists #2-#1 Exists #3-#1 Exists #1-#3 Exists WordN Word1 WordN Word1 Exists WordN Word1 WordN Word1 WordN Word1 Exists Ins t Topic Area Topic Area Topic Area #3-#2

Summary Link mining exciting new research area poses new statistical modeling challenges Link mining task should inform our choice of: Link-based statistical model visualization

References Link Mining: A New Data Mining Challenge, L. Getoor. SIGKDD Explorations, volume 4, issue 2, Link-based Classification, Q. Lu and L. Getoor, International Conference on Machine Learning, August, Labeled and Unlabeled Data for Link-based Classification, Q. Lu and L. Getoor. ICML workshop on The Continuum from Labeled to Unlabeled Data, August, Link-based Classification for Text Classification and Mining, Q. Lu and L. Getoor. IJCAI workshop on Text Mining and Link Analysis IJCAI Workshop: Learning Statistical Models from Relational Data