LINK PREDICTION IN CO-AUTHORSHIP NETWORK Le Nhat Minh ( A0074403N) Supervisor: Dongyuan Lu Aobo Tao Chen 1.

Slides:



Advertisements
Similar presentations
Link Prediction in Social Networks
Advertisements

Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.
Multi-Document Person Name Resolution Michael Ben Fleischman (MIT), Eduard Hovy (USC) From Proceedings of ACL-42 Reference Resolution workshop 2004.
+ Multi-label Classification using Adaptive Neighborhoods Tanwistha Saha, Huzefa Rangwala and Carlotta Domeniconi Department of Computer Science George.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
CSE 5243 (AU 14) Graph Basics and a Gentle Introduction to PageRank 1.
Analysis and Modeling of Social Networks Foudalis Ilias.
Networks. Graphs (undirected, unweighted) has a set of vertices V has a set of undirected, unweighted edges E graph G = (V, E), where.
Graph Data Management Lab School of Computer Science , Bristol, UK.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
1 On Compressing Web Graphs Michael Mitzenmacher, Harvard Micah Adler, Univ. of Massachusetts.
Query-Based Outlier Detection in Heterogeneous Information Networks Jonathan Kuck 1, Honglei Zhuang 1, Xifeng Yan 2, Hasan Cam 3, Jiawei Han 1 1 University.
The Very Small World of the Well-connected. (19 june 2008 ) Lada Adamic School of Information University of Michigan Ann Arbor, MI
Network Measures Social Media Mining. 2 Measures and Metrics 2 Social Media Mining Network Measures Klout.
Models of Influence in Online Social Networks
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα Link Prediction.
Projects ( ) Ida Mele. Rules Students have to work in teams (max 2 people). The project has to be delivered by the deadline that will be published.
Using Friendship Ties and Family Circles for Link Prediction Elena Zheleva, Lise Getoor, Jennifer Golbeck, Ugur Kuter (SNAKDD 2008)
Performance of Recommender Algorithms on Top-N Recommendation Tasks RecSys 2010 Intelligent Database Systems Lab. School of Computer Science & Engineering.
Online Social Networks and Media Absorbing Random Walks Link Prediction.
Social Networking Techniques for Ranking Scientific Publications (i.e. Conferences & journals) and Research Scholars.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Data mining and machine learning A brief introduction.
Chapter 2 Modeling and Finding Abnormal Nodes. How to define abnormal nodes ? One plausible answer is : –A node is abnormal if there are no or very few.
Towards Improving Classification of Real World Biomedical Articles Kostas Fragos TEI of Athens Christos Skourlas TEI of Athens
Using Transactional Information to Predict Link Strength in Online Social Networks Indika Kahanda and Jennifer Neville Purdue University.
Rate-based Data Propagation in Sensor Networks Gurdip Singh and Sandeep Pujar Computing and Information Sciences Sanjoy Das Electrical and Computer Engineering.
DATA MINING LECTURE 13 Absorbing Random walks Coverage.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop.
The Link Prediction Problem for Social Networks David Libel-Nowell, MIT John Klienberg, Cornell Saswat Mishra sxm
A Graph-based Friend Recommendation System Using Genetic Algorithm
Mining Social Network for Personalized Prioritization Language Techonology Institute School of Computer Science Carnegie Mellon University Shinjae.
Xiaowei Ying, Xintao Wu Univ. of North Carolina at Charlotte PAKDD-09 April 28, Bangkok, Thailand On Link Privacy in Randomizing Social Networks.
Page 1 Inferring Relevant Social Networks from Interpersonal Communication Munmun De Choudhury, Winter Mason, Jake Hofman and Duncan Watts WWW ’10 Summarized.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα Link Prediction.
Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova , Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.
Finding Top-k Shortest Path Distance Changes in an Evolutionary Network SSTD th August 2011 Manish Gupta UIUC Charu Aggarwal IBM Jiawei Han UIUC.
LINK PREDICTION IN CO-AUTHORSHIP NETWORK Le Nhat Minh ( A N) Supervisor: Dongyuan Lu 1.
Slides are modified from Lada Adamic
Link Prediction Topics in Data Mining Fall 2015 Bruno Ribeiro
Intelligent DataBase System Lab, NCKU, Taiwan Josh Jia-Ching Ying, Eric Hsueh-Chan Lu, Wen-Ning Kuo and Vincent S. Tseng Institute of Computer Science.
1 Friends and Neighbors on the Web Presentation for Web Information Retrieval Bruno Lepri.
11 A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 1, Michael R. Lyu 1, Irwin King 1,2 1 The Chinese.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
A Supervised Machine Learning Algorithm for Research Articles Leonidas Akritidis, Panayiotis Bozanis Dept. of Computer & Communication Engineering, University.
1 CS 430: Information Discovery Lecture 5 Ranking.
Supervised Random Walks: Predicting and Recommending Links in Social Networks Lars Backstrom (Facebook) & Jure Leskovec (Stanford) Proc. of WSDM 2011 Present.
CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Link mining ( based on slides.
Importance Measures on Nodes Lecture 2 Srinivasan Parthasarathy 1.
Ganesh J, Soumyajit Ganguly, Manish Gupta, Vasudeva Varma, Vikram Pudi
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.
Ontology Engineering and Feature Construction for Predicting Friendship Links in the Live Journal Social Network Author:Vikas Bahirwani 、 Doina Caragea.
Sul-Ah Ahn and Youngim Jung * Korea Institute of Science and Technology Information Daejeon, Republic of Korea { snowy; * Corresponding Author: acorn
Sul-Ah Ahn and Youngim Jung * Korea Institute of Science and Technology Information Daejeon, Republic of Korea { snowy; * Corresponding Author: acorn
GUILLOU Frederic. Outline Introduction Motivations The basic recommendation system First phase : semantic similarities Second phase : communities Application.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Link Prediction Class Data Mining Technology for Business and Society
Sofus A. Macskassy Fetch Technologies
Link Prediction on Hacker Networks
Source: Procedia Computer Science(2015)70:
Link Prediction Seminar Social Media Mining University UC3M
Using Friendship Ties and Family Circles for Link Prediction
An Efficient method to recommend research papers and highly influential authors. VIRAJITHA KARNATAPU.
Weakly Learning to Match Experts in Online Community
Asymmetric Transitivity Preserving Graph Embedding
Presentation transcript:

LINK PREDICTION IN CO-AUTHORSHIP NETWORK Le Nhat Minh ( A N) Supervisor: Dongyuan Lu Aobo Tao Chen 1

Introduction Co-authorship network A network of collaborations among researchers, scientists, academic writers Link prediction Introduce future connections within the network scope 2 Problem Background Related Work MethodExperiment Feature Analysis Result Analysis

Problem Background What connect researchers together ? Given an instance of co-authorship network: A researcher connect to another if they collaborated on at least one paper. 3 Paper X (2001)Paper Y (2004) X X XYXY Co-author A, B,C Co-author B and C A B C Problem Background Related Work MethodExperiment Feature Analysis Result Analysis

4 A snapshot of co-authorship network Problem Background Related Work MethodExperiment Feature Analysis Result Analysis

Graph Description Co-authorship graph: Undirected graph G (V, E) Node or Vertex ( Author ) Author ID Author Name Link or Edge (Co-authorship) Pair of author ID List of publication year followed by paper title (Ex: 2004 :”Introduction to …” ) 5 Problem Background Related Work MethodExperiment Feature Analysis Result Analysis

Problem Background How to predict the link? Based on criteria: Co-authorship network topology Researcher’s personal information Researcher’s papers Boost up link predictions performance Recommend link should be really relevant to the interest of the authors or at least possible for researcher to collaborate. 6 Problem Background Related Work MethodExperiment Feature Analysis Result Analysis

Related Work Three mainstream approaches for link prediction: Similarity based estimation Liben ‐ Nowell, D., & Kleinberg, J., 2007 Maximum likelihood estimation Murata, T., & Moriyasu, S., 2008 Guimerà, R., & Sales-Pardo, M., 2009 Supervised Learning model Pavlov, M., & Ichise, R., 2007 Al Hasan, M., Chaoji, V., Salem, S., & Zaki, M., Problem Background Related Work MethodExperiment Feature Analysis Result Analysis

Summary Similarity based estimation Not quite well-perform Maximum likelihood Depend on particular network Work best for block-based networks and hierarchy network. Supervised learning model Perform better than similarity based estimation 8 Problem Background Related Work MethodExperiment Feature Analysis Result Analysis

Method 9 Classifier ModelFeatures Set Problem Background Related Work MethodExperiment Feature Analysis Result Analysis

Compare performance with other related papers. Baseline 1: Link Prediction using Supervised Learning (2006) by Al Hasan, M., Chaoji, V., Salem, S., & Zaki, M. Baseline 2: Finding experts by link prediction in Coauthorship network(2007) by Pavlov, M., & Ichise, R. 10 Method Problem Background Related Work MethodExperiment Feature Analysis Result Analysis

Baseline 1 DBLP dataset training testing 1,564,617 authors 540,459 papers Features Set: Shortest Distance Sum of Paper Count Sum of Neighbors Count Second Shortest Distance 11 Link Prediction using Supervised Learning (2006).pdf Problem Background Related Work MethodExperiment Feature Analysis Result Analysis

Baseline 2 12 Finding experts by link prediction in co authorship networks (2007).pdf Institute of Electronics Information and Communication Engineers (IEICE) database training testing 1,380 authors 1,620 links training testing 3,603 authors 7,512 links Problem Background Related Work MethodExperiment Feature Analysis Result Analysis Features set: Shortest Distance Common Neighbors Jaccard’s coefficient Adamic/Adar Preferential attachment Katz PageRank (min) PageRank (max) SimRank

My feature set From baseline 1 Shortest Path Total Number of paper From baseline 2 PageRank Adamic/Adar Preferential Attachment My additional features Common 3 rd Neighbor ( replace for Common Neighbor) Productivity Affiliations (university / faculty) Keywords (extract from title) Address of author Institute Conference 13 Problem Background Related Work MethodExperiment Feature Analysis Result Analysis

14 Training ( )Testing ( ) No of node104,265 No of link413,69135,558 Experiments set up Training data With 104,265 nodes, we have ~5 x 10 9 links (Hugh !!!) Positive link: 413,691 links Negative link: 413,691 links (choose randomly from the rest) Testing data Positive link: 35,558 links Negative link: 35,558 links Problem Background Related Work MethodExperiment Feature Analysis Result Analysis

Common 3 rd Neighbor ( replace for Common Neighbor) Productivity Affiliations (university / faculty) Keywords (extract from title) Address of author Institute Conference 15 Extracting features Problem Background Related Work MethodExperiment Feature Analysis Result Analysis

Common 3 rd Neighbor Instead of looking at the common neighbor, we look at a larger network of the authors. The reason is simple : Neighbors within the 1 st neighborhood often connected to each others already. Few links can be predicted, hence the recall is low. 16 Problem Background Related Work MethodExperiment Feature Analysis Result Analysis

Affiliation and Address the Institute These two features capture geography location of the authors. Hopefully they can increase the hit rate of the prediction. 17 Problem Background Related Work MethodExperiment Feature Analysis Result Analysis

Key words and Conference Key words are extracted from all the paper titles. Hopefully can prefer to the authors 'interests, fields of research. Conference are refers to the conference names that those papers are published. 18 Problem Background Related Work MethodExperiment Feature Analysis Result Analysis

Productivity Observe the “publication history” of an author For example, at a particular node A: 19 T 2 = 2005T 0 = 2000T 1 = 2004T 3 = 2006 i=0i=1i=2i=3 n=3 m=1 n=4 m=2 n=6 m=2 n=7 m=3 n : No. of shared paper m: No. of collaborators Problem Background Related Work MethodExperiment Feature Analysis Result Analysis

Productivity 20 α : a constant to assign the weight of each time period Productivity of the authors Observe the “history” of an author The “productivity” of node A: Problem Background Related Work MethodExperiment Feature Analysis Result Analysis

Baseline implementation: 21 Baseline 1 Shortest Distance Sum of Paper Count Sum of Neighbors Count Baseline 2 Shortest Distance Common Neighbors Jaccard’s coefficient Adamic/Adar Preferential attachment Katz PageRank (min) PageRank (max) Using my data structure and classifiers. Problem Background Related Work MethodExperiment Feature Analysis Result Analysis

Baseline 1 22 Problem Background Related Work MethodExperiment Feature Analysis Result Analysis

Baseline 2 23 Problem Background Related Work MethodExperiment Feature Analysis Result Analysis

My feature set From baseline 1 Shortest Path Total Number of paper From baseline 2 PageRank Adamic/Adar Preferential Attachment My additional features Common 3 rd Neighbor ( replace for Common Neighbor) Productivity Affiliations (university / faculty) Keywords (extract from title) Address of author Institute Conference 24 Problem Background Related Work MethodExperiment Feature Analysis Result Analysis

Compare with the baselines 25 Decision Tree Problem Background Related Work MethodExperiment Feature Analysis Result Analysis

Compare with the baselines 26 SVM Problem Background Related Work MethodExperiment Feature Analysis Result Analysis

Testing Training Why poor decision tree learning result from the features such as Keywords. Key Words Problem Background Related Work MethodExperiment Feature Analysis Result Analysis

Testing Training Shortest Path Problem Background Related Work MethodExperiment Feature Analysis Result Analysis

Research Plan Implement completely the baselines features set. Observe the data and analyze the “false positive” result Consider choosing randomly data for training and testing more carefully 29 Thank you Problem Background Related Work MethodExperiment Feature Analysis Result Analysis

References Adamic, L. A., & Adar, E. (2003). Friends and neighbors on the web. Social networks, 25(3), Al Hasan, M., Chaoji, V., Salem, S., & Zaki, M. (2006). Link prediction using supervised learning. In SDM’06: Workshop on Link Analysis, Counter-terrorism and Security. Liben ‐ Nowell, D., & Kleinberg, J. (2007). The link ‐ prediction problem for social networks. Journal of the American society for information science and technology, 58(7), Pavlov, M., & Ichise, R. (2007). Finding Experts by Link Prediction in Co- authorship Networks. FEWS, 290, Murata, T., & Moriyasu, S. (2008). Link prediction based on structural properties of online social networks. New Generation Computing, 26(3), Guimerà, R., & Sales-Pardo, M. (2009). Missing and spurious interactions and the reconstruction of complex networks. Proceedings of the National Academy of Sciences, 106(52), Bliss, C. A., Frank, M. R., Danforth, C. M., & Dodds, P. S. (2013). An Evolutionary Algorithm Approach to Link Prediction in Dynamic Social Networks. arXiv preprint arXiv: Cohen, S., & Ebel, L. (2013). Recommending collaborators using keywords. In Proceedings of the 22nd international conference on World Wide Web companion

Back up slides 31

Potential applications Recommend experts or group of researchers for individual researcher. 32

Similarity Based Estimation Use metrics to estimate proximities of pairs of researchers Based on those proximities to rank pairs of researchers The top pairs of researchers will likely to be the recommendations. 33

Maximum Likelihood Estimation Predefine specific rules of a network Required a prior knowledge of the network The likelihood of any non-connected link is calculated according to those rules. 34

Stochastic Block Model X Y The reliability of an individual link is:

Supervised Learning Model Construct dimensional feature vectors Fetch these vectors to classifiers to optimize a target function (training model) Link prediction becomes a binary classification 36

Link per year of training set is greater than link per year of testing set: In testing period, only consider “new” collaborations. Any collaborations between researchers that already has a link will be disregarded No of node937 No of link309357

Results with different classifiers Classifier Precision (Positive Predictive Value) (%) Recall (Hit rate) (%) F1 (Harmonic mean) (%) Decision Tree SMO Bagging Naive Bayes Multilayer Perceptron

Proposed Feature The reason for proposing this feature: Keep track of the researcher tendency Give “bonus” to researcher who tend to collaborate with “new” colleagues rather than “old” ones Also give high score for prolific researchers (based on number of published paper) 39

Stochastic Block Model Guimerà, R., & Sales-Pardo, M., Problem Background Related Work WorkflowConclusion

Related Work Link prediction problems in Social network Liben ‐ Nowell, D., & Kleinberg, J., 2007 Bliss, C. A., Frank, M. R., Danforth, C. M., & Dodds, P. S., 2013 In social network, interactions among users are very dynamic with: Creation of new link within a few days Deletion or replacement of the existent links Different features present by the two networks Characteristics of individual researcher : citations, affiliations, institutions,... Characteristics of person : marriage status, ages, working places, … 41 Problem Background Related Work WorkflowConclusion

Related Work Link prediction problems in Social network Liben ‐ Nowell, D., & Kleinberg, J., 2007 Bliss, C. A., Frank, M. R., Danforth, C. M., & Dodds, P. S., 2013 In social network, interactions among users are very dynamic with: Creation of new link within a few days Deletion or replacement of the existent links Different features present by the two networks Characteristics of individual researcher : citations, affiliations, institutions,... Characteristics of person : marriage status, ages, working places, … 42 Problem Background Related Work WorkflowConclusion

Experimental Results Measurement of performance Precision: Recall: Harmonic mean: 43 New links to predict: 57 links Prediction True LinkFalse Link True Link2631 False Link5,588429,778

Result Analysis Possible reasons Features Small set of data – sampling problem Instances of the negative links used for training 44 Problem Background Related Work WorkflowConclusion

Similarity Based Estimation 45 Adamic/Adar: X Y Z Problem Background Related Work WorkflowConclusion

Similarity Based Estimation Network structure based measurement 46 Some conventions: Problem Background Related Work Experiment Feature Analysis

Similarity Based Estimation Common Neighbor: 47 X Y Problem Background Related Work Experiment Feature Analysis

Similarity Based Estimation Jaccard’s coefficient: 48 X Y Problem Background Related Work Experiment Feature Analysis

Similarity Based Estimation Preferential Attachment: 49 X Y Problem Background Related Work WorkflowConclusion

Similarity Based Estimation 50 Adamic/Adar: X Y Z Problem Background Related Work Experiment Feature Analysis

Similarity Based Estimation 51 Shortest Path: X Y Z Problem Background Related Work Experiment Feature Analysis

Similarity Based Estimation 52 PageRank: A random walk on the graph assigning the probability that a node could be reach. The proximity between a pair of node can be determined by the sum of the node PageRank. Problem Background Related Work Experiment Feature Analysis

Classify 3 kinds of features: 53