LINK PREDICTION IN CO-AUTHORSHIP NETWORK Le Nhat Minh ( A0074403N) Supervisor: Dongyuan Lu 1.

Slides:

Advertisements

Similar presentations

Link Prediction in Social Networks

Advertisements

Context-based object-class recognition and retrieval by generalized correlograms by J. Amores, N. Sebe and P. Radeva Discussion led by Qi An Duke University.

+ Multi-label Classification using Adaptive Neighborhoods Tanwistha Saha, Huzefa Rangwala and Carlotta Domeniconi Department of Computer Science George.

Linked data: P redicting missing properties Klemen Simonic, Jan Rupnik, Primoz Skraba {klemen.simonic, jan.rupnik,

Data Mining Classification: Alternative Techniques

Analysis and Modeling of Social Networks Foudalis Ilias.

Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.

LINK PREDICTION IN CO-AUTHORSHIP NETWORK Le Nhat Minh ( A N) Supervisor: Dongyuan Lu Aobo Tao Chen 1.

Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.

Lesson learnt from the UCSD datamining contest Richard Sia 2008/10/10.

1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.

Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.

Heterogeneous Consensus Learning via Decision Propagation and Negotiation Jing Gao† Wei Fan‡ Yizhou Sun†Jiawei Han† †University of Illinois at Urbana-Champaign.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Machine Learning Usman Roshan Dept. of Computer Science NJIT.

Models of Influence in Online Social Networks

Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα Link Prediction.

Using Friendship Ties and Family Circles for Link Prediction Elena Zheleva, Lise Getoor, Jennifer Golbeck, Ugur Kuter (SNAKDD 2008)

MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.

Online Social Networks and Media Absorbing Random Walks Link Prediction.

Data mining and machine learning A brief introduction.

Suggesting Friends using the Implicit Social Graph Maayan Roth et al. (Google, Inc., Israel R&D Center) KDD’10 Hyewon Lim 1 Oct 2014.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

Towards Improving Classification of Real World Biomedical Articles Kostas Fragos TEI of Athens Christos Skourlas TEI of Athens

Using Transactional Information to Predict Link Strength in Online Social Networks Indika Kahanda and Jennifer Neville Purdue University.

Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.

LOGO Ensemble Learning Lecturer: Dr. Bo Yuan

The Link Prediction Problem for Social Networks David Libel-Nowell, MIT John Klienberg, Cornell Saswat Mishra sxm

A Graph-based Friend Recommendation System Using Genetic Algorithm

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.

Part 1: Biological Networks 1.Protein-protein interaction networks 2.Regulatory networks 3.Expression networks 4.Metabolic networks 5.… more biological.

Mining Social Network for Personalized Prioritization Language Techonology Institute School of Computer Science Carnegie Mellon University Shinjae.

Today Ensemble Methods. Recap of the course. Classifier Fusion

Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.

Page 1 Inferring Relevant Social Networks from Interpersonal Communication Munmun De Choudhury, Winter Mason, Jake Hofman and Duncan Watts WWW ’10 Summarized.

Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova ， Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.

Finding Top-k Shortest Path Distance Changes in an Evolutionary Network SSTD th August 2011 Manish Gupta UIUC Charu Aggarwal IBM Jiawei Han UIUC.

Gang WangDerek HoiemDavid Forsyth. INTRODUCTION APROACH (implement detail) EXPERIMENTS CONCLUSION.

Link Prediction Topics in Data Mining Fall 2015 Bruno Ribeiro

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Post-Ranking query suggestion by diversifying search Chao Wang.

26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.

1 Friends and Neighbors on the Web Presentation for Web Information Retrieval Bruno Lepri.

11 A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 1, Michael R. Lyu 1, Irwin King 1,2 1 The Chinese.

KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.

A Supervised Machine Learning Algorithm for Research Articles Leonidas Akritidis, Panayiotis Bozanis Dept. of Computer & Communication Engineering, University.

1 CS 430: Information Discovery Lecture 5 Ranking.

Supervised Random Walks: Predicting and Recommending Links in Social Networks Lars Backstrom (Facebook) & Jure Leskovec (Stanford) Proc. of WSDM 2011 Present.

A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.

Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.

Hierarchical Mixture of Experts Presented by Qi An Machine learning reading group Duke University 07/15/2005.

A Binary Linear Programming Formulation of the Graph Edit Distance Presented by Shihao Ji Duke University Machine Learning Group July 17, 2006 Authors:

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

Response network emerging from simple perturbation Seung-Woo Son Complex System and Statistical Physics Lab., Dept. Physics, KAIST, Daejeon , Korea.

Sul-Ah Ahn and Youngim Jung * Korea Institute of Science and Technology Information Daejeon, Republic of Korea { snowy; * Corresponding Author: acorn

GUILLOU Frederic. Outline Introduction Motivations The basic recommendation system First phase : semantic similarities Second phase : communities Application.

1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.

Link Prediction Class Data Mining Technology for Business and Society

Semi-Supervised Clustering

Sofus A. Macskassy Fetch Technologies

Source: Procedia Computer Science（2015）70:

Link Prediction Seminar Social Media Mining University UC3M

COMP61011 : Machine Learning Ensemble Models

Using Friendship Ties and Family Circles for Link Prediction

An Efficient method to recommend research papers and highly influential authors. VIRAJITHA KARNATAPU.

Graph Algorithm.

Scaling up Link Prediction with Ensembles

Asymmetric Transitivity Preserving Graph Embedding

Presentation transcript:

LINK PREDICTION IN CO-AUTHORSHIP NETWORK Le Nhat Minh ( A N) Supervisor: Dongyuan Lu 1

Introduction Link prediction Introduce future connections within the network scope Co-authorship network A network of collaborations among researchers, scientists, academic writers 2

Introduction Potential applications Recommend experts or group of researchers for individual researcher. 3

Outline Problem Background Related Work Workflow Conclusion Result Analysis Research plan 4

Problem Background What connect researchers together ? Given an instance of co-authorship network: A researcher connect to another if they collaborated on at least one paper. 5 Problem Background Related Work WorkflowConclusion X 2001 Y 2004 X X XYXY

Problem Background How to predict the link? Based on criteria: Co-authorship network topology Researcher’s personal information Researcher’s papers Boost up link predictions performance Recommend link should be really relevant to the interest of the authors or at least possible for researcher to collaborate. 6 Problem Background Related Work WorkflowConclusion

Related Work Link prediction problems in Social network Liben ‐ Nowell, D., & Kleinberg, J., 2007 Bliss, C. A., Frank, M. R., Danforth, C. M., & Dodds, P. S., 2013 In social network, interactions among users are very dynamic with: Creation of new link within a few days Deletion or replacement of the existent links Different features present by the two networks Characteristics of individual researcher : citations, affiliations, institutions,... Characteristics of person : marriage status, ages, working places, … 7 Problem Background Related Work WorkflowConclusion

Three mainstream approaches for link prediction: Similarity based estimation Liben ‐ Nowell, D., & Kleinberg, J., 2007 Maximum likelihood estimation Murata, T., & Moriyasu, S., 2008 Guimerà, R., & Sales-Pardo, M., 2009 Supervised Learning model Pavlov, M., & Ichise, R., 2007 Al Hasan, M., Chaoji, V., Salem, S., & Zaki, M., Problem Background Related Work WorkflowConclusion

Similarity Based Estimation Use metrics to estimate proximities of pairs of researchers Based on those proximities to rank pairs of researchers The top pairs of researchers will likely to be the recommendations. 9 Problem Background Related Work WorkflowConclusion

Similarity Based Estimation Network structure based measurement 10 Some conventions: Problem Background Related Work WorkflowConclusion

Similarity Based Estimation Common Neighbor: 11 X Y Problem Background Related Work WorkflowConclusion

Similarity Based Estimation Jaccard’s coefficient: 12 X Y Problem Background Related Work WorkflowConclusion

Similarity Based Estimation Preferential Attachment: 13 X Y Problem Background Related Work WorkflowConclusion

Similarity Based Estimation 14 Adamic/Adar: X Y Z Problem Background Related Work WorkflowConclusion

Similarity Based Estimation Shortest Path: Defines the minimum number of edges connecting two nodes. 15 PageRank: A random walk on the graph assigning the probability that a node could be reach. The proximity between a pair of node can be determined by the sum of the node PageRank. Problem Background Related Work WorkflowConclusion

Maximum Likelihood Estimation Predefine specific rules of a network Required a prior knowledge of the network The likelihood of any non-connected link is calculated according to those rules. 16 Problem Background Related Work WorkflowConclusion

Supervised Learning Model Construct dimensional feature vectors Fetch these vectors to classifiers to optimize a target function (training model) Link prediction becomes a binary classification 17 Problem Background Related Work WorkflowConclusion

Supervised Learning Model Related work (Al Hasan, M., Chaoji, V., Salem, S., & Zaki, M., 2006) using: Decision Tree SVM (Linear Kernel) K nearest neighbor Multilayer Perceptron Naives Bayes Bagging Combine many classifiers (Pavlov, M., & Ichise, R., 2007) Decision stump + AdaBoost Decision Tree + AdaBoost SMO + AdaBoost 18 Problem Background Related Work WorkflowConclusion

Summary Similarity based estimation Not quite well-perform Maximum likelihood Depend on the network Supervised learning model Perform better than similarity based estimation 19 Problem Background Related Work WorkflowConclusion

Workflow 20 Problem Background Related Work WorkflowConclusion Classifier ModelFeatures

Graph Description Co-authorship graph: Undirected graph G (V, E) Node or Vertex ( Author ) Author ID Author Name Link or Edge (Co-authorship) Pair of author ID List of publication year followed by paper title (Ex: 2004 :”Introduction to …” ) 21 Problem Background Related Work WorkflowConclusion

Setting up data Dataset is separated into 2 timing spans: 2000 – 2010 and 2010 – 2013 The first is for training, the latter is for testing. Currently, there are 134,307 researchers in the network 2000 – Crop out authors who are not available in testing period, remaining 104,265 researchers 22 Problem Background Related Work WorkflowConclusion

Setting up data Choose a subset from 104,265 researchers Experiment on 937 researchers Real Network No of node104,265 No of link413,69135,558 Experiment Network No. of node937 No. of link Problem Background Related Work WorkflowConclusion

Baseline Features Extract features from the network structure: Local similarity Common Neighbor Adamic / Adar Preferential Attachment Jaccard’s coefficient Global similarity Shortest Path PageRank 24 Problem Background Related Work WorkflowConclusion

Baseline Features Feature for co-authorship network Keyword matching (Cohen, S., & Ebel, L., 2013 ) A suggested metric to measure the textual relavancy uses a TF-IDF based function to determine. 25 Problem Background Related Work WorkflowConclusion

Proposed Features Productivity of the authors Observe the “history” of an author For example, at a particular node A: 26 Problem Background Related Work WorkflowConclusion T 2 = 2005T 0 = 2000T 1 = 2004T 3 = 2006 i=0i=1i=2i=3 n=3 m=1 n=4 m=2 n=6 m=2 n=7 m=3 n : No. of shared paper m: No. of collaborators

Proposed Features 27 α : a constant to assign the weight of each time period Problem Background Related Work WorkflowConclusion Productivity of the authors Observe the “history” of an author The “productivity” of node A:

Training set Set up training data With n nodes, there is possible links. Among those, separate two links Positive link: links appear in training years. Negative link: the remaining non-existent link in training years. Note: Avoid bias training by balancing the number of instances between true and false label. Classify all the non-existent links Compare with the testing data 28 Problem Background Related Work WorkflowConclusion

Experimental Results Measurement of performance Precision: Recall: Harmonic mean: 29 New links to predict: 57 links Problem Background Related Work WorkflowConclusion Prediction True LinkFalse Link True Link2631 False Link5,588429,778

Result Analysis Possible reasons Features Small set of data – sampling problem Instances of the negative links used for training 30 Problem Background Related Work WorkflowConclusion

Research Plan Use weighted graph with parameters: No. of papers No. of neighbor No. of citations Focus on features that specifically target the co-authorship network: Citations Institutions Enlarge the experiment dataset size 31 Thank you Problem Background Related Work WorkflowConclusion

References Adamic, L. A., & Adar, E. (2003). Friends and neighbors on the web. Social networks, 25(3), Al Hasan, M., Chaoji, V., Salem, S., & Zaki, M. (2006). Link prediction using supervised learning. In SDM’06: Workshop on Link Analysis, Counter-terrorism and Security. Liben ‐ Nowell, D., & Kleinberg, J. (2007). The link ‐ prediction problem for social networks. Journal of the American society for information science and technology, 58(7), Pavlov, M., & Ichise, R. (2007). Finding Experts by Link Prediction in Co- authorship Networks. FEWS, 290, Murata, T., & Moriyasu, S. (2008). Link prediction based on structural properties of online social networks. New Generation Computing, 26(3), Guimerà, R., & Sales-Pardo, M. (2009). Missing and spurious interactions and the reconstruction of complex networks. Proceedings of the National Academy of Sciences, 106(52), Bliss, C. A., Frank, M. R., Danforth, C. M., & Dodds, P. S. (2013). An Evolutionary Algorithm Approach to Link Prediction in Dynamic Social Networks. arXiv preprint arXiv: Cohen, S., & Ebel, L. (2013). Recommending collaborators using keywords. In Proceedings of the 22nd international conference on World Wide Web companion

Link per year of training set is greater than link per year of testing set: In testing period, only consider “new” collaborations. Any collaborations between researchers that already has a link will be disregarded No of node937 No of link309357

Results with different classifiers Classifier Precision (Positive Predictive Value) (%) Recall (Hit rate) (%) F1 (Harmonic mean) (%) Decision Tree SMO Bagging Naive Bayes Multilayer Perceptron

Proposed Feature The reason for proposing this feature: Keep track of the researcher tendency Give “bonus” to researcher who tend to collaborate with “new” colleagues rather than “old” ones Also give high score for prolific researchers (based on number of published paper) 35

Stochastic Block Model Guimerà, R., & Sales-Pardo, M., Problem Background Related Work WorkflowConclusion

Stochastic Block Model X Y Problem Background Related Work WorkflowConclusion The reliability of an individual link is: