Presentation is loading. Please wait.

Presentation is loading. Please wait.

LINK PREDICTION IN CO-AUTHORSHIP NETWORK Le Nhat Minh ( A0074403N) Supervisor: Dongyuan Lu Aobo Tao Chen 1.

Similar presentations


Presentation on theme: "LINK PREDICTION IN CO-AUTHORSHIP NETWORK Le Nhat Minh ( A0074403N) Supervisor: Dongyuan Lu Aobo Tao Chen 1."— Presentation transcript:

1 LINK PREDICTION IN CO-AUTHORSHIP NETWORK Le Nhat Minh ( A0074403N) Supervisor: Dongyuan Lu Aobo Tao Chen 1

2 Introduction Co-authorship network A network of collaborations among researchers, scientists, academic writers Link prediction Introduce future connections within the network scope 2 Problem Background Related Work MethodExperiment Feature Analysis Result Analysis

3 Problem Background What connect researchers together ? Given an instance of co-authorship network: A researcher connect to another if they collaborated on at least one paper. 3 Paper X (2001)Paper Y (2004) X X XYXY Co-author A, B,C Co-author B and C A B C Problem Background Related Work MethodExperiment Feature Analysis Result Analysis

4 4 A snapshot of co-authorship network Problem Background Related Work MethodExperiment Feature Analysis Result Analysis

5 Graph Description Co-authorship graph: Undirected graph G (V, E) Node or Vertex ( Author ) Author ID Author Name Link or Edge (Co-authorship) Pair of author ID List of publication year followed by paper title (Ex: 2004 :”Introduction to …” ) 5 Problem Background Related Work MethodExperiment Feature Analysis Result Analysis

6 Problem Background How to predict the link? Based on criteria: Co-authorship network topology Researcher’s personal information Researcher’s papers Boost up link predictions performance Recommend link should be really relevant to the interest of the authors or at least possible for researcher to collaborate. 6 Problem Background Related Work MethodExperiment Feature Analysis Result Analysis

7 Related Work Three mainstream approaches for link prediction: Similarity based estimation Liben ‐ Nowell, D., & Kleinberg, J., 2007 Maximum likelihood estimation Murata, T., & Moriyasu, S., 2008 Guimerà, R., & Sales-Pardo, M., 2009 Supervised Learning model Pavlov, M., & Ichise, R., 2007 Al Hasan, M., Chaoji, V., Salem, S., & Zaki, M., 2006 7 Problem Background Related Work MethodExperiment Feature Analysis Result Analysis

8 Summary Similarity based estimation Not quite well-perform Maximum likelihood Depend on particular network Work best for block-based networks and hierarchy network. Supervised learning model Perform better than similarity based estimation 8 Problem Background Related Work MethodExperiment Feature Analysis Result Analysis

9 Method 9 Classifier ModelFeatures Set Problem Background Related Work MethodExperiment Feature Analysis Result Analysis

10 Compare performance with other related papers. Baseline 1: Link Prediction using Supervised Learning (2006) by Al Hasan, M., Chaoji, V., Salem, S., & Zaki, M. Baseline 2: Finding experts by link prediction in Coauthorship network(2007) by Pavlov, M., & Ichise, R. 10 Method Problem Background Related Work MethodExperiment Feature Analysis Result Analysis

11 Baseline 1 DBLP dataset 1990-2000 training 2000-2004 testing 1,564,617 authors 540,459 papers Features Set: Shortest Distance Sum of Paper Count Sum of Neighbors Count Second Shortest Distance 11 Link Prediction using Supervised Learning (2006).pdf Problem Background Related Work MethodExperiment Feature Analysis Result Analysis

12 Baseline 2 12 Finding experts by link prediction in co authorship networks (2007).pdf Institute of Electronics Information and Communication Engineers (IEICE) database 1993-1996 training 1997-1999 testing 1,380 authors 1,620 links 2000-2003 training 2004-2006 testing 3,603 authors 7,512 links Problem Background Related Work MethodExperiment Feature Analysis Result Analysis Features set: Shortest Distance Common Neighbors Jaccard’s coefficient Adamic/Adar Preferential attachment Katz PageRank (min) PageRank (max) SimRank

13 My feature set From baseline 1 Shortest Path Total Number of paper From baseline 2 PageRank Adamic/Adar Preferential Attachment My additional features Common 3 rd Neighbor ( replace for Common Neighbor) Productivity Affiliations (university / faculty) Keywords (extract from title) Address of author Institute Conference 13 Problem Background Related Work MethodExperiment Feature Analysis Result Analysis

14 14 Training (2000-2009)Testing (2010-2013) No of node104,265 No of link413,69135,558 Experiments set up Training data With 104,265 nodes, we have ~5 x 10 9 links (Hugh !!!) Positive link: 413,691 links Negative link: 413,691 links (choose randomly from the rest) Testing data Positive link: 35,558 links Negative link: 35,558 links Problem Background Related Work MethodExperiment Feature Analysis Result Analysis

15 Common 3 rd Neighbor ( replace for Common Neighbor) Productivity Affiliations (university / faculty) Keywords (extract from title) Address of author Institute Conference 15 Extracting features Problem Background Related Work MethodExperiment Feature Analysis Result Analysis

16 Common 3 rd Neighbor Instead of looking at the common neighbor, we look at a larger network of the authors. The reason is simple : Neighbors within the 1 st neighborhood often connected to each others already. Few links can be predicted, hence the recall is low. 16 Problem Background Related Work MethodExperiment Feature Analysis Result Analysis

17 Affiliation and Address the Institute These two features capture geography location of the authors. Hopefully they can increase the hit rate of the prediction. 17 Problem Background Related Work MethodExperiment Feature Analysis Result Analysis

18 Key words and Conference Key words are extracted from all the paper titles. Hopefully can prefer to the authors 'interests, fields of research. Conference are refers to the conference names that those papers are published. 18 Problem Background Related Work MethodExperiment Feature Analysis Result Analysis

19 Productivity Observe the “publication history” of an author For example, at a particular node A: 19 T 2 = 2005T 0 = 2000T 1 = 2004T 3 = 2006 i=0i=1i=2i=3 n=3 m=1 n=4 m=2 n=6 m=2 n=7 m=3 n : No. of shared paper m: No. of collaborators Problem Background Related Work MethodExperiment Feature Analysis Result Analysis

20 Productivity 20 α : a constant to assign the weight of each time period Productivity of the authors Observe the “history” of an author The “productivity” of node A: Problem Background Related Work MethodExperiment Feature Analysis Result Analysis

21 Baseline implementation: 21 Baseline 1 Shortest Distance Sum of Paper Count Sum of Neighbors Count Baseline 2 Shortest Distance Common Neighbors Jaccard’s coefficient Adamic/Adar Preferential attachment Katz PageRank (min) PageRank (max) Using my data structure and classifiers. Problem Background Related Work MethodExperiment Feature Analysis Result Analysis

22 Baseline 1 22 Problem Background Related Work MethodExperiment Feature Analysis Result Analysis

23 Baseline 2 23 Problem Background Related Work MethodExperiment Feature Analysis Result Analysis

24 My feature set From baseline 1 Shortest Path Total Number of paper From baseline 2 PageRank Adamic/Adar Preferential Attachment My additional features Common 3 rd Neighbor ( replace for Common Neighbor) Productivity Affiliations (university / faculty) Keywords (extract from title) Address of author Institute Conference 24 Problem Background Related Work MethodExperiment Feature Analysis Result Analysis

25 Compare with the baselines 25 Decision Tree Problem Background Related Work MethodExperiment Feature Analysis Result Analysis

26 Compare with the baselines 26 SVM Problem Background Related Work MethodExperiment Feature Analysis Result Analysis

27 Testing Training Why poor decision tree learning result from the features such as Keywords. Key Words Problem Background Related Work MethodExperiment Feature Analysis Result Analysis

28 Testing Training Shortest Path Problem Background Related Work MethodExperiment Feature Analysis Result Analysis

29 Research Plan Implement completely the baselines features set. Observe the data and analyze the “false positive” result Consider choosing randomly data for training and testing more carefully 29 Thank you Problem Background Related Work MethodExperiment Feature Analysis Result Analysis

30 References Adamic, L. A., & Adar, E. (2003). Friends and neighbors on the web. Social networks, 25(3), 211-230. Al Hasan, M., Chaoji, V., Salem, S., & Zaki, M. (2006). Link prediction using supervised learning. In SDM’06: Workshop on Link Analysis, Counter-terrorism and Security. Liben ‐ Nowell, D., & Kleinberg, J. (2007). The link ‐ prediction problem for social networks. Journal of the American society for information science and technology, 58(7), 1019-1031. Pavlov, M., & Ichise, R. (2007). Finding Experts by Link Prediction in Co- authorship Networks. FEWS, 290, 42-55. Murata, T., & Moriyasu, S. (2008). Link prediction based on structural properties of online social networks. New Generation Computing, 26(3), 245-257. Guimerà, R., & Sales-Pardo, M. (2009). Missing and spurious interactions and the reconstruction of complex networks. Proceedings of the National Academy of Sciences, 106(52), 22073-22078. Bliss, C. A., Frank, M. R., Danforth, C. M., & Dodds, P. S. (2013). An Evolutionary Algorithm Approach to Link Prediction in Dynamic Social Networks. arXiv preprint arXiv:1304.6257. Cohen, S., & Ebel, L. (2013). Recommending collaborators using keywords. In Proceedings of the 22nd international conference on World Wide Web companion 959-962. 30

31 Back up slides 31

32 Potential applications Recommend experts or group of researchers for individual researcher. 32

33 Similarity Based Estimation Use metrics to estimate proximities of pairs of researchers Based on those proximities to rank pairs of researchers The top pairs of researchers will likely to be the recommendations. 33

34 Maximum Likelihood Estimation Predefine specific rules of a network Required a prior knowledge of the network The likelihood of any non-connected link is calculated according to those rules. 34

35 Stochastic Block Model 35 1 2 3 4 5 6 7 X Y The reliability of an individual link is:

36 Supervised Learning Model Construct dimensional feature vectors Fetch these vectors to classifiers to optimize a target function (training model) Link prediction becomes a binary classification 36

37 Link per year of training set is greater than link per year of testing set: In testing period, only consider “new” collaborations. Any collaborations between researchers that already has a link will be disregarded. 37 2000-20102010-2013 No of node937 No of link309357

38 Results with different classifiers Classifier Precision (Positive Predictive Value) (%) Recall (Hit rate) (%) F1 (Harmonic mean) (%) Decision Tree0.324.60.5 SMO0.545.60.9 Bagging0.428.10.7 Naive Bayes0.277.20.3 Multilayer Perceptron 0.447.30.8 38

39 Proposed Feature The reason for proposing this feature: Keep track of the researcher tendency Give “bonus” to researcher who tend to collaborate with “new” colleagues rather than “old” ones Also give high score for prolific researchers (based on number of published paper) 39

40 Stochastic Block Model Guimerà, R., & Sales-Pardo, M., 2009 40 Problem Background Related Work WorkflowConclusion

41 Related Work Link prediction problems in Social network Liben ‐ Nowell, D., & Kleinberg, J., 2007 Bliss, C. A., Frank, M. R., Danforth, C. M., & Dodds, P. S., 2013 In social network, interactions among users are very dynamic with: Creation of new link within a few days Deletion or replacement of the existent links Different features present by the two networks Characteristics of individual researcher : citations, affiliations, institutions,... Characteristics of person : marriage status, ages, working places, … 41 Problem Background Related Work WorkflowConclusion

42 Related Work Link prediction problems in Social network Liben ‐ Nowell, D., & Kleinberg, J., 2007 Bliss, C. A., Frank, M. R., Danforth, C. M., & Dodds, P. S., 2013 In social network, interactions among users are very dynamic with: Creation of new link within a few days Deletion or replacement of the existent links Different features present by the two networks Characteristics of individual researcher : citations, affiliations, institutions,... Characteristics of person : marriage status, ages, working places, … 42 Problem Background Related Work WorkflowConclusion

43 Experimental Results Measurement of performance Precision: Recall: Harmonic mean: 43 New links to predict: 57 links Prediction True LinkFalse Link True Link2631 False Link5,588429,778

44 Result Analysis Possible reasons Features Small set of data – sampling problem Instances of the negative links used for training 44 Problem Background Related Work WorkflowConclusion

45 Similarity Based Estimation 45 Adamic/Adar: X Y Z Problem Background Related Work WorkflowConclusion

46 Similarity Based Estimation Network structure based measurement 46 Some conventions: Problem Background Related Work Experiment Feature Analysis

47 Similarity Based Estimation Common Neighbor: 47 X Y Problem Background Related Work Experiment Feature Analysis

48 Similarity Based Estimation Jaccard’s coefficient: 48 X Y Problem Background Related Work Experiment Feature Analysis

49 Similarity Based Estimation Preferential Attachment: 49 X Y Problem Background Related Work WorkflowConclusion

50 Similarity Based Estimation 50 Adamic/Adar: X Y Z Problem Background Related Work Experiment Feature Analysis

51 Similarity Based Estimation 51 Shortest Path: X Y Z Problem Background Related Work Experiment Feature Analysis

52 Similarity Based Estimation 52 PageRank: A random walk on the graph assigning the probability that a node could be reach. The proximity between a pair of node can be determined by the sum of the node PageRank. Problem Background Related Work Experiment Feature Analysis

53 Classify 3 kinds of features: 53


Download ppt "LINK PREDICTION IN CO-AUTHORSHIP NETWORK Le Nhat Minh ( A0074403N) Supervisor: Dongyuan Lu Aobo Tao Chen 1."

Similar presentations


Ads by Google