Query Independent Scholarly Article Ranking

Slides:

Advertisements

Similar presentations

A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.

Advertisements

Weiren Yu 1, Jiajin Le 2, Xuemin Lin 1, Wenjie Zhang 1 On the Efficiency of Estimating Penetrating Rank on Large Graphs 1 University of New South Wales.

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

Modeling and Analysis of Random Walk Search Algorithms in P2P Networks Nabhendra Bisnik, Alhussein Abouzeid ECSE, Rensselaer Polytechnic Institute.

“The Rise and Rise of Citation Analysis”

1 Social Influence Analysis in Large-scale Networks Jie Tang 1, Jimeng Sun 2, Chi Wang 1, and Zi Yang 1 1 Dept. of Computer Science and Technology Tsinghua.

Distributed PageRank Computation Based on Iterative Aggregation- Disaggregation Methods Yangbo Zhu, Shaozhi Ye and Xing Li Tsinghua University, Beijing,

Time-dependent Similarity Measure of Queries Using Historical Click- through Data Qiankun Zhao*, Steven C. H. Hoi*, Tie-Yan Liu, et al. Presented by: Tie-Yan.

Efficient and Robust Computation of Resource Clusters in the Internet Efficient and Robust Computation of Resource Clusters in the Internet Chuang Liu,

Kyle Heath, Natasha Gelfand, Maks Ovsjanikov, Mridul Aanjaneya, Leo Guibas Image Webs Computing and Exploiting Connectivity in Image Collections.

Affinity Rank Yi Liu, Benyu Zhang, Zheng Chen MSRA.

Journal Status* Using the PageRank Algorithm to Rank Journals * J. Bollen, M. Rodriguez, H. Van de Sompel Scientometrics, Volume 69, n3, pp , 2006.

Making Pattern Queries Bounded in Big Graphs 11 Yang Cao 1,2 Wenfei Fan 1,2 Jinpeng Huai 2 Ruizhe Huang 1 1 University of Edinburgh 2 Beihang University.

PageRank for Product Image Search Kevin Jing (Googlc IncGVU, College of Computing, Georgia Institute of Technology) Shumeet Baluja (Google Inc.) WWW 2008.

1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.

2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop.

Chengjie Sun,Lei Lin, Yuan Chen, Bingquan Liu Harbin Institute of Technology School of Computer Science and Technology 1 19/11/ :09 PM.

1/52 Overlapping Community Search Graph Data Management Lab, School of Computer Science

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

Q2Semantic: A Lightweight Keyword Interface to Semantic Search Haofen Wang 1, Kang Zhang 1, Qiaoling Liu 1, Thanh Tran 2, and Yong Yu 1 1 Apex Lab, Shanghai.

Zibin Zheng DR 2 : Dynamic Request Routing for Tolerating Latency Variability in Cloud Applications CLOUD 2013 Jieming Zhu, Zibin.

Lecture #10 PageRank CS492 Special Topics in Computer Science: Distributed Algorithms and Systems.

Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova ， Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.

Graph Summaries for Subgraph Frequency Estimation 1 Angela Maduko, 2 Kemafor Anyanwu, 3 Amit Sheth, 4 Paul Schliekelman 1 LSDIS Lab, University of Georgia.

Scientific Paper Recommendation Emphasizing Each Researcher’s Most Recent Research Topic Kazunari Sugiyama 8 th January, 2010.

1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.

Kijung Shin Jinhong Jung Lee Sael U Kang

Speaker : Yu-Hui Chen Authors : Dinuka A. Soysa, Denis Guangyin Chen, Oscar C. Au, and Amine Bermak From : 2013 IEEE Symposium on Computational Intelligence.

Online Evolutionary Collaborative Filtering RECSYS 2010 Intelligent Database Systems Lab. School of Computer Science & Engineering Seoul National University.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Mining Advisor-Advisee Relationships from Research Publication.

Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Link mining ( based on slides.

1 Discovering Web Communities in the Blogspace Ying Zhou, Joseph Davis (HICSS 2007)

Biao Wang 1, Ge Chen 1, Luoyi Fu 1, Li Song 1, Xinbing Wang 1, Xue Liu 2 1 Shanghai Jiao Tong University 2 McGill University

Paper Presentation Social influence based clustering of heterogeneous information networks Qiwei Bao & Siqi Huang.

Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.

Warren Shen, Xin Li, AnHai Doan Database & AI Groups University of Illinois, Urbana Constraint-Based Entity Matching.

Prediction of Interconnect Net-Degree Distribution Based on Rent’s Rule Tao Wan and Malgorzata Chrzanowska- Jeske Department of Electrical and Computer.

Yu Wang1, Gao Cong2, Guojie Song1, Kunqing Xie1

MINING DEEP KNOWLEDGE FROM SCIENTIFIC NETWORKS

Recommendation in Scholarly Big Data

Finding Dense and Connected Subgraphs in Dual Networks

A Viewpoint-based Approach for Interaction Graph Analysis

RankSQL: Query Algebra and Optimization for Relational Top-k Queries

Greedy & Heuristic algorithms in Influence Maximization

Pagerank and Betweenness centrality on Big Taxi Trajectory Graph

Sofus A. Macskassy Fetch Technologies

A paper on Join Synopses for Approximate Query Answering

Table 1. Advantages and Disadvantages of Traditional DM/ML Methods

Preface to the special issue on context-aware recommender systems

Parallel Density-based Hybrid Clustering

Temporal Indexing MVBT.

Yun-FuLiu Jing-MingGuo Che-HaoChang

Understanding and Exploiting Amazon EC2 Spot Instances

Chao Zhang1, Yu Zheng2, Xiuli Ma3, Jiawei Han1

Latent Space Model for Road Networks to Predict Time-Varying Traffic

An Efficient method to recommend research papers and highly influential authors. VIRAJITHA KARNATAPU.

Centrality in Social Networks

Hidden Markov Models Part 2: Algorithms

Weakly Learning to Match Experts in Online Community

Liang Zheng and Yuzhong Qu

CSc4730/6730 Scientific Visualization

Paraskevi Raftopoulou, Euripides G.M. Petrakis

MEgo2Vec: Embedding Matched Ego Networks for User Alignment Across Social Networks Jing Zhang+, Bo Chen+, Xianming Wang+, Fengmei Jin+, Hong Chen+, Cuiping.

Scaling up Link Prediction with Ensembles

Topological Signatures For Fast Mobility Analysis

2019/9/14 The Deep Learning Vision for Heterogeneous Network Traffic Control Proposal, Challenges, and Future Perspective Author: Nei Kato, Zubair Md.

Peng Cui Tsinghua University

Modeling Topic Diffusion in Scientific Collaboration Networks

Presentation transcript:

Query Independent Scholarly Article Ranking Shuai Ma, Chen Gong, Renjun Hu, Dongsheng Luo, Chunming Hu, Jinpeng Huai SKLSDE Lab, Beihang University, China Beijing Advanced Innovation Center for Big Data and Brain Computing

Query Independent Scholarly Article Ranking Goal: giving static ranking based on scholarly data only Applications Playing a key role in literature recommendation systems, especially in the cold start scenario For search engines, determining the ranking of results WSDM Cup 2016 http://www.wsdm-conference.org/2016/wsdm-cup.html

Challenges Heterogeneous, evolving & dynamic Multiple types of entities involve with different contributions Entities and their importance evolve with time Academic data is dynamic and continuously growing The Microsoft Academic Graph [Sinha et al. 2015] New Records per year of dblp Database Arnab Sinha, et al. An Overview of Microsoft Academic Service (MAS) and Applications. In WWW, 2015. https://dblp.uni-trier.de/statistics/newrecordsperyear.html

Outline Ranking Model Ranking Computation Dynamic Ranking Computation Our Time Weighted PageRank Ranking with Importance Assembling Ranking Computation Dynamic Ranking Computation Experimental Study Summary

Why Weighted PageRank? Traditional PageRank Weighted PageRank Assumption of equally propagating Articles are equally influenced by references Bias: favor older articles while underestimate new ones Not all citations are equal [Valenzuela et al. 2015] Different articles typically have different impacts Weighted PageRank Key: how to determine the weights (differentiate impacts) M. Valenzuela, V. Ha and O. Etzioni. Identifying Meaningful Citations. In AAAI Workshop, 2015.

Intuitions of Impacts of Articles Time decaying Most previous work simply decays exponentially [1-4] When to decay? [1] X. Li, B. Liu and P. Yu. Time sensitive ranking with application to publication search. In ICDM, 2008. [2] Y. Wang et al. Ranking scientific articles by exploiting citations, authors, journals and time information. In AAAI, 2013. [3] H. Sayyadi and L. Getoor. Future rank: Ranking scientific articles by predicting their future pagerank. In SDM, 2009. [4] D. Walker et al. Ranking scientific publications using a model of network traffic. Journal of Statistical Mechanics: Theory and Experiment, 2007.

Different Citation Patterns[Chakraborty et al. 2015] When to Decay Different patterns for different articles [Chakraborty et al. 2015] Categorized by when articles reach their citation peaks PeakInit, PeakMul, PeakLate, MonDec, MonIncr, Other Different Citation Patterns[Chakraborty et al. 2015] Decaying only after the peak time of each individual article Tanmoy Chakraborty, Suhansanu Kumar, Pawan Goyal, Niloy Ganguly, et al. On the categorization of scientific citation profiles in computer sciences. Commun. ACM 2015.

Our Time-Weighted PageRank Importance propagation based on time-weighted impacts Time-weighted impact 𝑇 𝑢 : time of paper 𝑢, 𝑃𝑒𝑎𝑘 𝑣 : peak time of paper 𝑣, 𝜎: decaying factor 𝑤 𝑢,𝑣 = 1, & 𝑒 𝜎( 𝑇 𝑢 − 𝑃𝑒𝑎𝑘 𝑣 ) , 𝑇 𝑢 < 𝑃𝑒𝑎𝑘 𝑣 𝑇 𝑢 ≥ 𝑃𝑒𝑎𝑘 𝑣 Decaying with time only after the peak time Each individual article has its own peak time Remarks Considering the temporal information and dynamic impacts Alleviating the bias through decayed time-weighted impacts

Outline Ranking Model Ranking Computation Dynamic Ranking Computation Our Time Weighted PageRank Ranking with Importance Assembling Ranking Computation Dynamic Ranking Computation Experimental Study Summary

Why Importance Assembling? Cold start case: ranking new articles No citations yet: only using citation information fails Venue and author information should be incorporated Observation Multiple types of entities involve with different contributions Assembling the different contributions of citation, venue and author components

Ranking with Importance Assembling Importance is defined as a combination of the prestige and popularity favoring those with recent citations 𝐼𝑚𝑝 𝑣 = 𝑃𝑟𝑠 𝑣 𝜆 𝑃𝑜𝑝 𝑣 1−𝜆 , λ: importance weighing factor favoring those with citations soon after publication Final ranking 𝑅 𝑣 =𝛼 𝑅 𝑐 𝑣 +𝛽 𝑅 𝑣 𝑣 +(1−𝛼−𝛽) 𝑅 𝑎 (𝑣) 𝛼 and 𝛽: aggregating parameters

Importance Computation Citation component 𝑃𝑟𝑠 𝑐 of article 𝑣 is its TWPageRank score on the citation graph 𝑃𝑜𝑝 𝑐 of article 𝑣 is the sum of its citation freshness 𝑃𝑜𝑝 𝑐 𝑣 = (𝑢,𝑣)∈𝐸 𝑒 𝜎( 𝑇 0 − 𝑇 𝑢 ) 𝑇 0 : current year, 𝑇 𝑢 : time of 𝑢, 𝜎: decaying factor Venue component Constructing a venue graph and computing in similar way Author component Using average prestige and popularity of his/her published articles

Outline Ranking Model Ranking Computation Dynamic Ranking Computation Our Time Weighted PageRank Ranking with Importance Assembling Ranking Computation Dynamic Ranking Computation Experimental Study Summary

Batch Algorithm batSARank Importance Popularity computation Can be done by scanning all citations once Prestige computation Traditionally computed by TWPageRank in an iterative manner and is the most expensive computation Adopting block-wise computation method batTWPR [Berkhin 2005] Treating each strong connected component (SCC) as a block Processing blocks one by one following topological orders The edges between blocks are only scanned once 𝐼𝑚𝑝 𝑣 = 𝑃𝑟𝑠 𝑣 𝜆 𝑃𝑜𝑝 𝑣 1−𝜆 𝑃𝑜𝑝 𝑐 𝑣 = (𝑢,𝑣)∈𝐸 𝑒 𝜎( 𝑇 0 − 𝑇 𝑢 ) P. Berkhin. Survey: A survey on pagerank computing. Internet Mathematics, vol. 2, no. 1, pp. 73–120, 2005.

Why Adopting Block-wise Method? Observation: citations obey a natural temporal order SCC edge ratios are small for citation and venue graphs Based on statistics of scholarly data, block-wise method is a good choice for TWPageRank Time complexity analysis Taking t=100 for example, algorithm batTWPR only needs to scan 4|E| edges on citation and venue graphs, but over 59|E| edges on Web graphs.

Outline Ranking Model Ranking Computation Dynamic Ranking Computation Our Time Weighted PageRank Ranking with Importance Assembling Ranking Computation Dynamic Ranking Computation Experimental Study Summary

Incremental Algorithm incSARank Observation on scholarly data Data only increases without decreasing Citation relationships obey a natural temporal order The original block-wise graph and topological order do NOT change The existing popularity simply needs to be scaled Data structure maintenance Only new SCCs and new topological order need to be computed Popularity computation Computing freshness of new citations Prestige computation Incremental TWPageRank algorithm incTWPR Partitioning graph 𝐺 into affected and unaffected areas Employing different updating strategies for different areas

Affected and Unaffected Area Analysis Nodes that are reachable from newly added nodes Nodes with outgoing edges having weight changes Nodes that are reachable from other affected nodes The rest of the original graph is unaffected area Unaffected Area Affected Area

Time Complexity Analysis Data structure maintenance Saving 𝑂( 𝑉 +|𝐸|) time (about 90%) Popularity computation Saving 𝑂(|𝐸|) time (about 90%) Prestige computation Saving 𝑂( 𝐸 𝐴 ∪ 𝐸 𝐴𝐵 ) time (about 30%) Cost: 𝑂(|𝑉|) space for affected/unaffected areas 𝑉 𝐸

Outline Ranking Model Ranking Computation Dynamic Ranking Computation Our Time Weighted PageRank Ranking with Importance Assembling Ranking Computation Dynamic Ranking Computation Experimental Study Summary

Experimental Settings Datasets: AAN [Liang et al. 16], DBLP [Tang et al. 08], MAG [Sinha et al. 15] Metric: pairwise accuracy PairAcc= # of agreed pairs # of all pairs Algorithms PRank [Brin et al. 98]: PageRank on the article citation graph; FRank [Sayyadi et al. 09]: using citation, temporal and other heterogeneous information; HRank [Liang et al. 16]: using both citation and heterogeneous information based on hyper networks; SARank: our method; R. Liang and X. Jiang, Scientific ranking over heterogeneous academic hypernetwork, in AAAI, 2016. J. Tang, J. Zhang, L. Yao, et al., Arnetminer: Extraction and mining of academic social networks, in KDD, 2008. A. Sinha, Z. Shen, Y. Song, et al., An overview of microsoft academic service (MAS) and applications, in WWW, 2015. S. Brin and L. Page, The anatomy of a large-scale hypertextual web search engine, Computer Networks, 1998. H. Sayyadi and L. Getoor, Future rank: Ranking scientific articles by predicting their future pagerank, in SDM, 2009.

Experimental Settings Ground-truth: RECOM [Liang et al. 16], which assumes articles with more recommendations are more important PFCTN for article ranking in a concerned year (splitting year) Simply using citation numbers for fair evaluation Past and future citations contribute equally Articles in the same pairs must be in similar research fields and published in the same years Articles with more PF citations are more important current year start year splitting year total # of PF citations past future x years R. Liang and X. Jiang, Scientific ranking over heterogeneous academic hypernetwork, in AAAI, 2016.

Effectiveness with RECOM SARank consistently ranks better with RECOM Note: RECOM is originally given on AAN, and we extend it to DBLP and MAG through exact title matching.

Effectiveness with PFCTN +16.7%, +7.2%, +2.9% +23.6%, +8.3%, +3.2% +13.4%, +6.0%, +2.4% current year start year splitting year # of published years article published ranking data SARank consistently ranks better with PFCTN

Batch and incremental algorithms are more efficient Efficiency (2.5, 4.1) times faster (2.0, 3.0, 4.4, 245) times faster MAG MAG Batch and incremental algorithms are more efficient

Outline Ranking Model Ranking Computation Dynamic Ranking Computation Time Weighted PageRank Ranking with Importance Assembling Ranking Computation Dynamic Ranking Computation Experimental Study Summary

Summary Proposing a scholarly article ranking model SARank Time-Weighted PageRank algorithm Assembling the importance of articles, venues and authors Developing efficient ranking computation algorithms Block-wise computation for TWPageRank Incremental algorithm by affected/unaffected area division Experimentation study SARank consistently ranks better Batch and incremental algorithms are more efficient PFCTN, a new benchmark for article ranking

Thanks! Q&A

Components Computation Venue component Treating the venue in each year individually and its importance is the sum of importance in all individual years 𝑃𝑟𝑠 𝑣 of venue 𝑘 is its TWPageRank score on the venue graph 𝑃𝑜𝑝 𝑣 of venue 𝑘 is the average popularity of its articles

Components Computation Author component Compute the TWPagerank on the author citation graph is computationally expensive 𝑃𝑟𝑠 𝑎 of author 𝑢 is the average prestige of his/her articles 𝑃𝑜𝑝 𝑎 of author 𝑢 is the average popularity of his/her articles

Time decaying factor 𝜎 barely affects the result Impacts of Parameters Time decaying factor 𝜎 barely affects the result The PairAcc of combining prestige and popularity is generally better than using prestige or popularity alone

Impacts of Parameters 𝛼 and 𝛽 the PairAcc changes gently, and the optimal PairAcc is obtained with in a single region. SARank is very robust to parameters 𝛼 and 𝛽.

SARank vs. DRank(exponentially decay directly) TWPR generally ranks better than directly decaying