Mining Advisor-Advisee Relatio nships from Research Publication Networks KDD2010 报告人：徐晓旻.

Mining Advisor-Advisee Relatio nships from Research Publication Networks KDD2010 报告人：徐晓旻

INTRODUCTION  conduct a systematic investigation of the ca se of mining advisor-advisee relationships between authors in a research publication n etwork.  better understand the insight of the research co mmunity  provides additional semantic information on t he links

INTRODUCTION(cont.)  The left figure  shows the input: an temporal collaboration net work, which consists of authors, papers  The middle figure  shows the output of our analysis: an author net work with solid arrow indicating the advising r elationship  The right figure  gives an example of visualized chronological hierarchies.

PROBLEM FORMULATION  {G} = {(V = Vp ∪ Va,E)}, where Vp ={p 1,..., p np } is the set of public ations, with pi published in time ti, V a = {a 1,..., a na } is the set of authors, and E is the set of edges. Each edge e ij ∈ E associates the paper pi and t he author aj, meaning aj is one author of pi.

original network transformed  original network can be transformed into networ k containing only authors.  Let G ′ = (V ′,E ′,{py ij }e ij ∈ E ′,{pn ij }e ij ∈ E ′ ), where V ′ = {a 0,..., a na } is the set of authors (includin g a virtual node a 0 ). Each edge e ′ ij = (i, j) ∈ E c onnects authors ai and aj if they have publicati on together  two vectors associated with the edge, Pub_Ye ar_vector py ij and Pub_Num_vector pn ij.

network transformed cont.  associate with each author two vectors p y i a nd p n i to respectively represent the number of papers and the corresponding published y ear by author ai. The two vectors p y i and p n i can be derived from py ij and pn ij.

this problem is more complicated  (i) one could have multiple advisors like maste r advisors, PhD co-advisors  (ii) some mentors from industry behave simila rly as academic advisors if only judged by the collaboration history;  (iii) one’s advisor could be missing in the data set

construct subgraph H′  Formally, we denote r ij as the probability of a j being t he advisor of a i.  construct a subgraph H′< G′by removing some edges f rom G′ and make the remaining edges directed from a dvisee to potential advisor.

construct subgraph H′cont. A simple way to predict is :  to fetch top k potential advisors of a i and check whether a j i s one of them while r ij > r i0 or r ij >, where is a threshold such as 0.5. We use P@(k, ) to denote this method.

4. APPROACH  The main idea is to leverage a time-constrained pr obabilistic factor graph model to decompose the jo int probability of the unknown advisor of every au thor.  By maximizing the joint probability of the factor graph we can infer the relationship and compute ra nking score for each relation edge on the candidate graph.

4.1 Assumptions and Framework

two-stage framework solution  In stage 1, we preprocess the heterogeneous collaboration netwo rk to generate the candidate graph H′. This includes the transfor mation from G to a homogeneous network G′, the construction from G′ to H′, and the estimate of the local likelihood on each ed ge of H′  In stage 2, these potential relations are further modeled with a pr obabilistic model. Local likelihood and time constraints are com bined in the global joint probability of all the hidden variables. The joint probability is maximized and the ranking score of all t he potential relations is computed together. The construction of H is finished in this stage.

4.2 Stage 1: Preprocessing

Rule to detect advisor  The Kulczynski meas ure reflects the correla tion of the two authors ’publications.  IR is used to measure the imbalance of the o ccurrence of aj given a i and the occurrence o f ai given aj

Rule to detect advisor

 When the pair of authors passes the test of selected rules from them, we construct a dir ected edge from ai to aj in H′.  we estimate the starting time and ending ti me of the advising, as well as the local likeli hood of a j being a i ’s advisor l ij  starting time st ij is estimated as the time the y started to collaborate

 the ending time ed ij can be estimated as eit her the time point when the Kulczynski mea sure starts to decrease, or the year making t he largest difference between the Kulczynsk i measure before and after it. local likelihood of aj being ai’s advisor lij

Stage 2: TPFG Model  define the TPFG model  For each node a i, there are three variables to d ecide: y i, st i, and ed i.  local feature function g(y i, st i, ed i ) joint probability of all the variables in the network

Stage 2: TPFG Model  To find the most probable values of all the hidden variables, we need to maximize the j oint probability of all of them.  It is intractable to do exhaustive search

Decomposition of variables dependency 消除变量 sti,edi 计算 j 为 i 的老师的可能性，以及必须满足的条件 ( 由指示函数 I 给出 )

Decomposition of variables dependency

该图中 f1(.) 相关的节点有 y1, 以及节点 1 所有可能的学生节点从图表中可以看出是节点 2,3

4.4 Model Learning

Sum-product Sum-Product 算法继承了消息传递机制，但通过引入 factor graph 将全局的概率密度函数分解成若干个局部概率密度函数的乘积

single- sum-product algorithm

Sum-product algorithm 考虑 g i (x i ) 正是只关于 xi 的函数，即有 g i (x i )=u x->gi ()(xi) 于是就照公式 (5) 可得 g i (x i )

single- sum-product algorithm

New TPFG Inference Algorithm  The original sum-product algorithm meet with dif ficulty since it requires that each node needs to wa it for all-but-one message to arrive. Thus in TPFG some nodes will be waiting forever due to the exis tence of cycles.  we arrange the message passing in a mode based on the strict order determined by H′. Each node ai has a descendant set Y −1 i and an ascendant set Y i.

Message Passing two-phase schema  In the first phase, messages are passed from advis ees to possible advisors, and in the second, messag es are passed back from advisors to possible advis ees.  the first phase:  The message from f i () to yi is generated and sent only when all the messages from its descendants h ave arrived. And yi immediately send it to all its as cendants f j (), j ∈ Y i.

two-phase schema cont.  the second phase:  each of which are along the reverse direction on the edge as in phase 1. 为什么有了 lij 还要计算 rij? 因为 lij 是 j 为 i 的导师的 local 支持度 rij 根据定义是全局意义上的支持度他考虑了图的其他依赖关系，考虑形式就是该传播模型

two-phase schema cont.  After the two phases of message propagatio n, we can collect the two messages on any e dge and obtain the marginal function.

simplify the message propagation  Eliminating the function nodes and the internal m essages between a function node and a variable no de  The improved message propagation is still separat ed into two Phases  the first phase, the messages senti which passe d from one to their ascendants are generated in a similar order as before.  In the second, messages returned from ascend ants recvi are stored in each node.

simplify the message propagation

5. EXPERIMENTAL RESULTS  Data Sets:DBLP Computer Science Bibliog raphy Database  test the accuracy of the discovered advisor- advisee relationships  adopt three data sets: One is manually labeled by looking into the home page of the advisors, and the other two are crawled from the Mathem atics Genealogy project1 and AI Genealogy pro ject

compare TPFG with baseline methods  Evaluation Aspects  two performance measurements: accuracy and sc alability.

5.2 Accuracy  Effect of rules in TPFG  From Figure 5(a) we can see that R2/R3 has th e highest suitability on the tested data. ROC 曲线：通过 test data 中已知的师生 pair 和算法计算出的师生 pair 的比较，将计算出的 pair 按照 rank score 从大到小排列，然后取横轴为 top a%of 计算 pair, 纵轴为 top a% 与 test data 中 pair 的交集 /test data 规模

Effect of network structure  From Figure 5(c) we see that for closures with differ ent depths,TPFG achieves better accuracy when the depth increases,  To compare it with the exact maximal joint probabili ty and other approximate algorithmJuncT and LBP

Effect of training data  Support Vector Machines(SVMs) are accurate supervised learning approaches  reduce advisor mining to a classification problem  we combined Kulczynski and IR measures wit h as features.  TPFG can achieve comparable or even better accuracy compared with a supervised method

Effect of training data

5.3 Scalability Performance

5.4 Applications  Visualization of genealogy  The visualized hierarchies of research community based on the relationship can help us gain a better insight of the community

5.4 Applications  Expert finding and Bole search  bole search, a specific expert finding task, ai ming to identify best supervisors

Mining Advisor-Advisee Relatio nships from Research Publication Networks KDD2010 报告人：徐晓旻.

Similar presentations

Presentation on theme: "Mining Advisor-Advisee Relatio nships from Research Publication Networks KDD2010 报告人：徐晓旻."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Mining Advisor-Advisee Relatio nships from Research Publication Networks KDD2010 报告人：徐晓旻.

Similar presentations

Presentation on theme: "Mining Advisor-Advisee Relatio nships from Research Publication Networks KDD2010 报告人：徐晓旻."— Presentation transcript:

Similar presentations

About project

Feedback