Download presentation
Presentation is loading. Please wait.
1
Detecting Research Topics via the Correlation between Graphs and Texts Yookyung Jo Dept. of Computer Science, Cornell University Carl Lagoze †, and C. Lee Giles ‡ † Dept. of Computer Science, Cornell University, ‡ Information Sciences and Technology, The Pennsylvania State University
2
Acknowledgment John E. Hopcroft Thorsten Joachims Simeon Warner Isaac G. Councill NSF IIS-0430906, 0227648, 0227888, and 0424671
3
Topic detection Problem Statement : Our strategy : –The correlation between Distribution of terms representing a topic Distribution of citation links How to detect topics in a linked corpus (e.g. Citeseer, arXiv, the Web …)
4
Correlation between Terms and Links Term citation graph for α Term citation graph for η Term α : representing a topic (e.g. “sensor network’’, or “association rule’’ ) Term η : not representing a topic (e.g. “six months’’, or “practical examples’’ ) α α α α α α α α α α α α η η η η η η η η η η η η η
5
Term citation graph for a term α α α α α α α α α α α α
6
Correlation between Terms and Links Term citation graph for α Term citation graph for η Term α : representing a topic (e.g. “sensor network’’, or “association rule’’ ) Term η : not representing a topic (e.g. “six months’’, or “practical examples’’ ) α α α α α α α α α α α α η η η η η η η η η η η η η
7
Detecting a topic via a single term H1 : A represents a topic H0 : A does not represent a topic G A : The term citation graph for A O(G A ) : Link connectivity observation on G A Finally, a ranked list of terms Given a term A, Binary decision of whether A represents a topic or not
8
Loglikelihood of H1 Observation O(G A ) : –For each node i in G A, is it connected to other nodes in G A by at least one link? Under H1 –p c1 : estimation of p c –p c1 set to a value close to 1 (e.g. p c1 = 0.9) This probability = p c
9
Loglikelihood of H0 GAGA ? ? p c0 : estimation of p c
10
Evaluation arXiv –A Physics literature collection –Year 1991-2006, 7 major arXiv areas –214,546 papers, 2,165,170 citation links –Abstract as document –137,098 bi-gram terms after low-frequency prune Citeseer –A Computer Science related collection –Year 1994-2004 –716,771 papers, 1,740,326 citation links –Abstract + title as document –631,839 bi-gram terms after low-frequency prune
11
arXiv (physics) : topic terms at top ranks top rankTopic (term) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Black hole Quantum hall Black holes Higgs boson Renormalization group Quantum gravity Standard model Heavy quark Cosmological constant Quantum dot Chiral perturbation Form factors Lattice qcd String theory Hubbard model n : number of nodes in G A n c : number of nodes with at least one connection within G A |E| : number of edges in G A
12
arXiv (Physics) : Term citation graphs for intermediate rank topic terms Research communities time
13
arXiv (Physics) : terms at bottommost ranks Bottom entries are stop-phrases rankterm 137098 137097 137096 137095 137094 137093 137092 137091 137090 137089 137088 137087 137086 137085 137084 we show has been we find we present we study we have we also have been we discuss we consider does not our results we investigate into account we propose
14
ranktopic (term) up to 1999topic (term) since 2000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 logic programs model checking semidefinite programming inductive logic petri nets genetic programming interior point kolmogorov complexity automatic differentiation complementarity problems congestion control complementarity problem conservation laws linear logic timed automata situation calculus real-time database motion planning duration calculus volume rendering chain monte association rules sensor networks hoc networks logic programs image retrieval support vector congestion control model checking decision diagrams wireless sensor ad hoc intrusion detection vector machines mobile ad binary decision sensor network energy consumption content-based image semantic web fading channels xml data source separation timed automata Citeseer (CS): top rank terms Top rank terms from two different time periods Time up to 1999 Time since 2000
15
Citeseer: Topic time evolution ``sensor networks’’ ``support vector’’``congestion control’’ ``logic programs’’
16
Citeseer: Topic time evolution ``petri nets’’``association rules’’ ``genetic programming’’``semantic web’’
17
Algorithm Extension To detect topics represented by a single term –Algorithm –Evaluation on arXiv, Citeseer To detect topics defined by a set of terms –Algorithm –Evaluation on arXiv
18
Conclusion (poster session : #7) Topic detection via the correlation between terms and links Our algorithm (in its evaluation on arXiv, Citeseer) –Effectively discovers topics represented by a single-term or by a set of terms –Identifies stop-phrases as a by-product –Discovers topics in their natural scale –Demonstrates its utility in trend analysis –Shows the association between topic scale and specificity
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.