Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bibliometric Impact Measures Leveraging Topic Analysis Gideon Mann David Mimno Andrew McCallum Computer Science Department University of Massachusetts.

Similar presentations


Presentation on theme: "Bibliometric Impact Measures Leveraging Topic Analysis Gideon Mann David Mimno Andrew McCallum Computer Science Department University of Massachusetts."— Presentation transcript:

1 Bibliometric Impact Measures Leveraging Topic Analysis Gideon Mann David Mimno Andrew McCallum Computer Science Department University of Massachusetts Amherst

2 Goal: Measure the impact of papers, and research subfields. Researchers understanding their own field. Libraries deciding which journals to purchase. Personnel committees deciding on hiring, promotion, awards. Important for:

3 Typical Impact Measures Citation Count Garfield’s Journal Impact Factor

4 Why are topical divisions useful in bibliometrics? Biochemistry and molecular biology: J. Biol. Chem405017 Cell136472 Biochem.-US96809 Mathematics Lect. Notes Math6926 T. Am. Math. Soc6469 J. Math. Anal. Appl.6004 Source: Journal Citation Reports (2004) Can you compare the tallest building in NY to the tallest building in Stamford, CT? Citation counts

5 Why are topical divisions useful in bibliometrics?

6 Why not use Journal as a proxy for Topic? Journals not necessarily about one topic. Topics may not have their own journal. Open access publishing on the rise. 5% of the 200 most-cited papers in CiteSeer are tech reports! Spidered web documents often do not include venue information.

7 This Paper Discovering fine-grained, interpretable topics from text 8 impact measures leveraging topics Analysis on 1.5 million research papers and their citations. Where did we get all this data from? Topical N-Grams a phrase-discovering enhancement to LDA A quick tour of 8 impact measures with examples An introduction to Rexa, a new little sibling of CiteSeer, Google Scholar, etc. Talk Outline

8 Clustering words into topics with Latent Dirichlet Allocation [Blei, Ng, Jordan 2003] Sample a distribution over topics,  For each document: Sample a topic, z For each word in doc Sample a word from the topic, w Example: 70% Iraq war 30% US election Iraq war “bombing” Generative Process:

9 Inference and Estimation Gibbs Sampling: - Easy to implement - Reasonably fast r

10 STORY STORIES TELL CHARACTER CHARACTERS AUTHOR READ TOLD SETTING TALES PLOT TELLING SHORT FICTION ACTION TRUE EVENTS TELLS TALE NOVEL MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNESS STRANGE FEELING WHOLE BEING MIGHT HOPE WATER FISH SEA SWIM SWIMMING POOL LIKE SHELL SHARK TANK SHELLS SHARKS DIVING DOLPHINS SWAM LONG SEAL DIVE DOLPHIN UNDERWATER DISEASE BACTERIA DISEASES GERMS FEVER CAUSE CAUSED SPREAD VIRUSES INFECTION VIRUS MICROORGANISMS PERSON INFECTIOUS COMMON CAUSING SMALLPOX BODY INFECTIONS CERTAIN Example topics induced from a large collection of text FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POLES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORCE MAGNETS BE MAGNETISM POLE INDUCED SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BIOLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIENTIST STUDYING SCIENCES BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIELD PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNIS TEAMS GAMES SPORTS BAT TERRY JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTUNITIES WORKING TRAINING SKILLS CAREERS POSITIONS FIND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY EARN ABLE [Tennenbaum et al]

11 STORY STORIES TELL CHARACTER CHARACTERS AUTHOR READ TOLD SETTING TALES PLOT TELLING SHORT FICTION ACTION TRUE EVENTS TELLS TALE NOVEL MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNESS STRANGE FEELING WHOLE BEING MIGHT HOPE WATER FISH SEA SWIM SWIMMING POOL LIKE SHELL SHARK TANK SHELLS SHARKS DIVING DOLPHINS SWAM LONG SEAL DIVE DOLPHIN UNDERWATER DISEASE BACTERIA DISEASES GERMS FEVER CAUSE CAUSED SPREAD VIRUSES INFECTION VIRUS MICROORGANISMS PERSON INFECTIOUS COMMON CAUSING SMALLPOX BODY INFECTIONS CERTAIN FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POLES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORCE MAGNETS BE MAGNETISM POLE INDUCED SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BIOLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIENTIST STUDYING SCIENCES BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIELD PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNIS TEAMS GAMES SPORTS BAT TERRY JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTUNITIES WORKING TRAINING SKILLS CAREERS POSITIONS FIND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY EARN ABLE Example topics induced from a large collection of text [Tennenbaum et al]

12 Topics Modeling Multi-word Phrases Topics based only on unigrams sometimes difficult to interpret Topic discovery itself is confused because important meaning / distinctions carried by phrases.

13 A Topic Comparison LDA algorithms algorithm genetic problems efficient Topical N-grams genetic algorithms genetic algorithm evolutionary computation evolutionary algorithms fitness function

14 Topical N-gram Model z1z1 z2z2 z3z3 z4z4 w1w1 w2w2 w3w3 w4w4 y1y1 y2y2 y3y3 y4y4  11 T D...  W T W  11 22  22 [Wang, McCallum 2005]

15 Features of Topical N-Grams model Easily trained by Gibbs sampling –Can run efficiently on millions of words Topic-specific phrase discovery –“white house” has special meaning as a phrase in the politics topic, –... but not in the real estate topic.

16 Topic Comparison learning optimal reinforcement state problems policy dynamic action programming actions function markov methods decision rl continuous spaces step policies planning LDA reinforcement learning optimal policy dynamic programming optimal control function approximator prioritized sweeping finite-state controller learning system reinforcement learning rl function approximators markov decision problems markov decision processes local search state-action pair markov decision process belief states stochastic policy action selection upright position reinforcement learning methods policy action states actions function reward control agent q-learning optimal goal learning space step environment system problem steps sutton policies Topical N-grams (2+)Topical N-grams (1)

17 Our Data for This JCDL Paper 1.6 million research papers –mostly in Computer Science –400k of them with full text 14 fields of meta-data from –“headers” at top of papers – “citations” in References section automatically extracted with 99% accuracy. Reference resolution performed on 4 million citations.

18 Example Results on our Corpus Sample LDA topics Sample Topical N-gram topics Run LDA on 1.6 million papers. Use topic analysis to select a subset of AI: ML, NLP, robotics, vision, etc. Step 1: Run Topical N-grams on the ~300k papers in the subset. Step 2:

19 Each topic is now an intellectual “domain” that includes some number of documents. We can substitute topic for journal in most traditional bibliometric indicators. We can also now define several new indicators.

20 Impact Measures Leveraging Topics 1.Topical Citation count 2.Topical Impact factor 3.Topical Diffusion 4.Topical Diversity 5.Topical Half-life 6.Topical Precedence 7.Topical H-factor 8.Topical Transfer

21 Impact Measures Leveraging Topics 1.Topical Citation count 2.Topical Impact factor 3.Topical Diffusion 4.Topical Diversity 5.Topical Half-life 6.Topical Precedence 7.Topical H-factor 8.Topical Transfer

22 Topical Citation Count

23

24 Impact Factor Journal Impact Factor: Citations from articles published in 2004 to articles in Cell published in 2002-3, divided by the number of articles published in Cell in 2002-3. 2004 Impact factors from JCR: Nature32.182 Cell28.389 JMLR5.952 Machine Learning3.258

25 Topical Impact Factor over time

26 Impact Measures Leveraging Topics 1.Topical Citation count 2.Topical Impact factor 3.Topical Diffusion 4.Topical Diversity 5.Topical Half-life 6.Topical Precedence 7.Topical H-factor 8.Topical Transfer

27 Broad Impact: Diffusion Journal Diffusion: # of journals citing Cell divided by the total number of citations to Cell, over a given time period, times 100 Problem: Relatively brittle at low citation counts. If a topic/journal is cited twice by two different topics/journals, it will have high diffusion.

28 Broad Impact: Diversity Topic Diversity: Entropy of the distribution of citing topics These are just the least cited topics! Better at capturing broad end of impact spectrum DiffusionDiversity

29 Broad Impact: Diversity, for papers Topic Diversity: Entropy of the distribution of citing topic

30 Impact Measures Leveraging Topics 1.Topical Citation count 2.Topical Impact factor 3.Topical Diffusion 4.Topical Diversity 5.Topical Half-life 6.Topical Precedence 7.Topical H-factor 8.Topical Transfer

31 Topical Longevity: Cited Half Life Two views: Given a paper, what is the median age of citations to that paper? What is the median age of citations from current literature? Collaborative Filtering is young, fast moving. Maximum Entropy looks further back, but is still producing new work. Neural Networks literature is aging.

32 Topical Precedence Within a topic, what are the earliest papers that received more than n citations? “Early-ness” Speech Recognition: Some experiments on the recognition of speech, with one and two ears, E. Colin Cherry (1953) Spectrographic study of vowel reduction, B. Lindblom (1963) Automatic Lipreading to enhance speech recognition, Eric D. Petajan (1965) Effectiveness of linear prediction characteristics of the speech wave for..., B. Atal (1974) Automatic Recognition of Speakers from Their Voices, B. Atal (1976)

33 Topical Precedence Within a topic, what are the earliest papers that received more than n citations? “Early-ness” Information Retrieval: On Relevance, Probabilistic Indexing and Information Retrieval, Kuhns and Maron (1960) Expected Search Length: A Single Measure of Retrieval Effectiveness Based on the Weak Ordering Action of Retrieval Systems, Cooper (1968) Relevance feedback in information retrieval, Rocchio (1971) Relevance feedback and the optimization of retrieval effectiveness, Salton (1971) New experiments in relevance feedback, Ide (1971) Automatic Indexing of a Sound Database Using Self-organizing Neural Nets, Feiten and Gunzel (1982)

34 Impact Measures Leveraging Topics 1.Topical Citation count 2.Topical Impact factor 3.Topical Diffusion 4.Topical Diversity 5.Topical Half-life 6.Topical Precedence 7.Topical H-factor 8.Topical Transfer

35 H-factor H = maximum number K for which you have K papers, each with at least K citations....for journals [Braun et al, 2005]

36 Topical H-factor 16 12 Natural Language Parsing (16) 173 12 Neural Networks (173) 120 12 Speech Recognition (120) 21 11 Hidden Markov Models (21) 71 11 Genetic Algorithms (71) 48 11 Optical Flow (48) 83 10 Reinforcement Learning (83) 49 10 Computer Vision (49) 22 10 Mobile Robots (22) 118 9 Word Sense Disambiguation (118) 160 9 NLP (160) 35 8 Planning (35) 106 8 Markov Chain Monte Carlo (106) 40 8 Maximum Likelihood Estimators (40) 131 8 Genetic Algorithms (131) 61 7 Genetic Programming (61) Year 1990

37 Topical H-factor Year 1995 49 18 Computer Vision (49) 120 17 Speech Recognition (120) 146 15 Decision Trees (146) 176 15 Data Mining (176) 21 14 Hidden Markov Models (21) 71 14 Genetic Algorithms (71) 106 13 Markov Chain Monte Carlo (106) 138 13 IR And Queries (138) 118 12 Word Sense Disambiguation (118) 80 12 Web And VR (80) 16 12 Natural Language Parsing (16) 110 12 Bayesian Inference (110) 83 12 Reinforcement Learning (83) 150 12 Logic Programming (150) 22 12 Mobile Robots (22) 160 12 NLP (160)

38 Topical H-factor Year 2001 129 15 Web Pages (129) 186 15 Ontologies (186) 50 13 SVMs (50) 49 13 Computer Vision (49) 126 13 Gene Expression (126) 176 13 Data Mining (176) 29 12 Dimensionality Reduction (29) 111 12 Question Answering (111) 132 12 Search Engines (132) 16 11 Natural Language Parsing (16) 83 11 Reinforcement Learning (83) 184 11 Web Services (184) 164 11 HCI (164) 21 10 Hidden Markov Models (21) 118 10 Word Sense Disambiguation (118) 138 10 IR And Queries (138)

39 Impact Measures Leveraging Topics 1.Topical Citation count 2.Topical Impact factor 3.Topical Diffusion 4.Topical Diversity 5.Topical Half-life 6.Topical Precedence 7.Topical H-factor 8.Topical Transfer

40 Topical Transfer Transfer from Digital Libraries to other topics Other topicCit’sPaper Title Web Pages31Trawling the Web for Emerging Cyber- Communities, Kumar, Raghavan,... 1999. Computer Vision14On being ‘Undigital’ with digital cameras: extending the dynamic... Video12Lessons learned from the creation and deployment of a terabyte digital video Graphs12Trawling the Web for Emerging Cyber- Communities Web Pages11WebBase: a repository of Web pages

41 Topical Transfer Citation counts from one topic to another. Map “producers and consumers”

42 Where did the data come from? http://rexa.info

43 Rexa System Overview Reference resolution (of papers, authors & grants) Spider Web for PDFs Convert to text (with layout & format) Extract metadata (title, authors, abstract, venue, citations; 14 fields in total) Browsable Web Interface Topic Analysis & other Data Mining WWW Home-grown Java+MySQL (~1m PDF/day) Enhanced ps2text (better word stiching, plus layout in XML) Conditional Random Fields (99% word accuracy) NSF grant DB Discriminatively trained graph partitioning (competition-winning accuracy)

44 IE from Research Papers [McCallum et al ‘99] @article{ kaelbling96reinforcement, author = "Leslie Pack Kaelbling and Michael L. Littman and Andrew P. Moore", title = "Reinforcement Learning: A Survey", journal = "Journal of Artificial Intelligence Research", volume = "4", pages = "237-285", year = "1996",

45 (Linear Chain) Conditional Random Fields y t-1 y t x t y t+1 x t +1 x t - 1 Finite state modelGraphical model Undirected graphical model, trained to maximize conditional probability of output sequence given input sequence... FSM states observations y t+2 x t +2 y t+3 x t +3 said Jones a Microsoft VP … OTHER PERSON OTHER ORG TITLE … output seq input seq Asian word segmentation [COLING’04], [ACL’04] IE from Research papers [HTL’04] Object classification in images [CVPR ‘04] Wide-spread interest, positive experimental results in many applications. Noun phrase, Named entity [HLT’03], [CoNLL’03] Protein structure prediction [ICML’04] IE from Bioinformatics text [Bioinformatics ‘04],… [Lafferty, McCallum, Pereira 2001] where

46 IE from Research Papers Field-level F1 Hidden Markov Models (HMMs)75.6 [Seymore, McCallum, Rosenfeld, 1999] Support Vector Machines (SVMs)89.7 [Han, Giles, et al, 2003] Conditional Random Fields (CRFs)93.9 [Peng, McCallum, 2004]  error 40% (Word-level accuracy is >99%)

47 Previous Systems

48

49 Research Paper Cites Previous Systems

50 Research Paper Cites Person UniversityVenue Grant Groups Expertise More Entities and Relations

51

52

53

54

55

56

57

58

59

60

61

62

63 access, access control, digital library, digital, digital libraries

64

65

66

67

68

69

70

71

72 Summary Demonstrated a new topic discovery method (Topical N-Grams) on 1.6m research papers. Presented 8 impact measures based on topics. Introduced Rexa, a showcase for our research on information extraction, coreference and data mining. http://rexa.info is publicly available now. Try it out! Feedback appreciated.

73 Neural Information Processing Conference Dataset 1740 Papers 13649 Unique words 2,301,375 Words Volumes 0-12 Spanning 1987 – 1999. Prepared by Sam Roweis.

74 Trends in 17 years of NIPS proceedings

75 Topic Distributions Conditioned on Time time topic mass (in vertical height)

76 Finding Topics in 1 million CS papers 200 topics & keywords automatically discovered.

77 Topic Correlations in PAM 5000 research paper abstracts, from across all CS Numbers on edges are supertopics’ Dirichlet parameters


Download ppt "Bibliometric Impact Measures Leveraging Topic Analysis Gideon Mann David Mimno Andrew McCallum Computer Science Department University of Massachusetts."

Similar presentations


Ads by Google