Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bibliometric Impact Measures Leveraging Topic Analysis

Similar presentations


Presentation on theme: "Bibliometric Impact Measures Leveraging Topic Analysis"— Presentation transcript:

1 Bibliometric Impact Measures Leveraging Topic Analysis
11/19/2018 Bibliometric Impact Measures Leveraging Topic Analysis Gideon Mann David Mimno Andrew McCallum Computer Science Department University of Massachusetts Amherst

2 Goal: Important for: Measure the impact of
11/19/2018 Goal: Measure the impact of papers, and research subfields. Important for: Researchers understanding their own field. Libraries deciding which journals to purchase. Personnel committees deciding on hiring, promotion, awards. I want to help people make good decisions by leveraging the knowledge on the Web and other bodies of text. Sometimes document retrieval is enough for this, but sometimes you need to find patterns in structured data that span many pages. Let me give you some examples of what I mean.

3 Typical Impact Measures
11/19/2018 Typical Impact Measures Citation Count Garfield’s Journal Impact Factor

4 Why are topical divisions useful in bibliometrics?
11/19/2018 Why are topical divisions useful in bibliometrics? Source: Journal Citation Reports (2004) Biochemistry and molecular biology: J. Biol. Chem 405017 Cell 136472 Biochem.-US 96809 Citation counts Mathematics Lect. Notes Math 6926 T. Am. Math. Soc 6469 J. Math. Anal. Appl. 6004 Can you compare the tallest building in NY to the tallest building in Stamford, CT?

5 Why are topical divisions useful in bibliometrics?
11/19/2018 Why are topical divisions useful in bibliometrics?

6 Why not use Journal as a proxy for Topic?
11/19/2018 Why not use Journal as a proxy for Topic? Journals not necessarily about one topic. Topics may not have their own journal. Open access publishing on the rise 5% of the 200 most-cited papers in CiteSeer are tech reports! Spidered web documents often do not include venue information.

7 This Paper Talk Outline
11/19/2018 This Paper Talk Outline Topical N-Grams a phrase-discovering enhancement to LDA A quick tour of 8 impact measures with examples An introduction to Rexa, a new sibling of CiteSeer Google Scholar, etc. Discovering fine-grained, interpretable topics from text 8 impact measures leveraging topics Analysis on 1.5 million research papers and their citations. Where did we get all this data from? In contrast to some previous scientometric studies that have analyzed just 16 articles from one journal issue.

8 Clustering words into topics with Latent Dirichlet Allocation
11/19/2018 Clustering words into topics with Latent Dirichlet Allocation [Blei, Ng, Jordan 2003] Generative Process: Example: For each document: Sample a distribution over topics,  70% Iraq war 30% US election For each word in doc Sample a topic, z Iraq war Sample a word from the topic, w “bombing”

9 Inference and Estimation
11/19/2018 Inference and Estimation Gibbs Sampling: Easy to implement Reasonably fast r

10 Example topics induced from a large collection of text
11/19/2018 Example topics induced from a large collection of text DISEASE BACTERIA DISEASES GERMS FEVER CAUSE CAUSED SPREAD VIRUSES INFECTION VIRUS MICROORGANISMS PERSON INFECTIOUS COMMON CAUSING SMALLPOX BODY INFECTIONS CERTAIN WATER FISH SEA SWIM SWIMMING POOL LIKE SHELL SHARK TANK SHELLS SHARKS DIVING DOLPHINS SWAM LONG SEAL DIVE DOLPHIN UNDERWATER MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNESS STRANGE FEELING WHOLE BEING MIGHT HOPE STORY STORIES TELL CHARACTER CHARACTERS AUTHOR READ TOLD SETTING TALES PLOT TELLING SHORT FICTION ACTION TRUE EVENTS TELLS TALE NOVEL FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POLES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORCE MAGNETS BE MAGNETISM POLE INDUCED SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BIOLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIENTIST STUDYING SCIENCES BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIELD PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNIS TEAMS GAMES SPORTS BAT TERRY JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTUNITIES WORKING TRAINING SKILLS CAREERS POSITIONS FIND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY EARN ABLE [Tennenbaum et al]

11 Example topics induced from a large collection of text
11/19/2018 Example topics induced from a large collection of text DISEASE BACTERIA DISEASES GERMS FEVER CAUSE CAUSED SPREAD VIRUSES INFECTION VIRUS MICROORGANISMS PERSON INFECTIOUS COMMON CAUSING SMALLPOX BODY INFECTIONS CERTAIN MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNESS STRANGE FEELING WHOLE BEING MIGHT HOPE STORY STORIES TELL CHARACTER CHARACTERS AUTHOR READ TOLD SETTING TALES PLOT TELLING SHORT FICTION ACTION TRUE EVENTS TELLS TALE NOVEL FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POLES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORCE MAGNETS BE MAGNETISM POLE INDUCED SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BIOLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIENTIST STUDYING SCIENCES BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIELD PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNIS TEAMS GAMES SPORTS BAT TERRY JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTUNITIES WORKING TRAINING SKILLS CAREERS POSITIONS FIND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY EARN ABLE WATER FISH SEA SWIM SWIMMING POOL LIKE SHELL SHARK TANK SHELLS SHARKS DIVING DOLPHINS SWAM LONG SEAL DIVE DOLPHIN UNDERWATER [Tennenbaum et al]

12 Topics Modeling Multi-word Phrases
11/19/2018 Topics Modeling Multi-word Phrases Topics based only on unigrams sometimes difficult to interpret Topic discovery itself is confused because important meaning / distinctions carried by phrases.

13 Topical N-gram Model [Wang, McCallum 2005]   z1 z2 z3 z4 . . . y1 y2
11/19/2018 Topical N-gram Model [Wang, McCallum 2005] z1 z2 z3 z4 . . . y1 y2 y3 y4 . . . w1 w2 w3 w4 . . . D 1 1 2 2 W W T T

14 Features of Topical N-Grams model
11/19/2018 Features of Topical N-Grams model Easily trained by Gibbs sampling Can run efficiently on millions of words Topic-specific phrase discovery “white house” has special meaning as a phrase in the politics topic, ... but not in the real estate topic.

15 evolutionary computation evolutionary algorithms
11/19/2018 A Topic Comparison LDA algorithms algorithm genetic problems efficient Topical N-grams genetic algorithms genetic algorithm evolutionary computation evolutionary algorithms fitness function

16 Topic Comparison LDA Topical N-grams (2+) Topical N-grams (1) learning
11/19/2018 Topic Comparison LDA Topical N-grams (2+) Topical N-grams (1) learning optimal reinforcement state problems policy dynamic action programming actions function markov methods decision rl continuous spaces step policies planning reinforcement learning optimal policy dynamic programming optimal control function approximator prioritized sweeping finite-state controller learning system reinforcement learning rl function approximators markov decision problems markov decision processes local search state-action pair markov decision process belief states stochastic policy action selection upright position reinforcement learning methods policy action states actions function reward control agent q-learning optimal goal learning space step environment system problem steps sutton policies

17 Our Data for This Paper 1.6 million research papers
11/19/2018 Our Data for This Paper 1.6 million research papers mostly in Computer Science 400k of them with full text 14 fields of meta-data from “headers” at top of papers “citations” in References section automatically extracted with 99% accuracy. Reference resolution performed on 4 million citations.

18 Example Results on our Corpus
11/19/2018 Example Results on our Corpus Step 1: Step 2: Run LDA on 1.6 million papers. Use topic analysis to select a subset of AI: ML, NLP, robotics, vision, etc. Run Topical N-grams on the ~300k papers in the subset. Sample LDA topics Sample Topical N-gram topics

19 We can also now define several new indicators.
11/19/2018 Each topic is now an intellectual “domain” that includes some number of documents. We can substitute topic for journal in most traditional bibliometric indicators. We can also now define several new indicators.

20 Impact Measures Leveraging Topics
11/19/2018 Impact Measures Leveraging Topics Topical Citation count Topical Impact factor Topical Diffusion Topical Diversity Topical Half-life Topical Precedence Topical H-factor Topical Transfer

21 Impact Measures Leveraging Topics
11/19/2018 Impact Measures Leveraging Topics Topical Citation count Topical Impact factor Topical Diffusion Topical Diversity Topical Half-life Topical Precedence Topical H-factor Topical Transfer

22 Topical Citation Count
11/19/2018 Topical Citation Count

23 11/19/2018 Impact Factor Journal Impact Factor: Citations from articles published in 2004 to articles in Cell published in , divided by the number of articles published in Cell in 2004 Impact factors from JCR: Nature 32.182 Cell 28.389 JMLR 5.952 Machine Learning 3.258

24 Topical Impact Factor over time
11/19/2018 Topical Impact Factor over time

25 Impact Measures Leveraging Topics
11/19/2018 Impact Measures Leveraging Topics Topical Citation count Topical Impact factor Topical Diffusion Topical Diversity Topical Half-life Topical Precedence Topical H-factor Topical Transfer

26 Broad Impact: Diffusion
11/19/2018 Broad Impact: Diffusion Journal Diffusion: # of journals citing Cell divided by the total number of citations to Cell, over a given time period, times 100 Problem: Relatively brittle at low citation counts. If a topic/journal is cited twice by two different topics/journals, it will have high diffusion.

27 Broad Impact: Diversity
11/19/2018 Broad Impact: Diversity Topic Diversity: Entropy of the distribution of citing topics Diffusion Diversity These are just the least cited topics! Better at capturing broad end of impact spectrum

28 Broad Impact: Diversity, for papers
11/19/2018 Broad Impact: Diversity, for papers Topic Diversity: Entropy of the distribution of citing topic

29 Impact Measures Leveraging Topics
11/19/2018 Impact Measures Leveraging Topics Topical Citation count Topical Impact factor Topical Diffusion Topical Diversity Topical Half-life Topical Precedence Topical H-factor Topical Transfer

30 Topical Longevity: Cited Half Life
11/19/2018 Topical Longevity: Cited Half Life Two views: Given a paper, what is the median age of citations to that paper? What is the median age of citations from current literature? Collaborative Filtering is young, fast moving. Maximum Entropy looks further back, but is still producing new work. Neural Networks literature is aging.

31 11/19/2018 Topical Precedence “Early-ness” Within a topic, what are the earliest papers that received more than n citations? Speech Recognition: Some experiments on the recognition of speech, with one and two ears, E. Colin Cherry (1953) Spectrographic study of vowel reduction, B. Lindblom (1963) Automatic Lipreading to enhance speech recognition, Eric D. Petajan (1965) Effectiveness of linear prediction characteristics of the speech wave for..., B. Atal (1974) Automatic Recognition of Speakers from Their Voices, B. Atal (1976)

32 11/19/2018 Topical Precedence “Early-ness” Within a topic, what are the earliest papers that received more than n citations? Information Retrieval: On Relevance, Probabilistic Indexing and Information Retrieval, Kuhns and Maron (1960) Expected Search Length: A Single Measure of Retrieval Effectiveness Based on the Weak Ordering Action of Retrieval Systems, Cooper (1968) Relevance feedback in information retrieval, Rocchio (1971) Relevance feedback and the optimization of retrieval effectiveness, Salton (1971) New experiments in relevance feedback, Ide (1971) Automatic Indexing of a Sound Database Using Self-organizing Neural Nets, Feiten and Gunzel (1982)

33 Impact Measures Leveraging Topics
11/19/2018 Impact Measures Leveraging Topics Topical Citation count Topical Impact factor Topical Diffusion Topical Diversity Topical Half-life Topical Precedence Topical H-factor Topical Transfer

34 11/19/2018 H-factor H = maximum number K for which you have K papers, each with at least K citations. ...for journals [Braun et al, 2005]

35 Topical H-factor Year 1990 16 12 Natural Language Parsing (16)
11/19/2018 Topical H-factor Year 1990 Natural Language Parsing (16) Neural Networks (173) Speech Recognition (120) Hidden Markov Models (21) Genetic Algorithms (71) Optical Flow (48) Reinforcement Learning (83) Computer Vision (49) Mobile Robots (22) Word Sense Disambiguation (118) NLP (160) Planning (35) Markov Chain Monte Carlo (106) Maximum Likelihood Estimators (40) Genetic Algorithms (131) Genetic Programming (61)

36 Topical H-factor Year 1995 49 18 Computer Vision (49)
11/19/2018 Topical H-factor Year 1995 Computer Vision (49) Speech Recognition (120) Decision Trees (146) Data Mining (176) Hidden Markov Models (21) Genetic Algorithms (71) Markov Chain Monte Carlo (106) IR And Queries (138) Word Sense Disambiguation (118) Web And VR (80) Natural Language Parsing (16) Bayesian Inference (110) Reinforcement Learning (83) Logic Programming (150) Mobile Robots (22) NLP (160)

37 Topical H-factor Year 2001 129 15 Web Pages (129)
11/19/2018 Topical H-factor Year 2001 Web Pages (129) Ontologies (186) SVMs (50) Computer Vision (49) Gene Expression (126) Data Mining (176) Dimensionality Reduction (29) Question Answering (111) Search Engines (132) Natural Language Parsing (16) Reinforcement Learning (83) Web Services (184) HCI (164) Hidden Markov Models (21) Word Sense Disambiguation (118) IR And Queries (138)

38 Impact Measures Leveraging Topics
11/19/2018 Impact Measures Leveraging Topics Topical Citation count Topical Impact factor Topical Diffusion Topical Diversity Topical Half-life Topical Precedence Topical H-factor Topical Transfer

39 Topical Transfer Transfer from Digital Libraries to other topics
11/19/2018 Topical Transfer Transfer from Digital Libraries to other topics Other topic Cit’s Paper Title Web Pages 31 Trawling the Web for Emerging Cyber-Communities, Kumar, Raghavan, Computer Vision 14 On being ‘Undigital’ with digital cameras: extending the dynamic... Video 12 Lessons learned from the creation and deployment of a terabyte digital video Graphs Trawling the Web for Emerging Cyber-Communities 11 WebBase: a repository of Web pages

40 11/19/2018 Topical Transfer Citation counts from one topic to another. Map “producers and consumers”

41 11/19/2018 Where did the data come from?

42 Topic Analysis & other Data Mining Browsable Web Interface
11/19/2018 Rexa System Overview WWW NSF grant DB Spider Web for PDFs Convert to text (with layout & format) Extract metadata (title, authors, abstract, venue, citations; 14 fields in total) Reference resolution (of papers, authors & grants) Topic Analysis & other Data Mining Home-grown Java+MySQL (~1m PDF/day) We have built a system that begins with your Inbox as input, … when it finds evidence of new people on those pages, it recursively calls itself so that it can know about friends-of-friends-of-friends. More on the details of models used later… Enhanced ps2text (better word stiching, plus layout in XML) Conditional Random Fields (99% word accuracy) Discriminatively trained graph partitioning (competition-winning accuracy) Browsable Web Interface

43 IE from Research Papers
11/19/2018 [McCallum et al ‘99] @article{ kaelbling96reinforcement, author = "Leslie Pack Kaelbling and Michael L. Littman and Andrew P. Moore", title = "Reinforcement Learning: A Survey", journal = "Journal of Artificial Intelligence Research", volume = "4", pages = " ", year = "1996", Some of you might know about Cora, a contemporary of CiteSeer. This is a system created by Kamal Nigam, Kristie Seymore, Jason Rennie and myself that mines the web for Computer Science research papers. You can view search results in terms of summaries from extracted data. Here you see the automatically extracted titles, authors, institutions and abstracts.

44 (Linear Chain) Conditional Random Fields
11/19/2018 [Lafferty, McCallum, Pereira 2001] Undirected graphical model, trained to maximize conditional probability of output sequence given input sequence Finite state model Graphical model OTHER PERSON OTHER ORG TITLE … output seq y y y y y t - 1 t t+1 t+2 t+3 FSM states . . . observations x x x x x t - 1 t t +1 t +2 t +3 said Jones a Microsoft VP … input seq A CRF is simply an undirected graphical model trained to maximize a conditional probability. First explorations with these models centered on finite state models, represented as linear-chain graphical models, with joint probability distribution over state sequence Y calculated as a normalized product over potentials on cliques of the graph. As is often traditional in NLP and other application areas, these potentials are defined to be log-linear combination of weights on features of the clique values. The chief excitement from an application point of view is the ability to use rich and arbitrary features of the input without complicating inference. Yielded many good results, for example… Explosion of interest in many different conferences. where Asian word segmentation [COLING’04], [ACL’04] IE from Research papers [HTL’04] Object classification in images [CVPR ‘04] Wide-spread interest, positive experimental results in many applications. Noun phrase, Named entity [HLT’03], [CoNLL’03] Protein structure prediction [ICML’04] IE from Bioinformatics text [Bioinformatics ‘04],…

45 IE from Research Papers
11/19/2018 IE from Research Papers Field-level F1 Hidden Markov Models (HMMs) [Seymore, McCallum, Rosenfeld, 1999] Support Vector Machines (SVMs) 89.7 [Han, Giles, et al, 2003] Conditional Random Fields (CRFs) 93.9 [Peng, McCallum, 2004]  error 40% HMMs have FSM sequence modeling, but generative SVMs are discriminative, but don’t do full inference on sequences. (Word-level accuracy is >99%)

46 11/19/2018 Previous Systems

47 11/19/2018

48 11/19/2018 Previous Systems Cites Research Paper

49 More Entities and Relations
11/19/2018 More Entities and Relations Expertise Cites Grant Research Paper Person Venue University Groups

50 11/19/2018

51 11/19/2018

52 11/19/2018

53 11/19/2018

54 11/19/2018

55 11/19/2018

56 11/19/2018

57 11/19/2018

58 11/19/2018

59 11/19/2018

60 11/19/2018

61 11/19/2018

62 11/19/2018

63 11/19/2018

64 11/19/2018

65 11/19/2018

66 11/19/2018

67 11/19/2018

68 Trends in 17 years of NIPS proceedings
11/19/2018 Trends in 17 years of NIPS proceedings

69 Topic Distributions Conditioned on Time
11/19/2018 Topic Distributions Conditioned on Time topic mass (in vertical height) time

70 Topic Correlations in PAM
11/19/2018 Topic Correlations in PAM 5000 research paper abstracts, from across all CS Numbers on edges are supertopics’ Dirichlet parameters


Download ppt "Bibliometric Impact Measures Leveraging Topic Analysis"

Similar presentations


Ads by Google