Bibliometric Impact Measures Leveraging Topic Analysis Gideon Mann David Mimno Andrew McCallum Computer Science Department University of Massachusetts.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

What Did We See? & WikiGIS Chris Pal University of Massachusetts A Talk for Memex Day MSR Redmond, July 19, 2006.
1/1/ A Knowledge-based Approach to Citation Extraction Min-Yuh Day 1,2, Tzong-Han Tsai 1,3, Cheng-Lung Sung 1, Cheng-Wei Lee 1, Shih-Hung Wu 4, Chorng-Shyong.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
{bojan.furlan, jeca, 1/42 Probabilistic Graphical Models For Text Mining: A Topic Modeling Survey V. Jelisavčić*, B.
Data Visualization STAT 890, STAT 442, CM 462
Generative Topic Models for Community Analysis
CSE 574 – Artificial Intelligence II Statistical Relational Learning Instructor: Pedro Domingos.
Research Introspection “ICML does ICML” Andrew McCallum Computer Science Department University of Massachusetts Amherst.
Part IV: Inference algorithms. Estimation and inference Actually working with probabilistic models requires solving some difficult computational problems…
Latent Dirichlet Allocation a generative model for text
Statistical Relational Learning for Link Prediction Alexandrin Popescul and Lyle H. Unger Presented by Ron Bjarnason 11 November 2003.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
British Museum Library, London Picture Courtesy: flickr.
1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.
Computer vision: models, learning and inference Chapter 10 Graphical Models.
Scalable Text Mining with Sparse Generative Models
Learning Programs Danielle and Joseph Bennett (and Lorelei) 4 December 2007.
Online Stacked Graphical Learning Zhenzhen Kou +, Vitor R. Carvalho *, and William W. Cohen + Machine Learning Department + / Language Technologies Institute.
1 A Topic Modeling Approach and its Integration into the Random Walk Framework for Academic Search 1 Jie Tang, 2 Ruoming Jin, and 1 Jing Zhang 1 Knowledge.
Introduction to Data Mining Engineering Group in ACL.
UMass and Learning for CALO Andrew McCallum Information Extraction & Synthesis Laboratory Department of Computer Science University of Massachusetts.
CS Machine Learning. What is Machine Learning? Adapt to / learn from data  To optimize a performance function Can be used to:  Extract knowledge.
Reinforcement Learning in the Presence of Hidden States Andrew Howard Andrew Arnold {ah679
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
N-gram Topic Models for Bibliometric Analysis Gideon Mann, David Mimno, and Andrew McCallum Can topic models provide better measurements of the impact.
Srihari-CSE730-Spring 2003 CSE 730 Information Retrieval of Biomedical Text and Data Inroduction.
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.
Graphical models for part of speech tagging
Project MLExAI Machine Learning Experiences in AI Ingrid Russell, University.
Introduction Many decision making problems in real life
Information Extraction: Distilling Structured Data from Unstructured Text. -Andrew McCallum Presented by Lalit Bist.
A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
27. May Topic Models Nam Khanh Tran L3S Research Center.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
1 Machine Learning 1.Where does machine learning fit in computer science? 2.What is machine learning? 3.Where can machine learning be applied? 4.Should.
I Robot.
A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,
AdvancedBioinformatics Biostatistics & Medical Informatics 776 Computer Sciences 776 Spring 2002 Mark Craven Dept. of Biostatistics & Medical Informatics.
Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang
Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.
Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.
Topic Modeling using Latent Dirichlet Allocation
Bibliometric Impact Measures Leveraging Topic Analysis Gideon Mann David Mimno Andrew McCallum Computer Science Department University of Massachusetts.
Latent Dirichlet Allocation
Automatic Labeling of Multinomial Topic Models
DESIGNING AN ARTICLE Effective Writing 3. Objectives Raising awareness of the format, requirements and features of scientific articles Sharing information.
Citation-Based Retrieval for Scholarly Publications 指導教授:郭建明 學生:蘇文正 M
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Pattern Recognition NTUEE 高奕豪 2005/4/14. Outline Introduction Definition, Examples, Related Fields, System, and Design Approaches Bayesian, Hidden Markov.
CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Link mining ( based on slides.
What is Multimedia Anyway? David Millard and Paul Lewis.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Network Management Lecture 13. MACHINE LEARNING TECHNIQUES 2 Dr. Atiq Ahmed Université de Balouchistan.
Topic and Role Discovery in Social Networks Andrew McCallum Andre Corrada-Emmanuel Xuerui Wang Computer Science Department University of Massachusetts.
Brief Intro to Machine Learning CS539
Recommendation in Scholarly Big Data
Online Multiscale Dynamic Topic Models
School of Computer Science & Engineering
Course Summary (Lecture for CS410 Intro Text Info Systems)
CH. 1: Introduction 1.1 What is Machine Learning Example:
Bibliometric Impact Measures Leveraging Topic Analysis
Overview of Machine Learning
Presentation transcript:

Bibliometric Impact Measures Leveraging Topic Analysis Gideon Mann David Mimno Andrew McCallum Computer Science Department University of Massachusetts Amherst

Goal: Measure the impact of papers, and research subfields. Researchers understanding their own field. Libraries deciding which journals to purchase. Personnel committees deciding on hiring, promotion, awards. Important for:

Typical Impact Measures Citation Count Garfield’s Journal Impact Factor

Why are topical divisions useful in bibliometrics? Biochemistry and molecular biology: J. Biol. Chem Cell Biochem.-US96809 Mathematics Lect. Notes Math6926 T. Am. Math. Soc6469 J. Math. Anal. Appl.6004 Source: Journal Citation Reports (2004) Can you compare the tallest building in NY to the tallest building in Stamford, CT? Citation counts

Why are topical divisions useful in bibliometrics?

Why not use Journal as a proxy for Topic? Journals not necessarily about one topic. Topics may not have their own journal. Open access publishing on the rise. 5% of the 200 most-cited papers in CiteSeer are tech reports! Spidered web documents often do not include venue information.

This Paper Discovering fine-grained, interpretable topics from text 8 impact measures leveraging topics Analysis on 1.5 million research papers and their citations. Where did we get all this data from? Topical N-Grams a phrase-discovering enhancement to LDA A quick tour of 8 impact measures with examples An introduction to Rexa, a new little sibling of CiteSeer, Google Scholar, etc. Talk Outline

Clustering words into topics with Latent Dirichlet Allocation [Blei, Ng, Jordan 2003] Sample a distribution over topics,  For each document: Sample a topic, z For each word in doc Sample a word from the topic, w Example: 70% Iraq war 30% US election Iraq war “bombing” Generative Process:

Inference and Estimation Gibbs Sampling: - Easy to implement - Reasonably fast r

STORY STORIES TELL CHARACTER CHARACTERS AUTHOR READ TOLD SETTING TALES PLOT TELLING SHORT FICTION ACTION TRUE EVENTS TELLS TALE NOVEL MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNESS STRANGE FEELING WHOLE BEING MIGHT HOPE WATER FISH SEA SWIM SWIMMING POOL LIKE SHELL SHARK TANK SHELLS SHARKS DIVING DOLPHINS SWAM LONG SEAL DIVE DOLPHIN UNDERWATER DISEASE BACTERIA DISEASES GERMS FEVER CAUSE CAUSED SPREAD VIRUSES INFECTION VIRUS MICROORGANISMS PERSON INFECTIOUS COMMON CAUSING SMALLPOX BODY INFECTIONS CERTAIN Example topics induced from a large collection of text FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POLES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORCE MAGNETS BE MAGNETISM POLE INDUCED SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BIOLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIENTIST STUDYING SCIENCES BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIELD PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNIS TEAMS GAMES SPORTS BAT TERRY JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTUNITIES WORKING TRAINING SKILLS CAREERS POSITIONS FIND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY EARN ABLE [Tennenbaum et al]

STORY STORIES TELL CHARACTER CHARACTERS AUTHOR READ TOLD SETTING TALES PLOT TELLING SHORT FICTION ACTION TRUE EVENTS TELLS TALE NOVEL MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNESS STRANGE FEELING WHOLE BEING MIGHT HOPE WATER FISH SEA SWIM SWIMMING POOL LIKE SHELL SHARK TANK SHELLS SHARKS DIVING DOLPHINS SWAM LONG SEAL DIVE DOLPHIN UNDERWATER DISEASE BACTERIA DISEASES GERMS FEVER CAUSE CAUSED SPREAD VIRUSES INFECTION VIRUS MICROORGANISMS PERSON INFECTIOUS COMMON CAUSING SMALLPOX BODY INFECTIONS CERTAIN FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POLES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORCE MAGNETS BE MAGNETISM POLE INDUCED SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BIOLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIENTIST STUDYING SCIENCES BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIELD PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNIS TEAMS GAMES SPORTS BAT TERRY JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTUNITIES WORKING TRAINING SKILLS CAREERS POSITIONS FIND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY EARN ABLE Example topics induced from a large collection of text [Tennenbaum et al]

Topics Modeling Multi-word Phrases Topics based only on unigrams sometimes difficult to interpret Topic discovery itself is confused because important meaning / distinctions carried by phrases.

A Topic Comparison LDA algorithms algorithm genetic problems efficient Topical N-grams genetic algorithms genetic algorithm evolutionary computation evolutionary algorithms fitness function

Topical N-gram Model z1z1 z2z2 z3z3 z4z4 w1w1 w2w2 w3w3 w4w4 y1y1 y2y2 y3y3 y4y4  11 T D...  W T W  11 22  22 [Wang, McCallum 2005]

Features of Topical N-Grams model Easily trained by Gibbs sampling –Can run efficiently on millions of words Topic-specific phrase discovery –“white house” has special meaning as a phrase in the politics topic, –... but not in the real estate topic.

Topic Comparison learning optimal reinforcement state problems policy dynamic action programming actions function markov methods decision rl continuous spaces step policies planning LDA reinforcement learning optimal policy dynamic programming optimal control function approximator prioritized sweeping finite-state controller learning system reinforcement learning rl function approximators markov decision problems markov decision processes local search state-action pair markov decision process belief states stochastic policy action selection upright position reinforcement learning methods policy action states actions function reward control agent q-learning optimal goal learning space step environment system problem steps sutton policies Topical N-grams (2+)Topical N-grams (1)

Our Data for This JCDL Paper 1.6 million research papers –mostly in Computer Science –400k of them with full text 14 fields of meta-data from –“headers” at top of papers – “citations” in References section automatically extracted with 99% accuracy. Reference resolution performed on 4 million citations.

Example Results on our Corpus Sample LDA topics Sample Topical N-gram topics Run LDA on 1.6 million papers. Use topic analysis to select a subset of AI: ML, NLP, robotics, vision, etc. Step 1: Run Topical N-grams on the ~300k papers in the subset. Step 2:

Each topic is now an intellectual “domain” that includes some number of documents. We can substitute topic for journal in most traditional bibliometric indicators. We can also now define several new indicators.

Impact Measures Leveraging Topics 1.Topical Citation count 2.Topical Impact factor 3.Topical Diffusion 4.Topical Diversity 5.Topical Half-life 6.Topical Precedence 7.Topical H-factor 8.Topical Transfer

Impact Measures Leveraging Topics 1.Topical Citation count 2.Topical Impact factor 3.Topical Diffusion 4.Topical Diversity 5.Topical Half-life 6.Topical Precedence 7.Topical H-factor 8.Topical Transfer

Topical Citation Count

Impact Factor Journal Impact Factor: Citations from articles published in 2004 to articles in Cell published in , divided by the number of articles published in Cell in Impact factors from JCR: Nature Cell JMLR5.952 Machine Learning3.258

Topical Impact Factor over time

Impact Measures Leveraging Topics 1.Topical Citation count 2.Topical Impact factor 3.Topical Diffusion 4.Topical Diversity 5.Topical Half-life 6.Topical Precedence 7.Topical H-factor 8.Topical Transfer

Broad Impact: Diffusion Journal Diffusion: # of journals citing Cell divided by the total number of citations to Cell, over a given time period, times 100 Problem: Relatively brittle at low citation counts. If a topic/journal is cited twice by two different topics/journals, it will have high diffusion.

Broad Impact: Diversity Topic Diversity: Entropy of the distribution of citing topics These are just the least cited topics! Better at capturing broad end of impact spectrum DiffusionDiversity

Broad Impact: Diversity, for papers Topic Diversity: Entropy of the distribution of citing topic

Impact Measures Leveraging Topics 1.Topical Citation count 2.Topical Impact factor 3.Topical Diffusion 4.Topical Diversity 5.Topical Half-life 6.Topical Precedence 7.Topical H-factor 8.Topical Transfer

Topical Longevity: Cited Half Life Two views: Given a paper, what is the median age of citations to that paper? What is the median age of citations from current literature? Collaborative Filtering is young, fast moving. Maximum Entropy looks further back, but is still producing new work. Neural Networks literature is aging.

Topical Precedence Within a topic, what are the earliest papers that received more than n citations? “Early-ness” Speech Recognition: Some experiments on the recognition of speech, with one and two ears, E. Colin Cherry (1953) Spectrographic study of vowel reduction, B. Lindblom (1963) Automatic Lipreading to enhance speech recognition, Eric D. Petajan (1965) Effectiveness of linear prediction characteristics of the speech wave for..., B. Atal (1974) Automatic Recognition of Speakers from Their Voices, B. Atal (1976)

Topical Precedence Within a topic, what are the earliest papers that received more than n citations? “Early-ness” Information Retrieval: On Relevance, Probabilistic Indexing and Information Retrieval, Kuhns and Maron (1960) Expected Search Length: A Single Measure of Retrieval Effectiveness Based on the Weak Ordering Action of Retrieval Systems, Cooper (1968) Relevance feedback in information retrieval, Rocchio (1971) Relevance feedback and the optimization of retrieval effectiveness, Salton (1971) New experiments in relevance feedback, Ide (1971) Automatic Indexing of a Sound Database Using Self-organizing Neural Nets, Feiten and Gunzel (1982)

Impact Measures Leveraging Topics 1.Topical Citation count 2.Topical Impact factor 3.Topical Diffusion 4.Topical Diversity 5.Topical Half-life 6.Topical Precedence 7.Topical H-factor 8.Topical Transfer

H-factor H = maximum number K for which you have K papers, each with at least K citations....for journals [Braun et al, 2005]

Topical H-factor Natural Language Parsing (16) Neural Networks (173) Speech Recognition (120) Hidden Markov Models (21) Genetic Algorithms (71) Optical Flow (48) Reinforcement Learning (83) Computer Vision (49) Mobile Robots (22) Word Sense Disambiguation (118) NLP (160) 35 8 Planning (35) Markov Chain Monte Carlo (106) 40 8 Maximum Likelihood Estimators (40) Genetic Algorithms (131) 61 7 Genetic Programming (61) Year 1990

Topical H-factor Year Computer Vision (49) Speech Recognition (120) Decision Trees (146) Data Mining (176) Hidden Markov Models (21) Genetic Algorithms (71) Markov Chain Monte Carlo (106) IR And Queries (138) Word Sense Disambiguation (118) Web And VR (80) Natural Language Parsing (16) Bayesian Inference (110) Reinforcement Learning (83) Logic Programming (150) Mobile Robots (22) NLP (160)

Topical H-factor Year Web Pages (129) Ontologies (186) SVMs (50) Computer Vision (49) Gene Expression (126) Data Mining (176) Dimensionality Reduction (29) Question Answering (111) Search Engines (132) Natural Language Parsing (16) Reinforcement Learning (83) Web Services (184) HCI (164) Hidden Markov Models (21) Word Sense Disambiguation (118) IR And Queries (138)

Impact Measures Leveraging Topics 1.Topical Citation count 2.Topical Impact factor 3.Topical Diffusion 4.Topical Diversity 5.Topical Half-life 6.Topical Precedence 7.Topical H-factor 8.Topical Transfer

Topical Transfer Transfer from Digital Libraries to other topics Other topicCit’sPaper Title Web Pages31Trawling the Web for Emerging Cyber- Communities, Kumar, Raghavan, Computer Vision14On being ‘Undigital’ with digital cameras: extending the dynamic... Video12Lessons learned from the creation and deployment of a terabyte digital video Graphs12Trawling the Web for Emerging Cyber- Communities Web Pages11WebBase: a repository of Web pages

Topical Transfer Citation counts from one topic to another. Map “producers and consumers”

Where did the data come from?

Rexa System Overview Reference resolution (of papers, authors & grants) Spider Web for PDFs Convert to text (with layout & format) Extract metadata (title, authors, abstract, venue, citations; 14 fields in total) Browsable Web Interface Topic Analysis & other Data Mining WWW Home-grown Java+MySQL (~1m PDF/day) Enhanced ps2text (better word stiching, plus layout in XML) Conditional Random Fields (99% word accuracy) NSF grant DB Discriminatively trained graph partitioning (competition-winning accuracy)

IE from Research Papers [McCallum et al kaelbling96reinforcement, author = "Leslie Pack Kaelbling and Michael L. Littman and Andrew P. Moore", title = "Reinforcement Learning: A Survey", journal = "Journal of Artificial Intelligence Research", volume = "4", pages = " ", year = "1996",

(Linear Chain) Conditional Random Fields y t-1 y t x t y t+1 x t +1 x t - 1 Finite state modelGraphical model Undirected graphical model, trained to maximize conditional probability of output sequence given input sequence... FSM states observations y t+2 x t +2 y t+3 x t +3 said Jones a Microsoft VP … OTHER PERSON OTHER ORG TITLE … output seq input seq Asian word segmentation [COLING’04], [ACL’04] IE from Research papers [HTL’04] Object classification in images [CVPR ‘04] Wide-spread interest, positive experimental results in many applications. Noun phrase, Named entity [HLT’03], [CoNLL’03] Protein structure prediction [ICML’04] IE from Bioinformatics text [Bioinformatics ‘04],… [Lafferty, McCallum, Pereira 2001] where

IE from Research Papers Field-level F1 Hidden Markov Models (HMMs)75.6 [Seymore, McCallum, Rosenfeld, 1999] Support Vector Machines (SVMs)89.7 [Han, Giles, et al, 2003] Conditional Random Fields (CRFs)93.9 [Peng, McCallum, 2004]  error 40% (Word-level accuracy is >99%)

Previous Systems

Research Paper Cites Previous Systems

Research Paper Cites Person UniversityVenue Grant Groups Expertise More Entities and Relations

access, access control, digital library, digital, digital libraries

Summary Demonstrated a new topic discovery method (Topical N-Grams) on 1.6m research papers. Presented 8 impact measures based on topics. Introduced Rexa, a showcase for our research on information extraction, coreference and data mining. is publicly available now. Try it out! Feedback appreciated.

Neural Information Processing Conference Dataset 1740 Papers Unique words 2,301,375 Words Volumes 0-12 Spanning 1987 – Prepared by Sam Roweis.

Trends in 17 years of NIPS proceedings

Topic Distributions Conditioned on Time time topic mass (in vertical height)

Finding Topics in 1 million CS papers 200 topics & keywords automatically discovered.

Topic Correlations in PAM 5000 research paper abstracts, from across all CS Numbers on edges are supertopics’ Dirichlet parameters