Presentation is loading. Please wait.

Presentation is loading. Please wait.

N-gram Topic Models for Bibliometric Analysis Gideon Mann, David Mimno, and Andrew McCallum Can topic models provide better measurements of the impact.

Similar presentations


Presentation on theme: "N-gram Topic Models for Bibliometric Analysis Gideon Mann, David Mimno, and Andrew McCallum Can topic models provide better measurements of the impact."— Presentation transcript:

1 N-gram Topic Models for Bibliometric Analysis Gideon Mann, David Mimno, and Andrew McCallum Can topic models provide better measurements of the impact of research literature?

2 Bibliometrics and Scientometrics Typically analyzes patterns of citations in research literature Derek de Solla Price: “Little Science, Big Science” Eugene Garfield: Science Citation Index, Journal Citation Reports

3 Comparing apples to apples: top journals by citations Biochemistry and molecular biology: J. Biol. Chem405017 Cell136472 Biochem.-US96809 Mathematics Lect. Notes Math6926 T. Am. Math. Soc6469 J. Math. Anal. Appl.6004 Source: Journal Citation Reports (2004)

4 What’s wrong with grouping by journal? 10 of the 200 most cited papers in CiteSeer are unpublished technical reports, 15% of most cited papers are from conference proceedings Open-access publication increasing, but venue information often not available Hand entered ISI citation data noisy Article has only one venue, journals cover many topics

5 A topic model for N-grams Determine whether the next word will be part of an n-gram based on the current word and the current hidden topic. “White house” is a collocation in politics, but may not be one in real estate.

6 Sample n-gram topics 1. Digital Libraries (102): digital, electronic, library, metadata, access; “digital libraries”, “digital library”, “electronic commerce”, “dublin core”, “cultural heritage” 2. WWW (129): web, site, pages, page, www, sites; “world wide web”, “web pages”, “web sites”, “web site”, “world wide” 3. Ontologies (186): semantic, ontology, ontologies, rdf, semantics, meta; “semantic web”, “description logics”, “rdf schema”, “description logic”, “resource description framework” 4. Web services (184): web, services, service, xml, business; “web services”, “web service”, “markup language”, “xml documents”, “xml schema”

7 Assigning topics to documents 1. Build a 200 topic n-gram topic model on 300k documents 2. Remove stopword or methodological topics (e.g. “efficient, fast, speed”) 3. For each document d, if more than 10% of d’s tokens are assigned to topic t, and that comprises more than two tokens, assign d to t Each topic is now an intellectual “domain” that includes some number of documents. We can substitute topic for journal in most traditional bibliometric indicators. We can also now define several new indicators.

8 Impact Factor Journal Impact Factor: Citations from articles published in 2004 to articles in Cell published in 2002-3, divided by the number of articles published in Cell in 2002-3. 2004 Impact factors from JCR: Nature32.182 Cell28.389 JMLR5.952 Machine Learning3.258

9 Topic Impact Factor

10 Broad Impact: Diffusion Journal Diffusion: # of journals citing Cell divided by the total number of citations to Cell, over a given time period, times 100 Problem: relatively brittle at low citation counts. If a topic/journal is cited twice by two different topics/journals, it will have high diffusion.

11 Broad Impact: Diversity Topic Diversity: Entropy of the distribution of citing topics Better at capturing broad end of impact spectrum: the high diffusion topics are identical to the least frequently cited topics

12 Broad Impact: Diversity Topic Diversity: Entropy of the distribution of citing topics Topic diversity can also be measured for papers:

13 Longevity: Cited Half Life Two views: Given a paper, what is the median age of citations to that paper? What is the median age of citations from current literature?

14 History: Topical Precedence Within a topic, what are the earliest papers that received more than n citations? Information Retrieval (138): On Relevance, Probabilistic Indexing and Information Retrieval, Kuhns and Maron (1960) Expected Search Length: A Single Measure of Retrieval Effectiveness Based on the Weak Ordering Action of Retrieval Systems, Cooper (1968) Relevance feedback in information retrieval, Rocchio (1971) Relevance feedback and the optimization of retrieval effectiveness, Salton (1971) New experiments in relevance feedback, Ide (1971) Automatic Indexing of a Sound Database Using Self-organizing Neural Nets, Feiten and Gunzel (1982)

15 Sharing: Topical Transfer


Download ppt "N-gram Topic Models for Bibliometric Analysis Gideon Mann, David Mimno, and Andrew McCallum Can topic models provide better measurements of the impact."

Similar presentations


Ads by Google