A Comparison of On-line Computer Science Citation Databases Vaclav Petricek, Ingemar J. Cox, Hui Han, Isaac G. Councill, C. Lee Giles

Slides:



Advertisements
Similar presentations
28 April 2004Second Nordic Conference on Scholarly Communication 1 Citation Analysis for the Free, Online Literature Tim Brody Intelligence, Agents, Multimedia.
Advertisements

How can I find the number of times a work has been cited by other authors?
CSE594 Fall 2009 Jennifer Wong Oct. 14, 2009
By Soumajit Pramanik Guide : Dr. Bivas Mitra. Important Author-based Metrics: In-Citation Count H-Index etc.
PHYSICS AND THE CITY. Carvalho and Batty: Scaling in the Geography of US Computer Science 1 Scaling in the Geography of US Computer Science Rui Carvalho.
Social networks, in the form of bibliographies and citations, have long been an integral part of the scientific process. We examine how to leverage the.
Vermelding onderdeel organisatie May 3, Literature Search IN 3305 Created by Tomas Klos. Edited by Alexandru Iosup. Parallel and Distributed Systems.
Bibliometrics – an overview of the main metrics and products The MyRI Project team.
NIH PUBLIC ACCESS POLICY NIHMSID, PMCID, PMID OBJECTIVE When the National Institutes of Health (NIH) Public Access Policy became law on April 7, 2008 several.
New Features Update ISI Web of Knowledge. Copyright 2006 Thomson Corporation 2 New features added Mozilla Firefox web browser is now supported New access.
1/1/ A Knowledge-based Approach to Citation Extraction Min-Yuh Day 1,2, Tzong-Han Tsai 1,3, Cheng-Lung Sung 1, Cheng-Wei Lee 1, Shih-Hung Wu 4, Chorng-Shyong.
1 2 HEP aims to understand how our Universe works: -Experimental HEP : builds the largest scientific instruments ever to reach.
P. Boyce 1 Use of Astronomy’s Info System : The Highly Productive User Peter B. Boyce Maria Mitchell Association and Past Executive Officer American Astronomical.
Digital Libraries and Autonomous Citation Indexing Steve Lawrence C. Lee Giles Kurt Bollacker.
Detecting Research Topics via the Correlation between Graphs and Texts Yookyung Jo Dept. of Computer Science, Cornell University Carl Lagoze †, and C.
11/18/02Travis Brooks-ASIST The Unpublishing of High Energy Physics Travis Brooks SPIRES Scientific Databases Manager Stanford Linear Accelerator.
SCIENTIFIC SOLUTIONS Thomson ResearchSoft Paul Torpey April 8, 2005.
Using library resources for research Paul Johnson Bedford Library.
1 William Y. Arms Cornell University April 4, 2003 Free Access to Information Today Who Benefits? What are the Risks? Who Pays?
Web of Science Pros Excellent depth of coverage in the full product (from 1900-present for some journals) A large number of the records are enhanced with.
Bibliometrics in Computer Science MyRI project team.
WEB OF SCIENCE now including the CONFERENCE PROCEEDINGS CITATION INDEXES.
Conference papers & proceedings. Many conference papers are published in journals and some may be released before a conference takes place. Other papers.
Araba Dawson-Andoh 122 A Alden Library
Measurement and Evolution of Online Social Networks Review of paper by Ophir Gaathon Analysis of Social Information Networks COMS , Spring 2011,
Journal Impact Factors and H index
Tomo-gravity Yin ZhangMatthew Roughan Nick DuffieldAlbert Greenberg “A Northern NJ Research Lab” ACM.
Advanced Information Retrieval CSCI 6403 – Tuesday Gwendolyn MacNairn Computer Science Librarian Room 209.
Social Networking Techniques for Ranking Scientific Publications (i.e. Conferences & journals) and Research Scholars.
Bibliometrics toolkit: ISI products Website: Last edited: 11 Mar 2011 Thomson Reuters ISI product set is the market leader for.
Doing Literature Survey: Some General Tips on Sources to Be Consulted
Online Autonomous Citation Management for CiteSeer CSE598B Course Project By Huajing Li.
Digital Assessment Training Trainer: Name
Publishing Your Work Not a Question, But rather an Execution Who? Why? When? Where? How? รัตติกร ยิ้มนิรัญ สาขาวิชาฟิสิกส์ สำนักวิชา วิทยาศาสตร์ มหาวิทยาลัยเทคโนโลยีสุรนารี
Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.
Master Thesis Defense Jan Fiedler 04/17/98
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
Building a discipline-specific aggregate for computing and library and information science Thomas Krichel Long Island University, NY, USA
Complex Networks First Lecture TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA TexPoint fonts used in EMF. Read the.
INFSCI 3005: Introduction to Doctoral Program Lecture 6: Reference and Search Tools With materials and inspiration from professors Marek Druzdzel, Stephen.
Definition and search of scientific articles Tord Heljeberg
Presented by Dr. S. C. Jindal Librarian Central Science Library University of Delhi Delhi Information Competency.
LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009.
Astrophysics publications on arXiv, Scopus and Mendeley – A Case Study Judit Bar-Ilan Department of Information Science, Bar-Ilan University, Israel.
1 CH450 CHEMICAL WRITING AND PRESENTATION Alan Buglass.
Connecting you with information, support and your community Tunnelling and Underground Space MSc Welcome to Warwick!
University rankings, dissected Răzvan V. Florian Ad Astra association of Romanian scientists Center for Cognitive and Neural Studies, Cluj, Romania.
A System for Automatic Personalized Tracking of Scientific Literature on the Web Tzachi Perlstein Yael Nir.
The Structure of Scientific Collaboration Networks by M. E. J. Newman CMSC 601 Paper Summary Marie desJardins January 27, 2009.
An Overview of Literature Management Systems Qiaozhu Mei April 12, 2007.
1 ACCESSING THE PURDUE LIBRARY DATABASES AND ONLINE JOURNALS September 14, 2006.
Digital Research tools – Endnote, RefWorks, Zotero, Mendeley, Papers.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Boosting the Feature Space: Text Classification for Unstructured.
Publication Pattern of CA-A Cancer Journal for Clinician Hsin Chen 1 *, Yee-Shuan Lee 2 and Yuh-Shan Ho 1# 1 School of Public Health, Taipei Medical University.
Measuring Research Impact Using Bibliometrics Constance Wiebrands Manager, Library Services.
CS & CS ST: Probabilistic Data Management Fall 2016 Xiang Lian Kent State University Kent, OH
Open Research Data and Open Access publications: How do they sit in the Web of Science? Guillaume Rivalle, Manager, Europe solution specialists
Demonstrating Scholarly Impact: Metrics, Tools and Trends
Bibliometrics toolkit: Thomson Reuters products
Compilation of SCOAP supported papers
Submitted By: Usha MIT-876-2K11 M.Tech(3rd Sem) Information Technology
An Efficient method to recommend research papers and highly influential authors. VIRAJITHA KARNATAPU.
Gwyn P. Williams and Kim Kindrew Pizza Seminar, September 18, 2013
Thomas Krichel Long Island University, NY, USA
Building an autonomous citation index for grey literature: the
Indication of Publication Pattern of Scientometrics
Networks and the Internet
Reviewing the Literature
Reading - How to read Fausto Giunchiglia Literature:
Presentation transcript:

A Comparison of On-line Computer Science Citation Databases Vaclav Petricek, Ingemar J. Cox, Hui Han, Isaac G. Councill, C. Lee Giles

2 Motivation  Autonomous databases have advantages compared to manually constructed - Easier maintenance - Lower cost  Is it really an equivalent solution that is just cheaper?  Does the automated acquisition introduce any bias?

3 Talk Overview  Datasets  Acquisition bias and models  CS Citation Distribution  Conclusions  Future Work

4 Datasets - DBLP  DBLP was operated by Micheal Ley since 1994 [8]. It currently contains over 550,000 computer science references from around 368,000 authors.  Each entry is manually inserted by a group of volunteers and occasionally hired students. The entries are obtained from conference proceeding and journals.

5 Datasets - CiteSeer  CiteSeer was created by Steve Lawrence and C. Lee Giles in It currently contains over 716,797 documents.  In contrast, each entry in CiteSeer is automatically entered from an analysis of documents found on the Web.

6 Datasets – Publication year CiteSeer DBLP  Declining CiteSeer maintenance  Increased DBLP funding

7 Author bias  CiteSeer papers have higher average number of authors  Both databases show growing team sizes

8 Author bias  Crossover for low number of authors  CiteSeer has higher proportion of multiauthor papers than DBLP (for number of authors <4)

9 Author bias “Papers with higher number of authors are more likely to be included in CiteSeer” Hypothesis Crawler suffers from acquisition bias due to - Submission - Crawling

10 Models - CiteSeer  CiteSeer Submission model Probability of a document being submitted grows with number of authors - Publication submitted with probability β - Probabilities independent for coauthors citeseer s (i) = (1-(1- β ) i ) * all(i)

11 Models - CiteSeer  CiteSeer crawler model - Probability of crawling a document grows with number of its online copies - Probability of a document being online grows with number of authors - Probabilities independent between authors - Publication published online with probability δ - Publication found by crawler with probability γ citeseer c (i) = (1-(1- γδ) i ) * all(i)  Both models result in equivalent type of bias

12 Coverage  Can we estimate the coverage of dblp?  Can we estimate the coverage of CiteSeer?  Can we estimate the coverage of CS literature?  We need a model of DBLP acquisition method

13 Models - DBLP  DBLP model - Publication included in DBLP with probability α - α is a parameter reflecting DBLP “coverage” of CS literature dblp(i) = α * all(i)

14 Coverage citeseer(i) = (1-(1- β )^i) * all(i) dblp(i) = α * all(i) r(i) = dblp(i) / citeseer(i) r(i) = α / (1-(1- β )^i)

15 Results r(i) = α / (1-(1- β )^i)  Alpha ~ 0.3 DBLP covers approx 30% of CS literature CiteSeer covers approx 40% CS literature ~ 2M publications

Citation distribution

17 Citation distribution  Studied before  Follow a power-law  Redner, Laherrere et al, Lehmann and others  Mostly physics community  We use a subset of CiteSeer and DBLP papers that have citation information

18 Citation distribution  Power law  Sparse data for high number of citations

19 Citation distribution Exponential binning  Data aggregated in exponentially increasing ‘bins’  Equivalent to constant bins on a logarithmic scale  Easier interpolation

20 Citation distribution  Distribution of citations more uneven in CS than in Physics  Significant differences between DBLP and CiteSeer slope # citations LehmannDBLPCiteSeer < >

21 Citation distribution  CiteSeer contains fewer low cited papers than DBLP  No model yet  Lawrence - “Online or invisible?”

22 Conclusions - authors  CiteSeer and DBLP have very different acquisition methods  Significant bias against papers with low number of authors (less than 4) in CiteSeer.  Single author papers appear to be disadvantaged with regard to the CiteSeer acquisition method.  two probabilistic models for paper acquisition in CiteSeer resulting in the same type of bias - Crawler model - Submission model

23 Conclusions - coverage  Simple model of DBLP coverage predicts coverage of approx 30% of the entire Computer Science literature.  This gives us CiteSeer coverage of approx 40% and total number of CS papers around 2M

24 Conclusions - citations  CiteSeer and DBLP citation distributions are different  Both indicate that highly cited papers in Computer Science receive a larger citation share than in Physics.  CiteSeer contains fewer low cited papers

25 Future Work  Repeat experiments on most recent CiteSeer data  Other methods to estimate Computer science literature size and trends - Overlap of CiteSeer and DBLP  Bias introduced by bibliography parsing  Collaborative network analysis  Connection to internet surveys?

Thank you

27 References [1] Arxiv e-print archive, [2] Compuscience database, karlsruhe.de/COMP/quick.htm. [3] Corr, [4] Cs bibtex database, [5] Dblp, [6] Scientific citation index, [7] Spires high energy physics literature database, [8] Sciencedirect digital library, [9] P. Bailey, N. Craswell, and D. Hawking. Dark matter on the web. In Poster Proceedings of 9th International World Wide Web Conference. ACM Press, [10] M. Batty. Citation geography: It’s about location. The Scientist, 17(16), [11] M. Batty. The geography of scientific citation. Environment and Planning A, 35:761–770, [12] S.Lawrence “Online or invisible”, Nature, Volume 411, Number 6837, p. 521, 2001

28