Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Comparison of On-line Computer Science Citation Databases Vaclav Petricek, Ingemar J. Cox, Hui Han, Isaac G. Councill, C. Lee Giles

Similar presentations


Presentation on theme: "A Comparison of On-line Computer Science Citation Databases Vaclav Petricek, Ingemar J. Cox, Hui Han, Isaac G. Councill, C. Lee Giles"— Presentation transcript:

1 A Comparison of On-line Computer Science Citation Databases Vaclav Petricek, Ingemar J. Cox, Hui Han, Isaac G. Councill, C. Lee Giles v.petricek@cs.ucl.ac.uk http://www.cs.ucl.ac.uk/staff/V.Petricek

2 2 Motivation  Autonomous databases have advantages compared to manually constructed - Easier maintenance - Lower cost  Is it really an equivalent solution that is just cheaper?  Does the automated acquisition introduce any bias?

3 3 Talk Overview  Datasets  Acquisition bias and models  CS Citation Distribution  Conclusions  Future Work

4 4 Datasets - DBLP  DBLP was operated by Micheal Ley since 1994 [8]. It currently contains over 550,000 computer science references from around 368,000 authors.  Each entry is manually inserted by a group of volunteers and occasionally hired students. The entries are obtained from conference proceeding and journals.

5 5 Datasets - CiteSeer  CiteSeer was created by Steve Lawrence and C. Lee Giles in 1997. It currently contains over 716,797 documents.  In contrast, each entry in CiteSeer is automatically entered from an analysis of documents found on the Web.

6 6 Datasets – Publication year CiteSeer DBLP  Declining CiteSeer maintenance  Increased DBLP funding

7 7 Author bias  CiteSeer papers have higher average number of authors  Both databases show growing team sizes

8 8 Author bias  Crossover for low number of authors  CiteSeer has higher proportion of multiauthor papers than DBLP (for number of authors <4)

9 9 Author bias “Papers with higher number of authors are more likely to be included in CiteSeer” Hypothesis Crawler suffers from acquisition bias due to - Submission - Crawling

10 10 Models - CiteSeer  CiteSeer Submission model Probability of a document being submitted grows with number of authors - Publication submitted with probability β - Probabilities independent for coauthors citeseer s (i) = (1-(1- β ) i ) * all(i)

11 11 Models - CiteSeer  CiteSeer crawler model - Probability of crawling a document grows with number of its online copies - Probability of a document being online grows with number of authors - Probabilities independent between authors - Publication published online with probability δ - Publication found by crawler with probability γ citeseer c (i) = (1-(1- γδ) i ) * all(i)  Both models result in equivalent type of bias

12 12 Coverage  Can we estimate the coverage of dblp?  Can we estimate the coverage of CiteSeer?  Can we estimate the coverage of CS literature?  We need a model of DBLP acquisition method

13 13 Models - DBLP  DBLP model - Publication included in DBLP with probability α - α is a parameter reflecting DBLP “coverage” of CS literature dblp(i) = α * all(i)

14 14 Coverage citeseer(i) = (1-(1- β )^i) * all(i) dblp(i) = α * all(i) r(i) = dblp(i) / citeseer(i) r(i) = α / (1-(1- β )^i)

15 15 Results r(i) = α / (1-(1- β )^i)  Alpha ~ 0.3 DBLP covers approx 30% of CS literature CiteSeer covers approx 40% CS literature ~ 2M publications

16 Citation distribution

17 17 Citation distribution  Studied before  Follow a power-law  Redner, Laherrere et al, Lehmann and others  Mostly physics community  We use a subset of CiteSeer and DBLP papers that have citation information

18 18 Citation distribution  Power law  Sparse data for high number of citations

19 19 Citation distribution Exponential binning  Data aggregated in exponentially increasing ‘bins’  Equivalent to constant bins on a logarithmic scale  Easier interpolation

20 20 Citation distribution  Distribution of citations more uneven in CS than in Physics  Significant differences between DBLP and CiteSeer slope # citations LehmannDBLPCiteSeer < 50-1.29-1.876-1.504 > 50-2.32-3.509-3.074

21 21 Citation distribution  CiteSeer contains fewer low cited papers than DBLP  No model yet  Lawrence - “Online or invisible?”

22 22 Conclusions - authors  CiteSeer and DBLP have very different acquisition methods  Significant bias against papers with low number of authors (less than 4) in CiteSeer.  Single author papers appear to be disadvantaged with regard to the CiteSeer acquisition method.  two probabilistic models for paper acquisition in CiteSeer resulting in the same type of bias - Crawler model - Submission model

23 23 Conclusions - coverage  Simple model of DBLP coverage predicts coverage of approx 30% of the entire Computer Science literature.  This gives us CiteSeer coverage of approx 40% and total number of CS papers around 2M

24 24 Conclusions - citations  CiteSeer and DBLP citation distributions are different  Both indicate that highly cited papers in Computer Science receive a larger citation share than in Physics.  CiteSeer contains fewer low cited papers

25 25 Future Work  Repeat experiments on most recent CiteSeer data  Other methods to estimate Computer science literature size and trends - Overlap of CiteSeer and DBLP  Bias introduced by bibliography parsing  Collaborative network analysis  Connection to internet surveys?

26 Thank you

27 27 References [1] Arxiv e-print archive, http://arxiv.org/. [2] Compuscience database, http://www.zblmath.fiz- karlsruhe.de/COMP/quick.htm. [3] Corr, http://xxx.lanl.gov/archive/cs/. [4] Cs bibtex database, http://liinwww.ira.uka.de/bibliography/. [5] Dblp, http://dblp.uni-trier.de/. [6] Scientific citation index, http://www.isinet.com/products/citation/sci/. [7] Spires high energy physics literature database, http://www.slac.stanford.edu/spires/hep/. [8] Sciencedirect digital library, http://www.sciencedirect.com, 2003. [9] P. Bailey, N. Craswell, and D. Hawking. Dark matter on the web. In Poster Proceedings of 9th International World Wide Web Conference. ACM Press, 2000. [10] M. Batty. Citation geography: It’s about location. The Scientist, 17(16), 2003. [11] M. Batty. The geography of scientific citation. Environment and Planning A, 35:761–770, 2003. [12] S.Lawrence “Online or invisible”, Nature, Volume 411, Number 6837, p. 521, 2001

28 28


Download ppt "A Comparison of On-line Computer Science Citation Databases Vaclav Petricek, Ingemar J. Cox, Hui Han, Isaac G. Councill, C. Lee Giles"

Similar presentations


Ads by Google