Download presentation
Presentation is loading. Please wait.
Published byJoshua Eustace Thompson Modified over 8 years ago
1
BioCaddie Pilot Project 3.2 Development of Citation and Data Access Metrics applied to RCSB Protein Data Bank and related Resources Chun-Nan Hsu Department of Biomedical Informatics UC San Diego 1
2
RCSB Protein Data Bank (PDB) PDB assigns each protein structure a PDB ID and their corresponding primary citations 2
3
Citing PDB publications 3 PDB original debut paper
4
Citing PDB and it’s entry 4 Cites PDB repository by URL Cites a PDB entry with the primary citation Cites a PDB entry by PDB ID
5
Questions to answer Does a new PDB publication by any of the wwPDB members attract more citations? Do PDB users refer to PDB URLs more often than citing PDB publications? How many use both? How does data usage statistics correlate to paper citations and URL mentions? Does the above results apply to other data repository, say, UniProt? Do PDB users mention to PDB IDs in their paper more often than citing the primary citation papers? How is the PDB entry statistically dependent to the data citation by paper or ID mention? What are the co-citation and co-mention patterns of the PDB entries? Are these two kinds of patterns consistent to each other? 5
6
New PDB publication does not attract more citations. 6 The paper citation result seems to match the well-documented Matthew effect in science, which states that the rich get richer and the poor get poorer in terms of citations.
7
Authors increasingly refer to PDB URLs more often than citing PDB publications. 7
8
How does data usage statistics correlate to paper citations and URL mentions? 8
9
Short Summary 1.Authors still prefer citing the original PDB debut paper to citing follow-up papers. 2.The number of authors citing PDB by URL mentioning is growing rapidly. 3.The impact of PDB URL mentioning, however, is still lower than that of PDB follow-up papers collectively. 4.PDB website access statistics and URL mentions are highly correlated. 5.Correlations between PDB data usage statistics and PDB paper citations are not as high, though PDB FTP access seems to correlate with paper citations in early years. 9
10
Does the above results apply to other data repository, say, UniProt? 10
11
UniProt Paper citations vs. URL mentions 11
12
IdentifierExampleMachine Readable Mentions (*) % PDB IDPDB ID: 1STPY14,8884.8 PDB DOIhttp://dx.doi.org/10.2210/pdb1stp/pdby1550.05 External Link Tag y32,10810 PDB File Name1stp.pdby8950.03 PDB URLhttp://www.rcsb.org/../structureId=1stpy, but URL may change 6570.2 Non-standard PDB ID PDB code: 1STP, PDB reference 1STP, PDB accession number 1STP, Many variations … y/n22,0817.1 PDB in ContextWe employed the following PDB coordinates: glycogen phosphorylase, 1gpy … y/n with NLP or ML16,7265.4 Free TextWe first placed S2 bound to human PI3KC; (3ene) into the reference coordinates … y/n with NLP or ML221,28772 (incl. many false positives!) * Preliminary data, includes duplicate mentions within same article Citing entries in PDB
13
PDB Identifier is not unique Currency: Each participant received Ksh 200 (1USD = +/-75Ksh) as Year: were abruptly interrupted in 1914 with the (example of an integer PDB ID!) Postal code: 385 Euston Road, London, NW1 3AUT, UK Room number: 110 Irving Street NW, Room 2A56, Washington Floating point number: 1E10, 1D10 Grant type: Parent study (NIH R01 NR04749; NIH 2R01 NR04749). Catalog number: selective detergent method kit (ultra HDL) cat no. 3K33 supplied by Chemical formulas: ellipsoid plot of Zn(H2O)2(C5H5N3O2)2 2NO3 at the 50% Chemical name: Glycolysis under anaerobic condition produces 2ATP per molecule Gene name: The polymorphisms of cytochrome P450 2C19 (CYP2C19) gene Antibody: The primary detection antibody was unlabelled Mab 4B11 Technique: were subjected to 2D-gel electrophoresis (2DGE) Technique: when the recommendations of the NMR and 3DEM VTFs are Instrument: using an Olympus Inverted Microscope (Olympus 1X71, Tokyo, Japan) Instrument: were obtained with Hamamatsu C5810 color chilled 3CCD camera Software: involved in base-pairing as computed by the 3DNA program Software: domain definitions from SCOP, CATH, DALI, 3DEE, and MMDB are 13 Identifier needs a prefix to minimize ambiguities Tagging in text document will further disambiguate identifier
14
PDB users use PDB IDs in their paper much less often than citing the primary citation papers. 14 The growth of the depositions of new PDB entries
15
Highly cited PDB entries differ from highly mentioned entries 15
16
Frequency of PDB ID mentions only moderately statistically correlated to the frequency of primary paper citations. 16 The growth of Pearson correlation coefficient
17
The co-citation and co-mention networks of the PDB entries are quite different. 17 Doc-1 Doc-2 Doc-3 Doc-A Doc-B Co-citation / Co-mention degree is defined as the frequency with which two documents are cited / mentioned together by other documents. We say that the co-citation / co-mention degree of Doc-A and Doc-B is 3. Cites / Mentions
18
Co-citation and co-mention patterns of the categories of PDB entries appear similar. 18 Co-citation degree between top cited categories Co-mention degree between top cited categories
19
Key findings Authors mainly follow a traditional paper citation paradigm for data citation; URL and ID mentions are relatively few but start to pick up the pace Data citations by paper citations and ID mentions show similarity when considering data categories but they are dissimilar at the level of individual entries Inconsistent data citation practices make it relatively difficult to measure impacts of data consistently 19
20
Contributors Peter W. Rose (RCSB PDB) Yi-Hung Huang (National Taiwan University) Cathy W. Wu, Cecilia Neomi Arighi, Ruoyao Ding (UniProt, University of Delaware) 20
21
21 Thank you for your attention. Funding: The project is supported in part by Grant U24AI117966 National Institutes of Health Big Data to Knowledge (BD2K) Initiative to PWR and CNH and by Ministry of Science and Technology, Taiwan under Grant MOST 103-2911-I-002-001, and National Taiwan University-Intel Corporation NTU-ICRP-104R7501 and NTU-ICRP-104R7501-1 to YHH. PWR was in part supported by the RCSB PDB grant from the National Science Foundation NSF DBI-1338415; National Institute of General Medical Sciences (NIGMS); Office of Science, Department of Energy (DOE); National Library of Medicine (NLM); National Cancer Institute (NCI); National Institute of Neurological Disorders and Stroke (NINDS); and National Institute of Diabetes & Digestive & Kidney Diseases (NIDDK). Intel-NTU Connected Context Computing Center provided support in the form of a salary for author YHH, but did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. The specific role of this authors is articulated in the “author contributions” section.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.