BioCaddie Pilot Project 3.2 Development of Citation and Data Access Metrics applied to RCSB Protein Data Bank and related Resources Chun-Nan Hsu Department.


Similar presentations
Usage statistics in context - panel discussion on understanding usage, measuring success Peter Shepherd Project Director COUNTER AAP/PSP 9 February 2005.

NATIONAL LIBRARY OF MEDICINE PubMed Central Edwin Sequeira National Library of Medicine May 26, 2004.
Library Resources in the Networked Environment or, Its all about service(s) (and data…) Kevin Kidd Library Applications & Systems Manager Boston College.
Managing References : Mendeley
NIH Public Access Compliance Cleveland Health Sciences Library Case Western Reserve University Kathleen C. Blazar.
Introduction to Mendeley. What is Mendeley? Mendeley is a reference manager allowing you to manage, read, share, annotate and cite your research papers...
Data, Data Everywhere, But Not a Byte to Eat Michael F. Huerta, Ph.D. Associate Director, National Library of Medicine Director, Office of Health Information.
NATIONAL LIBRARY OF MEDICINE NLM Journal Archiving and Interchange Tagset Jeff Beck National Center for Biotechnology Information National Library of Medicine.
NATIONAL LIBRARY OF MEDICINE The PubMed ID and Entrez, PubMed and PubMed Central Edwin Sequeira National Center for Biotechnology Information June 21,
PubMed for Trainers, Spring 2012 U.S. National Library of Medicine (NLM) and NLM Training Center LinkOut for Libraries.
Jeffery Loo NLM Associate Fellow ’03 – ’05 chemicalinformaticsforlibraries.
Archives and Information Retrieval
Redesign of the Columbia University Infobutton Manager James J. Cimino, Beth E. Friedmann, Kevin M. Jackson, Jianhua Li, Jenia Pevzner, Jesse Wrenn Department.
6/17/2015Lars Björnshauge1 The Next Generation of IRs – enabling closer cooperation & networking International Workshop on institutional repositories and.
Bio/CS 251 Introduction to Bioinformatics. Class Web Site This site will contain all important documents.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Indian Journal of Physics: A Scientometric Analysis
Gene Ontology Project
Datamining MEDLINE for Topics and Trends in Dental and Craniofacial Research William C. Bartling, D.D.S. NIDCR/NLM Fellow in Dental Informatics Center.
Indexing 1/2 BDK12-3 Information Retrieval William Hersh, MD Department of Medical Informatics & Clinical Epidemiology Oregon Health & Science University.
Erice 2008 Introduction to PDB Workshop From Molecules to Medicine: Integrating Crystallography in Drug Discovery Erice, 29 May - 8 June Peter Rose
Evaluation of Structure Quality Using RCSB PDB Tools Kyle Burkhardt, Lead Data Annotator The RCSB PDB at Rutgers University.
DIGEST OF KEY SCIENCE AND ENGINEERING INDICATORS 2008 Presentation Slides National Science Board.
X-ray crystallography NMR cryoEM Experimental approaches for structural biology.
Protein 3D-structure analysis Exercises. Practicals Find update frequency for RCSB PDB: weekly. When was the last update? How many protein structures.
Digital Libraries: Redefining the Library Value Paradigm Peter E Sidorko The University of Hong Kong 3 December 2010.
The role of knowledge bases in improving discoverability now and in the future- why national and international collaboration is key The role of knowledge.
Introduction to Mendeley. What is Mendeley? Mendeley is a reference manager allowing you to manage, read, share, annotate and cite your research papers...
Michael F. Huerta, Ph.D. Associate Director for Program Development National Library of Medicine, NIH BD2K CDE Webinar – September 8, 2015 Common Data.
Innovation & Supplementary Material Eleonora Presani – Elsevier
THOMSON SCIENTIFIC Patricia Brennan Thomson Scientific January 10, 2008.
Resource Curation and Automated Resource Discovery.
Library needs and workflows Diane Boehr Head of Cataloging National Library of Medicine, NIH, DHHS
A Bibliometric Analysis of Greek Scientific Publications Υπέρτιτλος………… ……………………… Ευάγγελος Μπούµπουκας | Διευθυντής ΕΚΤ A Bibliometric Analysis.
VIRTUAL HEALTH LIBRARY JAMAICA PROJECT Presented by Swarna Bandara VHL Coordinator At the 4th VHL Meeting in Bahia, Salvador Sept. 2005
Molly Harrington, MLS, AHIP St. Joseph’s Hospital and Medical Center, Phoenix, AZ Presented October 16, 2013 for Midday at the Oasis.
Towards Data Attribution & Citation in the Life Sciences Philip E. Bourne UCSD 8/22/11Data Attribution and Citation.
Structural Genomics Consortium releases 1,000th protein structure The Structural Genomics Consortium (SGC), an international public-private partnership.
The Gene Ontology and its insertion into UMLS Jane Lomax.
Nicole A. Vasilevsky 1, Matthew Brush 1, Holly Paddock 2, Laura Ponting 3, Shreejoy Tripathy 4, Greg LaRocca 4, Melissa A. Haendel 1 1 Ontology Development.
Data Integration and Management A PDB Perspective.
Retrieval of Highly Related Biomedical References by Key Passages of Citations Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan.
Protein Data Bank: An Introduction Learning to Use the RCSB PDB Portal.
Mission-Based Management August 2006 Electronic CV System Users Group.
From the Advanced Search page of the Cochrane Library, we have clicked on the Cochrane Reviews: By Topic hyperlink. This has displayed the Topics for Cochrane.
Real World Experiences in Operating a Collaboratory: The Protein Data Bank Helen M. Berman Board of Governors Professor of Chemistry.
Introductions, Discussions, and References Barbara Gastel, MD, MPH Veterinary Integrative Biosciences.
Gene Ontology Project
RDA/US Adoption Seed Projects RDA/US is partnering with four groups as part of the MacArthur 2016 Adoption Seeds program Bringing visibility to food security. As a result of the mandates Research in the open How mandates work in practice 29 th May, 2009 Paul Davey, UK PubMed Central Engagement Manager,
SciENcv: NLM’s Fed-wide biosketch tool NIH Regional Meeting May 2016 Neil Thakur, PhD Office of Extramural Research Bart Trawick, PhD National Center for.
SciENcv: a Federal biosketch tool NIH Regional Meeting October 2016 Neil Thakur, PhD Office of Extramural Research Bart Trawick, PhD National Center for.
PSYCH 625 MENTOR It's Your Life/
Measuring Scholarly and Public Impact: Let’s Talk Metrics
RDA US Science workshop Arlington VA, Aug 2014 Cees de Laat with many slides from Ed Seidel/Rob Pennington.
SciENcv: NLM’s Fed-wide biosketch tool NIH Regional Meeting May 2016
The Influence of Domain-Specific Metric Development on Evaluation and Design: An Example from National Institutes of Health Technology Development Programs.
Bulgaria’s research landscape and the context of CERN collaboration
Human Cells Human genomics
PSYCH 625 MENTOR Lessons in
ISI Web of Knowledge Early updates
Lexical ambiguity in SNOMED CT
Review Key Teaching Points
Introduction of KNS55 Platform
An Introducation to ResearcherID
Lesson 3 Bioinformatics Laboratory
Presentation transcript:

BioCaddie Pilot Project 3.2 Development of Citation and Data Access Metrics applied to RCSB Protein Data Bank and related Resources Chun-Nan Hsu Department of Biomedical Informatics UC San Diego 1

RCSB Protein Data Bank (PDB) PDB assigns each protein structure a PDB ID and their corresponding primary citations 2

Citing PDB publications 3 PDB original debut paper

Citing PDB and it’s entry 4 Cites PDB repository by URL Cites a PDB entry with the primary citation Cites a PDB entry by PDB ID

Questions to answer Does a new PDB publication by any of the wwPDB members attract more citations? Do PDB users refer to PDB URLs more often than citing PDB publications? How many use both? How does data usage statistics correlate to paper citations and URL mentions? Does the above results apply to other data repository, say, UniProt? Do PDB users mention to PDB IDs in their paper more often than citing the primary citation papers? How is the PDB entry statistically dependent to the data citation by paper or ID mention? What are the co-citation and co-mention patterns of the PDB entries? Are these two kinds of patterns consistent to each other? 5

New PDB publication does not attract more citations. 6 The paper citation result seems to match the well-documented Matthew effect in science, which states that the rich get richer and the poor get poorer in terms of citations.

Authors increasingly refer to PDB URLs more often than citing PDB publications. 7

How does data usage statistics correlate to paper citations and URL mentions? 8

Short Summary 1.Authors still prefer citing the original PDB debut paper to citing follow-up papers. 2.The number of authors citing PDB by URL mentioning is growing rapidly. 3.The impact of PDB URL mentioning, however, is still lower than that of PDB follow-up papers collectively. 4.PDB website access statistics and URL mentions are highly correlated. 5.Correlations between PDB data usage statistics and PDB paper citations are not as high, though PDB FTP access seems to correlate with paper citations in early years. 9

Does the above results apply to other data repository, say, UniProt? 10

UniProt Paper citations vs. URL mentions 11

IdentifierExampleMachine Readable Mentions (*) % PDB IDPDB ID: 1STPY14, PDB DOI /pdb1stp/pdby External Link Tag y32,10810 PDB File Name1stp.pdby PDB URLhttp:// but URL may change Non-standard PDB ID PDB code: 1STP, PDB reference 1STP, PDB accession number 1STP, Many variations … y/n22, PDB in ContextWe employed the following PDB coordinates: glycogen phosphorylase, 1gpy … y/n with NLP or ML16, Free TextWe first placed S2 bound to human PI3KC; (3ene) into the reference coordinates … y/n with NLP or ML221,28772 (incl. many false positives!) * Preliminary data, includes duplicate mentions within same article Citing entries in PDB

PDB Identifier is not unique Currency: Each participant received Ksh 200 (1USD = +/-75Ksh) as Year: were abruptly interrupted in 1914 with the (example of an integer PDB ID!) Postal code: 385 Euston Road, London, NW1 3AUT, UK Room number: 110 Irving Street NW, Room 2A56, Washington Floating point number: 1E10, 1D10 Grant type: Parent study (NIH R01 NR04749; NIH 2R01 NR04749). Catalog number: selective detergent method kit (ultra HDL) cat no. 3K33 supplied by Chemical formulas: ellipsoid plot of Zn(H2O)2(C5H5N3O2)2 2NO3 at the 50% Chemical name: Glycolysis under anaerobic condition produces 2ATP per molecule Gene name: The polymorphisms of cytochrome P450 2C19 (CYP2C19) gene Antibody: The primary detection antibody was unlabelled Mab 4B11 Technique: were subjected to 2D-gel electrophoresis (2DGE) Technique: when the recommendations of the NMR and 3DEM VTFs are Instrument: using an Olympus Inverted Microscope (Olympus 1X71, Tokyo, Japan) Instrument: were obtained with Hamamatsu C5810 color chilled 3CCD camera Software: involved in base-pairing as computed by the 3DNA program Software: domain definitions from SCOP, CATH, DALI, 3DEE, and MMDB are 13 Identifier needs a prefix to minimize ambiguities Tagging in text document will further disambiguate identifier

PDB users use PDB IDs in their paper much less often than citing the primary citation papers. 14 The growth of the depositions of new PDB entries

Highly cited PDB entries differ from highly mentioned entries 15

Frequency of PDB ID mentions only moderately statistically correlated to the frequency of primary paper citations. 16 The growth of Pearson correlation coefficient

The co-citation and co-mention networks of the PDB entries are quite different. 17 Doc-1 Doc-2 Doc-3 Doc-A Doc-B Co-citation / Co-mention degree is defined as the frequency with which two documents are cited / mentioned together by other documents. We say that the co-citation / co-mention degree of Doc-A and Doc-B is 3. Cites / Mentions

Co-citation and co-mention patterns of the categories of PDB entries appear similar. 18 Co-citation degree between top cited categories Co-mention degree between top cited categories

Key findings Authors mainly follow a traditional paper citation paradigm for data citation; URL and ID mentions are relatively few but start to pick up the pace Data citations by paper citations and ID mentions show similarity when considering data categories but they are dissimilar at the level of individual entries Inconsistent data citation practices make it relatively difficult to measure impacts of data consistently 19

Contributors Peter W. Rose (RCSB PDB) Yi-Hung Huang (National Taiwan University) Cathy W. Wu, Cecilia Neomi Arighi, Ruoyao Ding (UniProt, University of Delaware) 20

21 Thank you for your attention. Funding: The project is supported in part by Grant U24AI National Institutes of Health Big Data to Knowledge (BD2K) Initiative to PWR and CNH and by Ministry of Science and Technology, Taiwan under Grant MOST I , and National Taiwan University-Intel Corporation NTU-ICRP-104R7501 and NTU-ICRP-104R to YHH. PWR was in part supported by the RCSB PDB grant from the National Science Foundation NSF DBI ; National Institute of General Medical Sciences (NIGMS); Office of Science, Department of Energy (DOE); National Library of Medicine (NLM); National Cancer Institute (NCI); National Institute of Neurological Disorders and Stroke (NINDS); and National Institute of Diabetes & Digestive & Kidney Diseases (NIDDK). Intel-NTU Connected Context Computing Center provided support in the form of a salary for author YHH, but did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. The specific role of this authors is articulated in the “author contributions” section.