Strategies towards improving the utility of scientific big data Evan Bolton, PhD National Center for Biotechnology Information (NCBI) National Library.

Slides:



Advertisements
Similar presentations
Session 4 - Communities of practice Natural Products Drug Discovery The ICBG Program: A view from afar* Special thanks to Joshua Rosenthal of the NIH Fogarty.
Advertisements

PubMed and its search options Jan Emmerich, Sonja Jacobi, Kerstin Müller (5th Semester Library Management)
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
Searching Pubmed Database استخدام قاعدة المعلومات Pubmed د. سيناء عبد المحسن العقيل قسم الصيدلة الإكلينيكية برنامج مهارات البحث العلمي.
1 Welcome to the Protein Database Tutorial This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
The Thomson Reuters CITATION CONNECTION Digital Library st March – 3 rd April 2014, Jasná David Horký Country Manager – Central and Eastern Europe.
Global Alignment and Collaboration Jo
Protein databases Morten Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Jeffery Loo NLM Associate Fellow ’03 – ’05 chemicalinformaticsforlibraries.
Gene Ontology Luis Tari. Gene Ontology (GO) URL: Gene Ontology is A hierarchy of roles of genes.
Use of Ontologies in the Life Sciences: BioPax Graciela Gonzalez, PhD (some slides adapted from presentations available at
The University of Arizona
The Application of the Scientific Method: Preclinical Trials Copyright PEER.tamu.edu.
B IOMEDICAL T EXT M INING AND ITS A PPLICATION IN C ANCER R ESEARCH Henry Ikediego
Molecular Library and Imaging Francis Collins, NHGRI Tom Insel, NIMH Rod Pettigrew, NIBIB Building Blocks and Pathways Francis Collins,NHGRI Richard Hodes,
Introductory Overview
Srihari-CSE730-Spring 2003 CSE 730 Information Retrieval of Biomedical Text and Data Inroduction.
Development of Bioinformatics and its application on Biotechnology
Managing Information Quality in e-Science using Semantic Web technology Alun Preece, Binling Jin, Edoardo Pignotti Department of Computing Science, University.
Chapter 13. The Impact of Genomics on Antimicrobial Drug Discovery and Toxicology CBBL - Young-sik Sohn-
PubChem—Substance, Compound, BioAssay Part 3: Essentials.
Evan Bolton, PhD Jian Zhang, PhD Gang Fu, PhD Jun. 15, 2015 U.S. National Center for Biotechnology Information (NCBI)
Copyright OpenHelix. No use or reproduction without express written consent1.
Intellectual Property, Patents & Technology Transfer Sagar Manoli Shashidhar, Philippe Abdel-Sayed Responsible Conduct in Biomedical Research EPFL,
Searching PubMed® NCBI, NLM Resources, Micromedex -GSBS TTUHSC Preston Smith Library presents Rev. 08/17/14.
Board on Research Data and Information, National Research Council “Changing Roles of Libraries in Support of Scientific Data Activities” June 3, 2010 More.
Intralab Workshop - Reactome CMAP Chang-Feng Quo June 29 th, 2006.
Biological Databases By : Lim Yun Ping E mail :
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
Expert PubMed/Medline Searching Skills Konstantina (Dina) Matsoukas, MLIS Head of Reference & Education Coordinator CUMC - Health Sciences Library
Making Sense of the Periodic Table Trends and what they mean adapted from:
DONNA MAGLOTT, PH.D. PRO AND MEDICAL GENETICS RESOURCES AT NCBI.
ChEMBL– Open Access Database For Drug Discovery By – Udghosh Singh M.S.(Pharm), 3 rd Sem Pharmacoinformatics.
An integrative approach to drug repositioning: a use case for semantic web technologies Paul Rigor Institute for Genomics and Bioinformatics Donald Bren.
8 October 2009Microbial Research Commons1 Toward a biomedical research commons: A view from NLM-NIH Jerry Sheehan Assistant Director for Policy Development.
NCBI Literature Databases: PubMed
ECCR Overview/MLSCN. NIH Roadmap Series of initiatives designed to pursue major opportunities in biomedical research and gaps in current knowledge that.
Integration of chemical-genetic & genetic interaction data links bioactive compounds to cellular target pathways Parsons et al Nature Biotechnology.
December 1, Classification Analysis of HIV RNase H Bioassay Lianyi Han Computational Biology Branch NCBI/NLM/NIH Rocky ‘07.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
Towards an Integrated Research Policy in the Area of Drug Discovery in the Arab Countries Including mechanisms to better utilization of their terrestrial.
BIOSIS CITATION INDEX CRITICAL BIOSIS CONTENT, NOW WITH THE POWER OF CITED REFERENCE INFORMATION.
Discovering functional interaction patterns in Protein-Protein Interactions Networks   Authors: Mehmet E Turnalp Tolga Can Presented By: Sandeep Kumar.
PubMed …featuring more than 20 million citations for biomedical literature from MEDLINE, life science journals, and online books.
PubChem: An Open Repository for Chemical Structure and Biological Activity Information Steve Bryant The NIH Biowulf Cluster: 10 Years of Scientific Supercomputing.
Introduction to Carbon Chemistry Honors Physical Science Ms. Mandel.
Lecture 10 CS566 Fall Structural Bioinformatics Motivation Concepts Structure Solving Structure Comparison Structure Prediction Modeling Structural.
NCBI: something old, something new. What is NCBI? Create automated systems for knowledge about molecular biology, biochemistry, and genetics. Perform.
Welcome to the Protein Database Tutorial. This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
Małopolska Centre of Biotechnology (MCB) X-Ray Crystallography Laboratory Looking into the deep – structural investigations of biological macromolecules.
The National Library of Medicine and its databases Lívia Vasas, PhD
Information Representation Working Group WG Meeting September 5, 2008.
NCBI PubMed NCBI Literature Databases: PubMed Session #1, April 28, 2005 Session #2, April 29, 2005 Ho Chi Minh City, VietNam.
PubChem—Substance, Compound, BioAssay Part 1: Essentials Principles of May 24, 2007.
신기술 접목에 의한 신약개발의 발전전망과 전략 LGCI 생명과학 기술원. Confidential LGCI Life Science R&D 새 시대 – Post Genomic Era Genome count ‘The genomes of various species including.
The National Library of Medicine and its databases a PhD Lívia Vasas February.
BIOBASE Training TRANSFAC ® Containing data on eukaryotic transcription factors, their experimentally-proven binding sites, and regulated genes ExPlain™
PubChem Search Features Stephen Bryant Wolfram Data Summit Scientific and Technical Data Session September 9-10, 2010.
PubChem BioAssay: Link chemical research to GenBank and beyond
Introduction to PubChem BioAssay
The National Library of Medicine and its databases
Directly Upload Data From An ELN Into PubChem
The National Library of Medicine and its databases
Introduction to PubChem BioAssay
Lívia Vasas, PhD 2018 The National Library of Medicine and its databases Mozilla Firefox/Google Chrome Lívia Vasas, PhD.
The National Library of Medicine and its databases
Lívia Vasas, PhD 2018 The National Library of Medicine and its databases Mozilla Firefox/Google Chrome Lívia Vasas, PhD.
Lívia Vasas, PhD 2018 The Nation Library of Medicine and its databases Mozilla Firefox or Google Chrome Lívia Vasas, PhD.
PubMed.
The National Library of Medicine and its databases
Presentation transcript:

Strategies towards improving the utility of scientific big data Evan Bolton, PhD National Center for Biotechnology Information (NCBI) National Library of Medicine (NLM) National Institutes of Health (NIH) Sep. 4, 2014

U.S. National Center for Biotechnology Information

PubChem website

PubChem primary goal … to be an on-line resource providing comprehensive information on the biological activities of substances where “substance” means any biologically testable entity Small molecules, RNAs, carbohydrates, peptides, plant extracts, etc.

PubChem data growth over ten years Contributors ChemicalsBiological Assays Bioactivity ResultsTested ChemicalsProtein Targets +280 substance contributors, +60 assay contributors, +150M substances, +50M compounds, +1.0M bioassays, +6.1T protein targets, +2.9M tested substances, +2.0M tested compounds, +225M bioactivity result sets [M=millions, T=thousands, MLP = Molecular Libraries Program]

CAVEAT! All data has “errors”

Big data has “big errors” Hypothetical If your average data error rate is 1 in 1,000,000, you have % data accuracy If you have one trillion facts (10^12), can you accept one million errors (10^9)? Strategies to mitigate errors? Manual curation has its limits (accuracy, cost, time) So.. what do you do?

Error suppression strategies for scientific big data 1.Identify quality {un}known known/unknowns use to formulate an error suppression strategy 2.Perform data normalization improves utility by helping to refine identification 3.“Trust but verify” cross compare authoritative and curated data 4.Consistency filtering improves precision by removal of outliers 5.Address error feedback loops use “is”, “can be”, and, if all else fails, “is not” lists

Error suppression strategies for scientific big data 1.Identify quality {un}known known/unknowns use to formulate an error suppression strategy there are known knowns; there are things that we know that we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns, the ones we don't know we don't know Feb news briefing Image credit: Tautomers and resonance forms of same chemical structure are prolific (+)-Iridodial Defense chemicals from abdominal glands of 13 rove beetle species of subtribe Staphylinina Ring Closed Ring Open Salt-form drawing variations are common Chemical meaning of a substance may change upon context

Error suppression strategies for scientific big data 2.Perform data normalization improves utility by helping to refine identification Verify chemical content – Atoms defined/real – Implicit hydrogen – Functional group – Atom valence sanity Normalize representation – Tautomer invariance – Aromaticity detection – Stereochemistry – Explicit hydrogen Calculate –Coordinates –Properties –Descriptors Detect components –Isolate covalent units –Neutralize (+/- proton) –Reprocess –Detect unique

Error suppression strategies for scientific big data 3.“Trust but verify” cross compare authoritative and curated data or John Kerry’s more recent adaption of the phrase when discussing Syria’s chemical weapons disposal: “Verify and verify” Image credit: Доверяй, но проверяй (doveryai, no proveryai) Russian proverb used extensively by Ronald Regan when discussing relations with the Soviet Union Image credit: Cross concept count % CTD HDO KEG MED NDF ORD CTD HDO KEG MED NDF ORD Cross-reference overlaps between various disease resources: Human Disease Ontology (HDO), NCBI MedGen (MED), CTD MEDIC (CTD), KEGG Disease (KEG), NDF-RT (NDF), and OrphaNet (ORD) using NLM Medical Subject Headings (MeSH) as the basis of comparison.

Error suppression strategies for scientific big data 4.Consistency filtering improves precision by removal of outliers Keep consensus, remove the rest Image credit:

Error suppression strategies for scientific big data 5.Address error feedback loops use “is”, “can be”, and, if all else fails, “is not” lists Prevent error proliferation at the data source, when possible

Error suppression strategies for scientific big data 1.Identify quality {un}known known/unknowns use to formulate an error suppression strategy 2.Perform data normalization improves utility by helping to refine identification 3.“Trust but verify” cross compare authoritative and curated data 4.Consistency filtering improves precision by removal of outliers 5.Address error feedback loops use “is”, “can be”, and, if all else fails, “is not” lists

Okay … now what? … you have cleaned up your data … but it is huge, unwieldy, unstructured How can it be made more useful?

Data organization strategies for scientific big data 1.Crosslink and annotate data provides context and identifies associated concepts 2.Establish similarity schemes enables identification of related records 3.Associate to concept hierarchies improves navigation between related records 4.Perform data reduction suppresses “redundant” information 5.Be succinct simplifies presentation by hiding details

Data organization strategies for scientific big data 1.Crosslink and annotate data provides context and identifies associated concepts Compound Substance Protein Gene Drug Publication Patent Disease Pathway cites inhibit encode ingredient treat cites associates participates cites

Data organization strategies for scientific big data 2.Establish similarity schemes enables identification of related records Vioxx

Data organization strategies for scientific big data 3.Associate to concept hierarchies improves navigation between related records Match to concept Independent hierarchy = chemical protein gene patent publication pathway … Organized records

Data organization strategies for scientific big data 4.Perform data reduction suppresses “redundant” information 5.Be succinct simplifies presentation by hiding details “subject-predicate-object” “atorvastatin may treat hypercholesterolemia” subjectobject predicate Evidence citation (PMID) From whom? (Data Source) Provenance information

Data organization strategies for scientific big data 1.Crosslink and annotate data provides context and identifies associated concepts 2.Establish similarity schemes enables identification of related records 3.Associate to concept hierarchies improves navigation between related records 4.Perform data reduction suppresses “redundant” information 5.Be succinct simplifies presentation by hiding details

Concluding remarks Scientific “big data” … … contains an amazing amount of information … provides opportunities to make discoveries … benefits from strategies to massage it PubChem is doing its part … … making chemical substance data broadly accessible … cross-integrating it to key scientific resources … suppressing errors and their propagation … organizing the data and making it available

PubChem Crew … Steve Bryant Tiejun Chen Gang Fu Lewis Geer Renata Geer Asta Gindulyte Volker Hahnke Lianyi Han Jane He Siqian He Sunghwan Kim Ben Shoemaker Paul Thiessen Jiyao Wang Yanli Wang Bo Yu Jian Zhang Special thanks to the NCBI Help Desk, especially Rana Morris

Any questions? If you think of one later, me: