Presentation is loading. Please wait.

Presentation is loading. Please wait.

Strategies towards improving the utility of scientific big data Evan Bolton, PhD National Center for Biotechnology Information (NCBI) National Library.

Similar presentations


Presentation on theme: "Strategies towards improving the utility of scientific big data Evan Bolton, PhD National Center for Biotechnology Information (NCBI) National Library."— Presentation transcript:

1 Strategies towards improving the utility of scientific big data Evan Bolton, PhD National Center for Biotechnology Information (NCBI) National Library of Medicine (NLM) National Institutes of Health (NIH) Sep. 4, 2014

2 http://www.nlm.nih.gov/

3 U.S. National Center for Biotechnology Information https://www.ncbi.nlm.nih.gov/

4 https://pubchem.ncbi.nlm.nih.gov/ PubChem website

5 PubChem primary goal … to be an on-line resource providing comprehensive information on the biological activities of substances where “substance” means any biologically testable entity Small molecules, RNAs, carbohydrates, peptides, plant extracts, etc.

6 PubChem data growth over ten years Contributors ChemicalsBiological Assays Bioactivity ResultsTested ChemicalsProtein Targets +280 substance contributors, +60 assay contributors, +150M substances, +50M compounds, +1.0M bioassays, +6.1T protein targets, +2.9M tested substances, +2.0M tested compounds, +225M bioactivity result sets [M=millions, T=thousands, MLP = Molecular Libraries Program]

7 CAVEAT! All data has “errors”

8 Big data has “big errors” Hypothetical If your average data error rate is 1 in 1,000,000, you have 99.999% data accuracy If you have one trillion facts (10^12), can you accept one million errors (10^9)? Strategies to mitigate errors? Manual curation has its limits (accuracy, cost, time) So.. what do you do?

9 Error suppression strategies for scientific big data 1.Identify quality {un}known known/unknowns use to formulate an error suppression strategy 2.Perform data normalization improves utility by helping to refine identification 3.“Trust but verify” cross compare authoritative and curated data 4.Consistency filtering improves precision by removal of outliers 5.Address error feedback loops use “is”, “can be”, and, if all else fails, “is not” lists

10 Error suppression strategies for scientific big data 1.Identify quality {un}known known/unknowns use to formulate an error suppression strategy there are known knowns; there are things that we know that we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns, the ones we don't know we don't know Feb. 2002 news briefing Image credit: http://en.wikipedia.org/wiki/Donald_Rumsfeld Tautomers and resonance forms of same chemical structure are prolific (+)-Iridodial Defense chemicals from abdominal glands of 13 rove beetle species of subtribe Staphylinina Ring Closed Ring Open Salt-form drawing variations are common Chemical meaning of a substance may change upon context

11 Error suppression strategies for scientific big data 2.Perform data normalization improves utility by helping to refine identification Verify chemical content – Atoms defined/real – Implicit hydrogen – Functional group – Atom valence sanity Normalize representation – Tautomer invariance – Aromaticity detection – Stereochemistry – Explicit hydrogen Calculate –Coordinates –Properties –Descriptors Detect components –Isolate covalent units –Neutralize (+/- proton) –Reprocess –Detect unique

12 Error suppression strategies for scientific big data 3.“Trust but verify” cross compare authoritative and curated data or John Kerry’s more recent adaption of the phrase when discussing Syria’s chemical weapons disposal: “Verify and verify” Image credit: http://en.wikipedia.org/wiki/John_Kerry Доверяй, но проверяй (doveryai, no proveryai) Russian proverb used extensively by Ronald Regan when discussing relations with the Soviet Union Image credit: http://en.wikipedia.org/wiki/Ronald_Reagan Cross concept count % CTD HDO KEG MED NDF ORD CTD 100.0 14.3 79.1 40.7 49.7 35.8 HDO 26.0 100.0 38.7 52.4 48.3 26.2 KEG 24.8 6.7 100.0 10.7 6.4 25.2 MED 97.2 68.9 81.6 100.0 93.8 79.6 NDF 30.4 16.3 12.5 24.0 100.0 10.8 ORD 31.9 12.8 71.6 29.7 15.7 100.0 Cross-reference overlaps between various disease resources: Human Disease Ontology (HDO), NCBI MedGen (MED), CTD MEDIC (CTD), KEGG Disease (KEG), NDF-RT (NDF), and OrphaNet (ORD) using NLM Medical Subject Headings (MeSH) as the basis of comparison.

13 Error suppression strategies for scientific big data 4.Consistency filtering improves precision by removal of outliers Keep consensus, remove the rest Image credit: http://withfriendship.com/images/c/11229/Accuracy-and-precision-picture.png

14 Error suppression strategies for scientific big data 5.Address error feedback loops use “is”, “can be”, and, if all else fails, “is not” lists Prevent error proliferation at the data source, when possible

15 Error suppression strategies for scientific big data 1.Identify quality {un}known known/unknowns use to formulate an error suppression strategy 2.Perform data normalization improves utility by helping to refine identification 3.“Trust but verify” cross compare authoritative and curated data 4.Consistency filtering improves precision by removal of outliers 5.Address error feedback loops use “is”, “can be”, and, if all else fails, “is not” lists

16 Okay … now what? … you have cleaned up your data … but it is huge, unwieldy, unstructured How can it be made more useful?

17 Data organization strategies for scientific big data 1.Crosslink and annotate data provides context and identifies associated concepts 2.Establish similarity schemes enables identification of related records 3.Associate to concept hierarchies improves navigation between related records 4.Perform data reduction suppresses “redundant” information 5.Be succinct simplifies presentation by hiding details

18 Data organization strategies for scientific big data 1.Crosslink and annotate data provides context and identifies associated concepts Compound Substance Protein Gene Drug Publication Patent Disease Pathway cites inhibit encode ingredient treat cites associates participates cites

19 Data organization strategies for scientific big data 2.Establish similarity schemes enables identification of related records Vioxx

20 Data organization strategies for scientific big data 3.Associate to concept hierarchies improves navigation between related records Match to concept Independent hierarchy = chemical protein gene patent publication pathway … Organized records

21 Data organization strategies for scientific big data 4.Perform data reduction suppresses “redundant” information 5.Be succinct simplifies presentation by hiding details “subject-predicate-object” “atorvastatin may treat hypercholesterolemia” subjectobject predicate Evidence citation (PMID) From whom? (Data Source) Provenance information

22 Data organization strategies for scientific big data 1.Crosslink and annotate data provides context and identifies associated concepts 2.Establish similarity schemes enables identification of related records 3.Associate to concept hierarchies improves navigation between related records 4.Perform data reduction suppresses “redundant” information 5.Be succinct simplifies presentation by hiding details

23 Concluding remarks Scientific “big data” … … contains an amazing amount of information … provides opportunities to make discoveries … benefits from strategies to massage it PubChem is doing its part … … making chemical substance data broadly accessible … cross-integrating it to key scientific resources … suppressing errors and their propagation … organizing the data and making it available https://pubchem.ncbi.nlm.nih.gov

24 PubChem Crew … Steve Bryant Tiejun Chen Gang Fu Lewis Geer Renata Geer Asta Gindulyte Volker Hahnke Lianyi Han Jane He Siqian He Sunghwan Kim Ben Shoemaker Paul Thiessen Jiyao Wang Yanli Wang Bo Yu Jian Zhang Special thanks to the NCBI Help Desk, especially Rana Morris

25 Any questions? If you think of one later, email me: bolton@ncbi.nlm.nih.gov


Download ppt "Strategies towards improving the utility of scientific big data Evan Bolton, PhD National Center for Biotechnology Information (NCBI) National Library."

Similar presentations


Ads by Google