Data challenges in the pharmaceutical industry

Slides:



Advertisements
Similar presentations
Obesity e-Lab Enabling obesity research using the Health Surveys for England: The Obesity e-Lab project Dexter Canoy The University of Manchester
Advertisements

The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information Database Resources.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Research Methods, Overview 1 There are hundreds of Scientific fields and disciplines, ranging from the Physical Sciences, to the Life Sciences, to the.
The Experience Factory May 2004 Leonardo Vaccaro.
University of Pittsburgh Department of Biomedical Informatics Healthcare institutions have established local clinical data research repositories to enable.
ONCOMINE: A Bioinformatics Infrastructure for Cancer Genomics
Bioinformatics Databases: Fundamental Concepts of Database Technology & Data Organization Kristen Anton Director of BioInformatics Dartmouth Medical School.
DI FC UL1 Gene Function Prediction by Mining Biomedical Literature Pooja Jain Master in Bioinformatics Supervisor - Mário Jorge Costa Gaspar.
Bioinformatics Databases: Fundamentals of Database Technology & Data Organization Kristen Chambers Director of Bioinformatics Dartmouth Medical School.
Metadata Management – Our Journey Thus Far
NCBI resources III: GEO and expression data analysis Yanbin Yin Fall
Algorithms in Computational Biology Tanya Berger-Wolf Compbio.cs.uic.edu/~tanya/teaching/CompBio January 13, 2006.
1/ Thomson Scientific/ 13 July 2015 Investigator Portal.
JumpStart the Regulatory Review: Applying the Right Tools at the Right Time to the Right Audience Lilliam Rosario, Ph.D. Director Office of Computational.
1 The Discovery Informatics Framework Pat Rougeau President and CEO MDL Information Systems, Inc. Delivering the Integration Promise American Chemical.
Scientific Data as Research Infrastructure: The Biomedical Sciences Strategies for Economic Sustainability of Publicly Funded Data Repositories Board on.
Bioinformatics and medicine: Are we meeting the challenge?
Chapter 6: Foundations of Business Intelligence - Databases and Information Management Dr. Andrew P. Ciganek, Ph.D.
Data Analysis Summary. Elephant in the room General Comments General understanding that informatics is integral in medical sequencing and other –omics.
Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15.
BIRN Advantages in Morphometry  Standards for Data Management / Curation File Formats, Database Interfaces, User Interfaces  Uniform Acquisition and.
Ontologies GO Workshop 3-6 August Ontologies  What are ontologies?  Why use ontologies?  Open Biological Ontologies (OBO), National Center for.
Data provenance in biomedical discovery Donald Dunbar Queen’s Medical Research Institute University of Edinburgh Workshop on Principles of Provenance in.
Clinical Collaboration Platform Overview ST Electronics (Training & Simulation Systems) 8 September 2009 Research Enablers  Consulting  Open Standards.
Proteomics databases for comparative studies: Transactional and Data Warehouse approaches Patricia Rodriguez-Tomé, Nicolas Pinaud, Thomas Kowall GeneProt,
Gene Expression Omnibus (GEO)
BBN Technologies Copyright 2009 Slide 1 The S*QL Plugin for Cytoscape Visual Analytics on the Web of Linked Data Rusty (Robert J.) Bobrow Jeff Berliner,
A collaborative tool for sequence annotation. Contact:
Bioinformatics and Computational Biology
Databases, Ontologies and Text mining Session Introduction Part 2 Carole Goble, University of Manchester, UK Dietrich Rebholz-Schuhmann, EBI, UK Philip.
Construction of Shanghai Life Science & Bio-technology Service Platform for Data Access and Sharing International Workshop on Strategies Presentation of.
1 PLEASING CLIENTS AT A MOLECULAR AND CELLULAR LEVEL AUGUST 7, 2015.
Clinical research data interoperbility Shared names meeting, Boston, Bosse Andersson (AstraZeneca R&D Lund) Kerstin Forsberg (AstraZeneca R&D.
ERP and Related Technologies
High throughput biology data management and data intensive computing drivers George Michaels.
Introduction to Oncomine Xiayu Stacy Huang. Oncomine is a cancer-specific microarray database and has a web-based data-mining platform aimed at facilitating.
ArrayExpress Ugis Sarkans EMBL - EBI
VIEWS b.ppt-1 Managing Intelligent Decision Support Networks in Biosurveillance PHIN 2008, Session G1, August 27, 2008 Mohammad Hashemian, MS, Zaruhi.
Database Principles: Fundamentals of Design, Implementation, and Management Chapter 1 The Database Approach.
Enhancements to Galaxy for delivering on NIH Commons
E- Patient Medical History System
Semantic Web - caBIG Abstract: 21st century biomedical research is driven by massive amounts of data: automated technologies generate hundreds of.
Databases, Ontologies and Text mining Session Introduction Part 2
Testbed for Medical Cyber-Physical Systems
Themes in Geosciences.
Biomedical Text Mining and Its Applications
An Artificial Intelligence Approach to Precision Oncology
KnowEnG: A SCALABLE KNOWLEDGE ENGINE FOR LARGE SCALE GENOMIC DATA
Bio68: Bioinformatics Databases
Development of the Amphibian Anatomical Ontology
Interrogation of cross talk between proteins and gene regulatory networks in breast cancer Chambers, Teressa Lee Hiren Karathia Sridhar Hannenhalli.
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Top Investment Pockets in Artificial Intelligence in Healthcare Market
Workshop Aims TAMU GO Workshop 17 May 2010.
Workshop Aims GO Workshop 3-6 August 2010.
Functional Annotation of the Horse Genome
Mangaldai College, Mangaldai
Using Spotfire for Proteomic Analysis
Gene Expression Omnibus (GEO)
C.U.SHAH COLLEGE OF ENG. & TECH.
Topic: Medicine of the future Reading: Harbron, Chris (2006)
Benjamin Wooden, Nicolas Goossens, Yujin Hoshida, Scott L. Friedman 
Elsevier’s New Biology Solution
Computer Science Issues In a Patient’s Perspective
Chapter 1 Database Systems
CHAPTER 7: Information Visualization
Network biology An introduction to STRING and Cytoscape
Data Mining.
The FaceBase Consortium
Presentation transcript:

Data challenges in the pharmaceutical industry Kejie Li PhD Computational Biologist, Computational Biology and Genomics DBOnto, London May 26, 2016

disease drug targets

Two Challenges Knowledge management Clinical registry data management

Biology view from different people view from biologist / chemist view from math / computer science +

Biological knowledge is the key for pharmaceutical industry

Knowledge Bases Biological information is out there in hundreds of biological databases Information is not knowledge Curation and knowledge management strategy are important Knowledge bases are the essential asset to have to help with pharma/biotech companies Internal drug discovery and development Competitive intelligence

Literature Based Knowledge Bases Curated from publications Noisy, information from all possible contexts (experiment condition, sample used differently etc.) Manual curation High confidence Low throughput Text mining Medium/low confidence High throughput

Data driven Knowledge Base GEO / Array Express contain molecular profiling data Company proprietary data (animal model or clinical trial) Context specific knowledge is more relevant to drug discovery

Represent Knowledge as Graphs (Networks) and Build Analytical Pipelines Literature Existing public sources OMICs (semi-) automatic extraction collect public networks/pathways data driven network Standardization & Sharing Internal tools Public tools Algorithm repository Network Repository Benchmarking & Pipelines Visualization

Knowledge Base Related Challenges Different ontologies used KB Integration is not straight forward for literature and data driven knowledge bases Sharing same public domain knowledge (without each company re-creating the wheel) is challenging

ADNI and AIBL Data ADNI and AIBL are two studies for understanding Alzheimer’s Disease progression They both collect comprehensive information for hundreds of individuals over several years Omics: whole genome sequencing, transcriptome, etc. Clinical data: cognitive tests, labs, etc. Imaging data: PET and MRI brain scans and derived metrics Biomarker data: e.g. protein expression in blood Demographics Data are available through a common archive (LONI)

ADNI and AIBL Data Integration Problem Problem Statement: Academic and Industry projects aim to find models/results that can be validated in multiple, independent datasets, which requires global schema construction and data harmonization (Data) Goal: Have a unified data model that can accommodate multiple studies for cross- study querying and analysis The challenges in achieving data goal are: Multiple centers collect data without upfront data organization strategy resulting in several disjoint datasets that are subsequently consolidated – in the case of ADNI there are > 200 text files with common person ID and visit number Inherent to all biomedical data, fields are messy (free text, data entry errors, unit mismatches) Different studies have different protocols, (slightly) different aims, in different regions, in different people, and therefore collect different data

ADNI-Specific Challenges ADNI actually contains 3 sub-studies ADNI 1, then ADNI GO, then ADNI 2 Same person may be enrolled in one or more sub-studies, concurrently or disjointed Variable names and values may not be harmonized across sub-studies Data redundancy To facilitate data analysis, ADNI statisticians provide a merged dataset called ADNIMERGE that contains commonly used variables – therefore, this is a direct copy of subset of variables from the text files When both common and non-ADNIMERGE variables are needed, users may unnecessarily join ADNIMERGE with data in other text files creating redundancies A smart query system that automatically selects the minimal number of text files traversed to return requested data would minimizes redundancies and logic-conflicts (from processed ADNIMERGE versus unprocessed text files) Data describing the same field (e.g. medication) may be partially controlled with details in free text Medication history is found in multiple files recorded either as a specific, and controlled, background question or as free text

Registry dataset challenges Identify the global schema to accommodate different studies Identify data redundancy to improve data completeness

THANK YOU!