Million Veteran Program: Industry Day Genomic Data Processing and Storage Saiju Pyarajan, PhD and Philip Tsao, PhD Million Veteran Program: Industry Day.

Slides:



Advertisements
Similar presentations
Tales from the Lab: Experiences and Methodology Demand Technology User Group December 5, 2005 Ellen Friedman SRM Associates, Ltd.
Advertisements

Agile Infrastructure built on OpenStack Building The Next Generation Data Center with OpenStack John Griffith, Senior Software Engineer,
Next–generation DNA sequencing technologies – theory & practice
High Throughput Sequencing
It’s Not Magic: Automated NGS Sample Preparation Zach Smith, MS Senior Application Scientist.
Bioinformatics at WSU Matt Settles Bioinformatics Core Washington State University Wednesday, April 23, 2008 WSU Linux User Group (LUG)‏
Candidate Gene Resource Steering Committee Meeting July 25, 2006.
Components and Architecture CS 543 – Data Warehousing.
High Throughput Sequencing
Laboratory Information Management Systems. Laboratory Information The sole product of any laboratory, serving any purpose, in any industry, is information.
WORKFLOWS IN CLOUD COMPUTING. CLOUD COMPUTING  Delivering applications or services in on-demand environment  Hundreds of thousands of users / applications.
Doc Document Management Systems For Manufacturing Industry Infocrew Solutions Pvt.Ltd.
11 © 2009 PerkinElmer © 2010 PerkinElmer November 20, 2012 DNA Services Overview.
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | OFSAAAI: Modeling Platform Enterprise R Modeling Platform Gagan Deep Singh Director.
Whole Exome Sequencing for Variant Discovery and Prioritisation
Overview Big Data Big Data in Genomics Enter: The Cloud
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 1 Buying factors – HP.
K E Y : SW Service Use Big Data Information Flow SW Tools and Algorithms Transfer Application Provider Visualization Access Analytics Curation Collection.
Beyond the Human Genome Project Future goals and projects based on findings from the HGP.
DDN & iRODS at ICBR By Alex Oumantsev History of ICBR  Campus wide Interdisciplinary Center for Biotechnology Research  Core Facility  Funded by the.
Microsoft TechForge 2009 SQL Server 2008 Unplugged Microsoft’s Data Platform Vinod Kumar Technology Evangelist – DB and BI
Informatics Software and Services Jim Shaw BergenShaw International Integrate. Automate. Manage. Your company Logo In collaboration.
David R. McWilliams, Ph.D. Section of Statistical Genetics, Department of Biostatistical Sciences, Center for Public Health Genomics Bioinformatician IV.
Current Challenges in Metagenomics: an Overview Chandan Pal 17 th December, GoBiG Meeting.
1 MISA Model Douglas Petry Manager Information Security Architecture Methodist Health System Managed Information Security.
K E Y : SW Service Use Big Data Information Flow SW Tools and Algorithms Transfer Transformation Provider Visualization Access Analytics Curation Collection.
Tutorial 6 High Throughput Sequencing. HTS tools and analysis Review of resequencing pipeline Visualization - IGV Analysis platform – Galaxy Tuning up.
HW2: exome sequencing and complex disease Jacquemin Jonathan de Bournonville Sébastien.
Current Data And Future Analysis Thomas Wieland, Thomas Schwarzmayr and Tim M Strom Helmholtz Zentrum München Institute of Human Genetics Geneva, 16/04/12.
K E Y : DATA SW Service Use Big Data Information Flow SW Tools and Algorithms Transfer Hardware (Storage, Networking, etc.) Big Data Framework Scalable.
High throughput biology data management and data intensive computing drivers George Michaels.
VIEWS b.ppt-1 Managing Intelligent Decision Support Networks in Biosurveillance PHIN 2008, Session G1, August 27, 2008 Mohammad Hashemian, MS, Zaruhi.
Metadata Driven Clinical Data Integration – Integral to Clinical Analytics April 11, 2016 Kalyan Gopalakrishnan, Priya Shetty Intelent Inc. Sudeep Pattnaik,
CSE 5810 Biomedical Informatics and Cloud Computing Zhitong Fei Computer Science & Engineering Department The University of Connecticut CSE5810: Introduction.
Data Coordinating Center University of Washington Department of Biostatistics Elizabeth Brown, ScD Siiri Bennett, MD.
1 Finding disease genes: A challenge for Medicine, Mathematics and Computer Science Andrew Collins, Professor of Genetic Epidemiology and Bioinformatics.
From Reads to Results Exome-seq analysis at CCBR
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
Journey to the HyperConverged Agile Infrastructure
Enhancements to Galaxy for delivering on NIH Commons
DATA Storage and analytics with AZURE DATA LAKE
To develop the scientific evidence base that will lessen the burden of cancer in the United States and around the world. NCI Mission Key message:
Connected Infrastructure
Interpreting exomes and genomes: a beginner’s guide
Laboratory Information Management Systems (LIMS)
Organizations Are Embracing New Opportunities
Cloud University Live: 8 Steps to Build Your Cloud Go to Market Plan
Cancer Genomics Core Lab
Gil McVean Department of Statistics
Million Veteran Program Data Marts and Data Access
Oleksiy Karpenko, Neil J. Bahroos Galaxy Community Conference Chicago
The Genome Diversity in Africa Project
Connected Infrastructure
Astrix Technology Group
© 2016 Global Market Insights, Inc. USA. All Rights Reserved Software Defined Networking Market to grow at 54% CAGR from 2017 to 2024:
Whole-exome sequencing for RH genotyping and alloimmunization risk in children with sickle cell anemia by Stella T. Chou, Jonathan M. Flanagan, Sunitha.
Journey of Quality Analysts towards Data Analytics
Operationalize your data lake Accelerate business insight
USF Health Informatics Institute (HII)
HII Technical Infrastructure
Big Data - in Performance Engineering
Division of Air AirCom DARM’s New Compliance and Enforcement Database and Field Inspection Tool.
GateKeeper: A New Hardware Architecture
ONRR Compliance Process Improvement
Litech Order to Cash Dashboard – an SAP Qualified Package
JOINED AT THE HIP: DEVSECOPS AND CLOUD-BASED ASSETS
Computer Services Business challenge
Development Goals for Year 2
Architecture of modern data warehouse
Global Next Generation Sequencing (NGS) Market (By Products - Consumables, Platforms, Services, Sequencing Services, Bioinformatics, Technology, Applications, End Users, Regions), Key Company Profiles - Forecast to 2025
Presentation transcript:

Million Veteran Program: Industry Day Genomic Data Processing and Storage Saiju Pyarajan, PhD and Philip Tsao, PhD Million Veteran Program: Industry Day Genomic Data Processing and Storage Saiju Pyarajan, PhD and Philip Tsao, PhD

VETERANS HEALTH ADMINISTRATION Current State - Genomics 2 MVP Study conduct facilitates standardization of tissue collection & processing DNA extraction & quality assurance Enterprise LIMS for sample tracking Electronic manifest for all sample transfers Genotyping at two vendors Affymetrix Axiom platform with MVP Custom Chip Sequencing at two vendors Illumina Ion Torrent FY 15 target ~ 400,000 for genotyping ~25,000 Whole exome & 2,000 Whole genome sequencing

VETERANS HEALTH ADMINISTRATION Data QA/QC 3 Tier 1 QA/QC being performed by vendors Vendor SNP Array Genotyping Process Flow

4 Genomic Data Ingestion Pipeline

VETERANS HEALTH ADMINISTRATION Current State – Whole Genome Sequencing Data Drives received from vendors VAPAHCS/Stanford WGS Pipeline – Encrypted keys to unlock drives – Analysis pipeline that can take raw sequence data and process to call variants (sequence and structural) at a rate of ~4/day – 1° analysis: QA/QC of sequence data (VQSR); check with SNP array data – 2° analysis: QA filtering, alignment/assembly and variant calling; quality metrics from VCF – 3° analysis: multi-sample processing; QA of variant calls; QA of population structure; annotation and filtering; association analysis; prediction algorithms scale pipeline for MVP – Increase throughput – Enhance general utility for end users – AAA genomes are used to design, test and tune pipeline 5

VETERANS HEALTH ADMINISTRATION Current State – Whole Exome Sequencing Data Drives received from vendors VAPAHCS/Stanford WES Pipeline – Encrypted keys to unlock drives – Get checksum manifest of files on each drive Evaluate data stability/corruption – Analysis of Coverage (from BAM) Determine coverage of coding regions – Analysis of Mapping (from BAM) Mapping quality indicative of potential contamination – Analysis of Variants Variant quality – Exome vs. Genome Some duplicate samples between vendors/platforms scale pipeline for MVP – Increase throughput – Enhance general utility for end users 6

VETERANS HEALTH ADMINISTRATION “Big Data” Challenges for MVP 7 Data Volume – Tiered storage by data types and use Data Storage architecture Efficiency, compression Data Query/Metadata Manager Service Computing/Analytical infrastructure Bringing processing power to the data instead of moving data Data security, Data access and Data governance Flexible for researchers yet secure Consistent Data quality Across vendors, batches and time

VETERANS HEALTH ADMINISTRATION MVP Data - Looking ahead 8 – Capability Tiered data storage for raw vs processed data computing Data & system access controls based on MVP governance Dynamic annotation of data and self enriching knowledgebase Integrative clinical-genomic query capability for cohort identification Automated creation of customized extracts for individual studies without data duplication – Scalability Scalable hardware and file systems Horizontal scaling for hardware and data changes with minimal downtime – Flexibility Extremely fast and efficient querying across high-throughput data (>million for each data type) Data Architecture to allow for maximum flexibility for Scientific and technology changes – (genomics, proteomics, other “omics”) – Efficiency Fast upload and storage of data “Out-of-the box” computing resource management – Security Granular tiered access controls