Robust Software Tools for Variant Identification and Functional Assessment (Boston College & University of Michigan) Gabor Marth, Goncalo Abecasis, PIs.

Slides:



Advertisements
Similar presentations
What is genomesontheCloud ?
Advertisements

Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWS Ravi K Madduri University of Chicago and ANL.
Toward a unified view of human genetic variation Gabor Marth Boston College Biology Department on behalf of the International 1000 Genomes Project.
Bioinformatics for high-throughput DNA sequencing Gabor Marth Boston College Biology New grad student orientation Boston College September 8, 2009.
DevOps and Private Cloud Automation 23 April 2015 Hal Clark.
General methods of SNP discovery: PolyBayes Gabor T. Marth Department of Biology Boston College Chestnut Hill, MA
Informatics challenges and computer tools for sequencing 1000s of human genomes Gabor T. Marth Boston College Biology Department Cold Spring Harbor Laboratory.
Building Data-intensive Pipelines Ravi K Madduri Argonne National Lab University of Chicago.
NGS Analysis Using Galaxy
NOAA National Weather Service Water Predictions for Life Decisions DOH/RDM Science Workshop 1 Community Hydrologic Prediction System CHPS George Smith.
Celoxis Intro Celoxis is a web-based project management software company based in India. The Celoxis application integrates management of projects, resources,
GMOD in the Cloud Genome Informatics November 3, 2011 Scott Cain GMOD Project Coordinator Ontario Institute for Cancer Research
DYNAMICS CRM AS AN xRM DEVELOPMENT PLATFORM Jim Novak Solution Architect Celedon Partners, LLC
Customized cloud platform for computing on your terms !
Effective User Services for High Performance Computing A White Paper by the TeraGrid Science Advisory Board May 2009.
DESC mtg U Penn June, 2012 Computing Infrastructure Computing Parallel Session R.Dubois
CALIFORNIA DEPARTMENT OF WATER RESOURCES GEOSPATIAL TECHNICAL SUPPORT MODULE 2 ARCHITECTURE OVERVIEW AND DATA PROMOTION FEBRUARY 20, 2013.
Cloud Computing 1. Outline  Introduction  Evolution  Cloud architecture  Map reduce operation  Platform 2.
Infrastructure clouds, microbial genomics, and the Cloud Virtual Resource project (CloVR) Sam Angiuoli
Geospatial Technical Support Module 2 California Department of Water Resources Geospatial Technical Support Module 2 Architecture overview and Data Promotion.
GMOD Projects at the Center for Genomics and Bioinformatics Chris Hemmerich - Indiana University, Bloomington.
Customized cloud platform for computing on your terms ! Nirav Merchant
IPlant cyberifrastructure to support ecological modeling Presented at the Species Distribution Modeling Group at the American Museum of Natural History.
NGS data analysis CCM Seminar series Michael Liang:
An Introduction to Progress Arcade ™ June 12, 2013 Rob Straight Senior Manager, OpenEdge Product Management.
-- Don Preuss NCBI/NLM/NIH
©2015 EarthLink. All rights reserved Cloud Express ™ Optimize Your Business & Cloud Networks.
Issues in (Financial) High Performance Computing John Darlington Director Imperial College Internet Centre Fast Financial Algorithms and Computing 4th.
Continuous Integration and Code Review: how IT can help Alex Lossent – IT/PES – Version Control Systems 29-Sep st Forum1.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
BRUDNO LAB: A WHIRLWIND TOUR Marc Fiume Department of Computer Science University of Toronto.
Cloud Implementation of GT-FAR (Genome and Transcriptome-Free Analysis of RNA-Seq) University of Southern California.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
SAVANT GENOME BROWSER Marc Fiume Department of Computer Science University of Toronto.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
Clouds in Bioinformatics Rob Knight HHMI and University of Colorado at Boulder.
Nature Reviews/2012. Next-Generation Sequencing (NGS): Data Generation NGS will generate more broadly applicable data for various novel functional assays.
ELECTRAAdvantages ELECTRA Advantages Intuitive workflow Electra workflow consistently follows standard Civil engineering design process which intuitively.
Genome STRiP ASHG Workshop demo materials
….. The cloud The cluster…... What is “the cloud”? 1.Many computers “in the sky” 2.A service “in the sky” 3.Sometimes #1 and #2.
Ocean Observatories Initiative OOI Cyberinfrastructure Life Cycle Objectives Review January 8-9, 2013 Scientific Workflows for OOI Ilkay Altintas Charles.
A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.
Eliminate Team Build Headaches with Unit Tests, WiX and Virtualization Benjamin Day
Integrated variant detection Erik Garrison, Boston College.
CSE 5810 Biomedical Informatics and Cloud Computing Zhitong Fei Computer Science & Engineering Department The University of Connecticut CSE5810: Introduction.
Canadian Bioinformatics Workshops
CyVerse Workshop Discovery Environment Overview. Welcome to the Discovery Environment A Simple Interface to Hundreds of Bioinformatics Apps, Powerful.
The StratusLab Distribution and Its Evolution 4ème Journée Cloud (Bordeaux, France) 30 November 2012.
Transforming Science Through Data-driven Discovery Workshop Overview Ohio State University MCIC Jason Williams – Lead, CyVerse – Education, Outreach, Training.
From Reads to Results Exome-seq analysis at CCBR
Million Veteran Program: Industry Day Genomic Data Processing and Storage Saiju Pyarajan, PhD and Philip Tsao, PhD Million Veteran Program: Industry Day.
Galaxy for analyzing genome data Hardison October 05, 2010
University of Chicago and ANL
Customized cloud platform for computing on your terms !
HPE Synergy.
EIN 6133 Enterprise Engineering
StratusLab Sustainability
Yonglan Zheng Galaxy Hands-on Demo Step-by-step Yonglan Zheng
Genome sequencing informatics
What's New in eCognition 9
Data formats Gabor T. Marth Boston College
The 12 Factors to build Cloud Native Applications
How To Get to Cloud Faster
Delivering great hardware solutions for Windows
Overview of Workflows: Why Use Them?
Computational Pipeline Strategies
DBOS DecisionBrain Optimization Server
What's New in eCognition 9
What's New in eCognition 9
Presentation transcript:

Robust Software Tools for Variant Identification and Functional Assessment (Boston College & University of Michigan) Gabor Marth, Goncalo Abecasis, PIs

Informatics challenges for genomic analysis Tool building Facilitating analysis Widening accessibility

Intentions of the RFA

Our approach Complete toolbox including variant interpretation Full pipelines for start-to-finish analysis Easily accessible and well documented methods Cloud deployment (in addition to single machine/local compute cluster) Open development model

Progress in first 6 months Starting with two sets of tools and pipelines, geared toward high quality local analysis, battle-tested in the 1000GP data and medical sequencing projects The two groups follow a “divide and conquer” strategy to put critical pieces in place for making our algorithms available for the wider genomics community Boston College – A universal tool/pipeline launcher application – Infrastructure for dissemination – Cloud access via Galaxy University of Michigan – Integration of variant annotation/impact assessment – Pipeline/workflow control infrastructure – Adaptation for Amazon Cloud Services

FUNCTIONALITY & TOOLS

Scope

Include latest versions Tools constantly evolving (as they must to remain relevant) Our community toolbox to be updated with new tools as they become available ref: TATAGAGAGAGAGAGAGAGCGAGAGAGAGAGAGAGAGGGAGAGACGGA GTT alt: TATAGAGAGAGAGAGAGCGAGAGAGAGAGAGAGAGAGGGAGAGACGGA GTT ref: TATAGAGAGAGAGAGAGAGC-- GAGAGAGAGAGAGAGAGGGAGAGACGGAGTT alt: TATAGAGAGAGAGAGAG-- CGAGAGAGAGAGAGAGAGAGGGAGAGACGGAGTT New algorithms for complex variant detection (FreeBayes)

Include tools when ready for prime time MEI type RetroSeqTangramTea SampleTotalSensitivityTotalSensitivityTotalSensitivity ALUNA %119298%112792% NA %118598%107892% NA %132699%103889% L1NA %19081%28681% The BC mobile element insertion caller performs best in its class

EPACTS variant interpretation tools (Efficient and Parallelizable Association Container Toolbox) Genetic analysis tool based on VCF o Fast and parallelizable access to large VCF files o Built-in widely used single variant and burden tests o R/C++ interface for extending to newer tests o Binary & quantitative phenotypes with covariates o Useful visualization tools of association results Automated visualization

PIPELINES & WORKFLOW

The UM pipeline Optional LD-aware step Genotype Likelihood BAM Unfiltered VCF Hard-filtered VCF Genotype Likelihood BAM Genotype Likelihood BAM samtools glfMultiples vcfCooker Filtered VCF SVM Filtered/Phased VCF Beagle/Thunder Filtered/Phased VCF EPACTS

UMAKE workflow system Makefile based approach – The Make utility is very good for representing dependencies – Pick up where left off on Failure Flexible deployment – Local Machine – Local Cluster (Mosix) – Amazon Web Services Elastic Compute Cloud (EC2) Default options – User configurable

Application of UMAKE to large-scale projects 14 Project Depth / Region N#SNPs %dbSNP (129) Known Ts/Tv Novel Ts/Tv 1000G4x Genome1, M G>40x Exome822598K GoT2D4x Genome~2, M ESP>80x Exome~6, M Sardinia3x Genome M Bipolar10x Genome Computational cost is ~1 week / 1000 samples in a 5 node mini-cluster

ACCESSIBILITY

The Boston College tool hub (genome)

Simplified installation & use Unified launcher application (gkno) – single tools (e.g. Mosaik) – tool “macros” (e.g. map) – pipelines (e.g. exome variant calling) Download and installation – All tools pulled in a single step from github – All tools installed – All tools tested

Easily configurable pipeline system Part of our new unified launcher system (gkno) Pipeline types (e.g. mapping, variant calling) and instances (exome, whole-genome) User-configurable: tools can be swapped in and out, parameters configured via config files

Support Documentation Tutorials / Blog User forum Bug reports

DEPLOYMENT / CLOUD

Software deployment All software is ready for running locally on a single machine UMAKE adds cluster support Cloud deployment – Simple Michigan pipelines ported to Amazon – Portation of all project software on the way

Cloud-based analysis – Galaxy

OPEN & COLLABORATIVE DEVELOPMENT MODEL

Integration Our workflows leverage 3 rd party tools for specific functionality All our tools are open- source, available on github (many clones, community contributed code) Ensemble approach (multiple tools for critical tasks)

Ensemble approach Multiple tools usually benefit analysis Ts/Tv Called in# SNPs%dbSNPNovelKnownTotal Union907, of 5766, of 5696, of 5601, Intersection520,

Ensemble approach Our pipelines will use multiple aligners (BWA, Mosaik) and variant callers (Freebayes, glfMultiples), developed by BC/UM

In progress Expanding pipelines to integrate all tools Michigan tools -> gkno BC tools -> Michigan cloud ready pipelines Large data set analysis on the cloud Integrate variant interpretation tools Integrate SV tools as they become more robust Integrate consensus analysis (SVM and MLP approaches to callset aggregation) Minimal, functional pipeline -> Galaxy

Team Boston College Alistair Ward Derek Barnett Chase Miller Wan-Ping Lee Erik Garrison Gabor Marth University of Michigan Mary-Kate Trost Tom Blackwell Hyun-Min Kang Youna Hu Adrian Tan Xiaowei Zhan Dajiang Liu Goncalo Abecasis