Next Generation Sequencing and Bioinformatics Analysis Pipelines Adam Ameur National Genomics Infrastructure SciLifeLab Uppsala

Slides:

Advertisements

Similar presentations

Next-Generation Sequencing: Methodology and Application

Advertisements

Schulich School of Medicine & Dentistry The University of Western Ontario London Regional Genomics Centre Next Generation Sequencing Meeting April 1, 2010.

The Past, Present, and Future of DNA Sequencing

Considerations for Analyzing Targeted NGS Data HLA

Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis Yan Guo.

Genetic Approaches to Rare Diseases: What has worked and what may work for AHC Erin L. Heinzen, Pharm.D, Ph.D Center for Human Genome Variation Duke University.

Next–generation DNA sequencing technologies – theory & practice

Transcriptome Sequencing with Reference

Introduction  Human leukocyte antigen (HLA) is the major histocompatibility complex (MHC) in humans  Group of genes ('superregion') on chromosome 6.

Next-generation sequencing

Targeted Data Introduction  Many mapping, alignment and variant calling algorithms  Most of these have been developed for whole genome sequencing and.

The 454 and Ion PGM at the Genomics Core Facility Dr. Deborah Grove, Director for Genetic Analysis Genomics Core Facility Huck Institutes of the Life Sciences.

Next Generation Sequencing and Bioinformatics Analysis Pipelines Adam Ameur National Genomics Infrastructure SciLifeLab Uppsala

Evaluation of PacBio sequencing to improve the sunflower genome assembly Stéphane Muños & Jérôme Gouzy Presented by Nicolas Langlade Sunflower Genome Consortium.

High Throughput Sequencing

Next Generation DNA Sequencing Platforms: Evolving Tools for

Considerations for Analyzing Targeted NGS Data Exome Tim Hague, CTO.

NGS Analysis Using Galaxy

Next generation sequencing Xusheng Wang 4/29/2010.

Next Generation Sequencing – Benefits for Patients Jo Whittaker/ Su Stenhouse.

Whole Exome Sequencing for Variant Discovery and Prioritisation

Sequencing Technologies and Applications at JGI

Considerations for Analyzing Targeted NGS Data Introduction Tim Hague, CTO.

ARC Biotechnology Platform: Sequencing for Game Genomics Dr Jasper Rees

Todd J. Treangen, Steven L. Salzberg

MES Genome Informatics I - Lecture VIII. Interpreting variants Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute,

Introduction to Short Read Sequencing Analysis

Next Generation Sequencing and its data analysis challenges Background Alignment and Assembly Applications Genome Epigenome Transcriptome.

Next Generation DNA Sequencing

David R. McWilliams, Ph.D. Section of Statistical Genetics, Department of Biostatistical Sciences, Center for Public Health Genomics Bioinformatician IV.

RNA-Seq Assembly 转录组拼接唐海宝基因组与生物技术研究中心 2013 年 11 月 23 日.

De Novo Genome Assembly - Introduction Henrik Lantz - BILS/SciLife/Uppsala University.

Considerations for Analyzing Targeted NGS Data Exome Tim Hague, CTO.

Current Challenges in Metagenomics: an Overview Chandan Pal 17 th December, GoBiG Meeting.

SCRIPPS GENOME ADVISER Galina Erikson Senior Bioinformatics Programmer The Scripps Translational Science Institute Scripps Translational Science Institute.

SEQUENCING – THE BENCHTOPS. Roche 454 Junior Same technology as 454 FLX Read length: 400 bases Paired-end 100,000 reads 12 hours (instrument time) Output.

RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.

Anna Shcherbina Bioinformatics Challenge Day 01/10/2013 De novo assembly from clinical sample This work is sponsored by the Defense Threat Reduction Agency.

Genetic Cancer Susceptibility (GCS) Genomics in I4C James McKay.

Computational methods for genomics-guided immunotherapy Sahar Al Seesi Computer Science & Engineering Department, UCONN Immunology Department, UCONN Health.

HW2: exome sequencing and complex disease Jacquemin Jonathan de Bournonville Sébastien.

Next Generation Sequencing and Bioinformatics Analysis Pipelines

De Novo Genome Assembly - Introduction

© 2012 Genomatix GeneGrid finding disease causing variants in NGS data Claudia Gugenmus Genomatix Software GmbH Bayerstrasse 85a

Accessing and visualizing genomics data

Personalized genomics

CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.

A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.

When the next-generation sequencing becomes the now- generation Lisa Zhang November 6th, 2012.

Recent Advances in Genomic Science Julian Sampson Institute of Medical Genetics, Cardiff.

Canadian Bioinformatics Workshops

Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.

Canadian Bioinformatics Workshops

1. Ion Proton I well Ion 300 series well 454 Titanium well.

1 Finding disease genes: A challenge for Medicine, Mathematics and Computer Science Andrew Collins, Professor of Genetic Epidemiology and Bioinformatics.

Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops

From Reads to Results Exome-seq analysis at CCBR

Canadian Bioinformatics Workshops

Interpreting exomes and genomes: a beginner’s guide

Cancer Genomics Core Lab

Next Generation Sequencing and Bioinformatics Analysis Pipelines

Sahar Al Seesi University of Connecticut CANGS 2017

2nd (Next) Generation Sequencing

Next-generation DNA sequencing

BF nd (Next) Generation Sequencing

Mapping rates of different transcript sets to the P

Sequence Analysis - RNA-Seq 2

Presentation transcript:

Next Generation Sequencing and Bioinformatics Analysis Pipelines Adam Ameur National Genomics Infrastructure SciLifeLab Uppsala

What is an analysis pipeline? Basically just a number of steps to analyze data Raw data (FASTQ reads) Intermediate result Intermediate result Final result Pipelines can be simple or very complex…

Today’s lecture Sequencing instruments and ‘standard’ pipelines –IonTorrent/PacificBiosciences In-house bioinformatics pipelines, some examples News and future plans

Ion Torrent - PGM/Proton The Ion Torrent System –6 instruments available in Uppsala, early access users –Two instruments: PGM and Proton –For small scale (PGM) and large scale sequencing (Proton) –Rapid sequencing (run time ~ 2-4 hours) –Measures changes in pH –Sequencing on a chip Personal Genome Machine (PGM)Ion Proton

Ion Torrent output Ion Torrent throughput ~ from 10Mb to >10Gb, depending on the chip Read lengths: 400bp (PGM), 200bp (Proton) Output file format: FASTQ What can we use Ion Torrent for? –Anything, except perhaps very large genomes 2 human exomes (PI chip) 2 human transcriptomes 1 human genome = 6 PI chips PI (Proton) 318 (PGM) 316 (PGM) 314 (PGM) NEW! P2 (Proton)

Ion Torrent analysis workflow Downstream analysis Torrent Server.fastq.bam.fasta

Torrent Suite Software

Torrent Suite Software Analysis Plug-ins within the Torrent Suite Software –Alignment TMAP: Specifically developed for Ion Torrent data –Variant Caller SNP/Indel detection –Assembler MIRA –Analysis of gene panels (Human exomes, cancer panels, etc…) SNP/Indel detection in amplicon-seq data –Analysis of human transcriptomes –… Analyses are started automatically when run is complete!

Pacific Biosciences –Installed summer 2013 –Single molecule sequencing –Very long read lengths (up to 60 kb) –Rapid sequencing –Can detect base modifications (i.e. methylation) –Relatively low throughput

PacBio output PacBio throughput ~ 1 Gb/SMRT cell PacBio read lengths: 500bp-60kb Output file format: FASTQ What can we use PacBio for? –Anything, except really large genomes ~1 bacterial genome ~1 bacterial transcriptome 1 human genome = 100 SMRT cells?

PacBio analysis workflow In-house PacBio cluster Downstream analysis.fastq.bam.fasta

SMRT analysis portal

SMRT analysis pipelines Mapping Variant calling Assembly Scaffolding Base modifications

What about Illumina? There are many different pipelines for Illumina…

In-house development of pipelines The standard analysis pipelines are nice… … but sometimes we need to do own developments … or adapt the pipelines to our specific applications Some examples of in-house developments: I. Building a local variant database (WES/WGS) II. Assembly of genomes using long reads III. Clinical sequencing – Leukemia Diagnostics

* Example I: Computational infrastructure for exome-seq data

Background: exome-seq Main application of exome-seq – Find disease causing mutations in humans Advantages –Allows investigate all protein coding sequences –Possible to detect both SNPs and small indels –Low cost (compared to WGS) –Possible to multiplex several exomes in one run –Standardized work flow for data analysis Disadvantage –All genetic variants outside of exons are missed (~98%)

Exome-seq throughput We are producing a lot of exome-seq data –4-6 exomes/day on Ion Proton –In each exome we detect Over 50,000 SNPs About 2000 small indels => Over 1 million variants/run! In plain text files

How to analyze this? Traditional analysis - A lot of filtering! –Typical filters Focus on rare SNPs (not present in dbSNP) Remove FPs (by filtering against other exomes) Effect on protein: non-synonymous, stop-gain etc Heterozygous/homozygous –This analysis can be automated (more or less) Result: A few candidate causative SNP(s)! Start: All identified SNPs

Why is this not optimal? Drawbacks –Work on one sample at time Difficult to compare between samples –Takes time to re-run analysis When using different parameters –No standardized storage of detected SNPs/indels Difficult to handle 100s of samples Better solution –A database oriented system Both for data storage and filtering analyses

Analysis: In-house variant database Ameur et al., Database Journal, 2014 *CANdidate Variant Analysis System and Data Base *

CanvasDB - Filtering

CanvasDB - Filtering speed Rapid variant filtering, also for large databases

A recent exome-seq project Hearing loss: 2 affected brothers –Likely a rare, recessive disease => Shared homozygous SNPs/indels Sequencing strategy –TargetSeq exome capture –One sample per PI chip homoz heteroz

Filtering analysis CanvasDB filtering for a variant that is… –rare at most in 1% of ~700 exomes –shared found in both brothers –homozygous in brothers, but in no other samples –deleterious non-synonymous, frameshift, stop-gain, splicing, etc..

Filtering results Homozygous candidates –2 SNPs stop-gain in STRC non-synonymous in PCNT –0 indels Compound heterozygous candidates (lower priority) –in 15 genes => Filtering is fast and gives a short candidate list!

STRC - a candidate gene => Stop-gain in STRC is likely to cause hearing loss!

Brother #1 Brother #2 Unrelated sample IGV visualization: Stop gain in STRC

STRC, validation by Sanger Brother #1 Brother #2 Stop-gain site Sanger validation Does not seem to be homozygous.. –Explanation: difficult to sequence STRC by Sanger Pseudo-gene with very high similarity New validation showed mutation is homozygous!!

CanvasDB – some success stories Solved cases, exome-seq - Niklas Dahl/Joakim Klar Neuromuscular disorderNMD11 ArtrogryfosisSKD36 LipodystrophyACR1 AchondroplasiaACD2 Ectodermal dysplasiaED21 AchondroplasiaACD9 Ectodermal dysplasiaED1 ArythrodermaAV1 IchthyosisSD12 Muscular dystrophyDMD7 Neuromuscular disorderNMD8 Welanders myopathy (D)W Skeletal dysplasia SKD21 Visceral myopathy (D)D:5156 Ataxia telangiectasiaMR67 ExostosisSKD13 AlopeciaAP43 Epidermolysis bullosaSD14 Hearing lossD:9652

CanvasDB - Availability CanvasDB system now freely available on GitHub!

Next Step: Whole Genome Sequencing Capacity of HiSeq X Ten: 320 whole human genomes/week!!! More work on pipelines and databases needed!!! New instruments at SciLifeLab for human WGS…

Example of large WGS project: Generation of a high quality genetic variant database for the Swedish population Whole Genome Sequencing of ~1000 individuals Aim: to build a resource for Swedish research and health care PI: Ulf Gyllensten

Example II: Assembly of genomes using Pacific Biosciences

Genome assembly using NGS Short-read de novo assembly by NGS –Requires mate-pair sequences Ideally with different insert sizes –Complicated analysis Assembly, scaffolding, finishing Maybe even some manual steps => Rather expensive and time consuming Long reads really makes a difference!! –We can assemble genomes using PacBio data only!

HGAP de novo assembly HGAP uses both long and shorter reads Long reads (seeds) Short reads

PacBio – Current throughput & read lengths >10kb average read lengths! (run from April 2014) ~ 1 Gb of sequence from one PacBio SMRT cell

PacBio assembly analysis Simple -- just click a button!!

PacBio assembly, example result Example: Complete assembly of a bacterial genome

PacBio assembly – recent developments Also larger genomes can be assembled by PacBio..

Next step: assembly of large genomes We need to install such pipelines at UPPNEX!! 405,000 CPUh used on Google Cloud! A computational challenge!!

Example III: Clinical sequencing for Leukemia Treatment

Chronic Myeloid Leukemia BCR-ABL1 fusion protein – a CML drug target The BCR-ABL1 fusion protein can acquire resistance mutations following drug treatment

BCR-ABL1 workflow – PacBio Sequencing From sample to results: < 1 week 1 sample/SMRT cell Cavelier et al., BMC Cancer, 2015

BCR-ABL1 mutations at diagnosis BCRABL1 PacBio sequencing generates ~10 000X coverage! Sample from time of diagnosis:

BCR-ABL1 mutations in follow-up sample Sample 6 months later Mutations acquired in fusion transcript. Might require treatment with alternative drug. BCRABL1

BCR-ABL1 dilution series results Mutations down to 1% detected!

Summary of mutations in 5 CML patients

Mutations mapped to protein structure

BCR-ABL1 - Compound mutations

BCR-ABL1 - Multiple isoforms in one individual!

BCR-ABL1 – Isoforms and protein structure

Next step: A clinical diagnostics pipeline! Step1. Create CCS reads Step2. Run mutation analysis Step3. Upload to result server

Collaboration with Wesley Schaal & Ola Spjuth, UPPNEX/Uppsala Univ Reporting system for mutation results

Ion Torrent – Ongoing developments Ion S5 XL systemIon Proton PII chip (EA)

PacBio – Ongoing developments Fast track clinical samples -Preparing workflows for rapid sequencing -Organ transplantation, diagnostics, outbreaks,... New chemistry and active loading of SMRT cells -Improved quality, longer reads -Increased throughput (2016?) New analysis algorithms

Thank you!