Will 10x technology make us rethink genome assemblies?

Slides:



Advertisements
Similar presentations
The Past, Present, and Future of DNA Sequencing
Advertisements

V Improvements to 3kb Long Insert Size Paired-End Library Preparation Naomi Park, Lesley Shirley, Michael Quail, Harold Swerdlow Wellcome Trust Sanger.
Current methods for high-throughput resequencing of custom targets Adam Gordon Nickerson Lab, UW Genome Sciences WHI Genetics SIG call 3/26/14.
Next–generation DNA sequencing technologies – theory & practice
A new method of finding similarity regions in DNA sequences Laurent Noé Gregory Kucherov LORIA/UHP Nancy, France LORIA/INRIA Nancy, France Corresponding.
Proprietary Signal Generation and Imaging Photons Generated Reagent Flow PicoTiterPlate Wells Sequencing By Synthesis 1600K field of addressable wells.
Click to edit Master title style Irys data analysis January 10 th, 2014.
SOLiD Sequencing & Data
The 454 and Ion PGM at the Genomics Core Facility Dr. Deborah Grove, Director for Genetic Analysis Genomics Core Facility Huck Institutes of the Life Sciences.
Some new sequencing technologies. Molecular Inversion Probes.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
The SOLiD System: Next-Generation Sequencing Overview of the SOLiD System –  Scalable  Accurate Ultra High Throughput  Flexible  Mate Pairs.
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.
Genome Analysis Determine locus & sequence of all the organism’s genes More than 100 genomes have been analysed including humans in the Human Genome Project.
High Throughput Sequencing
Considerations for Analyzing Targeted NGS Data Exome Tim Hague, CTO.
11 © 2009 PerkinElmer © 2010 PerkinElmer November 20, 2012 DNA Services Overview.
Whole Exome Sequencing for Variant Discovery and Prioritisation
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
Sequencing Technologies and Applications at JGI
Detecting copy number variations using paired-end sequence data Nick Furlotte CS224 May 29, 2009.
De-novo Assembly Day 4.
CS 394C March 19, 2012 Tandy Warnow.
Todd J. Treangen, Steven L. Salzberg
PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.
Meraculous: De Novo Genome Assembly with Short Paired-End Reads
Next Generation DNA Sequencing
SIZE SELECT SHEAR Shotgun DNA Sequencing (Technology) DNA target sample LIGATE & CLONE Vector End Reads (Mates) SEQUENCE Primer.
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
De Novo Genome Assembly - Introduction Henrik Lantz - BILS/SciLife/Uppsala University.
How will new sequencing technologies enable the HMP? Elaine Mardis, Ph.D. Associate Professor of Genetics Co-Director, Genome Sequencing Center Washington.
Considerations for Analyzing Targeted NGS Data Exome Tim Hague, CTO.
SEQUENCING – THE BENCHTOPS. Roche 454 Junior Same technology as 454 FLX Read length: 400 bases Paired-end 100,000 reads 12 hours (instrument time) Output.
Human Genome.
Introduction to RNAseq
billion-piece genome puzzle
De Novo Genome Assembly - Introduction
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.
Library QA & QC Day 1, Video 3
From Reads to Results Exome-seq analysis at CCBR
An Overview of Applications for the MiSeq and HiSeq 2500 April 4, 2016 Kevin Shianna, Ph.D. Sequencing Specialist - Illumina, Inc. MGC USERS GROUP.
Short Read Sequencing Analysis Workshop
Detection of FLT3 Internal Tandem Duplication in Targeted, Short-Read-Length, Next- Generation Sequencing Data  David H. Spencer, Haley J. Abel, Christina.
RNA Quantitation from RNAseq Data
Amos Tanay Nir Yosef 1st HCA Jamboree, 8/2017
Cancer Genomics Core Lab
Sequencing technologies
DNA Sequencing -sayed Mohammad Amin Nourion -A’Kia Buford
Gapless genome assembly of Colletotrichum higginsianum reveals chromosome structure and association of transposable elements with secondary metabolite.
Metafast High-throughput tool for metagenome comparison
Denovo genome assembly of Moniliophthora roreri
Professors: Dr. Gribskov and Dr. Weil
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
Research in Computational Molecular Biology , Vol (2008)
Teagasc/APC Sequencing Facility
Henrik Lantz - NBIS/SciLife/Uppsala University
Detection of FLT3 Internal Tandem Duplication in Targeted, Short-Read-Length, Next- Generation Sequencing Data  David H. Spencer, Haley J. Abel, Christina.
2nd (Next) Generation Sequencing
Technology Experimental Design Cost Estimation
Molecular Diagnosis of Autosomal Dominant Polycystic Kidney Disease Using Next- Generation Sequencing  Adrian Y. Tan, Alber Michaeel, Genyan Liu, Olivier.
Introduction to Sequencing
The Right Tool for the Job: Two Platforms for Targeted DNA Sequencing
Sequence the 3 billion base pairs of human
BF528 - Genomic Variation and SNP Analysis
BF nd (Next) Generation Sequencing
Canadian Bioinformatics Workshops
(Top) Construction of synthetic long read clouds with 10× Genomics technology. (Top) Construction of synthetic long read clouds with 10× Genomics technology.
Development of a Novel Next-Generation Sequencing Assay for Carrier Screening in Old Order Amish and Mennonite Populations of Pennsylvania  Erin L. Crowgey,
Presentation transcript:

Will 10x technology make us rethink genome assemblies?

What 10x is currently used to do Genome –– Genome Resequencing Call the full spectrum of variants (particularly long INDELS/CNV and structural variants) and unlock previously inaccessible regions from a single library at equivalent coverage as standard genome resequencing projects Exome – Subselect reads using capture techniques (Agilent) Enable phasing of genes and detection of structural and copy number variation Agilent SureSelect baits improve gene phasing by closing gaps, and recovering hard- to-map loci in the genome (future kits to include previously failed regions) Assembly – de Novo genome assembly Single Cell 3’ RNAseq High-throughput single cell RNA sequencing Scalable transcriptional profiling of 1,000s to 10,000s of individual cells

Genome and Exome Analysis Long-range analysis and phasing of SNVs, indels, and structure variants

In a nut shell

Laboratory Workflow Shear? 8-bp Sample Index 16-bp 10x Barcode 7-bp N-primer (?)

The Math 1 ng Input DNA = 300 genomes copies of the genome Calculations imply that about 50% of all possible fragments end up in a bead

@ Recommended Loading Each locus will have 150 molecules Each locus will have 30x read depth ~35 fragments per molecule @50Kb molecules = 0.2x/molecule

LPM – Linked Reads per molecule LPM = N/M N=Total Sequenced Reads M=Molecules into System N = (G x C)/L G=Genomic size (bases) C=Coverage [target 60x coverage for Supernova] L=Read Length [ 300, for 2x150bp reads] M = (G x NG)/S Ng=Number of Haplotype Genomes [for human 1ng @3.2Gb = 150 haplotype genomes] S=Fragment size of molecules (bases) [Avg Fragment size, ex. 50kb] LPM = C/L x S/NG = 60/300 x 50,000/150 = 67 linked reads per molecule (Human at 1ng input)

As the Chromium Controller utilizes a haplotype limiting dilution, genome size must be taken into account before starting an experiment. Report testing 1Gb to 3.2Gb genome Haplotype limiting dilution is the proportion of a genome within each bead. The higher the dilution the more likely two fragments within the same bead will overlap. In general, larger mammalian genomes, such as mouse should work fine in the Chromium System. The smaller the genome, the smaller the amount of DNA that must be loaded to maintain a haplotype limiting dilution 10X Genomics recommends loading 1-1.2 ng of HMW (avg 100kb) human DNA into the Chromium Controller

Smaller Genomes For smaller genomes, assuming that the same DNA mass was loaded and that the library was sequenced to the same read­depth, the number of Linked­Reads (read pairs) per molecule would drop proportionally, which would reduce the power of the data type. For example, for a genome whose size is 1/10th the size of the human genome (320 Mb), the mean number of Linked­Reads per molecule would be about 6, and the distance between Linked­Reads would be about 8 kb, making it hard to anchor barcodes to short initial contigs. Modifications to workflow, such as loading less DNA and/or increasing coverage would be potential solutions for smaller genomes, but are not described here.

Increased Mapability

Linked Reads

Capture – linked genes Enrich reads of interest instead of random selection Depending on size of capture, can pool more samples/lane

Assembly de novo assemblies

Sample Requirements - Supernova Genome size: Supernova has been tested on genomes in the size range 1.0-3.2 Gb. Other genome characteristics: Supernova has not been tested on genomes having repeat content far greater than human, nor on genomes having extreme GC content. Clonality: strongly recommend that DNA be obtained from an individual organism or clonal population. DNA size: Recommend that this value be at least 50 kb, and preferably 100 kb. DNA length is highly correlated with several assembly statistics, including contig length, phase block length and scaffold length.

Sequencing - Supernova Instrument Configuration Result Lanes HiSeq X Standard Excellent 2 HiSeq 2500 Rapid run 4 High Output Not tested HiSeq 4000 Useable, but observed contig length half as long as those from HiSeq X Miseq standard Read length: Supernova requires as input 2x150 base reads. Sequencing depth: Recommends sequencing to depth between 38x and 56x. For highly polymorphic organisms, we recommend 56x. Coverage higher than 56x may not improve results. Sequence twice as much as mapping application

System Requirements - Supernova 16-core (or greater) Intel or AMD processor 384 GB RAM 2 TB free disk space 64-bit CentOS/RedHat 5.2+ or Ubuntu 8+ Bcl2fastq 2.17 No other large processes running on the system ** Supernova should be run with at most 1.2 billion reads (single reads), and at 38-56x coverage of the genome.

Genome Stats Genome Size (Gb) DNA size(Kb) N50 contig(Kb) N50 scaffold(Mb) N50 phase block (Mb) NA12878 3.2 95.5 85.0 12.8 2.8 NA24385 111.3 90.0 10.4 3.9 HGP 138.8 104.9 19.4 4.6 Yoruban 126.9 100.5 16.1 11.4 Komodo dragon 1.8 85.4 95.3 10.2 0.4 Spotted owl 1.5 72.2 118.3 10.1 0.2 Hummingbird 1.0 86.2 87.6 12.5 Monk seal 2.6 92.3 93.8 14.8 0.6 Chili pepper 3.5 53.3 84.7 4.0 2.1

Genome Quality - Pac Bio vs 10x Qualitatively Pac Bio will produce >> N50 contig lengths (contiguous sequence) on the order of 10 – 50x larger contig N50 10x will will produce >> N50 scaffold length (ordered and arranged contigs with gaps) on the order of 2-5x larger scaffold N50 Costs Human Genome sequenced at ~60x coverage (recommended depth for both Pac Bio and 10x) $70,000 Pac bio $8,500 (4 lanes, HiSeq 2500 2x150, rapid mode) [~$5000 on the X platform] Input DNA 10x requires ~ 1ng input DNA relative to ~10ug of input DNA for Pac Bio Genome size for 10x 1gb-3gb (tested), no such limitation on Pac Bio

Supernova Assembler Basic idea, based on DISCOVAR assembler + Linked Read information Use Barcodes to first prefilter kmers Using kmers (k=48bp), remover all kmers present in only 1 - 10x barcode DISCOVAR de novo Assembly Lines (discovar contigs, but with read info) are ordered and oriented using 10x barcode information. Creates bubbles and megabubbles. Phasing then occurs by orientating each bubble on a line, placing one of its branches on ‘top’ and the other on the ‘bottom’. Iterative algorithm.

Supernova supernova mkfastq supernova run supernova mkoutput Supernova begins with bcl files, converting raw Illumina output to fastq files amendable to their assembly pipeline. Mkfastq has a significant number of assumptions on how the run was performed. Can use bcl2fastq directly to produce fastq files https://support.10xgenomics.com/single-cell/software/pipelines/latest/using/bcl2fastq-direct supernova run Basically no parameters, just runs on fastq files supernova mkoutput run, megabubbles, pseudohap, pseudohap2

Supernova assemblies >55 edges=10,20,40 left=1 right=4 ver=1.3 style=2 ACTTTAGACGGGGACCCTAGACTTACTTGAGAAAACGTTTTTACACTTACCA Field Sample Value Meaning edges 10,20,40 path of edges in the assembly that the sequence describes left 1 identifier of vertex at left end of the path right 4 identifier of vertex at right end of the path ver 1.3 Supernova output format version number style 2 output style identifier (see below)

150X Genome Equivalent per locus - decaploid