Download presentation
1
Whole Genome Assembly with iPlant
Fritz Sedlazeck Dec. 2, 2015 iPlant workshop
2
Outline Assembly theory Assembly challenges Assembly methods
Assembly by analogy Assembly challenges Coverage Repeats Noise Assembly methods Error correction Overlap graph De Brujin graph Assembly metrics & Recommendations An unfamiliar problem with familiar data
3
1. Shredded Book Reconstruction
Dickens accidentally shreds the first printing of A Tale of Two Cities Text printed on 5 long spools It was the best of of times, it was the times, it was the worst age of wisdom, it was the age of foolishness, … It was the best worst of times, it was the age of wisdom, it was the age of foolishness, It was the the worst of times, it best of times, it was was the age of wisdom, it was the age of foolishness, … It was was the worst of times, the best of times, it wisdom, it was the age of foolishness, … It it was the worst of was the best of times, times, it was the age of wisdom, it was the age of foolishness, … It was the best of of times, it was the times, it was the worst age of wisdom, it was the age of foolishness, … It was the best worst of times, it was the age of wisdom, it was the age of foolishness, It was the the worst of times, it best of times, it was was the age of wisdom, it was the age of foolishness, … It was was the worst of times, the best of times, it wisdom, it was the age of foolishness, … It it was the worst of was the best of times, times, it was the age of wisdom, it was the age of foolishness, … It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, … Fragment are very short compared to the length of the text ~5 words per fragment uniformly sampling the text Fragment boundaries from different spools occur at different positions Compare fragments to each other to find overlapping sequences How can he reconstruct the text? 5 copies x 138, 656 words / 5 words per fragment = 138k fragments The short fragments from every copy are mixed together Some fragments are identical
4
Greedy Reconstruction
It was the best of of times, it was the best of times, it was times, it was the worst was the best of times, the best of times, it it was the worst of was the worst of times, worst of times, it was times, it was the age it was the age of was the age of wisdom, the age of wisdom, it age of wisdom, it was of wisdom, it was the wisdom, it was the age was the age of foolishness, the worst of times, it Greedy Reconstruction It was the best of was the best of times, the best of times, it best of times, it was of times, it was the of times, it was the times, it was the worst times, it was the age The repeated sequence make the correct reconstruction ambiguous It was the best of times, it was the [worst/age]
5
2. Challenges for assemblies
Uncovered regions/ Coverage Repeating regions Noise in the data (e.g. sequencing errors) It was the best of was the best of times, the best of times, it best of times, it was of times, it was the of times, it was the times, it was the worst times, it was the age
6
Typical sequencing coverage
Imagine raindrops on a sidewalk We want to cover the entire sidewalk but each drop costs $1
7
1x sequencing
8
2x sequencing
9
4x sequencing
10
8x sequencing
11
Poisson Distribution The probability of a given number of events occurring in a fixed interval of time and/or space if these events occur with a known average rate and independently of the time since the last event. Formulation comes from the limit of the binomial equation Resembles a normal distribution, but over the positive values, and with only a single parameter. Key property: The standard deviation is the square root of the mean.
12
2. Challenges for assemblies
Uncovered regions/ Coverage Repeating regions Noise in the data (e.g. sequencing errors) It was the best of was the best of times, the best of times, it best of times, it was of times, it was the of times, it was the times, it was the worst times, it was the age
13
Repetitive regions Repeat Type Definition / Example Prevalence
Low-complexity DNA / Microsatellites (b1b2…bk)N where 1 < k < 6 CACACACACACACACACACA 2% SINEs (Short Interspersed Nuclear Elements) Alu sequence (~280 bp) Mariner elements (~80 bp) 13% LINEs (Long Interspersed Nuclear Elements) ~500 – 5,000 bp 21% LTR (long terminal repeat) retrotransposons Ty1-copia, Ty3-gypsy, Pao-BEL (~100 – 5,000 bp) 8% Other DNA transposons 3% Gene families & segmental duplications 4% Over 50% of mammalian genomes are repetitive Large plant genomes tend to be even worse Wheat: 16 Gbp; Pine: 24 Gbp
14
Repetitive regions A R B C A R B C
15
Repetitive regions A R B C A R B R C R
16
Paired-end and Mate-pairs
Paired-end sequencing Read one end of the molecule, flip, and read the other end Generate pair of reads separated by up to 500bp with inward orientation 300bp Mate-pair sequencing Circularize long molecules (1-10kbp), shear into fragments, & sequence Mate failures create short paired-end reads 10kbp ~10kbp (outies) 10kbp circle 300bp (innies)
17
Scaffolding Initial contigs terminate at
Coverage gaps: especially extreme GC Conflicts: errors, repeat boundaries Use mate-pairs to resolve correct order through assembly graph Place sequence to satisfy the mate constraints Mates through repeat nodes are tangled Final scaffold may have internal gaps called sequencing gaps We know the order, orientation, and spacing, but just not the bases. Fill with Ns instead A C D R B A C D R B
18
Long Read Sequencing Technology
PacBio RS II CSHL/PacBio 10k 20k 30k 40k Moleculo (Voskoboynik et al. 2013) Oxford Nanopore CSHL/ONT 10k 20k 30k 40l Population structure in O. sativa: There are two major clades, Indica and Japonica, identified by chloroplast sequence differences/ There are 5 major subpopulations: indica and aus are in the Indica clade; tropical japonica, temperate japonica and aromatic (Basmati) are in the Japonica clade. We selected IR64 (indica) and BJ123 (aus) for denovo sequencing, and use Nipponbare (temperate japonica) as the control. The aus subpopulation is centered in the Bengali-speaking region that is now Bangladesh (aus is a Bangladeshi word meaning “autumn rice”)- it has nothing to do with Australia (!)
19
SMRT Sequencing Imaging of fluorescently phospholinked labeled nucleotides as they are incorporated by a polymerase anchored to a Zero-Mode Waveguide (ZMW). Intensity Time
20
Sample of 100k reads aligned with BLASR requiring >100bp alignment
SMRT Sequencing Data TTGTAAGCAGTTGAAAACTATGTGTGGATTTAGAATAAAGAACATGAAAG ||||||||||||||||||||||||| ||||||| |||||||||||| ||| TTGTAAGCAGTTGAAAACTATGTGT-GATTTAG-ATAAAGAACATGGAAG ATTATAAA-CAGTTGATCCATT-AGAAGA-AAACGCAAAAGGCGGCTAGG | |||||| ||||||||||||| |||| | |||||| |||||| |||||| A-TATAAATCAGTTGATCCATTAAGAA-AGAAACGC-AAAGGC-GCTAGG CAACCTTGAATGTAATCGCACTTGAAGAACAAGATTTTATTCCGCGCCCG | |||||| |||| || |||||||||||||||||||||||||||||||| C-ACCTTG-ATGT-AT--CACTTGAAGAACAAGATTTTATTCCGCGCCCG TAACGAATCAAGATTCTGAAAACACAT-ATAACAACCTCCAAAA-CACAA | ||||||| |||||||||||||| || || |||||||||| ||||| T-ACGAATC-AGATTCTGAAAACA-ATGAT----ACCTCCAAAAGCACAA -AGGAGGGGAAAGGGGGGAATATCT-ATAAAAGATTACAAATTAGA-TGA |||||| || |||||||| || |||||||||||||| || ||| GAGGAGG---AA-----GAATATCTGAT-AAAGATTACAAATT-GAGTGA ACT-AATTCACAATA-AATAACACTTTTA-ACAGAATTGAT-GGAA-GTT ||| ||||||||| | ||||||||||||| ||| ||||||| |||| ||| ACTAAATTCACAA-ATAATAACACTTTTAGACAAAATTGATGGGAAGGTT TCGGAGAGATCCAAAACAATGGGC-ATCGCCTTTGA-GTTAC-AATCAAA || ||||||||| ||||||| ||| |||| |||||| ||||| ||||||| TC-GAGAGATCC-AAACAAT-GGCGATCG-CTTTGACGTTACAAATCAAA ATCCAGTGGAAAATATAATTTATGCAATCCAGGAACTTATTCACAATTAG ||||||| ||||||||| |||||| ||||| |||||||||||||||||| ATCCAGT-GAAAATATA--TTATGC-ATCCA-GAACTTATTCACAATTAG Sample of 100k reads aligned with BLASR requiring >100bp alignment Match 83.7% Insertions 11.5% Deletions 3.4% Mismatch 1.4%
21
2. Challenges for assemblies
Uncovered regions/ Coverage Repeating regions Noise in the data (e.g. sequencing errors) It was the best of was the best of times, the best of times, it best of times, it was of times, it was the of times, it was the times, it was the worst times, it was the age
22
Consensus Accuracy and Coverage
Coverage can overcome random errors Dashed: error model from binomial sampling Solid: observed accuracy Koren, Schatz, et al (2012) Nature Biotechnology. 30:693–700
23
PacBio Assembly Algorithms
PBJelly PacBioToCA & ECTools HGAP/MHAP & Quiver Population structure in O. sativa: There are two major clades, Indica and Japonica, identified by chloroplast sequence differences/ There are 5 major subpopulations: indica and aus are in the Indica clade; tropical japonica, temperate japonica and aromatic (Basmati) are in the Japonica clade. We selected IR64 (indica) and BJ123 (aus) for denovo sequencing, and use Nipponbare (temperate japonica) as the control. The aus subpopulation is centered in the Bengali-speaking region that is now Bangladesh (aus is a Bangladeshi word meaning “autumn rice”)- it has nothing to do with Australia (!) Gap Filling and Assembly Upgrade English et al (2012) PLOS One. 7(11): e47768 Hybrid/PB-only Error Correction Koren, Schatz, et al (2012) Nature Biotechnology. 30:693–700 PB-only Correction & Polishing Chin et al (2013) Nature Methods. 10:563–569 < 5x PacBio Coverage > 50x
24
2. Ingredients for a good assembly
Coverage High coverage is required Oversample the genome to ensure every base is sequenced with long overlaps between reads Biased coverage will also fragment assembly Read Coverage Expected Contig Length Read Length Reads & mates must be longer than the repeats Short reads will have false overlaps forming hairball assembly graphs With long enough reads, assemble entire chromosomes into contigs Quality Errors obscure overlaps Reads are assembled by finding kmers shared in pair of reads High error rate requires very short seeds, increasing complexity and forming assembly hairballs Amount of oversampling depends of read length, genome complexity Current challenges in de novo plant genome sequencing and assembly Schatz MC, Witkowski, McCombie, WR (2012) Genome Biology. 12:243
25
3.Assembly methods Overlap consensus graph De Brujin graph
Typically used for long reads (+) Handles sequencing errors (-) High computational demand De Brujin graph Typically used for short reads (+) Low complexity (-) Sensitive to sequencing errors
26
Overlap consensus True sequence: Reads (7): Best overlap:
It was the best of times, it was the worst of times, it was the age of wisdom Reads (7): It was the best of times; best of times, it was the ; it was the worst of times; worst of times, it was the age ; it was the age the age of wisdom Best overlap: Target: It was the best of times Query1: best of times, it was the Target: It was the --best of times Query2: it was the worst of times Query3: worst of times, it was the age
27
De Bruijn Graph Construction
Dk = (V,E) V = All length-k subfragments (k < l) E = Directed edges between consecutive subfragments Nodes overlap by k-1 words Locally constructed graph reveals the global sequence structure Overlaps between sequences implicitly computed It was the best was the best of It was the best of Original Fragment Directed Edge de Bruijn, 1946 Idury and Waterman, 1995 Pevzner, Tang, Waterman, 2001
28
De Bruijn Graph Assembly
It was the best best of times, it was the best of the best of times, it was the worst was the worst of worst of times, it the worst of times, of times, it was times, it was the it was the age was the age of the age of foolishness After graph construction, try to simplify the graph as much as possible the age of wisdom, age of wisdom, it of wisdom, it was wisdom, it was the
29
De Bruijn Graph Assembly
It was the best of times, it it was the worst of times, it of times, it was the the age of foolishness it was the age of After graph construction, try to simplify the graph as much as possible the age of wisdom, it was the
30
Errors in the graph Clip Tips Pop Bubbles was the worst of times,
was the worst of tymes, was the worst of tymes, times, it was the age the worst of times, it tymes, it was the age the worst of tymes, tymes, was the worst of was the worst of it was the age the worst of times, times, worst of times, it Ecoli (Chaisson, 2009)
31
Applications in iPlant
Genome assembly short reads: Allpaths-LG Velvet Soap de novo Genome assembly long reads (upcoming): Celera Falcon
32
4. Comparing Assembler Assemblathon Genomic alignment
ALLPATHS and SOAPdenovo came out neck-and-neck followed closely behind by Celera Assembler, SGA, and ABySS Genomic alignment Using MUMmer Completeness of assembly Using CEGMA or BUSCO Length statistics N50, NG50
33
N50 size Def: 50% of the genome is in contigs as large as the N50 value Example: 1 Mbp genome N50 size = 30 kbp (300k+100k+45k+45k+30k = 520k >= 500kbp) A greater N50 is indicative of improvement in every dimension: Better resolution of genes and flanking regulatory regions Better resolution of transposons and other complex sequences Better resolution of chromosome organization Better sequence for all downstream analysis 50% 1000 300 100 45 45 30 20 15 15 10 . . . . .
34
Example: O. sativa pv Indica (IR64)
PacBio RS II sequencing at PacBio Over 118x coverage using P5-C3 long read sequencing Mean: 5918bp 49.7x over 10kbp 6.3x over 20kb Max: 54,288bp
35
O. sativa pv Indica (IR64) Genome size: ~370 Mb
Chromosome N50: ~29.7 Mbp Assembly Contig NG50 MiSeq Fragments 25x 456bp (3 runs 450 FLASH) 19 kbp “ALLPATHS-recipe” 50x 180 36x 2100 51x 4800 18 kbp HGAP + CA 10kbp 4.0 Mbp Nipponbare BAC-by-BAC Assembly 5.1 Mbp HGAP Read Lengths Max: 53,652bp 22.7x over 10kbp (discarded reads below 8500bp)
36
S5 Hybrid Sterility Locus
Sanger …ACCCTGATATTCTGAGTTACAAGGCATTCAGCTACTGCTTGCCCACTGACGAGACC… Illumina …ACCCTGATATTCTGAGTTACAAGGCATTCAGCTACTGCTTGCCCACTGACGAGACC… PacBio …ACCCTGATATTCTGAGTTACAAGGCATTCAGCTACTGCTTGCCCACTGACGAGACC…
37
S5 Hybrid Sterility Locus
Sanger …ACCCTGATATTCTGAGTTACAAGGCATTCAGCTACTGCTTGCCCACTGACGAGACC… Illumina …ACCCTGATATTCTGAGTTACAAGGCATTCAGCTACTGCTTGCCCACTGACGAGACC… PacBio …ACCCTGATATTCTGAGTTACAAGGCATTCAGCTACTGCTTGCCCACTGACGAGACC… 100kbp
38
S5 Hybrid Sterility Locus
Sanger …ACCCTGATATTCTGAGTTACAAGGCATTCAGCTACTGCTTGCCCACTGACGAGACC… Illumina …ACCCTGATATTCTGAGTTACAAGGCATTCAGCTACTGCTTGCCCACTGACGAGACC… PacBio …ACCCTGATATTCTGAGTTACAAGGCATTCAGCTACTGCTTGCCCACTGACGAGACC… S5 Hybrid Sterility Locus
39
S5 Hybrid Sterility Locus
Sanger …ACCCTGATATTCTGAGTTACAAGGCATTCAGCTACTGCTTGCCCACTGACGAGACC… Illumina …ACCCTGATATTCTGAGTTACAAGGCATTCAGCTACTGCTTGCCCACTGACGAGACC… Pacbio …ACCCTGATATTCTGAGTTACAAGGCATTCAGCTACTGCTTGCCCACTGACGAGACC… Genome size: ~370 Mb Chromosome N50: ~29.7 Mbp S5 Hybrid Sterility Locus
40
S5 Hybrid Sterility Locus
Sanger …ACCCTGATATTCTGAGTTACAAGGCATTCAGCTACTGCTTGCCCACTGACGAGACC… Illumina …ACCCTGATATTCTGAGTTACAAGGCATTCAGCTACTGCTTGCCCACTGACGAGACC… Pacbio …ACCCTGATATTCTGAGTTACAAGGCATTCAGCTACTGCTTGCCCACTGACGAGACC… Genome size: ~370 Mb Chromosome N50: ~29.7 Mbp S5 Hybrid Sterility Locus 5.3Mbp Improvements from 20kbp to 4Mbp contig N50: Over 20 Megabases of additional sequence Extremely high sequence identity (>99.9%) Thousands of gaps filled, hundreds of mis-assemblies corrected Complete gene models, promoter regions for nearly every gene True representation of transposons and other complex features Opportunities for studying large scale chromosome evolution Largest contigs approach complete chromosome arms
41
What should we expect from an assembly?
Analysis of dozens of genomes from across the tree of life with real and simulated data Summary & Recommendations < 100 Mbp: 100x PB C3-P5 expect near perfect chromosome arms < 1GB: 100x PB C3-P5 high quality assembly: contig N50 over 1Mbp > 1GB: hybrid/gap filling expect contig N50 to be 100kbp – 1Mbp > 5GB: Error correction and assembly complexity of single molecule sequencing reads. Lee, H*, Gurtowski, J*, Yoo, S, Marcus, S, McCombie, WR, Schatz, MC
42
Assembly Summary Assembly quality depends on
Coverage: low coverage is mathematically hopeless Repeat composition: high repeat content is challenging Read length: longer reads help resolve repeats Error rate: errors reduce coverage, obscure true overlaps Assembly is a hierarchical, starting from individual reads, build high confidence contigs, incorporate the mates to build scaffolds Extensive error correction is the key to getting the best assembly possible from a given data set Watch out for collapsed repeats & other misassemblies Globally/Locally reassemble data from scratch with better parameters & stitch the 2 assemblies together
43
Thank you http://schatzlab.cshl.edu @sedlazeck
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.