Download presentation
Presentation is loading. Please wait.
Published byScot Wells Modified over 9 years ago
1
Genome De Novo Assemblies and Applications in NGS Sequencing Zemin Ning The Wellcome Trust Sanger Institute
2
My academic background Challenges in genome assemblies from pure Illumina reads The Phusion2 pipeline The Tasmanian devil genome project The Devil genome assembly Other assemblies: human, bamboo,miscanthus, etc Outline of the Talk:
3
Powder Simulation
4
Hair Dynamics Genetics and Human Hair Structure AFRICAN CAUCASIAN EAST ASIAN
5
SSAHA (Sequence Search and Alignment by the Hashing Algorithm Ssaha2 – Alignment tool for Solexa, 454, ABI capillary reads ssahaSNP – SNP/indel detection, mainly for ABI capillary reads ssahaEST – EST or cDNA alignment ssaha_SV – Structural variation (CNVs) detection ssaha_pileup – SNP/indel detection from next-gen data Phusion & Phusion2 Development and maintenance of the pipeline Production of WGS assemblies: Mouse, Zebrafish, Human (Venter genome), C. Briggsae, Rice, Schisto, Sea Lamprey, Gorilla, Malaria and many bacterial genomes TraceSeach Public sequence search facility for all the traces Fuzzypath Short read assembler Informatics Projects Involved
6
Challenges in Whole Genome Assembly using Pure Illumina Reads Short read length: 2x36; 2x54; 2x75; 2x100 Large genome and huge datasets For human: 100Gb at 30x Repetitive/Duplication structures, Alus, LINES, SVAs 30-40% such as human, mouse; 50-60% such as rice and other plant genomes. Tandem repeats: how many copies they have? TATATATATATATATATATATATATATA GCGCGCGCGCGCGCGCGCGCGCGCGCGCGCG GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG AGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGT
7
De Bruijn vs Read overlap Missing from de Bruijn contigs Missing sequences
8
Phusion2 Assembly Pipeline Solexa Reads Assembly Reads Group Data Process Long Insert Reads Supercontig Contigs PRono Fuzzypath Velvet Phrap 2x75 or 2x100 Base Correction RP_Assemble
9
Gap-Hash4x3 ATGGGCAGATGT ATGGGCAGATGT TGGCCAGTTGTT TGGCCAGTTGTT GGCGAGTCGTTC GGCGAGTCGTTC GCGTGTCCTTCG GCGTGTCCTTCG ATGGCGTGCAGTCCATGTTCGGATCA ATGGCGTGCAGTCCATGTTCGGATCA ATGGCGTGCAGT TGGCGTGCAGTC TGGCGTGCAGTC GGCGTGCAGTCC GGCGTGCAGTCC GCGTGCAGTCCA GCGTGCAGTCCA CGTGCAGTCCAT CGTGCAGTCCAT ATGGCGTGCAGTCCATGTTCGGATCA ATGGCGTGCAGTCCATGTTCGGATCA Contiguous Base Hash Base Hash K = 12 Kmer Word Hashing
10
Word use distribution for the mouse sequence data at ~7.5 fold Useful Region Poisson Curve Real Data Curve
11
Sorted List of Each k-Mer and Its Read Indices ACAGAAAAGC10h06.p1c ACAGAAAAGC12a04.q1c ACAGAAAAGC13d01.p1c ACAGAAAAGC16d01.p1c ACAGAAAAGC26g04.p1c ACAGAAAAGC33h02.q1c ACAGAAAAGC37g12.p1c ACAGAAAAGC40d06.p1c ACAGAAAAGG16a02.p1c ACAGAAAAGG20a10.p1c ACAGAAAAGG22a03.p1c ACAGAAAAGG26e12.q1c ACAGAAAAGG30e12.q1c ACAGAAAAGG47a01.p1c High bits Low bits 64 -2k 2k
12
1 2 3 4 5 6 … j … N 3 1 4 2 6 5 i N 41 0 0 0 0 R(i,j) Relation Matrix: R(i,j) – number of kmer words shared between read i and read j 41 37 0 0 0 0 37 0 22 0 0 0 22 0 0 0 0 0 0 27 0 0 0 27 0 Group 1: (1,2,3,5) Group 2: (4,6)
13
Paired Reads Separated by “NN”
14
Error Bases Correction
15
Mis-assembly errors: Contig Breaking
16
Track read pairs to walk through repetitive regions Read Pair Guided Local Assembler
17
Tasmanian devil Opossum Wallaby Tasmanian devil
18
Tasmanian devil facial tumour disease (DFTD) n Transmissible cancer characterised by the growth of large tumours on the face, neck and mouth of Tasmanian devils n Transmitted by biting n Commonly metastasises n First observed in 1996 n Primarily affects adults >1yr n Death in 4 – 6 months
19
Forestier (33) Fentonbury (no host) Reedy Marsh Railton Mangalore Frankford Kempton (2) Mt William (2) Coles Bay Upper Natone West Pencil Pine (3) Trowunna (2) Narawntapu Tarraleah Bronte Park 2006 2007 2008 14 4 13 Nugent (2) St Mary’s (2) Wisedale (?) DFTD samples DFTD originated here c.1996 Area still DFTD free
20
Reedy Marsh 2007 Mangalore 2007 Mt William 2007 or 2008 Coles Bay Upper Natone 2007 Narawntapu 2007 Strain 1, tetraploid Strain 2 Strain 3 DFTD samples for sequencing DFTD originated here c.1996 Area still DFTD free Unknown strain “Evolved” Forestier 2007
21
Sequencing T. Devil on Illumina: Strategy Tumour or normal genomic DNA Fragments of defined size 0.5, 2, 5, 7, 8, 10 kb Sequencing 2x100bp reads short insert 2x50bp mate pairs Alignment using bwa, ssaha2 Somatic mutations Germline variants Sequencing performed at Illumina De novo Assembly
22
Solexa reads : Number of read pairs: 528 Million; Finished genome size: 3.5 GB; Read length:2x100bp; Estimated read coverage: ~30X; Insert size: 410/50-600 bp; Mate pair data:2k,4k,5k,6k,8k,10k Number of reads clustered:458 Million Assembly features: - stats Contigs Supercontigs Total number of contigs: 1,246,970792,099 Total bases of contigs: 3.22 Gb3,62 Gb N50 contig size: 9,642434,642 Largest contig:96,9194,150,712 Averaged contig size: 2,5784,564 Contig coverage on genome: ~92%>99% Ratio of placed PE reads:~92%? Genome Assembly – T. Devil
23
Monodelphis domestica ( Opossum ) Macropus eugenii (Wallaby) Sminthopsis macroura (Dunnart) Brown Bear Dog
24
Pipeline of Contig Gap Closure
25
Solexa reads : Number of read pairs: 560 Million; Finished genome size: 3.0 GB; Read length:2x100bp; Estimated read coverage: ~37X; Insert size: 500/50-700 bp; Number of reads clustered:499 Million Assembly features: - contig stats Total number of contigs: 1,142,077; Total bases of contigs: 2.92 Gb N50 contig size: 12,875; Largest contig:140,463 Averaged contig size: 2,561; Contig coverage over the genome: ~94 %; Mis-assembly errors:? Human Assembly - Yoruba NA18507
26
Solexa reads : Number of read pairs: 359 Million; Finished genome size: 2.0 GB; Read length:2x120bp; Estimated read coverage: ~43X; Insert size: 500/50-700 bp; Number of reads clustered:316 Million Assembly features: - contig stats Total number of contigs: 733,465; Total bases of contigs: 1.91 Gb N50 contig size: 8,163; Largest contig:117,250 Averaged contig size: 2,592; Contig coverage over the genome: ~92 %; Mis-assembly errors:? Bamboo Genome Assembly Tetraploid
27
Solexa reads : Number of read pairs: 502 Million; Finished genome size: 2.0 GB; Read length:2x76bp; Estimated read coverage: ~35X; Insert size: 410/50-600 bp; Mate pair data:5Kb Number of reads clustered:438 Million Assembly features: - stats Contigs Supercontigs Total number of contigs: 2,241,4652,090,385 Total bases of contigs: 1.64 Gb1.92 Gb N50 contig size: 4,30129,076 Largest contig:71,161730,290 Averaged contig size: 732919 Contig coverage on genome: ~85%>95% Ratio of placed PE reads:~82%? Genome Assembly – Miscanthus
28
Melanoma cell line COLO-829 Paul Edwards, Departments of Pathology and Oncology, University of Cambridge
29
Plots of INDELs/SVs size distribution for all events detected by Pindel at single-base resolution. Left, insertions from 1bp to 60 bp. Right, deletions from 1bp to 1Mb.
30
Insertion H1 Ref B Deletion H1 Ref B1 B2 Homozygous/Heterozygous Indels (a) Insertions: Solid lines – reads with alignment terminates at the breakpoint; dashed line – reads with alignment crosses over the breakpoint. (b) Deletion: Solid line – read with alignment terminates at breakpoint; Dashed lines – reads with alignment crosses over the breakpoint. (a) (b)
31
Assemblies are used to confirm Pindel predictions: (a) deletion is confirmed by aligning two flanking sequences F1 and F2 to the reference; (b) deletion is not found in the reference with flanking sequences; (c) insertion is confirmed.
32
Acknowledgements: Elizabeth Murchuson Erin Preasance Mike Stratton Kai Ye Dirk Evers Ole Schulz-Trieglaff Qi Feng Bin Han
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.