Genome De Novo Assemblies and Applications in NGS Sequencing Zemin Ning The Wellcome Trust Sanger Institute.

Slides:



Advertisements
Similar presentations
Genetic Map and Forward Genetics Tools for C. briggsae Presented by Dan Koboldt Ray Miller’s Group.
Advertisements

Large Plant Genome Assemblies using Phusion2 Zemin Ning The Wellcome Trust Sanger Institute.
WGS Assembly and Reads Clustering Zemin Ning Production Software Group Informatics Division.
Bioinformatics at Molecular Epidemiology - new tools for identifying indels in sequencing data Kai Ye
Next Generation Sequencing, Assembly, and Alignment Methods
Lecture 14 Genome sequencing projects
1000 Genomes SV detection Boston College Chip Stewart 24 November 2008.
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
Assembly.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Aut08, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Genome sequencing and assembling
Informatics challenges and computer tools for sequencing 1000s of human genomes Gabor T. Marth Boston College Biology Department Cold Spring Harbor Laboratory.
Genome Assembly Bonnie Hurwitz Graduate student TMPL.
Genome Analysis Determine locus & sequence of all the organism’s genes More than 100 genomes have been analysed including humans in the Human Genome Project.
Considerations for Analyzing Targeted NGS Data BRCA Tim Hague,CTO.
Next generation sequencing Xusheng Wang 4/29/2010.
Whole Exome Sequencing for Variant Discovery and Prioritisation
Detecting copy number variations using paired-end sequence data Nick Furlotte CS224 May 29, 2009.
CS 394C March 19, 2012 Tandy Warnow.
Todd J. Treangen, Steven L. Salzberg
Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources.
PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.
Developing Bioinformatics Tools for Genome Analysis Zemin Ning The Wellcome Trust Sanger Institute.
1 Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs March 12, 2008 Daniel R. Zerbino and Ewan Birney Presenter: Seunghak Lee.
Meraculous: De Novo Genome Assembly with Short Paired-End Reads
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
P. Tang ( 鄧致剛 ); RRC. Gan ( 甘瑞麒 ); PJ Huang ( 黄栢榕 ) Bioinformatics Center, Chang Gung University. Genome Sequencing Genome Resequencing De novo Genome.
High throughput sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department BI543 Fall 2013 January 29, 2013.
NGS sequencing and Genome Assemblies from Animals and Large Plants Zemin Ning The Wellcome Trust Sanger Institute.
By Zemin Ning & Adam Spargo Informatics Division The Wellcome Trust Sanger Institute The SSAHA2 Application Pack.
Fuzzypath – Algorithms, Applications and Future Developments
The Changing Face of Sequencing
Hash Algorithm and SSAHA Implementations Zemin Ning Production Software Group Informatics.
FuzzyPath Assemblies - from Mixed Solexa/454 Datasets to Extremely GC Biased Genomes Zemin Ning The Wellcome Trust Sanger Institute.
How will new sequencing technologies enable the HMP? Elaine Mardis, Ph.D. Associate Professor of Genetics Co-Director, Genome Sequencing Center Washington.
Genomics Method Seminar - BreakDancer January 21, 2015 Sora Kim Researcher Yonsei Biomedical Science Institute Yonsei University College.
Cancer Genome Assemblies and Variations between Normal and Tumour Human Cells Zemin Ning The Wellcome Trust Sanger Institute.
Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute.
RNA Sequence Assembly WEI Xueliang. Overview Sequence Assembly Current Method My Method RNA Assembly To Do.
Human Genome.
The Genome Assemblies of Tasmanian Devil Zemin Ning The Wellcome Trust Sanger Institute.
Hashing Algorithm and its Applications in Bioinformatics By Zemin Ning Informatics Division The Wellcome Trust Sanger Institute.
FuzzyPath - A Hybrid De novo Assembler using Solexa and 454 Short Reads Zemin Ning The Wellcome Trust Sanger Institute.
The Wellcome Trust Sanger Institute
Mojavensis: Issues of Polymorphisms Chris Shaffer GEP 2009 Washington University.
Accessing and visualizing genomics data
Chapter 5 Sequence Assembly: Assembling the Human Genome.
Cross_genome: Assembly Scaffolding using Cross-species Synteny Zemin Ning High Performance Assembly.
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
Phusion2 Assemblies and Indel Confirmation Zemin Ning The Wellcome Trust Sanger Institute.
Variation Detections and De novo Assemblies from Next-gen Data Zemin Ning The Wellcome Trust Sanger Institute.
Sequence Alignment and Genome Assembly Zemin Ning The Wellcome Trust Sanger Institute.
A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.
When the next-generation sequencing becomes the now- generation Lisa Zhang November 6th, 2012.
Canadian Bioinformatics Workshops
JERI DILTS SUZANNA KIM HEMA NAGRAJAN DEEPAK PURUSHOTHAM AMBILY SIVADAS AMIT RUPANI LEO WU Genome Assembly Final Results
Canadian Bioinformatics Workshops
Virginia Commonwealth University
Interpreting exomes and genomes: a beginner’s guide
Phusion2 and The Genome Assembly of Tasmanian Devil
Cross_genome: Assembly Scaffolding using Cross-species Synteny
Denovo genome assembly of Moniliophthora roreri
A Hybrid Assembly System in Zebrafish Pooled Clones
Ssaha_pileup - a SNP/indel detection pipeline from new sequencing data
Very important to know the difference between the trees!
Jin Zhang, Jiayin Wang and Yufeng Wu
AMOS Assembly Validation and Visualization
Canadian Bioinformatics Workshops
Presentation transcript:

Genome De Novo Assemblies and Applications in NGS Sequencing Zemin Ning The Wellcome Trust Sanger Institute

 My academic background  Challenges in genome assemblies from pure Illumina reads  The Phusion2 pipeline  The Tasmanian devil genome project  The Devil genome assembly  Other assemblies: human, bamboo,miscanthus, etc Outline of the Talk:

Powder Simulation

Hair Dynamics Genetics and Human Hair Structure AFRICAN CAUCASIAN EAST ASIAN

 SSAHA (Sequence Search and Alignment by the Hashing Algorithm Ssaha2 – Alignment tool for Solexa, 454, ABI capillary reads ssahaSNP – SNP/indel detection, mainly for ABI capillary reads ssahaEST – EST or cDNA alignment ssaha_SV – Structural variation (CNVs) detection ssaha_pileup – SNP/indel detection from next-gen data  Phusion & Phusion2 Development and maintenance of the pipeline Production of WGS assemblies: Mouse, Zebrafish, Human (Venter genome), C. Briggsae, Rice, Schisto, Sea Lamprey, Gorilla, Malaria and many bacterial genomes  TraceSeach Public sequence search facility for all the traces  Fuzzypath Short read assembler Informatics Projects Involved

Challenges in Whole Genome Assembly using Pure Illumina Reads  Short read length: 2x36; 2x54; 2x75; 2x100  Large genome and huge datasets For human: 100Gb at 30x  Repetitive/Duplication structures, Alus, LINES, SVAs 30-40% such as human, mouse; 50-60% such as rice and other plant genomes.  Tandem repeats: how many copies they have? TATATATATATATATATATATATATATA GCGCGCGCGCGCGCGCGCGCGCGCGCGCGCG GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG AGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGT

De Bruijn vs Read overlap Missing from de Bruijn contigs Missing sequences

Phusion2 Assembly Pipeline Solexa Reads Assembly Reads Group Data Process Long Insert Reads Supercontig Contigs PRono Fuzzypath Velvet Phrap 2x75 or 2x100 Base Correction RP_Assemble

Gap-Hash4x3 ATGGGCAGATGT ATGGGCAGATGT TGGCCAGTTGTT TGGCCAGTTGTT GGCGAGTCGTTC GGCGAGTCGTTC GCGTGTCCTTCG GCGTGTCCTTCG ATGGCGTGCAGTCCATGTTCGGATCA ATGGCGTGCAGTCCATGTTCGGATCA ATGGCGTGCAGT TGGCGTGCAGTC TGGCGTGCAGTC GGCGTGCAGTCC GGCGTGCAGTCC GCGTGCAGTCCA GCGTGCAGTCCA CGTGCAGTCCAT CGTGCAGTCCAT ATGGCGTGCAGTCCATGTTCGGATCA ATGGCGTGCAGTCCATGTTCGGATCA Contiguous Base Hash Base Hash K = 12 Kmer Word Hashing

Word use distribution for the mouse sequence data at ~7.5 fold Useful Region Poisson Curve Real Data Curve

Sorted List of Each k-Mer and Its Read Indices ACAGAAAAGC10h06.p1c ACAGAAAAGC12a04.q1c ACAGAAAAGC13d01.p1c ACAGAAAAGC16d01.p1c ACAGAAAAGC26g04.p1c ACAGAAAAGC33h02.q1c ACAGAAAAGC37g12.p1c ACAGAAAAGC40d06.p1c ACAGAAAAGG16a02.p1c ACAGAAAAGG20a10.p1c ACAGAAAAGG22a03.p1c ACAGAAAAGG26e12.q1c ACAGAAAAGG30e12.q1c ACAGAAAAGG47a01.p1c High bits Low bits 64 -2k 2k

… j … N i N R(i,j) Relation Matrix: R(i,j) – number of kmer words shared between read i and read j Group 1: (1,2,3,5) Group 2: (4,6)

Paired Reads Separated by “NN”

Error Bases Correction

Mis-assembly errors: Contig Breaking

Track read pairs to walk through repetitive regions Read Pair Guided Local Assembler

Tasmanian devil Opossum Wallaby Tasmanian devil

Tasmanian devil facial tumour disease (DFTD) n Transmissible cancer characterised by the growth of large tumours on the face, neck and mouth of Tasmanian devils n Transmitted by biting n Commonly metastasises n First observed in 1996 n Primarily affects adults >1yr n Death in 4 – 6 months

Forestier (33) Fentonbury (no host) Reedy Marsh Railton Mangalore Frankford Kempton (2) Mt William (2) Coles Bay Upper Natone West Pencil Pine (3) Trowunna (2) Narawntapu Tarraleah Bronte Park Nugent (2) St Mary’s (2) Wisedale (?) DFTD samples DFTD originated here c.1996 Area still DFTD free

Reedy Marsh 2007 Mangalore 2007 Mt William 2007 or 2008 Coles Bay Upper Natone 2007 Narawntapu 2007 Strain 1, tetraploid Strain 2 Strain 3 DFTD samples for sequencing DFTD originated here c.1996 Area still DFTD free Unknown strain “Evolved” Forestier 2007

Sequencing T. Devil on Illumina: Strategy Tumour or normal genomic DNA Fragments of defined size 0.5, 2, 5, 7, 8, 10 kb Sequencing 2x100bp reads short insert 2x50bp mate pairs Alignment using bwa, ssaha2 Somatic mutations Germline variants Sequencing performed at Illumina De novo Assembly

Solexa reads : Number of read pairs: 528 Million; Finished genome size: 3.5 GB; Read length:2x100bp; Estimated read coverage: ~30X; Insert size: 410/ bp; Mate pair data:2k,4k,5k,6k,8k,10k Number of reads clustered:458 Million Assembly features: - stats Contigs Supercontigs Total number of contigs: 1,246,970792,099 Total bases of contigs: 3.22 Gb3,62 Gb N50 contig size: 9,642434,642 Largest contig:96,9194,150,712 Averaged contig size: 2,5784,564 Contig coverage on genome: ~92%>99% Ratio of placed PE reads:~92%? Genome Assembly – T. Devil

Monodelphis domestica ( Opossum ) Macropus eugenii (Wallaby) Sminthopsis macroura (Dunnart) Brown Bear Dog

Pipeline of Contig Gap Closure

Solexa reads : Number of read pairs: 560 Million; Finished genome size: 3.0 GB; Read length:2x100bp; Estimated read coverage: ~37X; Insert size: 500/ bp; Number of reads clustered:499 Million Assembly features: - contig stats Total number of contigs: 1,142,077; Total bases of contigs: 2.92 Gb N50 contig size: 12,875; Largest contig:140,463 Averaged contig size: 2,561; Contig coverage over the genome: ~94 %; Mis-assembly errors:? Human Assembly - Yoruba NA18507

Solexa reads : Number of read pairs: 359 Million; Finished genome size: 2.0 GB; Read length:2x120bp; Estimated read coverage: ~43X; Insert size: 500/ bp; Number of reads clustered:316 Million Assembly features: - contig stats Total number of contigs: 733,465; Total bases of contigs: 1.91 Gb N50 contig size: 8,163; Largest contig:117,250 Averaged contig size: 2,592; Contig coverage over the genome: ~92 %; Mis-assembly errors:? Bamboo Genome Assembly Tetraploid

Solexa reads : Number of read pairs: 502 Million; Finished genome size: 2.0 GB; Read length:2x76bp; Estimated read coverage: ~35X; Insert size: 410/ bp; Mate pair data:5Kb Number of reads clustered:438 Million Assembly features: - stats Contigs Supercontigs Total number of contigs: 2,241,4652,090,385 Total bases of contigs: 1.64 Gb1.92 Gb N50 contig size: 4,30129,076 Largest contig:71,161730,290 Averaged contig size: Contig coverage on genome: ~85%>95% Ratio of placed PE reads:~82%? Genome Assembly – Miscanthus

Melanoma cell line COLO-829 Paul Edwards, Departments of Pathology and Oncology, University of Cambridge

Plots of INDELs/SVs size distribution for all events detected by Pindel at single-base resolution. Left, insertions from 1bp to 60 bp. Right, deletions from 1bp to 1Mb.

Insertion H1 Ref B Deletion H1 Ref B1 B2 Homozygous/Heterozygous Indels (a) Insertions: Solid lines – reads with alignment terminates at the breakpoint; dashed line – reads with alignment crosses over the breakpoint. (b) Deletion: Solid line – read with alignment terminates at breakpoint; Dashed lines – reads with alignment crosses over the breakpoint. (a) (b)

Assemblies are used to confirm Pindel predictions: (a) deletion is confirmed by aligning two flanking sequences F1 and F2 to the reference; (b) deletion is not found in the reference with flanking sequences; (c) insertion is confirmed.

Acknowledgements:  Elizabeth Murchuson  Erin Preasance  Mike Stratton  Kai Ye  Dirk Evers  Ole Schulz-Trieglaff  Qi Feng  Bin Han