Phusion2 and The Genome Assembly of Tasmanian Devil

Slides:



Advertisements
Similar presentations
Large Plant Genome Assemblies using Phusion2 Zemin Ning The Wellcome Trust Sanger Institute.
Advertisements

WGS Assembly and Reads Clustering Zemin Ning Production Software Group Informatics Division.
Click to edit Master title style Irys data analysis January 10 th, 2014.
Next Generation Sequencing, Assembly, and Alignment Methods
Lecture 14 Genome sequencing projects
CS273a Lecture 4, Autumn 08, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector.
CS262 Lecture 11, Win07, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector the.
Assembly.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Aut08, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA.
CS273a Lecture 2, Autumn 10, Batzoglou DNA Sequencing (cont.)
Evaluation of PacBio sequencing to improve the sunflower genome assembly Stéphane Muños & Jérôme Gouzy Presented by Nicolas Langlade Sunflower Genome Consortium.
Genome Assembly Bonnie Hurwitz Graduate student TMPL.
Genome Analysis Determine locus & sequence of all the organism’s genes More than 100 genomes have been analysed including humans in the Human Genome Project.
De-novo Assembly Day 4.
CS 394C March 19, 2012 Tandy Warnow.
Todd J. Treangen, Steven L. Salzberg
Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources.
PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.
Developing Bioinformatics Tools for Genome Analysis Zemin Ning The Wellcome Trust Sanger Institute.
1 Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs March 12, 2008 Daniel R. Zerbino and Ewan Birney Presenter: Seunghak Lee.
Meraculous: De Novo Genome Assembly with Short Paired-End Reads
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
P. Tang ( 鄧致剛 ); RRC. Gan ( 甘瑞麒 ); PJ Huang ( 黄栢榕 ) Bioinformatics Center, Chang Gung University. Genome Sequencing Genome Resequencing De novo Genome.
NGS sequencing and Genome Assemblies from Animals and Large Plants Zemin Ning The Wellcome Trust Sanger Institute.
Fuzzypath – Algorithms, Applications and Future Developments
The Changing Face of Sequencing
Hash Algorithm and SSAHA Implementations Zemin Ning Production Software Group Informatics.
FuzzyPath Assemblies - from Mixed Solexa/454 Datasets to Extremely GC Biased Genomes Zemin Ning The Wellcome Trust Sanger Institute.
Problems of Genome Assembly James Yorke and Aleksey Zimin University of Maryland, College Park 1.
Finishing tomato chromosomes #6 and #12 using a Next Generation whole genome shotgun approach Roeland van Ham, CBSG, NL René Klein Lankhorst, EUSOL Giovanni.
Cancer Genome Assemblies and Variations between Normal and Tumour Human Cells Zemin Ning The Wellcome Trust Sanger Institute.
Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute.
RNA Sequence Assembly WEI Xueliang. Overview Sequence Assembly Current Method My Method RNA Assembly To Do.
Genome De Novo Assemblies and Applications in NGS Sequencing Zemin Ning The Wellcome Trust Sanger Institute.
billion-piece genome puzzle
The Genome Assemblies of Tasmanian Devil Zemin Ning The Wellcome Trust Sanger Institute.
Hashing Algorithm and its Applications in Bioinformatics By Zemin Ning Informatics Division The Wellcome Trust Sanger Institute.
FuzzyPath - A Hybrid De novo Assembler using Solexa and 454 Short Reads Zemin Ning The Wellcome Trust Sanger Institute.
The Wellcome Trust Sanger Institute
Short read alignment BNFO 601. Short read alignment Input: –Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Chapter 5 Sequence Assembly: Assembling the Human Genome.
Cross_genome: Assembly Scaffolding using Cross-species Synteny Zemin Ning High Performance Assembly.
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
Phusion2 Assemblies and Indel Confirmation Zemin Ning The Wellcome Trust Sanger Institute.
Variation Detections and De novo Assemblies from Next-gen Data Zemin Ning The Wellcome Trust Sanger Institute.
Sequence Alignment and Genome Assembly Zemin Ning The Wellcome Trust Sanger Institute.
A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.
When the next-generation sequencing becomes the now- generation Lisa Zhang November 6th, 2012.
JERI DILTS SUZANNA KIM HEMA NAGRAJAN DEEPAK PURUSHOTHAM AMBILY SIVADAS AMIT RUPANI LEO WU Genome Assembly Final Results
De Novo Assembly of Mitochondrial Genomes from Low Coverage Whole-Genome Sequencing Reads Fahad Alqahtani and Ion Mandoiu University of Connecticut Computer.
Short Read Sequencing Analysis Workshop
DNA Sequencing Project
Sequence assembly Jose Blanca COMAV institute bioinf.comav.upv.es.
Cross_genome: Assembly Scaffolding using Cross-species Synteny
CAP5510 – Bioinformatics Sequence Assembly
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Gapless genome assembly of Colletotrichum higginsianum reveals chromosome structure and association of transposable elements with secondary metabolite.
Denovo genome assembly of Moniliophthora roreri
Genome sequence assembly
Professors: Dr. Gribskov and Dr. Weil
A Hybrid Assembly System in Zebrafish Pooled Clones
Ssaha_pileup - a SNP/indel detection pipeline from new sequencing data
Very important to know the difference between the trees!
Introduction to Sequencing
Sequence the 3 billion base pairs of human
Canadian Bioinformatics Workshops
Presentation transcript:

Phusion2 and The Genome Assembly of Tasmanian Devil Zemin Ning The Wellcome Trust Sanger Institute 1

Outline of the Talk: Challenges in genome assemblies from pure Illumina reads The Phusion2 pipeline The Tasmanian devil genome project The Devil genome assembly Other assemblies: human cancer, zebrafish, rice, etc

Challenges in Whole Genome Assembly using Pure Illumina Reads Large genome and huge datasets For human: 100Gb at 30x Repetitive/Duplication structures, Alus, LINES, SVAs 30-40% such as human, mouse; 50-60% such as rice and other plant genomes. Tandem repeats: how many copies they have? TATATATATATATATATATATATATATA GCGCGCGCGCGCGCGCGCGCGCGCGCGCGCG GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG AGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGT

De Bruijn vs Read overlap Missing from de Bruijn contigs Missing sequences Missing from de Bruijn contigs

Phusion2 Assembly Pipeline Solexa Reads Assembly Reads Group Data Process Long Insert Reads Supercontig Contigs PRono Fuzzypath Velvet Phrap 2x75 or 2x100 Base Correction

Grouped Reads by Phusion Repetitive Contig and Read Pairs Depth Grouped Reads by Phusion

Kmer Word Hashing Contiguous Base Hash K = 12 Gap-Hash 4x3 ATGGCGTGCAGT TGGCGTGCAGTC GGCGTGCAGTCC GCGTGCAGTCCA CGTGCAGTCCAT ATGGCGTGCAGTCCATGTTCGGATCA Contiguous Base Hash K = 12 ATGGGCAGATGT TGGCCAGTTGTT GGCGAGTCGTTC GCGTGTCCTTCG ATGGCGTGCAGTCCATGTTCGGATCA Gap-Hash 4x3

Word use distribution for the mouse sequence data at ~7.5 fold Useful Region Poisson Curve Real Data Curve

High bits Low bits Sorted List of Each k-Mer and Its Read Indices ACAGAAAAGC 10h06.p1c 12a04.q1c 13d01.p1c 16d01.p1c 26g04.p1c 33h02.q1c 37g12.p1c 40d06.p1c ACAGAAAAGG 16a02.p1c 20a10.p1c 22a03.p1c 26e12.q1c 30e12.q1c 47a01.p1c 64 -2k 2k

Relation Matrix: R(i,j) – number of kmer words shared between read i and read j 1 41 0 0 0 0 2 41 37 0 0 0 3 0 37 0 22 0 4 0 0 0 0 27 Group 2: (4,6) 5 0 0 22 0 0 6 0 0 0 27 0 i R(i,j) Group 1: (1,2,3,5) N

Relation Matrix: R(i,j) – Implementation 1 2 3 4 Number of shared kmer words (< 63) 5 . Read index R(i,j) N

Break contigs without read pair coverage

Tasmanian devil Tasmanian devil Opossum Wallaby

Tasmanian devil facial tumour disease (DFTD) Transmissible cancer characterised by the growth of large tumours on the face, neck and mouth of Tasmanian devils Transmitted by biting Commonly metastasises First observed in 1996 Primarily affects adults >1yr Death in 4 – 6 months

DFTD samples Area still DFTD free DFTD originated here c.1996 Narawntapu Mt William (2) Upper Natone 2006 Wisedale (?) Frankford Railton 2007 West Pencil Pine (3) St Mary’s (2) Reedy Marsh Trowunna (2) 2008 Bronte Park Coles Bay Tarraleah Kempton (2) Mangalore Fentonbury (no host) Nugent (2) 4 14 Forestier (33) 13

DFTD samples for sequencing Area still DFTD free DFTD originated here c.1996 Narawntapu 2007 Mt William 2007 or 2008 Upper Natone 2007 Strain 1, tetraploid Strain 2 Reedy Marsh 2007 Strain 3 “Evolved” Unknown strain Coles Bay Mangalore 2007 Forestier 2007

Sequencing T. Devil on Illumina: Strategy Tumour or normal genomic DNA Fragments of defined size 0.5, 5, 7 kb Sequencing 100 bp reads short insert 75 bp reads long insert Sequencing performed at Illumina Alignment using bwa, ssaha2 De novo Assembly Somatic mutations Germline variants

Paired Reads Separated by “NN”

Error Bases Correction

Genome Assembly – T. Devil Solexa reads: Number of read pairs: 528 Million; Finished genome size: 3.5 GB; Read length: 2x100bp; Estimated read coverage: ~30X; Insert size: 410/50-600 bp; Number of reads clustered: 458 Million Assembly features: - contig stats Phusion2 ABySS Total number of contigs: 1,420,262 7,796,722 Total bases of contigs: 3.29 Gb 2.28 Gb N50 contig size: 7,618 2,013 Largest contig: 76,418 31045 Averaged contig size: 2,314 292 Contig coverage on genome: ~94 % 65% Mis-assembly errors: ? ?

Dog Brown Bear Macropus eugenii (Wallaby) Monodelphis domestica ( Opossum ) Sminthopsis macroura (Dunnart)

Tasmanian devil Tasmanian devil Opossum Wallaby

Melanoma cell line COLO-829 Paul Edwards, Departments of Pathology and Oncology, University of Cambridge

Human Cancer Genome Assembly – Normal Cell Solexa reads: Number of read pairs: 557 Million; Finished genome size: 3.0 GB; Read length: 2x75bp; Estimated read coverage: ~25X; Insert size: 190/50-300 bp; Number of reads clustered: 458 Million Assembly features: - contig stats Total number of contigs: 1,020,346; Total bases of contigs: 2.713 Gb N50 contig size: 8,344; Largest contig: 107,613 Averaged contig size: 2,659; Contig coverage over the genome: ~90 %; Mis-assembly errors: ?

Genome Assembly – Tumour Cell Solexa reads: Number of read pairs: 562 Million; Finished genome size: 3.0 GB; Read length: 2x75bp; Estimated read coverage: ~25X; Insert size: 190/50-300 bp; Number of reads clustered: 449 Million Assembly features: - contig stats Total number of contigs: 1,249,719; Total bases of contigs: 2.690 Gb N50 contig size: 6,073; Largest contig: 72,123 Averaged contig size: 2,152; Contig coverage over the genome: ~90 %; Mis-assembly errors: ?

One Of the most difficult Genomes on earth? Rice Genome Assembly One Of the most difficult Genomes on earth? Solexa reads: Number of read pairs: 97.9 Million; Finished genome size: 440 MB; Read length: 2x76bp; Estimated read coverage: ~33X; Insert size: 500/50-600 bp; Number of reads clustered: 81.2 Million Assembly features: - contig stats Total number of contigs: 374,713; Total bases of contigs: 365 Mb N50 contig size: 7,639; Largest contig: 72,321 Averaged contig size: 973; Contig coverage over the genome: ~83 %; Mis-assembly errors: ?

Acknowledgements: Elizabeth Murchuson Erin Preasance Mike Stratton Dirk Evers Ole Schulz-Trieglaff Qi Feng Bin Han