Presentation is loading. Please wait.

Presentation is loading. Please wait.

Phusion2 and The Genome Assembly of Tasmanian Devil

Similar presentations


Presentation on theme: "Phusion2 and The Genome Assembly of Tasmanian Devil"— Presentation transcript:

1 Phusion2 and The Genome Assembly of Tasmanian Devil
Zemin Ning The Wellcome Trust Sanger Institute 1

2 Outline of the Talk: Challenges in genome assemblies from pure Illumina reads The Phusion2 pipeline The Tasmanian devil genome project The Devil genome assembly Other assemblies: human cancer, zebrafish, rice, etc

3 Challenges in Whole Genome Assembly using Pure Illumina Reads
Large genome and huge datasets For human: 100Gb at 30x Repetitive/Duplication structures, Alus, LINES, SVAs 30-40% such as human, mouse; 50-60% such as rice and other plant genomes. Tandem repeats: how many copies they have? TATATATATATATATATATATATATATA GCGCGCGCGCGCGCGCGCGCGCGCGCGCGCG GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG AGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGT

4 De Bruijn vs Read overlap Missing from de Bruijn contigs
Missing sequences Missing from de Bruijn contigs

5 Phusion2 Assembly Pipeline
Solexa Reads Assembly Reads Group Data Process Long Insert Reads Supercontig Contigs PRono Fuzzypath Velvet Phrap 2x75 or 2x100 Base Correction

6 Grouped Reads by Phusion
Repetitive Contig and Read Pairs Depth Grouped Reads by Phusion

7 Kmer Word Hashing Contiguous Base Hash K = 12 Gap-Hash 4x3
ATGGCGTGCAGT TGGCGTGCAGTC GGCGTGCAGTCC GCGTGCAGTCCA CGTGCAGTCCAT ATGGCGTGCAGTCCATGTTCGGATCA Contiguous Base Hash K = 12 ATGGGCAGATGT TGGCCAGTTGTT GGCGAGTCGTTC GCGTGTCCTTCG ATGGCGTGCAGTCCATGTTCGGATCA Gap-Hash 4x3

8 Word use distribution for the mouse sequence data at ~7.5 fold
Useful Region Poisson Curve Real Data Curve

9 High bits Low bits Sorted List of Each k-Mer and Its Read Indices ACAGAAAAGC 10h06.p1c 12a04.q1c 13d01.p1c 16d01.p1c 26g04.p1c 33h02.q1c 37g12.p1c 40d06.p1c ACAGAAAAGG 16a02.p1c 20a10.p1c 22a03.p1c 26e12.q1c 30e12.q1c 47a01.p1c 64 -2k 2k

10 Relation Matrix: R(i,j) – number of kmer words shared between read i and read j
1 2 3 4 Group 2: (4,6) 5 6 i R(i,j) Group 1: (1,2,3,5) N

11 Relation Matrix: R(i,j) – Implementation
1 2 3 4 Number of shared kmer words (< 63) 5 . Read index R(i,j) N

12 Break contigs without read pair coverage

13 Tasmanian devil Tasmanian devil Opossum Wallaby

14 Tasmanian devil facial tumour disease (DFTD)
Transmissible cancer characterised by the growth of large tumours on the face, neck and mouth of Tasmanian devils Transmitted by biting Commonly metastasises First observed in 1996 Primarily affects adults >1yr Death in 4 – 6 months

15 DFTD samples Area still DFTD free DFTD originated here c.1996
Narawntapu Mt William (2) Upper Natone 2006 Wisedale (?) Frankford Railton 2007 West Pencil Pine (3) St Mary’s (2) Reedy Marsh Trowunna (2) 2008 Bronte Park Coles Bay Tarraleah Kempton (2) Mangalore Fentonbury (no host) Nugent (2) 4 14 Forestier (33) 13

16 DFTD samples for sequencing
Area still DFTD free DFTD originated here c.1996 Narawntapu 2007 Mt William 2007 or 2008 Upper Natone 2007 Strain 1, tetraploid Strain 2 Reedy Marsh 2007 Strain 3 “Evolved” Unknown strain Coles Bay Mangalore 2007 Forestier 2007

17 Sequencing T. Devil on Illumina: Strategy
Tumour or normal genomic DNA Fragments of defined size 0.5, 5, 7 kb Sequencing 100 bp reads short insert 75 bp reads long insert Sequencing performed at Illumina Alignment using bwa, ssaha2 De novo Assembly Somatic mutations Germline variants

18 Paired Reads Separated by “NN”

19 Error Bases Correction

20 Genome Assembly – T. Devil
Solexa reads: Number of read pairs: Million; Finished genome size: GB; Read length: 2x100bp; Estimated read coverage: ~30X; Insert size: / bp; Number of reads clustered: 458 Million Assembly features: - contig stats Phusion2 ABySS Total number of contigs: ,420,262 7,796,722 Total bases of contigs: Gb 2.28 Gb N50 contig size: ,618 2,013 Largest contig: 76, Averaged contig size: , Contig coverage on genome: ~94 % 65% Mis-assembly errors: ? ?

21

22 Dog Brown Bear Macropus eugenii (Wallaby) Monodelphis domestica ( Opossum ) Sminthopsis macroura (Dunnart)

23 Tasmanian devil Tasmanian devil Opossum Wallaby

24 Melanoma cell line COLO-829
Paul Edwards, Departments of Pathology and Oncology, University of Cambridge

25 Human Cancer Genome Assembly – Normal Cell
Solexa reads: Number of read pairs: Million; Finished genome size: GB; Read length: 2x75bp; Estimated read coverage: ~25X; Insert size: / bp; Number of reads clustered: 458 Million Assembly features: - contig stats Total number of contigs: ,020,346; Total bases of contigs: Gb N50 contig size: ,344; Largest contig: ,613 Averaged contig size: ,659; Contig coverage over the genome: ~90 %; Mis-assembly errors: ?

26 Genome Assembly – Tumour Cell
Solexa reads: Number of read pairs: Million; Finished genome size: GB; Read length: 2x75bp; Estimated read coverage: ~25X; Insert size: / bp; Number of reads clustered: 449 Million Assembly features: - contig stats Total number of contigs: ,249,719; Total bases of contigs: Gb N50 contig size: ,073; Largest contig: 72,123 Averaged contig size: ,152; Contig coverage over the genome: ~90 %; Mis-assembly errors: ?

27 One Of the most difficult Genomes on earth?
Rice Genome Assembly One Of the most difficult Genomes on earth? Solexa reads: Number of read pairs: Million; Finished genome size: MB; Read length: 2x76bp; Estimated read coverage: ~33X; Insert size: / bp; Number of reads clustered: Million Assembly features: - contig stats Total number of contigs: ,713; Total bases of contigs: Mb N50 contig size: ,639; Largest contig: 72,321 Averaged contig size: ; Contig coverage over the genome: ~83 %; Mis-assembly errors: ?

28 Acknowledgements: Elizabeth Murchuson Erin Preasance Mike Stratton
Dirk Evers Ole Schulz-Trieglaff Qi Feng Bin Han


Download ppt "Phusion2 and The Genome Assembly of Tasmanian Devil"

Similar presentations


Ads by Google