Download presentation
Presentation is loading. Please wait.
Published byAustin McDonald Modified over 9 years ago
1
Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology University of Trento Italy Microbial Genome Assembly 1
2
2 Outline-summary 4. CASE STUDY 2. GENOME ASSEMBLY 3. ASSEMBLY STRATEGIES 1. QUICK INTRODUCTION
3
3 DNA packaging
4
4
5
5 Outline-summary 4. CASE STUDY 2. GENOME ASSEMBLY 3. ASSEMBLY STRATEGIES 1. QUICK INTRODUCTION
6
6 Next Generation Sequencing TCTTATTGTGACC TAGGCTAGCTTAG GCAATGCAGTAAC TCCAGCTAGGTTC ACGTAGGCTAGCGTTAGCGA........ CTGCAT C
7
7 Genome Assembly 1.GENOME SEQUENCING 2.PRELIMINARY ANALYSIS 3.ASSEMBLY 4.ADVANCED BIOINFORMATIC ANALYSIS OVERLAPPING SEQUENCE ALIGMENT
8
Sequencing the human genome with shotgun sequencing + assembly is the only feasible strategy Computational assembly of shotgun sequencing data is simply unfeasible, and a bad idea anyway Weber, James L., and Eugene W. Myers. "Human whole-genome shotgun sequencing." Genome Research 7.5 (1997): 401-409. Green, Philip. "Against a whole-genome shotgun.“ Genome Research 7.5 (1997): 410-417. They were both right! (…well, Weber and Myers were a bit more right from the practical viewpoint…) On the feasibility of sequence assembly
9
9 Outline-summary 4. CASE STUDY 2. GENOME ASSEMBLY 3. ASSEMBLY STRATEGIES 1. QUICK INTRODUCTION
10
10 Genome assembly strategies Greedy approach → SSAKE De Bruijn graph (DBG) → Velvet, SOAPdenovo Overlap Consensus Layout (OLC) → MIRA Mixed approaches → MaSuRCA
11
11 Genome assembly strategies DE BRUIJN GRAPH APPROACH (DBG) Velvet, SOAPdenovo2 Nodes = overlapping sequences of reads of uniform length Edges = kmer (unique subsequences within reads) EULERIAN PATH
12
12 Genome assembly strategies OVERLAP CONSENSUS LAYOUT (OLC) MIRA Nodes = reads Edges = overlap between reads 1.OVERLAP 2.LAYOUT 3.CONSENSUS HAMILTONIAN PATH
13
13 Genome assembly strategies
14
14 Genome assembly strategies DBGOLC ADVANTAGES Very sensitive to repeatsModular algorithmic design Kmer storaged just onceFlexibility and robustness Eulerian cycle Never explicitly computes pairwise computation DISADVANTAGES Sensitive to sequencing errors (new k-mers) Hamiltonian cycle Large computational memory space requirements Overlap stage istime- consuming Genome-size limitations
15
15 Greedy approach → SSAKE De Bruijn graph (DBG) → Velvet, SOAPdenovo Overlap Consensus Layout (OLC) → MIRA Mixed approaches → MaSuRCA Genome assembly strategies
16
16 Genome Assemblers Average Coverage Number of Contigs Number of Contigs > 1Kb N50 contig size Fraction of reads assembled Total consensus (in nt) Number of scaffolds N50 scaffolds size Ion Torrent PGM → MIRA 3.9 Illumina → MaSuRCA MIRA 3.9 too produced good quality results, but it has a longer execution time and it becomes unstable with large amount of small reads
17
17 Outline-summary 4. CASE STUDY 2. GENOME ASSEMBLY 3. ASSEMBLY STRATEGIES 1. QUICK INTRODUCTION
18
18 Mycobacteria Assembly: Case Study Responsible for many animal and human diseases M. tuberculosis and M. leprae (TM) M. fortuitum (NTM) outbreak (nail salon, 2002) M. chelonae (NTM) outbreak (face lifts, 2004) Illumina HiSeq sequencing (NGS Facility – CIBIO/UNITN) Twenty mycobacterial strains From 20 different Mycobacteria species → MaSuRCA Novel mycobacteria detection clinical tests
19
19 Fastq-mcf tool poor quality ends of reads Ns, duplicates and sequencing adapters reads that are too short Reduction up to 73% Raw data quality assessment and pre-processing
20
20 K-mers: strings of a particular length k, which are shorter than entire reads Best empirical k-mer length: 91 bases long Assembly parameters setting High coverage
21
21 MaSuRCA results of Mycobacteria Abnormal GC content Genome size too high
22
22 Examples of environmental contaminations GC content based quality analysis Staphylococcus epidermidis
23
Thanks Photo coming soon http://gcat.davidson.edu/phast/#methods
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.