Assembling Genomes BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.

Slides:



Advertisements
Similar presentations
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Advertisements

Graphs Graphs are the most general data structures we will study in this course. A graph is a more general version of connected nodes than the tree. Both.
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology University of Trento Italy Microbial Genome Assembly 1.
Next Generation Sequencing, Assembly, and Alignment Methods
Introduction to Short Read Sequencing Analysis
How many transcripts does it take to reconstruct the splice graph? Introduction Alternative splicing is the process by which a single gene may be used.
CS273a Lecture 4, Autumn 08, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector.
Genome Sequence Assembly: Algorithms and Issues Fiona Wong Jan. 22, 2003 ECS 289A.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler Ben Langmead Based on work by Juha Kärkkäinen.
CS262 Lecture 11, Win07, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector the.
Assembly.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics.
Genome sequencing and assembling
Phylogenetic Tree Construction and Related Problems Bioinformatics.
Genome Assembly Charles Yan Fragment Assembly Given a large number of fragments, such as ACC AC AT AC AT GG …, the goal is to figure out the original.
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Sequence comparison: Local alignment
1 Sequencing and Sequence Assembly --overview of the genome sequenceing process Presented by NIE, Lan CSE497 Feb.24, 2004.
Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers
Sequencing a genome and Basic Sequence Alignment
Spring 2015 Mathematics in Management Science Binary Linear Codes Two Examples.
De-novo Assembly Day 4.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
How to Build a Horse Megan Smedinghoff.
Mouse Genome Sequencing
Mon C222 lecture by Veli Mäkinen Thu C222 study group by VM  Mon C222 exercises by Anna Kuosmanen Algorithms in Molecular Biology, 5.
CS 394C March 19, 2012 Tandy Warnow.
A hierarchical approach to building contig scaffolds Mihai Pop Dan Kosack Steven L. Salzberg Genome Research 14(1), pp , 2004.
Phred/Phrap/Consed Analysis A User’s View Arthur Gruber International Training Course on Bioinformatics Applied to Genomic Studies Rio de Janeiro 2001.
Introduction to Short Read Sequencing Analysis
1 Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs March 12, 2008 Daniel R. Zerbino and Ewan Birney Presenter: Seunghak Lee.
Assembling Sequences Using Trace Signals and Additional Sequence Information Bastien Chevreux, Thomas Pfisterer, Thomas Wetter, Sandor Suhai Deutsches.
Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.
Physical Mapping of DNA BIO/CS 471 – Algorithms for Bioinformatics.
SIZE SELECT SHEAR Shotgun DNA Sequencing (Technology) DNA target sample LIGATE & CLONE Vector End Reads (Mates) SEQUENCE Primer.
Sequencing a genome and Basic Sequence Alignment
Fragment assembly of DNA A typical approach to sequencing long DNA molecules is to sample and then sequence fragments from them.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Fragment Assembly 蔡懷寬 We would like to know the Target DNA sequence.
GigAssembler. Genome Assembly: A big picture
Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Chapter 5 Sequence Assembly: Assembling the Human Genome.
454 Genome Sequence Assembly and Analysis HC70AL S Brandon Le & Min Chen.
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
Assembly S.O.P. Overlap Layout Consensus. Reference Assembly 1.Align reads to a reference sequence 2.??? 3.PROFIT!!!!!
Clustering [Idea only, Chapter 10.1, 10.2, 10.4].
CPS 100, Spring Burrows Wheeler Transform l Michael Burrows and David Wheeler in 1994, BWT l By itself it is NOT a compression scheme  It’s.
Structural genomics includes the genetic mapping, physical mapping and sequencing of entire genomes.
Virginia Commonwealth University
CAP5510 – Bioinformatics Sequence Assembly
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
A Fast Hybrid Short Read Fragment Assembly Algorithm
Genome sequence assembly
Pre-genomic era: finding your own clones
Sequence comparison: Local alignment
Introduction to Genome Assembly
CS 598AGB Genome Assembly Tandy Warnow.
CSE 589 Applied Algorithms Spring 1999
BIOINFORMATICS Fast Alignment
CSCI 1810 Computational Molecular Biology 2018
Assembling Genomes BCH339N Systems Biology / Bioinformatics – Spring 2016 Edward Marcotte, Univ of Texas at Austin.
Fragment Assembly 7/30/2019.
Presentation transcript:

Assembling Genomes BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring 2015

The image from Reference: Jones NC, Pevzner PA, Introduction to Bioinformatics Algorithms, MIT press

“shotgun” sequencing “mapping”

(Translating the cloning jargon)

Thinking about the basic shotgun concept Start with a very large set of random sequencing reads How might we match up the overlapping sequences? How can we assemble the overlapping reads together in order to derive the genome?

Thinking about the basic shotgun concept At a high level, the first genomes were sequenced by comparing pairs of reads to find overlapping reads Then, building a graph (i.e., a network) to represent those relationships The genome sequence is a “walk” across that graph

The “Overlap-Layout-Consensus” method Overlap: Compare all pairs of reads (allow some low level of mismatches) Layout: Construct a graph describing the overlaps Simplify the graph Find the simplest path through the graph Consensus: Reconcile errors among reads along that path to find the consensus sequence read sequence overlap

Building an overlap graph EUGENE W. MYERS. Journal of Computational Biology. Summer 1995, 2(2): ’ 3’

Reads Overlap graph EUGENE W. MYERS. Journal of Computational Biology. Summer 1995, 2(2): (more or less) Building an overlap graph 5’ 3’

EUGENE W. MYERS. Journal of Computational Biology. Summer 1995, 2(2): (more or less) 1. Remove all contained nodes & edges going to them Simplifying an overlap graph

EUGENE W. MYERS. Journal of Computational Biology. Summer 1995, 2(2): (more or less) Simplifying an overlap graph 2. Transitive edge removal: Given A – B – C and A – C, remove A – C

EUGENE W. MYERS. Journal of Computational Biology. Summer 1995, 2(2): (more or less) Simplifying an overlap graph 3. If un-branched, calculate consensus sequence If branched, assemble un-branched bits and then decide how they fit together

EUGENE W. MYERS. Journal of Computational Biology. Summer 1995, 2(2): (more or less) Simplifying an overlap graph “contig” (assembled contiguous sequence)

This basic strategy was used for most of the early genomes. Also useful: “mate pairs” 2 reads separated by a known distance Contig #1Contig #2 Contigs can be ordered using these paired reads Read #1 Read #2 DNA fragment of known size

GigAssembler (used to assemble the public human genome project sequence) Jim KentDavid Haussler

Whole genome Assembly: big picture

GigAssembler – Preprocessing 1. Decontaminating & Repeat Masking. 2. Aligning of mRNAs, ESTs, BAC ends & paired reads against initial sequence contigs. psLayout → BLAT 3. Creating an input directory (folder) structure.

RepBase + RepeatMasker

GigAssembler: Build merged sequence contigs (“rafts”)

Sequencing quality (Phred Score)

Base-calling Error Probability

GigAssembler: Build merged sequence contigs (“rafts”)

GigAssembler: Build sequenced clone contigs (“barges”)

GigAssembler: Build a “raft-ordering” graph

Add information from mRNAs, ESTs, paired plasmid reads, BAC end pairs: building a “bridge” Different weight to different data type: (mRNA ~ highest) Conflicts with the graph as constructed so far are rejected. Build a sequence path through each raft. Fill the gap with N’s. 100: between rafts 50,000: between bridged barges

Finding the shortest path across the ordering graph using the Bellman-Ford algorithm

A B C DE Find the shortest path to all nodes Take every edge and try to relax it (N – 1 times where N is the count of nodes)

A B C DE Find the shortest path to all nodes Take every edge and try to relax it (N – 1 times where N is the count of nodes)

A B C DE Find the shortest path to all nodes START Inf. Take every edge and try to relax it (N – 1 times where N is the count of nodes)

A B C DE Find the shortest path to all nodes START Inf. +7 ( → A) +6 ( → A) Take every edge and try to relax it (N – 1 times where N is the count of nodes)

A B C DE Find the shortest path to all nodes START +7 ( → A) +6 ( → A) +2 ( → B) +4 ( → D) Take every edge and try to relax it (N – 1 times where N is the count of nodes)

A B C DE Find the shortest path to all nodes START +7 ( → A) +2 ( → B) +4 ( → D) +2 ( → C) Take every edge and try to relax it (N – 1 times where N is the count of nodes)

A B C DE Find the shortest path to all nodes START +7 ( → A) +4 ( → D) +2 ( → C) -2 ( → B) Take every edge and try to relax it (N – 1 times where N is the count of nodes)

A B C DE START +7 ( → A) +4 ( → D) +2 ( → C) -2 ( → B) Answer: A-D-C-B-E

Modern assemblers now work a bit differently, using so-called DeBruijn graphs: Here’s what we saw before: Nature Biotech 29(11): (2011) In Overlap-Layout-Consensus: Nodes are reads Edges are overlaps

In a DeBruijn graph: Nature Biotech 29(11): (2011) Modern assemblers now work a bit differently, using so-called DeBruijn graphs:

Once a reference genome is assembled, new sequencing data can ‘simply’ be mapped to the reference. reads Reference genome

Mapping reads to assembled genomes Trapnell C, Salzberg SL, Nat. Biotech., 2009

Mapping strategies Trapnell C, Salzberg SL, Nat. Biotech., 2009

Burroughs Wheeler indexing Trapnell C, Salzberg SL, Nat. Biotech., 2009

Burroughs-Wheeler transform indexing InputSIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXES OutputTEXYDST.E.IXIXIXXSSMPPS.B..E.S.EUSFXDIIOIIIT Recovered input SIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXES BWT is often used for file compression (like bzip2), here used to make a fast ‘lookup’ index in a genome BWT = ‘reversible block-sorting’ Forward BWT Reverse BWT This sequence is more compressible

Burroughs-Wheeler transform indexing

Burroughs-Wheeler transform indexing

Burroughs-Wheeler transform indexing

Burroughs-Wheeler transform indexing

Burroughs-Wheeler transform indexing

Burroughs-Wheeler transform indexing

BWT is remarkable because it is reversible. Any ideas as how you might reverse it?

Burroughs-Wheeler transform indexing

Burroughs-Wheeler transform indexing Write the sequence as the last column Sort it…Add the columns… Sort those…

Burroughs-Wheeler transform indexing Add the columns… Sort those…Add the columns… Sort those…

Burroughs-Wheeler transform indexing Add the columns… Sort those…Add the columns… Sort those…

Burroughs-Wheeler transform indexing Add the columns… Sort those… The row with the "end of file" character at the end is the original text

Burroughs-Wheeler transform indexing The row with the "end of file" character at the end is the original text

Burroughs Wheeler indexing Trapnell C, Salzberg SL, Nat. Biotech., 2009