Fragment Assembly 蔡懷寬 We would like to know the Target DNA sequence.

Slides:



Advertisements
Similar presentations
Longest Common Subsequence
Advertisements

Huong Le Department of Molecular & Clinical Genetics, Royal Prince Alfred Hospital Click mouse to move to the next slide.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
DNA sequences alignment measurement
Lecture 14 Genome sequencing projects
Sequencing and Sequence Alignment
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
BIOS816/VBMS818 Lecture 6 – Sequence Assembly Guoqing Lu Office: E115 Beadle Center Tel: (402) Website:
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment II CIS 667 Spring Optimal Alignments So we know how to compute the similarity between two sequences  How do we construct an.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics.
Physical Mapping II + Perl CIS 667 March 2, 2004.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Genome sequencing and assembling
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Utilizing Fuzzy Logic for Gene Sequence Construction from Sub Sequences and Characteristic Genome Derivation and Assembly.
1 Theory I Algorithm Design and Analysis (11 - Edit distance and approximate string matching) Prof. Dr. Th. Ottmann.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Sequence comparison: Local alignment
Sequencing a genome and Basic Sequence Alignment
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Developing Pairwise Sequence Alignment Algorithms
Sequence Alignment.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Assembling Genomes BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
Genomic sequencing and its data analysis Dong Xu Digital Biology Laboratory Computer Science Department Christopher S. Life Sciences Center University.
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 8, 2005 ChengXiang Zhai Department of Computer Science University of Illinois,
CS 394C March 19, 2012 Tandy Warnow.
Pairwise Alignment, Part I Constructing the Values and Directions Tables from 2 related DNA (or Protein) Sequences.
Graphs and DNA sequencing CS 466 Saurabh Sinha. Three problems in graph theory.
Phred/Phrap/Consed Analysis A User’s View Arthur Gruber International Training Course on Bioinformatics Applied to Genomic Studies Rio de Janeiro 2001.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Assembling Sequences Using Trace Signals and Additional Sequence Information Bastien Chevreux, Thomas Pfisterer, Thomas Wetter, Sandor Suhai Deutsches.
Alignment methods April 26, 2011 Return Quiz 1 today Return homework #4 today. Next homework due Tues, May 3 Learning objectives- Understand the Smith-Waterman.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.
Scoring Matrices April 23, 2009 Learning objectives- 1) Last word on Global Alignment 2) Understand how the Smith-Waterman algorithm can be applied to.
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.
Lecture 6. Pairwise Local Alignment and Database Search Csc 487/687 Computing for bioinformatics.
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
Sequencing a genome and Basic Sequence Alignment
Fragment assembly of DNA A typical approach to sequencing long DNA molecules is to sample and then sequence fragments from them.
Chap. 4 FRAGMENT ASSEMBLY OF DNA Introduction to Computational Molecular Biology Chapter 4.
Fragment Assembly of DNA BIO/CS 471 – Algorithms for Bioinformatics.
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Mojavensis: Issues of Polymorphisms Chris Shaffer GEP 2009 Washington University.
Step 3: Tools Database Searching
Dynamic programming with more complex models When gaps do occur, they are often longer than one residue.(biology) We can still use all the dynamic programming.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
ELEC692 VLSI Signal Processing Architecture Lecture 12 Numerical Strength Reduction.
DNA sequences alignment measurement Lecture 13. Introduction Measurement of “strength” alignment Nucleic acid and amino acid substitutions Measurement.
Virginia Commonwealth University
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics)
Distance based phylogenetics
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Sequence comparison: Local alignment
Sequence Alignment 11/24/2018.
Fragment Assembly 7/30/2019.
Presentation transcript:

Fragment Assembly 蔡懷寬

We would like to know the Target DNA sequence

Example

However, There are always ERRORS- Substitution error

However, There are always ERRORS- Insertion Error

However, There are always ERRORS- Deletion error

However, There are always ERRORS- Chimeric error

Terrible, Right? Not Yet!!!

More Complicated  Unknown Orientation

More Complicated  Repeat Region

More Complicated  Coverage and Linkage

A new strategy for sequencing

Some basic models for Fragment Assembly Shortest Common Superstring Simplest model Reconstruction Deal with errors and orientation Multicontig Deal with errors, orientation and linkage

Example to illustrate these models Three sequences : GTAC, TAATG,TGTAA

Shortest Common Superstring (SCS) GTAC, TAATG,TGTAA TGTAA TAATG GTAC

Reconstruction GTAC, TAATG,TGTAA Find all bi-direction sequences GTAC, GTAC, TAATG, CATTA, TGTAA, TTACA Then, find a string S, s.t.

Multicontig

An Example of Repeat

Example of the important of Linkage

An Algorithm for Finding the SCS

Consed, Phred & Phrap Overview Developed at the University of Washington Phil Green (phrap) Brent Ewing (phred) David Gordon (consed)

Consed, Phred & Phrap UNIX (free to academic users) DNA assembly package for high through-put sequencing projects. Consed: graphical interface extension that controls both Phred and Phrap. Phred: base calling, vector trimming, end of sequence read trimming. Phrap: assembler Phrap uses Phred ’ s base calling scores to determine the consensus sequences. Phrap examines all individual sequences at a given position, and uses the highest scoring sequence (if it exists) to extend the consensus sequence.

More on Phrap Phrap constructs the contig sequence as a mosaic of the highest quality parts of the reads rather than as a statistically computed “ consensus ”. This avoids both the complex algorithm issues associated with multiple alignment methods, and problems that occur with these methods causing the consensus to be less accurate than individual reads at some positions. The sequence produced by Phrap is quite accurate: less than 1 error per 10 kb in typical datasets. Sequence quality at a given position is determined by the Phred base caller.

Consed Graphical User Interface

Trace Sequence Reads After Phred: Base Calling

Consed: Graphical Alignment Representation

Poor Trace Sequence Data and Corresponding Phred Basecalling

Phred Base Calling

Vector Trimming

Vector Trimming (Continued) Trimming of the vector sequence to yield only the insert DNA is an example of finding the longest prefix in S (raw sequence data) that is an exact match in T (Vector Multiple Cloning Site sequence). Let S ’ = S $ T, where ‘ $ ’ is a unique character. Using Fundamental Preprocessing and the calculation of all Z- Boxes in S ’, we choose the largest Z-Box that occurs in T and obtain its length to trim from the 5 ’ end of S.

End of Sequence Cropping It is common that the end of sequencing reads have poor data. This is due to the difficulties in resolving larger fragment ~1kb (it is easier to resolve 21bp from 20bp than it is to resolve 1001bp from 1000bp). Phred assigns a non-value of ‘ x ’ to this data by comparing peak separation and peak intensity to internal standards. If the standard threshold score is not reached, the data will not be used.

What is Phred? Phred is a program that observes the base trace, makes base calls, and assigns quality values (qv) of bases in the sequence. It then writes base calls and qv to output files that will be used for Phrap assembly. The qv will be useful for consensus sequence construction. For example, ATGCATTC string1 CGTTCATGC string2 ATGC-TTCATGC superstring Here we have a mismatch ‘ A ’ and ‘ G ’, the qv will determine the dash in the superstring. The base with higher qv will replaces the dash.

Why Phred? Output sequence might contain errors. Vector contamination might occur. Dye-terminator reaction might not occur. Segment migration abnormal in gel electrophoresis. Weak or variable signal strength of peak corresponding to a base.

How Phred calculates qv? From the base trace Phred know number of peaks and actual peak locations. Phred predicts peaks locations. Phred reads the actual peak locations from base trace. Phred match the actual locations with the predicted locations by using Dynamic Programming. The qv is related to the base call error probability (ep) by the formula qv = -10*log_10(ep)

Phred Code BEGIN Row 0 holds predicted values Column 0 holds actual values for i=1 to n do for j=1 to n do if D(0,j)=D(i,0) D(i,j)=0 else if |D(0,j)-D(i,0)| >= 1 then D(i,j)= min[D(i-1,j)+1, D(i,j-1)+1)] else D(i,j)=|D(0,j)-D(i,0)| END

Example 1 0 1(A)2 (G) 3(C) 4(A) 5(T)

Output from example 1 Quality value rank from 0 to is given by dark gray is given by a shade lighter is given by white (bright shade). SequenceAGCAT Error Probability Quality value

Example 2 01 (A)2 (G)3 (C)4 (A)5 (T)

Output from Example 2 The last base is removed. A base is added to the second place. Output: Sequence A c G C A Quality value the added base has quality value of zero.

Phrap Fragment Assembly

Sequence Reconstruction Algorithm In the shotgun approach to sequencing, small fragments of DNA are reassembled back into the original sequence. This is an example of the Shortest Common Superstring (SCS) problem where we are given fragments and we wish to find the shortest sequence containing all the fragments. A superstring of the set P is a single string that contains every string in P as a substring. For example: for The SCS is: GGCGCC F1 = GCGC F1 = GCGC F2 = CGCC F2 = CGCC F3 = GGCGF3 = GGCG

Greedy Algorithm for the Shortest Superstring Problem The shortest superstring problem can be examined as a Hamiltonian path and is shown to be equivalent to the Traveling Salesman problem. The shortest superstring problem is NP-complete. A greedy algorithm exists that sequentially merges fragments starting with the pair with the most overlap first. Let T be the set of all fragments and let S be an empty set. do { For the pair (s,t) in T with maximum overlap. [s=t is allowed] { If s is different from t, merge s and t. If s = t, remove s from T and add s to S. } } while ( T is not empty ); Output the concatenation of the elements of S. This greedy algorithm is of polynomial complexity and ignores the biological problems of: which direction a fragment is orientated, errors in data, insertions and deletions.

Phrap Preprocessing Steps 1. Read in sequence and quality data, trim off low quality ends of reads, construct read complements 2. Find pairs of reads with matching words. Eliminate exact duplicate reads. Perform Smith-Waterman pairwise alignments on pairs with matching words. 3. Find vector matches and mark so that they are not used in assembly. 4. Find and combine near duplicate reads. 5. Dissolve matching read pairs that do not have “ solid ” matching segments or self-matches.

Smith-Waterman Scoring SWi,j = max{SW i-1,j-1 +s(a i,b j ); SW i-k,j + g j ; SW i,j-k +g i ; 0} SW i,j is the score of the partial alignment of sequence a ending at residue i and sequence b ending at residue j The score is taken as the maximum of the 4 terms SW i-1,j-1 +s(a i,b j ) = extends the alignment by one residue in each sequence SW i-k,j + g j = extends to j in sequence b and inserts a single matching gap in sequence a SW i,j-k + g i = extends to i in sequence a and inserts a single matching gap in sequence b 0 = ends the alignment if the score falls below zero

Smith-Waterman Algorithm Assigns a score to each pair of bases Uses similarity scores only Uses positive scores for related residues Uses negative scores for substitutions and gaps Initializes edges of the matrix with zeros As the scores are summed in the matrix, any score below zero is recorded as zero Begins the trace back at the maximum value found anywhere in the matrix Continues until the score falls to zero

Phrap Iterative Steps 6. Use pairwise matches to identify confirmed parts of reads; use these to compute revised quality values. 7. Compute LLR scores for each match. LLR score is a measure of overlap length and quality. High quality discrepancies that might indicate different copies of a repeat lead to low LLR scores.

Phrap Steps (Continued) 8. Find best alignment for each matching pair of reads that have more than one significant alignment in a given region (highest LLR-scores among several overlapping). 9. Construct contig layouts, using consistent pairwise matches in decreasing score order (greedy algorithm). 10. Construct contig sequence as a mosaic of the highest quality parts of the reads. 11. Align reads to contig; tabulate inconsistencies and possible sites of misassembly. Adjust LLR- scores of contig sequence.

Accessory Overlap Slides

What is an Overlap? These are overlaps These are not overlaps

Calculating an Overlap Word Size (* 7 *) Word Size: is the shorted non-gapped local pairwise alignment allowed. Stringency (* 0.80 *) What fraction of words must match? Minimum overlap length (* 14 *) Denotes: * user defined variables * or * Phrap default values *

Overlap Sequence 1 Sequence

Overlap Plot Sequence 1 Sequence