Multiple Sequence Composition Alignment

Slides:



Advertisements
Similar presentations
Longest Common Subsequence
Advertisements

Hidden Markov Model in Biological Sequence Analysis – Part 2
Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically.
Gene Prediction: Similarity-Based Approaches (selected from Jones/Pevzner lecture notes)
Sequence Alignment Algorithms in Computational Biology Spring 2006 Edited by Itai Sharon Most slides have been created and edited by Nir Friedman, Dan.
CPM '05 Sensitivity Analysis for Ungapped Markov Models of Evolution David Fernández-Baca Department of Computer Science Iowa State University (Joint work.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.
Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
Algorithms Dr. Nancy Warter-Perez June 19, May 20, 2003 Developing Pairwise Sequence Alignment Algorithms2 Outline Programming workshop 2 solutions.
Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,
Phylogenetic Tree Construction and Related Problems Bioinformatics.
Class 2: Basic Sequence Alignment
Sequence comparison: Local alignment
Chapter 5 Multiple Sequence Alignment.
Developing Pairwise Sequence Alignment Algorithms
Sequence Alignment.
Sequence Alignment and Phylogenetic Prediction using Map Reduce Programming Model in Hadoop DFS Presented by C. Geetha Jini (07MW03) D. Komagal Meenakshi.
Multiple Sequence Alignment
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Traceback and local alignment Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
Content of the previous class Introduction The evolutionary basis of sequence alignment The Modular Nature of proteins.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Chapter 3 Computational Molecular Biology Michael Smith
1 Sequence Alignment Input: two sequences over the same alphabet Output: an alignment of the two sequences Example: u GCGCATGGATTGAGCGA u TGCGCCATTGATGACCA.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
DNA, RNA and protein are an alien language
Pairwise Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 4, 2004 ChengXiang Zhai Department of Computer Science University.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
9/27/10 A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne Adam Smith Algorithm Design and Analysis L ECTURE 16 Dynamic.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Topic 3: MSA Iterative Algorithms in Multiple Sequence Alignment Prepared By: 1. Chan Wei Luen 2. Lim Chee Chong 3. Poon Wei Koot 4. Xu Jin Mei 5. Yuan.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Multiple sequence alignment (msa)
The ideal approach is simultaneous alignment and tree estimation.
Genome alignment Usman Roshan.
Sequence comparison: Local alignment
String Processing.
Definition In simple terms, an algorithm is a series of instructions to solve a problem (complete a task) We focus on Deterministic Algorithms Under the.
Identifying templates for protein modeling:
Sequence Alignment 11/24/2018.
BNFO 236 Smith Waterman alignment
Using Dynamic Programming To Align Sequences
SMA5422: Special Topics in Biotechnology
Pairwise Sequence Alignment
Lecture 14 Algorithm Analysis
CSE 589 Applied Algorithms Spring 1999
Find the Best Alignment For These Two Sequences
Dynamic Programming Finds the Best Score and the Corresponding Alignment O Alignment: Start in lower right corner and work backwards:
String Processing.
Basic Local Alignment Search Tool (BLAST)
Fragment Assembly 7/30/2019.
Presentation transcript:

Multiple Sequence Composition Alignment Name: Yip Chi Kin Date: 21-12-2006

Studied Papers [B03] Composition Alignment [S98] Divide-and-conquer Alignment [M99] DIALIGN Algorithm [SMS03] DCA + Segment-based

Main Aspects ․Dynamic Programming ․Composition Alignment ․Meta-code MSA ․Simultaneous MSA Pairwise Library (Global & Local) Consistency & Ungapped Divide-and-conquer Segment-based (Optimal scores)

Dynamic Programming • -d -d Dot Matrix Edit Graph DP Matrix C T G A matches deletions insertions DP Matrix C T G A C T G A • s(ai,bi) -d -d

Needleman-Wunsch Algorithm Global Alignment Needleman-Wunsch Algorithm Scoring - C T T C T - G C A T -2 -3 -4 -7 -10 -1 -5 -8 -6 -9 GA Results G C A T C - - C T T C T

Smith-Waterman Algorithm Local Alignment 1 3 5 4 2 6 - T T T A C A G G C A G - G A C T Smith-Waterman Algorithm Scoring GA Results - G A A C – G G T - - T T T A C A G G C A G

MSA Methods ․Consistency-based ․Exact method ․Progressive method ․Iterative method ․Stochastic method ․Hidden Markov method

Consistency-based method MSA Concepts Consistency-based method PSAs C - G T C - T G A C G - T A Trace formulation C T G T G A C T A C G Latter formulation C T G T G A C T G A C

MSA Results Results of MSA Aligned regions C T G A G T C - A T C G A Unrealized Realized Consistent

Divide-and-conquer Prefix Suffix Divide Divide Divide Align optimally S1C1 S2C2 S3C3 C1S1 C2S2 C3S3 Divide Divide Align optimally Concatenate

DP Distance Wopt (prefix) Wopt (suffix) Sequence: GTTCATGCCAGGTGTAAATC Suffix Prefix Wopt (prefix) 0 2 4 6 8 10 12 2 1 3 5 7 9 11 4 3 1 3 5 7 9 6 5 3 1 3 5 7 8 7 5 3 1 3 5 10 8 7 5 3 2 3 - C T A T A C - G T A C 3 4 3 4 6 8 10 4 2 3 2 4 6 8 6 4 2 2 2 4 6 8 6 4 2 1 2 4 10 8 6 4 2 0 2 12 10 8 6 4 2 0 C T A T A C - G T A C - Wopt (suffix) CS1,S2[C1,C2] = Wopt (prefix) + Wopt (suffix) – Wopt (total)

Additional-cost CS1,S2[1,1] = 0 CS1,S2[2,2] = 0 CS1,S2[3,3] = 0 0 3 4 7 11 15 19 3 0 3 4 8 12 16 7 4 0 2 4 8 12 11 8 4 0 1 4 8 15 12 8 4 0 0 4 19 15 12 8 4 1 0 C T A T A C G T A C Cost of Diagonal CS1,S2[1,1] = 0 CS1,S2[2,2] = 0 CS1,S2[3,3] = 0 CS1,S2[4,4] = 0 CS1,S2[5,4] = 0 CS1,S2[6,5] = 0 CS1,S2[2,2] = 1 + 2 – 3 = 0 = Wopt [CT,GT] + Wopt [ATAC,ATAC] – Wopt [CTATC,GTATAC] CS1,S2[4,3] = 3 + 1 – 3 = 1 = Wopt [CTAT,GTA] + Wopt [AC,TAC] – Wopt [CTATC,GTATAC]

Space & Time ‘Chain’ of boxes along Diagonal in order to reduce searching time Full sequence searching

DIALIGN Non-Consistent (Simultaneous) Non-Consistent (Cross over) I A F E D C G S P W T Y I A V L F E D C G S P W T Y Consistent diagonals GA Results I A V L F E D C G S P W T Y y I A - V L F E d c G s p w T

Diagonals D1 , D4 and D5 Score = 1.9 + 2.6 + 0.2 = 4.7 Weighting Diagonal Weights where SD is sum of similarity values of same diagonal lD lD is length of diagonal D w(D) = – log P(lD, SD) Overlap weighting Y I A V L F A Y D D L A C V I F G S S W D D V M F Y A E w(D1) = 1.9 w(D3) = 1.5 w(D2) = 1.7 w(D4) = 2.6 w(D5) = 0.2 Diagonals D1 , D4 and D5 Score = 1.9 + 2.6 + 0.2 = 4.7 Diagonals D1 , D2 , D3 and D5 Score = 1.9 + 1.7 + 1.5 + 0.2 = 5.3 Y I A V L F A Y D D Y I A V L F A Y D D L A C V I F G S L A C V I F G S S W D D V M F Y A E S W D D V M F Y A E

Transitivity frontier [1,9] Consistency check Overlap weights 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 S1 S2 S3 Fragments checking f2 f1 f3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 S1 S2 S3 Transitivity frontier [1,9]

Greedy Approach Tandem duplications Greedy Strategy Consistency conflicts M1 (2) M1 (1) M2 M3 M1 (2) M1 (1) M2 M3 S1 S2 S3 S1 S2 S3

Composition Alignment Composition matches Single character match   A C G T -   CM of Prefix Length Sequence #1 1   + – 2 -1 -2 -3 Sequence #2 Matching Prefix length

Match Length 111010001 001101110 Replaced by 7 7 1110100 01 0011011 10 Composition Matching 3 2 2 2 Prefix length 1 4 9 15 –1 2 2 –2 Replaced by 2 –3

Composition Matching CM of Prefix Length (Total=9) 1 1 1 1 1 1 CM = 2 Sequence #1 1 1 1 1 1 1 CM = 2 Sequence #2 1 1 1 1 1 1 1 1 1 CM = -1 Sequence #1 1 1 1 1 1 1 CM = 1 Sequence #2 1 1 1 1 1 1 1 1 1 CM of Prefix Length (Total=9) Sequence #1 1 1 1 1 1 1 CM = 0 Sequence #2 1 1 1 1 1 1 1 1 1

Meta-Code Code about code Mismatch code Input code Code Reservoir Code for Testing Mismatch code Original Code Control Rule Meta-Code

Reservoir Codes Code ‘A’ in S1 Code ‘G’ Store code in Reservoir S1 Code ‘CT’ Store code in Reservoir S2 Code from S1 Code from S2 Store code in Reservoir S1 If both Codes founded from Reservoir S1 and Reservoir S2 delete this two codes Reservoir Code (e.g. AGRCT) Code ‘G’ Code ‘C’ Code ‘AG’ Code ‘A’ in S1 Code ‘T’ in S2

Meta-Code Rule If reservoir code = r, then stop the looping Looping for creating meta-code If CM length is valid, reservoir code = r, Position = p. Value of r Values of r and p Copy the codes from S1 and S2, p = p –1, output meta-code. Meta-code (e.g. AMT) Codes from S1 and S2

Composition Matching of S1 and S2 in prefix length CM (Lengths & Codes) Composition Matching of S1 and S2 in prefix length S1: S2: Meta Code Length T A C G 1 2 R ART AGRCT GRC GARTC AGRTT AGRCC ARC Reservoir codes in S1 A G Reservoir codes in S2 T C

CM of Metacode 2 4 Invalid length Composition Matching AGRTT GARTC ARC 2 1 Prefix length 2 6 10 12 2 4 –1 ART ART AGRCT GARTC

Composition MSA Composition matching Meta-code MSA | T C G A S1 T A C   T C G A Meta-code MSA S1 T A C G TMG GMT CMT GMC AMG TMA S2 New S2

Fixed Segment A = Currency / Cards B = Stock / Structured P. C = Unit Trusts / Bonds D = Insurance / Finance Code catalogue E = Mortgages / Loans ․Semi-global alignment ․Least overlap problem ․Simple segmentation ․Composition alignment ․Weekly behaviour Segment Length LS = 5 Week #1 Week #2 Branch bank #1 … Branch bank #2 Branch bank #3 A C B E D Time Granularities 1t1 1t2 1t3 1t4 1t5 2t1 2t2 2t3 2t4 2t5 3t1 3t2

Family Classifications Meta-Code Branch bank #1 Branch bank #2 Meta-Code Branch bank #3 C B A D PSA Branch bank #1 Branch bank #2 Branch bank #3 C B A D Composition alignment Family Group Fixed-Segment Composition MSA Family Group

Meta-Code Composition MSA Further Problems Meta-Code Composition MSA ․Fixed-segment length ․Prior sequence choice ․Speed-up PSAs ․Nos. of Segments/Codes

Conclusions ․Fixed-segment Composition ․Meta-code Approach (Least Overlap Problems) ․Meta-code Approach (Easier Transform Applications) ․Widespread use of MSA (Simultaneous Multiple Sequences)