Project Phase II Report l Due on 10/20, send me through email l Write on top of Phase I report. l 5-20 Pages l Free style in writing (use 11pt font or.

Slides:



Advertisements
Similar presentations
Multiple Alignment Anders Gorm Pedersen Molecular Evolution Group
Advertisements

Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Measuring the degree of similarity: PAM and blosum Matrix
Multiple Sequence Alignment. An alignment of heads.
Heuristic alignment algorithms and cost matrices
. Class 5: Multiple Sequence Alignment. Multiple sequence alignment VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Introduction to bioinformatics
Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal and MULTICLUSTAL Arunesh Mishra CMSC 838 Presentation Authors : Dmitri Mikhailov,
Multiple Sequence Alignment Mult-Seq-Align allows to detect similarities which cannot be detected with Pairwise-Seq-Align methods. Detection of family.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Multiple Sequence alignment Chitta Baral Arizona State University.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
Sequence Alignment III CIS 667 February 10, 2004.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Sequence Alignments Revisited
Multiple Sequence Alignments
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Chapter 5 Multiple Sequence Alignment.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Developing Pairwise Sequence Alignment Algorithms
Multiple Alignment Modified from Tolga Can’s lecture notes (METU)
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
Protein Sequence Alignment and Database Searching.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Chapter 3 Computational Molecular Biology Michael Smith
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Sequence Alignment.
Step 3: Tools Database Searching
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
1 Multiple Sequence Alignment(MSA). 2 Multiple Alignment Number of sequences >2 Global alignment Seek an alignment that maximizes score.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
Multiple Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 13, 2004 ChengXiang Zhai Department of Computer Science University.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
BIOINFORMATICS Ayesha M. Khan Spring Lec-6.
Multiple sequence alignment (msa)
Learning Sequence Motif Models Using Expectation Maximization (EM)
Sequence Alignment 11/24/2018.
Computational Genomics Lecture #3a
Presentation transcript:

Project Phase II Report l Due on 10/20, send me through l Write on top of Phase I report. l 5-20 Pages l Free style in writing (use 11pt font or larger) l Methods å Overview (high-level description) å Source of data å Algorithm (pseudo code) å Prove or argue why the algorithm will work å Analyze the computational complexity of the algorithm å Plan of implementation

Multiple Sequence Alignment Dong Xu Computer Science Department 109 Engineering Building West

As we proceed … Warning: Muddy Road Ahead!!!

Outline  Background (what, why)  Scoring function  Dynamic programming  Star alignment  Progressive alignment  Profile and Psi-Blast

Introduction l The multiple sequence alignment of a set of sequences may be viewed as an evolutionary history of the sequences. l No sequence ordering is required.

An Example of Multiple Alignment VTISCTGSESNIGAG-NHVKWYQQLPG VTISCTGTESNIGS--ITVNWYQQLPG LRLSCSSSDFIFSS--YAMYWVRQAPG LSLTCTVSETSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLTCLVKEFYPSD--IAVEWWSNG--

Why Multiple Alignment (1) l Natural extension of Pairwise Sequence Alignment l “Pairwise alignment whispers… multiple alignment shouts out loud” Hubbard et al 1996 l Much more sensitive in detecting sequence relationship and patterns

Why Multiple Alignment (2) l Give hints about the function and evolutionary history of a set of sequences l Foundation for phylogenic tree construction and protein family classification l Useful for protein structure prediction…

Outline  Background (what, why)  Scoring function  Dynamic programming  Star alignment  Progressive alignment  Profile and Psi-Blast

Idea of Scoring l Requirement of a good quality of alignment measure å Additive function å Function must be independent of order of arguments å Should reward presence of many equal or strongly related symbols (in the same column) and penalize unrelated symbols and spaces.

Distance from Consensus l Consensus sequnece: single sequence which represents the most common amino acid/base in that position DDGAV-EAL DGG---EAL EGGILVEAL D-GILVQAV EGGAVVQAL DGGA/IV/LVEAL l distance from consensus: total number of characters in the alignment that differs from the consensus character of their columns

Sum-of-pairs’ (SP) measure l The Sum of Pairs (SP) method is as follows: Given (1) a set of N aligned sequences each of length L in the form of a L x N MSA alignment matrix M and (2) a substitution matrix (PAM or BLOSUM) that gives a cost c(x,y) for aligning two characters x, y. l The SP score SP(m i ) for the I-th column of M denoted by m i is calculated: SP(m i ) = SUM (j, k) [ c( m i j, m i k ) ] l The SP score for M is SUM (i) [SP(m i ) ]

Feature of SP Measure l Theorem: Let alpha be a multiple alignment of the set of sequences s1, …, sk; and alpha(I,j) denote the pairwise alignment of si and sj as induced by alpha. Then SP-score(alpha) = Sum over i,j [score(alpha(i,j)] å The above is only true if we have s(-,-) = 0. å This is because in pairwise alignment the presence of two aligned spaces (–) in the two sequences are ignored.

l In the minimum entropy approach the score of multiple alignment is defined as the sum of entropies of the columns (the entropy of a column is related with the frequency of letter x in the column) l Information content l Profile score Profile Distance

Outline  Background (what, why)  Scoring function  Dynamic programming  Star alignment  Progressive alignment  Profile and Psi-Blast

Dynamic Programming l Multiple sequence alignment can be done with dynamic programming, using (n+1) k memory cells, where n is the length of each of the k sequences to be aligned. Sequence 1  Sequence 2 Sequence 3

Complexity Analysis of Dynamic Programming l Running time: å (n+1) k number of entries in the table å For each entry we need to find the maximum of 2 k -1 elements å Finding the SP-score corresponding to each element means adding O(k 2 ) numbers å Total = O(k 2 2 k n k ) i.e., NP-hard

Outline  Background (what, why)  Scoring function  Dynamic programming  Star alignment  Progressive alignment  Profile and Psi-Blast

Center Star Method l The center star method is an approximation algorithm, if scores can be decoupled into pairwise distances. l minimizing sum of pairwise distances. l Works well when a distance satisfies the triangle inequality: d st <= d sc + d ct for all sequences s, t, and c.

Star Alignments l Select a sequence sc as the center of the star l For each sequence s 1, …, s k such that index i  c, perform a Needleman-Wunsch global alignment l Aggregate alignments with the principle “once a gap, always a gap.”

Star Alignments Example s2s2 s1s1 s3s3 s4s4 s 1 : MPE s 2 : MKE s 3 : MSKE s 4 : SKE MPE | MKE MSKE - || MKE SKE || MKE MPE MKE -MPE -MKE MSKE -MPE -MKE MSKE -SKE

Choosing a Center l Try them all and pick the one with the best score l Calculate all O(k 2 ) alignments, and pick the sequence s c that maximizes

Performance l Therefore, if C is the center star method SP score, then C/2 is a lower bound to the SP score of any multiple alignment. l C is at most as bad as twice of the optimum, and this bound on how bad the result can be makes the center star method an approximation algorithm to the NP-hard problem.

Complexity Analysis l Assuming k sequences have length n å O(n 2 ) to calculate 1 global alignment å O(k 2 ) global alignments to calculate å O(k 2 n 2 ) overall cost

Outline  Background (what, why)  Scoring function  Dynamic programming  Star alignment  Progressive alignment  Profile and Psi-Blast

Progressive Alignment l Devised by Feng and Doolittle in l Essentially a heuristic method, not guaranteed to find the ‘optimal’ alignment. l Multiple alignment is achieved by successive application of pairwise methods.

Basic Algorithm l Compare all sequences pairwise. l Perform cluster analysis on the pairwise data to generate a hierarchy for alignment (guide tree). l Build alignment step by step according to the guide tree. Build the multiple alignment by first aligning the most similar pair of sequences, then add another sequence or another pairwise alignments.

Steps in Progressive Multiple Alignment l Compare pairwise sequences l Perform cluster analysis on pairwise data to generate hierarchy for alignment

Building Guided Tree

Using Weight Hbb_Human - Hbb_Horse.17 - Hba_Human Hba_Horse Myg_Whale Hbb_Human Hbb_Horse Hba_Horse Hba_Human Myg_Whale Quick pairwise alignment: calculate distance matrix Neighbor-joining tree (guide tree)

Alignment (1) l Build multiple alignments by first aligning most similar pair, then next similar pair etc.

Alignment (2)

Scoring PQRRZW YQRKZX YZTUOP TZZ_FO Total Score = [w(R, U ) + w(R,  ) + w(K, U) + w(K,  ) ] / 4

l Most successful implementation of progressive alignment (Des Higgins) l CLUSTAL - gives equal weight to all sequences l CLUSTALW - has the ability to give different weights to the sequences l CLUSTALX - provides a GUI to CLUSTAL l align/multi-align.html CLUSTAL

Problems with Progressive Alignments l Greedy Nature: No guarantee for global optimum l Depends on pairwise alignments l If sequences are very distantly related, much higher likelihood of errors l Care must be made in choosing scoring matrices and penalties l Other approaches using Bayesian methods such as hidden Markov models

Outline  Background (what, why)  Scoring function  Dynamic programming  Star alignment  Progressive alignment  Profile and Psi-Blast

Concept of Profile Seq1-> Seq3-> Seq4-> Seq2-> Information about the degree of conservation of sequence positions is included

Position-specific Score Matrix (PSSM) l For protein of length L, scoring matrix is L x 20, PSSM(i,j) --“Profile”: specific scores for each of the 20 amino acids at each position in a sequence. l For highly conserved residues at a particular position, a high positive score is assigned, and others are assigned high negatives. l For a weakly conserved position, a value close to zero is assigned to all the amino acid types.

Building a Profile  First, get multiple sequence alignments using substitution matrix, S jk.  Second, count the number of occurrences of amino acid k at position i, C ik.  (1) Average-score method: W ij =  k C ik S jk / N.  (2) log-odds-ratio formula: W ij = log(q ij /p j ). q ij = C ij / N. p j : background probability of residue j.

Calculating Profiles (1) l Gribskov et al, Proc. Natl. Acad. Sci. USA 84, , 1987 ACGCTAFKI GCGCTAFKI ACGCTAFKL GCGCTGFKI GCGCTLFKI ASGCTAFKL ACACTAFKL C 1A = 4, C 1G = 3 W 1A = (4  S AA + 3  S AG ) / 7 = (4   0) / 7 = 2.3 W ij =  k C ik S jk / N

Calculating profile (2) W ij = log(q ij /p j ). q ij = C ij / N. p j : background probability of residue j.  For small N, formula q ij = C ij / N is not good  A large set of too closely related sequences carries little more information than a single member.  Absence of Leu does not mean no Leu at this position when Ile is abundant!  Pseudocount frequency, g ij

Frequency Matrix l Effective frequency, f ij l Frequency matrix element, f ij, is the probability of amino acid j at position i. j g ij = p j  k q ik exp ( S kj ) p j: : background frequency

Frequency matrix, example  i (position) 1,…,L j (amino acid type) 1,…,20 

Profile Alignment (1) ACD……VWY sequence profile

l alignment of alignments Sequence – Profile Alignment. Profile – Profile Alignment. l Penalize gaps in conserved regions more heavily than gaps in more variable regions l Dynamic Programming. (same idea as in Pairwise Sequence Alignment) Optimal alignment in time O(a 2 l 2 ) a = alphabet size, l = sequence length Profile Alignment (2)

PsiBlast l Psi (Position Specific Iterated) is an automatic profile-like search l The program first performs a gapped blast search of the database. The information of the significant alignments is then used to construct a “position specific” score matrix. This matrix replaces the query sequence in the next round of database searching l The program may be iterated until no new significant are found

Search (Building Profile) with PSI-BLAST l Query sequence is “master” for multiple alignment. l Profile length = query sequence length. l Find family members with low E-value (e.g. <0.01). l Exclude sequences with >98% identity. l Build frequency matrix. l Build PSSM. l Build alignment using PSSM. l Add new sequences. l Iterate…

l PSI-BLAST estimates the statistical significance of the local alignments found. Because profile substitution scores are constructed to a fixed scale, and gap scores remain independent of position, the statistical theory and parameters for gapped BLAST alignments remain applicable to profile alignments. Much more sensitive than BLAST l E-values can be misleading! Significance of Psi Blast

Reading Assignments l Suggested reading: å Chapter 4 in “Current Topics in Computational Molecular Biology, edited by Tao Jiang, Ying Xu, and Michael Zhang. MIT Press ” l Optional reading: å Chapter 7 in “Pavel Pevzner: Computational Molecular Biology - An Algorithmic Approach. MIT Press, 2000.”

Develop a program that implement the Center Star algorithm 1. Modify your code in the first assignment for global alignment. 2. Use edit distance (match 1; otherwise 0) with gap penalty –1 – k (k is gap size). Project Assignment