BIOINFORMATICS Ayesha M. Khan Spring 2013 1 Lec-6.

Slides:



Advertisements
Similar presentations
Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Advertisements

Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Measuring the degree of similarity: PAM and blosum Matrix
Lecture 8 Alignment of pairs of sequence Local and global alignment
Heuristic alignment algorithms and cost matrices
Multiple alignment June 29, 2007 Learning objectives- Review sequence alignment answer and answer questions you may have. Understand how the E value may.
We continue where we stopped last week: FASTA – BLAST
Introduction to bioinformatics
Sequence Analysis Tools
Sequence similarity.
Multiple alignment: heuristics
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Multiple Sequence Alignments
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Information theoretic interpretation of PAM matrices Sorin Istrail and Derek Aguiar.
1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.
Chapter 5 Multiple Sequence Alignment.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Multiple sequence alignment
Inferring function by homology The fact that functionally important aspects of sequences are conserved across evolutionary time allows us to find, by homology.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
An Introduction to Bioinformatics
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
Protein Sequence Alignment and Database Searching.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
BLAST Workshop Maya Schushan June 2009.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Comp. Genomics Recitation 3 The statistics of database searching.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Sequence Alignment Only things that are homologous should be compared in a phylogenetic analysis Homologous – sharing a common ancestor This is true for.
Multiple Sequence Alignment. How to score a MSA? Very commonly: Sum of Pairs = SP Compute the pairwise score of all pairs of sequences and sum them. Gap.
CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Multiple Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 13, 2004 ChengXiang Zhai Department of Computer Science University.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Sequence similarity, BLAST alignments & multiple sequence alignments
In Bioinformatics use a computational method - Dynamic Programming.
Sequence Based Analysis Tutorial
Basic Local Alignment Search Tool
Presentation transcript:

BIOINFORMATICS Ayesha M. Khan Spring Lec-6

Some statistics of local sequence comparison (BLAST)  Once BLAST has found a similar sequence to the query in the database, it is helpful to have some idea of whether the alignment is “good” and whether it portrays a possible biological relationship, or whether the similarity observed is attributable to chance alone.  BLAST uses statistical theory to produce a bit score and expect value (E-value) for each alignment pair (query to hit). 2 Lec-6

Max score = highest alignment score (bit-score) between the query sequence and the database sequence segment. Total score = sum of alignment scores of all segments from the same database sequence that match the query sequence (calculated over all segments). This score is different from the max score if several parts of the database sequence match different parts of the query sequence. Query coverage = percent of the query length that is included in the aligned segments. This coverage is calculated over all segments. E-value = number of alignments expected by chance with a particular score or better. 3 Lec-6 BLAST Results: Scores and Values

Some details: Bit score  The bit score gives an indication of how good the alignment is; the higher the score, the better the alignment.  In general terms, this score is calculated from a formula that takes into account the alignment of similar or identical residues, as well as any gaps introduced to align the sequences.  Key element  substitution matrix 4 Lec-6

Bit score (contd.)  The BLOSUM62 matrix is the default for most BLAST programs, the exceptions being blastn, megaBLAST and discontig megablast (programs that perform nucleotide–nucleotide comparisons and hence do not use protein-specific matrices).  Bit scores are normalized, which means that the bit scores from different alignments can be compared, even if different scoring matrices have been used. 5 Lec-6

Some details: E-value The E-value gives an indication of the statistical significance of a given pairwise alignment and reflects the size of the database and the scoring system used. The lower the E- value, the more significant the hit. A sequence alignment that has an E-value of 0.05 means that this similarity has a 5 in 100 (1 in 20) chance of occurring by chance alone. E=Kmne - λ S m, n is size of the search space (n is length of query sequence, m is length of the database) K is a scale parameter for size of search space λ is a s cale parameter for scoring method S is bit score 6 Lec-6

Difference between BLOSUM and PAM matrices Lec-6 7  BLOSUM comes from alignments of shorter sequences-blocks of sequences that match each other at some defined level of similarity. The BLOSUM method thereby incorporates much more data into its matrices, and is therefore, presumably more accurate.  PAM is derived from alignments of proteins.  BLOSUM matrices tend to be more sensitive to distant relationships than PAM.  BLOSUM tends to give higher scores to substitutions involving hydrophilic amino acids and lower scores to substitutions involving hydrophobic amino acids than PAM.  Substitutions of rare amino acids are more tolerated by BLOSUM.  General rules: -Use higher PAM or lower BLOSUM matrices for more divergent sequences -Use lower PAM or higher BLOSUM matrices for more closely related sequences

Concept of Gaps in Alignment Lec-6 8  Sequences may have diverged from a common ancestor through various types of mutations: Substitutions Insertions Deletions The latter two will result in gaps in alignments

Gap Penalty Lec-6 9  Gap penalties are used during sequence alignments to penalize the gaps.  The gap extension penalty is usually much smaller, for instance, 10 insertions of one nucleotide each should be harder than one insertion of 10 nucleotides.  That is, gap opening is less probable than a single gap extending over more than one nucleotide. Hence a single mutation event (causing incorporation or deletion of more than one nucleotide) is more probable than multiple mutation events.

Gap Penalty Lec-6 10  Linear gap penalties  Simplest type of gap penalty  The overall penalty for one large gap is the same as for many small gaps  wk=c L  Affine gap penalties  Have a gap opening penalty c, and a gap extension penalty, e  wk=c +(L-1)e

BLAST & FASTA: heuristic methods Lec-6 11  BLAST & FASTA use heuristic methods that attempt to approximate the optimal local similarity shared by two sequences.  Use word or k-tuple methods  They align two sequences very quickly, by first searching for identical short stretches of sequences (called words, or k-tuples) and then joining these words into an alignment by the dynamic programming method.

BLAST… Lec-6 12  The BLAST programs are used to find high-scoring local alignments between a query sequence and a target database.  The BLAST algorithm is based on the fact that true match alignments are very likely to contain short stretch of identities, or very high scoring matches somewhere within them.  So BLAST initially looks for such short stretches and uses them as ‘seeds’ from which it extends out in search of a good longer alignment.

Main stages of BLAST Lec Remove (filter) low-complexity regions from Q 2. Harvest k-tuples (triples) from Q 3. Expand each triple into ~50 high-scoring words 4. Seed a set of possible alignments 5. Generate high-scoring pairs (HSP)s from the seeds 6. Test the significance of matches from the HSPs 7. Report the alignments found from the HSPs

Main stages of BLAST (contd.) Lec-6 14

Multiple Sequence Alignment Lec-6 15 Why do we need to carry out multiple sequence alignments?  To make connections between more than two family members  To reveal conserved family characteristics MSA is a 2D table  rows represent individual sequences and columns the residue positions. Absolute position: Property of the sequence Relative position: Property of the alignment

Example: Lec-6 16

MSA: computational complexity (O (m 1 m 2 ) O: order of the time taken by the algorithm, and m 1 and m 2 are the sequence lengths.  When considering more sequences, the time complexity becomes O(m 1,m 2,m 3,….m l ) where m l is the length of the last sequence in the comparison set Lec-6 17

Simultaneous methods vs progressive methods Lec-6 18 Simultaneous methods: Align all the sequences in a given set at once Extension of a 2D matrix to three or more dimensions No. of dimensions reflect the no. of sequences to be aligned Work best on small sets of short sequences Progressive methods: Align pairs of sequences or building sequence clusters Use heuristics to reach an alignment in a timely and cost-efficient manner

MSA models  There are several models for assessing the score of a given multiple sequence alignment. The most popular ones are sum-of-pairs (SP), tree alignment, and consensus alignment. Note: which of the above models are progressive alignments and which are based on dynamic programming?(should be able to answer after a few slides) Lec-6 19

Sum-of-pairs (SP)  Recall that: The standard computational formulation of the pairwise problem is to identify the alignment that maximizes protein sequence similarity, which is typically defined as the sum of substitution matrix scores for each aligned pair of residues, minus some penalties for gaps.  The mathematically — though not necessarily biologically — exact solution can be found in a fraction of a second for a pair of proteins. This approach is generalized to the multiple sequence case by seeking an alignment that maximizes the sum of similarities for all pairs of sequences, i.e. the sum-of- pairs, or SP, score. Lec-6 20

Sum-of-pairs (SP)...  The SP score for the complete alignment M is the sum of the scores for each column (m i ) in the alignment: Example: We wish to align the following three DNA sequences: S1 = TGCG S2 = AGCTG S3 = AGCG We wish to use the SP method to score the following alignments of these three sequences: Alignment #1 Alignment #2 T-GC-G TGC-G -AGCTG AGCTG -AGC-G AGC-G 21 Lec-6

Sum-of-pairs (SP)... We will use the following simplified DNA substitution matrix: s(x,y) = 1: when x = y [match] s(x,y) = -1: when x ! y [mismatch] s(x,-) = -2: [gap] s(-,y) = -2: [gap] s(-,-) = 0: to prevent double counting of gaps We will construct the following matrices M for each alignment: 22 Lec-6

Sum-of-pairs (SP)... The SP score for each alignment is calculated by summing the individual scores for each column in the matrix. Using the simplified substitution matrix, the Sum of Pairs method ranks the second alignment as the higher scoring alignment. 23 Lec-6

Consensus alignment Lec-6 24 Consider a group of sequences. First all are compared to each other, pairwise, using normal dynamic programming. This establishes an order for the set, most to least similar. Subgroups are clustered together similarly. Then take the top two most similar sequences and align them using normal dynamic programming. Now create a consensus of the two and align that consensus to the third sequence using standard dynamic programming. Now create a consensus of the first three sequences and align that to the forth most similar. This process continues until it has worked its way through all sequences and/or sets of clusters.

Tree alignment Lec-6 25

Progressive alignment It is a heuristic method! Up until about 1987, multiple alignments would typically be constructed manually, although a few computer methods did exist. Around that time, algorithms based on the idea of progressive alignment appeared. In this approach, a pairwise alignment algorithm is used iteratively, first to align the most closely related pair of sequences, then the next most similar one to that pair, and so on.  The rule “once a gap, always a gap” was implemented, on the grounds that the positions and lengths of gaps introduced between more similar pairs of sequences should not be affected by more distantly related ones. 26 Lec-6

Progressive alignment: CLUSTALW The three basic steps in the CLUSTAL W approach are shared by all progressive alignment algorithms*: A. Calculate a matrix of pairwise distances based on pairwise alignments between the sequences B. Use the result of A to build a guide tree, which is an inferred phylogeny for the sequences C. Use the tree from B to guide the progressive alignment of the sequences 27 Lec-6

Progressive alignment: CLUSTALW  The basic idea is to use a series of pairwise alignments to align larger and larger groups of sequences, following the branching order of the guide tree. We proceed from the tips of the rooted tree towards the root.  At each stage a full dynamic programming algorithm is used, with a residue scoring matrix (e.g., a PAM or a BLOSUM matrix) and gap opening and extension penalties. Each step consists of aligning two existing alignments.  Scores at a position are averages of all pairwise scores for residues in the two sets of sequences using matrices with only positive values. 28 Lec-6

Pairwise progressive dynamic programming-liabilities (1) dependence on initial pairwise sequence alignments and the order of alignment - ordering them from most similar to least similar usually makes biological sense and works very well. (2) dependence on substitution matrices and gap penalties Lec-6 29

Common usage of MSA Lec-6 30  Detecting similarities between sequences (closely/distantly related)  Detecting conserved regions/motifs in the sequences  Detection of structural homologies; Patterns of hydrophobicity/hydrophilicity, gaps etc.  Thus assisting the improved prediction of secondary and tertiary structures and loops and variable regions.  Predict features of aligned sequences like conserved positions which may have structural or functional importance  Making patterns or profiles that can be further used to predict new sequences falling in a given family  Computing consensus sequence  Inferring evolutionary trees or linkage-phylogenetic analysis etc  Deriving profiiles of hidden markov models (HMMs) that can be used to remove distant sequences (outliers) from the protein families

Applicability of MSA  Very useful in the development of PCR primers and hybridization probes;  Great for producing annotated, publication quality, graphics and illustrations;  Invaluable in structure/function studies through homology inference;  Recognizable structural conservation between true homologues extends way beyond statistically significant sequence similarity. Lec-6 31

Applicability of MSA- contd.  Essential for building “profiles” for remote homology similarity searching; and  Required for molecular evolutionary phylogenetic inference programs. Lec-6 32

 For a given group of sequences, there is no single "correct" alignment, only an alignment that is "optimal" according to some set of calculations.  Determining what alignment is best for a given set of sequences is really up to the judgement of the investigator. “To raise new questions, new possibilities, to regard old problems from a new angle, require creative imagination and marks real advance in science” Albert Einstein Lec-6 33