Sequence Alignment Techniques. In this presentation…… Part 1 – Searching for Sequence Similarity Part 2 – Multiple Sequence Alignment.

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Sources Page & Holmes Vladimir Likic presentation: 20show.pdf
Measuring the degree of similarity: PAM and blosum Matrix
DNA sequences alignment measurement
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
Lecture 8 Alignment of pairs of sequence Local and global alignment
Introduction to Bioinformatics
Sequence Similarity Searching Class 4 March 2010.
Heuristic alignment algorithms and cost matrices
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Introduction to bioinformatics
Sequence Analysis Tools
Protein Modules An Introduction to Bioinformatics.
Sequence similarity.
Pattern databases in protein analysis Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Roadmap The topics:  basic concepts of molecular biology  more on Perl  overview of the field  biological databases and database searching  sequence.
Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Inferring function by homology The fact that functionally important aspects of sequences are conserved across evolutionary time allows us to find, by homology.
Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.
An Introduction to Bioinformatics
Protein Sequence Alignment and Database Searching.
Chapter 11 Assessing Pairwise Sequence Similarity: BLAST and FASTA (Lecture follows chapter pretty closely) This lecture is designed to introduce you to.
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Sequencing a genome and Basic Sequence Alignment
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Basic terms:  Similarity - measurable quantity. Similarity- applied to proteins using concept of conservative substitutions Similarity- applied to proteins.
Protein and RNA Families
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Doug Raiford Lesson 5.  Dynamic programming methods  Needleman-Wunsch (global alignment)  Smith-Waterman (local alignment)  BLAST Fixed: best Linear:
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics.
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
DNA sequences alignment measurement Lecture 13. Introduction Measurement of “strength” alignment Nucleic acid and amino acid substitutions Measurement.
Last lecture summary.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Sequence similarity, BLAST alignments & multiple sequence alignments
Sequence Based Analysis Tutorial
Basic Local Alignment Search Tool
Presentation transcript:

Sequence Alignment Techniques

In this presentation…… Part 1 – Searching for Sequence Similarity Part 2 – Multiple Sequence Alignment

Part 1 Searching for Sequence Similarity

Sequence similarity searches Sequence similarity searches of database enable us to extract sequences that are similar to a query sequence Information about these extracted sequences can be used to predict the structure or function of the query sequence Prediction using similarity is a powerful and ubiquitous idea in bioinformatics. The underlying reason for this is molecular evolution

Sequence alignment Any pair of DNA sequence will show some degree of similarity Sequence alignment is the first step in quantifying this in order to distinguish between chance similarity and real biological relationships Alignments show the differences between sequences and changes (mutations), insertions or deletions (indels or gaps) and can be interpreted in evolutionary terms

Alignment algorithms Dynamic programming algorithms can calculate the best alignment of two sequences Well-known variants are –the Smith-Waterman algorithm (local alignments) –the Needleman-Wunsch algorithm (global alignments) Local alignments are useful when sequences are not related over their full lengths, e.g., proteins sharing only certain domains or DNA sequences related only in exons

Alignment scores and gap penalties A simple alignment score measures the number or proportion of identically matching residues Gap penalties are subtracted from such scores to ensure that alignment algorithms produce biologically sensible alignments without many gaps Gap penalties may be constant (independent of the length of the gap), proportional (proportional to the length of the gap) or affine (containing gap opening and gap extension contributions) Gap penalties can be varied according to the desired application

Similarity and homology Similarity may exist between any sequences Sequences are homologous only if they have evolved from a common ancestor Homologous sequences often have similar biological functions (orthologs), but the mechanism of gene duplication allows homologous sequences to evolve different functions (paralogs)

Similarity search in databases Sequences similar to a query can be found in a database by aligning it to each database sequence in turn and returning the highest scoring (most similar) sequences This can be achieved by dynamic programming algorithms but in practice faster approximate methods are often used

Statistical scores The p value of a similarity score is the probability of obtaining a score at least as high in a chance similarity between two unrelated sequences of similar composition Low p values indicate significance matches that are likely to have real biological significance The related E value is the expected frequency of chance occurrences scoring at least as high as the identified similarity A low p value for a similarity between two sequences can translate into a high E value for a search of a large database

Sensitivity and specificity These measures quantify the success of a database search strategy Sensitivity measures the proportion of real biological sequence relationships in the database that were detected as hits in the search Specificity is the proportion of the hits corresponding to real biological relationships Changing E and p value thresholds results in a trade-off between these complementary measures of success

Maximizing amino acid identities Protein sequences can be aligned to maximize amino acid identities, but this will not reveal distant evolutionary relationships

Evolution Protein-coding sequences evolve slowly compared with most other parts of the genome, because of the need to maintain protein structure and function An exception to this is the fast evolution that might occur in the redundant copy of a recently duplicated gene

Allowed changes Changes in protein sequences during evolution tend to involve substitutions between amino acids with similar properties because these tend to maintain the structural stability of the protein

Substitution score matrices These matrices give scores for all possible amino acid substitutions during evolution Higher scores indicate more likely substitutions Example matrices are BLOSUM62 and PAM250 PAM stands for Accepted Point Mutations, and in this case, the evolutionary distance of the matrix is 250 amino acid changes per 100 residues Dynamic programming algorithms for sequence alignment can operate using scores from these matrices

Significance of score matrices Substitution score matrices allow detection of distant evolutionary relationships between protein sequences It is possible to detect much more distant relationships by comparing protein sequences than by comparing nucleic acid sequences

Part of the sequence of human Huntington’s disease protein (Huntingtin) showing low complexity regions (underlined) associated with compositional bias towards glutamine (Q) and proline (P) MATLEKLMKA FESLKSFQQQ QQQQQQQQQQ QQQQQQQQQQ PPPPPPPPPP PQLPQPPPQA QPLLPQPQPP PPPPPPPPGP AVAEEPLHRP KKELSATKKD RVNHCLTICE NIVAQSVRNS PEFQKLLGIA MELFLLCSDD AESDVRMVAD ECLNKVIKAL MSDNLPRLQL ELYKEIKKNG APRSLRAALW RFAELAHLVR PQKCRPYLVN LLPCLTRTSK RPEESVQETL AAAVPKIMAS

A dot plot of human pleckstrin sequence against itself produced with Erik Sonnhammer’s ‘dotter’ program. The sequence is plotted from N- to C- terminus along horizontal and vertical axes between residues 1 and approximately PLEK_HUMAN (horizontal) vs. PLEK_HUMAN (vertical)

The PAM250 matrix and alignment of sequences. Total alignment scores for two matrices should not be compared, but note that the PAM matrix is able to detect a much better alignment in second halves of these sequences rather than identity matrix. With the introduction of a single gap, sensible alignments of hydrophobic amino acids, and alignment of K with R (both basic), D with E (both acidic) and F with Y (both aromatic) can be seen C 12 S 0 2 T –2 1 3 P – A – G –3 1 0 –1 1 5 N –4 1 0 – D –5 0 0 – E –5 0 0 – Q –5 –1 –1 0 0 – B –3 –1 –1 0 –1 – R –4 0 –1 0 –2 –3 0 –1 – K –5 0 0 –1 –1 – M –5 –2 –1 –2 –1 –3 –2 –3 –2 –1 – I –3 –1 0 –2 –1 –3 –2 –2 – –2 2 5 L –6 –3 –2 –3 –2 –4 –3 –4 – –3 – V –2 –3 0 –1 0 –1 –2 –2 – –2 – F –4 –3 –3 –5 –4 –5 –4 –6 –5 -5 –2 –4 – –1 9 Y 0 –3 –3 –5 –3 –5 –2 –4 –4 –4 0 –4 –4 –2 –1 –1 – W –8 –2 –5 –6 –6 – –3 –4 –5 –5 – C S T P A G N D E Q H R K M I L V F Y W Sequence 1: MIIVKP –VVLKGDFG Sequence 2: MILLKP AIIIRAEY- Position score:

Figure 3. Display of the DNA unit. DNA can be described at several levels of detail. At the most detailed level, DNA can be characterized by the 5' and 3' termini at both external and internal positions; at the most abstract level, the substrate DNA can be one of 16 common structures. The goal is to provide methods for specifying the properties of DNA in as many ways as is natural for a scientist.

Figure 7. An initial experimental environment. The temperature is 37 degrees Celsius and the pH value is 7.4. No DNA polymerase I activity is possible

Part 2 Multiple Sequence Alignment

Non specific sequence similarity Certain types of sequence similarity are less likely to be indicative of an evolutionary relationship than others are Examples of this are similarity between regions of low compositional complexity, short period repeats and protein sequences coding for generic structures like coiled coils

Similarity search filters Regions of the non specific sequence types can degrade the results of similarity searches and are often filtered out of query sequences prior to searching The programs SEG and DUST can be used to detect and filter low complexity sequences, XNU can filter short period repeats and COILS can detect the presence of potential coiled coil structures

Database types for searches Database and query sequences can be protein or nucleic acid sequences and different query strategies are required for different types and combinations In general, searches are more sensitive using strategies where protein-coding nucleic acid database and/or query sequences are first translated to protein sequences

Iterative database searches PSI-BLAST is an iterative search method that improves on the detection rate of BLAST and FASTA Each iteration discovers intermediate sequences that are used in a sequence profile to discover more distant relatives of the query sequence in subsequent iterations Potential problems with PSI-BLAST are associated with the potential for unrelated sequences to pollute the iterative search, and difficulties associated with the domain structure of proteins PSI-BLAST often detects up to twice as many evolutionary relationships as BLAST

Multiple sequence alignment Multiple alignment illustrates relationships between two or more sequences When the sequences involved are diverse, the conserved residues are often key residues associated with maintenance of structural stability or biological function Multiple alignments can reveal many clues about protein structure and functions

Multiple alignment Part of a (artificial) multiple alignment of a family consisting of 7 sequences, which subdivide into 3 subfamilies. The bars on the left indicate subfamilies; the dotted boxes highlight conservation patterns.

Progressive sequence alignment Most commonly used software uses the method of progressive alignment This is a fast method, but frozen-in errors mean that it does not always work perfectly Biological knowledge can provide information about likely alignments, and where automatically produced alignments turn out to be imperfect, software for manual alignment editing is required

Protein families Assigning sequences to protein families is a very valuable way of predicting protein family (consensus sequences, conserved residues, residue patterns, sequence profiles, etc.) Many ways have been developed to represent protein family information and these have been stored in secondary protein family databases

Consensus sequences These condenses the information from a multiple alignment into single sequence Their main shortcoming is the inability to represent any probabilistic information apart from the most common residue at a particular position Derivation of consensus sequence illustrates that any protein family representation is subject to bias if the set of sequences from which it was derived is biased

PRINTS and BLOCKS These represent protein families of multiply aligned ungapped segments (motifs) derived from the most highly conserved regions of sequences By representing more of the sequence, they have the potential to be more sensitive than short PROSITE patterns The ability to match in only a subset of the motifs associated with a particular family means that they have the ability to detect splice variants and sequence fragments and to represent subfamilies WWW-based search engines for the databases are available

Protein domain families Many proteins are built up from domains in a modular architecture The study of protein families is best pursued as a study of protein domain families Prodom is a database of protein domain sequences created by automatic means from the protein sequence databases

Resources for domain families Pfam and SMART can be used for protein domain family analysis The integrated resource Interpro unites PROSITE, PRINTS, Pfam, Prodom and SMART

Visualization of similarities Dot plots are a very good way to visualize sequence similarity and find repeats