Introduction to sequence alignment Mike Hallett (David Walsh)

Slides:

Advertisements

Similar presentations

Pairwise sequence alignment.

Advertisements

Sources Page & Holmes Vladimir Likic presentation: 20show.pdf

Measuring the degree of similarity: PAM and blosum Matrix

DNA sequences alignment measurement

Sequence Alignment.

Lecture 8 Alignment of pairs of sequence Local and global alignment

Introduction to Bioinformatics

Sequence Similarity Searching Class 4 March 2010.

Heuristic alignment algorithms and cost matrices

Sequence Alignment.

Introduction to Bioinformatics Algorithms Sequence Alignment.

Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.

Introduction to bioinformatics

Sequence similarity.

Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.

Similar Sequence Similar Function Charles Yan Spring 2006.

Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.

Introduction to Bioinformatics Algorithms Sequence Alignment.

Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.

Sequence Alignments Revisited

Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.

Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.

Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Basics of Sequence Alignment and Weight Matrices and DOT Plot

TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,

Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.

Developing Pairwise Sequence Alignment Algorithms

Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.

Pairwise & Multiple sequence alignments

Pairwise Alignments Part 1 Biology 224 Instructor: Tom Peavy Sept 8

An Introduction to Bioinformatics

Protein Evolution and Sequence Analysis Protein Evolution and Sequence Analysis.

Protein Sequence Alignment and Database Searching.

CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)

Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)

Introduction to Bioinformatics Algorithms Sequence Alignment.

Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.

Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.

Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.

Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.

Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.

Construction of Substitution Matrices

Sequence Alignment Csc 487/687 Computing for bioinformatics.

Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.

Chapter 3 Computational Molecular Biology Michael Smith

COT 6930 HPC and Bioinformatics Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.

In-Class Assignment #1: Research CD2

Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-

Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.

Sequence Alignment.

Construction of Substitution matrices

Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University

Step 3: Tools Database Searching

Cédric Notredame (22/02/2016) Comparing Two Protein Sequences Cédric Notredame.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,

Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.

9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.

Pairwise Sequence Alignment and Database Searching

Sequence similarity, BLAST alignments & multiple sequence alignments

Bioinformatics for Research

Sequence comparison: Local alignment

Sequence Alignment.

Biology 162 Computational Genetics Todd Vision Fall Aug 2004

Protein Sequence Alignments

Pairwise sequence Alignment.

Pairwise Sequence Alignment

Pairwise Alignment Global & local alignment

Basic Local Alignment Search Tool (BLAST)

Presentation transcript:

Introduction to sequence alignment Mike Hallett (David Walsh) WEEK 2 Mike Hallett (David Walsh) BIOL510: Bioinformatics Alignment of dog genome to the human genome Like to start with some general comments before we begin

Outline: pairwise alignment The importance of pairwise alignment The important steps in comparing two sequences (Sections 4.1-4.5) Performing pairwise alignments using BLAST (Section 4.6-4.7): WILL COVER IN LAB CLASS ON TUESDAY

4.1 Principles of sequence alignment Sequences (DNA and protein) vary as a result of evolutionary processes acting at the molecular level: Point mutations: nucleotides or amino acids Insertions and deletions (length variation) Fusion of two genes into a single gene Evolution in gene sequences can effectively mask any underlying sequence similarity P. 73

Similarity and Homology Similarity is quantitative measure of how related two sequences are: Usually based on pairwise alignment of two sequences By aligning sequences we can count the number of residues that line up and be expressed in terms of percent identity High degrees of sequence similarity may imply a common evolutionary history or a possible commonality in biological function Homology refers specifically to similarity in sequence or structure due to decent from a common ancestor The concept of homology implies an evolutionary relationship

Definition: homology Homology Similarity attributed to descent from a common ancestor. Morphological homology Molecular homology fly GAKKVIISAP SAD.APM..F human GAKRVIISAP SAD.APM..F plant GAKKVIISAP SAD.APM..F bacterium GAKKVVMTGP SKDNTPM..F yeast GAKKVVITAP SS.TAPM..F archaeon GADKVLISAP PKGDEPVKQL

Definitions: identity, similarity, conservation Identity The extent to which two (nucleotide or amino acid) sequences are invariant. Similarity The extent to which nucleotide or protein sequences are related. It is based upon identity plus conservation. Conservation Changes at a specific position of an amino acid or (less commonly, DNA) sequence that preserve the physico-chemical properties of the original residue.

Pairwise sequence alignment is the most fundamental operation of bioinformatics • It is fundamental to characterizing genome sequences To identify genes within a genome To identify related proteins, predict protein structure and function To construct phylogenetic trees and compare evolutionary relationships P. 72

Definition: pairwise alignment Pairwise alignment: The process of lining up two sequences to achieve maximal levels of similarity Query: 181 catcaactacaactccaaagacacccttacacccactaggatatcaacaaacctacccac 240 |||||||| |||| |||||| ||||| | ||||||||||||||||||||||||||||||| Sbjct: 189 catcaactgcaaccccaaagccacccct-cacccactaggatatcaacaaacctacccac 247

4.5 Types of alignment Global alignments: an alignment that covers the full length of a gene or protein sequence  for aligning closely related sequences that are similar over their whole length

4.5 Types of alignment Global alignments: an alignment that covers the full length of a gene or protein sequence  for aligning closely related sequences that are similar over their whole length Local alignments: an alignment that only covers a certain region (e.g. domain) of a gene or protein sequences  for aligning proteins that are only partly related (e.g multidomain proteins)  for identifying conserved regions in very divergent sequences

4.5 Types of alignment

4.5 Types of alignment

General approach to pairwise alignment Choose two sequences Select an alignment algorithm that generates a score Score reflects degree of similarity Allow gaps (insertions, deletions) Estimate probability that the alignment occurred by chance Many possible alignments

4.1 Principles of sequence alignment When sequences are derived from a common ancestor, we want to align bases/amino acids derived from the same ancestral position b-corticotropin (sheep) Corticotropin A (pig) ala gly glu asp asp glu asp gly ala glu asp glu Oxytocin Vasopressin CYIQNCPLG CYFQNCPRG (Nueromodulators) (Peptide Hormones)

4.1 Principles of sequence alignment When sequences are derived from a common ancestor, we want to align bases/amino acids derived from the same ancestral position T H I S S E Q E N C E T H A T S E Q E N C E Two amino acid point mutations Identical matches Mismatches P. 73

4.1 Principles of sequence alignment Often sequences we wish to align will differ in length, obscuring the similarity that exists: T H I S I S A S E Q E N C E T H A T S E Q E N C E How many amino acid point mutations?  8 point mutations? Identical matches Mismatches P. 73

4.1 Principles of sequence alignment Insertion/deletion mutations result in gaps in an alignment T H I S I S A - S E Q E N C E T H - - - - A T S E Q E N C E How many amino acid point mutations?  0 point mutations?  but two indel mutations! Identical matches Mismatches The best pairwise alignment is not obvious, hence we have algorithms for testing different alignments quantitatively P. 74

Matches do not have to be identical Certain amino acids resemble each other in their physical and chemical characteristics, and can substitute functionally for each other T H I S I S A S E Q E N C E isoleucine - alanine serine - threonine T H A T - - - S E Q E N C E

Charged amino acids Polar uncharged amino acids Hydrophobic amino acids

Pairwise alignment: protein versus DNA sequences Synonymous mutations alter DNA but not amino acid sequences Nonsynonymous mutations alter amino acid sequence • Protein sequences offer a longer “look-back” time DNA sequences can be translated into protein, and then used in pairwise alignments

Codons are degenerate: changes in the third position often do not alter the amino acid that is specified

DNA alignments Many times, DNA alignments are appropriate (or necessary): -- to identify promoters and regulatory elements -- to identify gene sequences -- to study noncoding regions of DNA -- to study DNA polymorphisms (SNPs) Query: 181 catcaactacaactccaaagacacccttacacccactaggatatcaacaaacctacccac 240 |||||||| |||| |||||| ||||| | ||||||||||||||||||||||||||||||| Sbjct: 189 catcaactgcaaccccaaagccacccct-cacccactaggatatcaacaaacctacccac 247

4.2 Scoring alignments How do we objectively determine which is the best possible alignment for a pair of sequences?

4.2 Scoring alignments How do we objectively determine which is the best possible alignment for a pair of sequences?  Generate all possible alignments (not possible: 1075 possibilities for an alignment of 100 positions!!!)  Calculate a score for each alignment Optimal alignment: the alignment with the best score Suboptimal alignments: alignments with slightly poorer scores

4.2 Scoring alignments: Percent identity The simplest way to quantify similarity is to sum the number of bases/amino acid matches and divide by length of the alignment T H I S I S A - S E Q E N C E T H - - - - A T S E Q E N C E (10 matches/15 positions)*100 = 66% identity

4.2 Scoring alignments: dot plots Dot-plots are a simple way to visualize pairwise sequence similarity Fig 4.1

Matches do not have to be identical Do all amino acid substitutions occur with the same probability?

Matches do not have to be identical Do all amino acid substitutions occur with the same probability? NO!!!! T H I S I S A S E Q E N C E T H A T - - - S E Q E N C E serine – threonine : highly conservative isoleucine – alanine : poorly conservative

Substitution Matrix A substitution matrix contains the likelihood that a particular pair of amino acids will occupy the same position due to decent from a common ancestor (i.e. homology)  20 x 20 substitution matrix

The BLOSUM62 substitution matrix +5 for Arg to Arg -2 for Arg to Asp Fig 4.4

The BLOSUM62 substitution matrix + 1 for Ser to Thr +5 for Arg to Arg -2 for Arg to Asp Fig 4.4

Scoring a pairwise alignment using the BLOSUM62 matrix T H I S S E Q E N C E T H A T S E Q E N C E 5 8 -1 1 4 5 5 5 6 9 5 The overall alignment score (S) = 52

Generation of substitution scoring matrices Based on the observed amino acid substitution frequencies in alignments of homologous protein sequences Use real data to model the evolutionary processes PAM substitution matrices are calculated from global protein alignments BLOSUM substitution matrices are calculated from local protein alignments

Point-accepted mutations PAM matrices: Point-accepted mutations Dayoff (1960’s) calculated substitution probabilities from alignments of highly similar protein families All the PAM data come from closely related proteins (>85% amino acid identity).

Point-accepted mutations PAM matrices: Point-accepted mutations Dayoff (1960’s) calculated substitution probabilities from alignments of highly similar protein families All the PAM data come from closely related proteins (>85% amino acid identity). The PAM1 is the matrix calculated from comparisons of sequences with no more than 1% divergence. Other PAM matrices are extrapolated from PAM1. For PAM250, 250 changes have occurred for two proteins over a length of 100 amino acids.

A PAM250 scoring matrix that assigns scores and is forgiving of mismatches… (such as +17 for W to W or -5 for W to T) 36

…compared to a scoring matrices such as PAM10 that are strict and do not tolerate mismatches (such as +13 for W to W or -19 for W to T) 37

BLOSUM Matrices BLOSUM matrices are based on local alignments. The BLOCKS database contains thousands of groups of multiple sequence alignments. BLOSUM stands for blocks substitution matrix. All BLOSUM matrices are based on observed alignments; they are not extrapolated from comparisons of closely related proteins. 38

BLOSUM Matrices BLOSUM62 is a matrix calculated from comparisons of sequences with no more than 62% similarity. BLOSUM62 is the default matrix in BLAST2.0. Though it is tailored for comparisons of moderately distant proteins, it performs well in detecting closer relationships. A search for distant relatives may be more sensitive with a different matrix. 39

Selecting an appropriate scoring matrix More conserved Less conserved Rat versus mouse globin Rat versus bacterial globin

4.4 Inserting Gaps Homologous sequences are often different in length as a result of insertions and deletions (indels) The alignment of indels involves inserting gaps into the alignment Gap penalty: each time a gap is introduced, a gap penalty is subtracted from the score A gap opening penalty is usually high A gap extension penalty is usually low

Scoring a pairwise alignment using the BLOSUM62 matrix and gap penalty T H I S I S A S E Q E N C E T H A T - - - S E Q E N C E 5 8 -1 1 4 5 5 5 6 9 5

Scoring a pairwise alignment using the BLOSUM62 matrix and gap penalty T H I S I S A S E Q E N C E T H A T - - - S E Q E N C E 5 8 -1 1 4 5 5 5 6 9 5 Gap opening penalty = -11 Gap extension penalty = -1 The overall score (S) = 52 + (-11 - 2) = 40

4.4 Inserting Gaps Alignment with a high gap penalty Alignment with a low gap penalty Page 86

Next in the course... Ch 5. We have learned how to score an alignment, but how do you generate the alignment in the first place? Here are two approaches:  Dynamic Programming Algorithms  Heuristic Search Algorithms

Sequence alignments continued David Walsh david.walsh@concordia.ca BIOL510: Bioinformatics Alignment of dog genome to the human genome Like to start with some general comments before we begin Rasko et al. Nucleic Acids Res. 2004; 32(3): 977–988

Outline: sequence alignments (Ch 5) Dynamic Programming Algorithms (Ch 5.2) Global alignment: Needleman-Wunsch Local alignments: Smith-Waterman Heuristic Search Algorithms (Ch 5.3)  BLAST Alignment Score Significance (Ch 5.4) WE WILL COVER THIS ON TUESDAY DURING THE LAB

Scoring an alignment using the BLOSUM62 substitution matrix and gap penalty T H I S I S A S E Q E N C E T H A T - - - S E Q E N C E 5 8 -1 1 4 5 5 5 6 9 5 Gap opening penalty = -11 Gap extension penalty = -1 The overall score (S) = 52 + (-11 - 2) = 40

Dynamic Programming Algorithms For any given pair of sequences, if gaps are allowed there is a large number of possible alignments. 49

Dynamic Programming Algorithms For any given pair of sequences, if gaps are allowed there is a large number of possible alignments. Dynamic programming algorithms: can explore the full range of alignments using a variety of different constraints, by dividing the problem of alignment into many smaller parts Needleman and Wunsch published the original program in the 1970’s and there have been many modifications and improvements since. 50

Global alignment versus local alignment Global alignment (Needleman-Wunsch) extends from one end of each sequence to the other. Local alignment finds optimally matching regions within two sequences (“subsequences”). Local alignment is almost always used for database searches such as BLAST. It is useful to find domains (or limited regions of homology) within sequences. Smith and Waterman (1981) solved the problem of performing optimal local sequence alignment. Other methods (BLAST, FASTA) are faster but less thorough. 51

Needleman-Wunsch: dynamic programming N-W is guaranteed to find optimal alignments, although the algorithm does not search all possible alignments. It is an example of a dynamic programming algorithm: an optimal path (alignment) is identified by incrementally extending optimal subpaths. Thus, a series of decisions is made at each step of the alignment to find the pair of residues with the best score. 52

4.2 Scoring alignments: dot plots Dot-plots are a simple way to visualize pairwise sequence similarity But, they are the beginning of generating optimal alignments as well. Fig 4.1

Three steps to global alignment with the Needleman-Wunsch algorithm [1] set up a matrix of two sequences [2] score the matrix [3] identify the optimal alignment(s) 54

Global alignment with the algorithm of Needleman and Wunsch (1970) • Two sequences can be compared in a matrix along x- and y-axes. • If they are identical, a path along a diagonal can be drawn • Find the optimal subpaths, and add them up to achieve the best score. This involves --adding gaps when needed --allowing for conservative substitutions --choosing a scoring system (simple or complicated) N-W is guaranteed to find optimal alignment(s) 55

Four possible outcomes in aligning two sequences 1 2 [1] identity (stay along a diagonal) [2] mismatch (stay along a diagonal) [3] gap in one sequence (move vertically!) [4] gap in the other sequence (move horizontally!) 56

57

The initial stage of dynamic programming Gap extension penalty (E) = -8 BLOSUM62 substitution matrix Figure 5.8

The initial stage of dynamic programming Figure 5.10 Gap extension penalty (E) = -8 BLOSUM62 substitution matrix Figure 5.8

The initial stage of dynamic programming -16 Figure 5.10 Gap extension penalty (E) = -8 BLOSUM62 substitution matrix Figure 5.8

The initial stage of dynamic programming -16 Figure 5.10 Gap extension penalty (E) = -8 BLOSUM62 substitution matrix Figure 5.8

The initial stage of dynamic programming: filling in the matrix -1 Figure 5.10 Thr  Ile Score= -1 Gap extension penalty (E) = -8 BLOSUM62 substitution matrix Figure 5.8

The initial stage of dynamic programming: filling in the matrix Score = -4 Gap extension penalty (E) = -8 BLOSUM62 substitution matrix Figure 5.9 The final stage of dynamic programming: traceback

The initial stage of dynamic programming: filling in the matrix Score = 7 Gap extension penalty (E) = -4 BLOSUM62 substitution matrix Figure 5.11 The final stage of dynamic programming: traceback

Local alignment: the Smith-Waterman (SW) algorithm Remember: two protein sequences may not exhibit homology along their full length Page 88, Page 136 65

Local alignment: the Smith-Waterman (SW) algorithm Remember: two protein sequences may not exhibit homology along their full length SW is a modification of the Needleman-Wunsch algorithm Instead of looking at each sequence in its entirety, the method compares segments of all possible lengths and chooses the segments that optimize the similarity measure Page 88, Page 136 66

Local alignment algorithm: optimal subsequence alignments less than zero (<0) are rejected Score = 12 Gap extension penalty (E) = -8 (!!!!!) BLOSUM62 substitution matrix Figure 5.15

Outline: sequence alignments (Ch 5) Dynamic Programming Algorithms (Ch 5.2) Global alignment: Needleman-Wunsch Local alignments: Smith-Waterman Heuristic Search Algorithms (Ch 5.3)  BLAST Alignment Score Significance (Ch 5.4) Will cover this topic during the lab