Introduction to sequence alignment Mike Hallett (David Walsh)

Introduction to sequence alignment Mike Hallett (David Walsh)
WEEK 2 Mike Hallett (David Walsh) BIOL510: Bioinformatics Alignment of dog genome to the human genome Like to start with some general comments before we begin

Outline: pairwise alignment
The importance of pairwise alignment The important steps in comparing two sequences (Sections ) Performing pairwise alignments using BLAST (Section ): WILL COVER IN LAB CLASS ON TUESDAY

4.1 Principles of sequence alignment
Sequences (DNA and protein) vary as a result of evolutionary processes acting at the molecular level: Point mutations: nucleotides or amino acids Insertions and deletions (length variation) Fusion of two genes into a single gene Evolution in gene sequences can effectively mask any underlying sequence similarity P. 73

Similarity and Homology
Similarity is quantitative measure of how related two sequences are: Usually based on pairwise alignment of two sequences By aligning sequences we can count the number of residues that line up and be expressed in terms of percent identity High degrees of sequence similarity may imply a common evolutionary history or a possible commonality in biological function Homology refers specifically to similarity in sequence or structure due to decent from a common ancestor The concept of homology implies an evolutionary relationship

Definition: homology Homology
Similarity attributed to descent from a common ancestor. Morphological homology Molecular homology fly GAKKVIISAP SAD.APM..F human GAKRVIISAP SAD.APM..F plant GAKKVIISAP SAD.APM..F bacterium GAKKVVMTGP SKDNTPM..F yeast GAKKVVITAP SS.TAPM..F archaeon GADKVLISAP PKGDEPVKQL

Definitions: identity, similarity, conservation
Identity The extent to which two (nucleotide or amino acid) sequences are invariant. Similarity The extent to which nucleotide or protein sequences are related. It is based upon identity plus conservation. Conservation Changes at a specific position of an amino acid or (less commonly, DNA) sequence that preserve the physico-chemical properties of the original residue.

Pairwise sequence alignment is the most
fundamental operation of bioinformatics • It is fundamental to characterizing genome sequences To identify genes within a genome To identify related proteins, predict protein structure and function To construct phylogenetic trees and compare evolutionary relationships P. 72

Definition: pairwise alignment
Pairwise alignment: The process of lining up two sequences to achieve maximal levels of similarity Query: 181 catcaactacaactccaaagacacccttacacccactaggatatcaacaaacctacccac 240 |||||||| |||| |||||| ||||| | ||||||||||||||||||||||||||||||| Sbjct: 189 catcaactgcaaccccaaagccacccct-cacccactaggatatcaacaaacctacccac 247

4.5 Types of alignment Global alignments: an alignment that covers the full length of a gene or protein sequence  for aligning closely related sequences that are similar over their whole length

4.5 Types of alignment Global alignments: an alignment that covers the full length of a gene or protein sequence  for aligning closely related sequences that are similar over their whole length Local alignments: an alignment that only covers a certain region (e.g. domain) of a gene or protein sequences  for aligning proteins that are only partly related (e.g multidomain proteins)  for identifying conserved regions in very divergent sequences

4.5 Types of alignment

General approach to pairwise alignment
Choose two sequences Select an alignment algorithm that generates a score Score reflects degree of similarity Allow gaps (insertions, deletions) Estimate probability that the alignment occurred by chance Many possible alignments

When sequences are derived from a common ancestor, we want to align bases/amino acids derived from the same ancestral position b-corticotropin (sheep) Corticotropin A (pig) ala gly glu asp asp glu asp gly ala glu asp glu Oxytocin Vasopressin CYIQNCPLG CYFQNCPRG (Nueromodulators) (Peptide Hormones)

When sequences are derived from a common ancestor, we want to align bases/amino acids derived from the same ancestral position T H I S S E Q E N C E T H A T S E Q E N C E Two amino acid point mutations Identical matches Mismatches P. 73

Often sequences we wish to align will differ in length, obscuring the similarity that exists: T H I S I S A S E Q E N C E T H A T S E Q E N C E How many amino acid point mutations?  8 point mutations? Identical matches Mismatches P. 73

Insertion/deletion mutations result in gaps in an alignment T H I S I S A - S E Q E N C E T H A T S E Q E N C E How many amino acid point mutations?  0 point mutations?  but two indel mutations! Identical matches Mismatches The best pairwise alignment is not obvious, hence we have algorithms for testing different alignments quantitatively P. 74

Matches do not have to be identical
Certain amino acids resemble each other in their physical and chemical characteristics, and can substitute functionally for each other T H I S I S A S E Q E N C E isoleucine - alanine serine - threonine T H A T S E Q E N C E

Charged amino acids Polar uncharged amino acids Hydrophobic amino acids

Pairwise alignment: protein versus DNA sequences
Synonymous mutations alter DNA but not amino acid sequences Nonsynonymous mutations alter amino acid sequence • Protein sequences offer a longer “look-back” time DNA sequences can be translated into protein, and then used in pairwise alignments

Codons are degenerate: changes in the third position
often do not alter the amino acid that is specified

DNA alignments Many times, DNA alignments are appropriate (or necessary): -- to identify promoters and regulatory elements -- to identify gene sequences -- to study noncoding regions of DNA -- to study DNA polymorphisms (SNPs) Query: 181 catcaactacaactccaaagacacccttacacccactaggatatcaacaaacctacccac 240 |||||||| |||| |||||| ||||| | ||||||||||||||||||||||||||||||| Sbjct: 189 catcaactgcaaccccaaagccacccct-cacccactaggatatcaacaaacctacccac 247

4.2 Scoring alignments How do we objectively determine which is the best possible alignment for a pair of sequences?

4.2 Scoring alignments How do we objectively determine which is the best possible alignment for a pair of sequences?  Generate all possible alignments (not possible: 1075 possibilities for an alignment of 100 positions!!!)  Calculate a score for each alignment Optimal alignment: the alignment with the best score Suboptimal alignments: alignments with slightly poorer scores

4.2 Scoring alignments: Percent identity
The simplest way to quantify similarity is to sum the number of bases/amino acid matches and divide by length of the alignment T H I S I S A - S E Q E N C E T H A T S E Q E N C E (10 matches/15 positions)*100 = 66% identity

4.2 Scoring alignments: dot plots
Dot-plots are a simple way to visualize pairwise sequence similarity Fig 4.1

Do all amino acid substitutions occur with the same probability?

Do all amino acid substitutions occur with the same probability? NO!!!! T H I S I S A S E Q E N C E T H A T S E Q E N C E serine – threonine : highly conservative isoleucine – alanine : poorly conservative

Substitution Matrix A substitution matrix contains the likelihood that a particular pair of amino acids will occupy the same position due to decent from a common ancestor (i.e. homology)  20 x 20 substitution matrix

The BLOSUM62 substitution matrix
+5 for Arg to Arg -2 for Arg to Asp Fig 4.4

The BLOSUM62 substitution matrix
+ 1 for Ser to Thr +5 for Arg to Arg -2 for Arg to Asp Fig 4.4

Scoring a pairwise alignment using the BLOSUM62 matrix
T H I S S E Q E N C E T H A T S E Q E N C E 5 8 -1 1 4 5 5 5 6 9 5 The overall alignment score (S) = 52

Generation of substitution scoring matrices
Based on the observed amino acid substitution frequencies in alignments of homologous protein sequences Use real data to model the evolutionary processes PAM substitution matrices are calculated from global protein alignments BLOSUM substitution matrices are calculated from local protein alignments

Point-accepted mutations
PAM matrices: Point-accepted mutations Dayoff (1960’s) calculated substitution probabilities from alignments of highly similar protein families All the PAM data come from closely related proteins (>85% amino acid identity).

Point-accepted mutations
PAM matrices: Point-accepted mutations Dayoff (1960’s) calculated substitution probabilities from alignments of highly similar protein families All the PAM data come from closely related proteins (>85% amino acid identity). The PAM1 is the matrix calculated from comparisons of sequences with no more than 1% divergence. Other PAM matrices are extrapolated from PAM1. For PAM250, 250 changes have occurred for two proteins over a length of 100 amino acids.

A PAM250 scoring matrix that assigns scores and is forgiving of mismatches…
(such as +17 for W to W or -5 for W to T) 36

…compared to a scoring matrices such as PAM10 that are strict and do not tolerate mismatches
(such as +13 for W to W or -19 for W to T) 37

BLOSUM Matrices BLOSUM matrices are based on local alignments.
The BLOCKS database contains thousands of groups of multiple sequence alignments. BLOSUM stands for blocks substitution matrix. All BLOSUM matrices are based on observed alignments; they are not extrapolated from comparisons of closely related proteins. 38

BLOSUM Matrices BLOSUM62 is a matrix calculated from comparisons of
sequences with no more than 62% similarity. BLOSUM62 is the default matrix in BLAST2.0. Though it is tailored for comparisons of moderately distant proteins, it performs well in detecting closer relationships. A search for distant relatives may be more sensitive with a different matrix. 39

Selecting an appropriate scoring matrix
More conserved Less conserved Rat versus mouse globin Rat versus bacterial globin

4.4 Inserting Gaps Homologous sequences are often different in length as a result of insertions and deletions (indels) The alignment of indels involves inserting gaps into the alignment Gap penalty: each time a gap is introduced, a gap penalty is subtracted from the score A gap opening penalty is usually high A gap extension penalty is usually low

Scoring a pairwise alignment using the BLOSUM62 matrix and gap penalty
T H I S I S A S E Q E N C E T H A T S E Q E N C E 5 8 -1 1 4 5 5 5 6 9 5

Scoring a pairwise alignment using the BLOSUM62 matrix and gap penalty
T H I S I S A S E Q E N C E T H A T S E Q E N C E 5 8 -1 1 4 5 5 5 6 9 5 Gap opening penalty = -11 Gap extension penalty = -1 The overall score (S) = 52 + ( ) = 40

4.4 Inserting Gaps Alignment with a high gap penalty
Alignment with a low gap penalty Page 86

Next in the course... Ch 5. We have learned how to score an alignment, but how do you generate the alignment in the first place? Here are two approaches:  Dynamic Programming Algorithms  Heuristic Search Algorithms

Sequence alignments continued
David Walsh BIOL510: Bioinformatics Alignment of dog genome to the human genome Like to start with some general comments before we begin Rasko et al. Nucleic Acids Res. 2004; 32(3): 977–988

Outline: sequence alignments (Ch 5)
Dynamic Programming Algorithms (Ch 5.2) Global alignment: Needleman-Wunsch Local alignments: Smith-Waterman Heuristic Search Algorithms (Ch 5.3)  BLAST Alignment Score Significance (Ch 5.4) WE WILL COVER THIS ON TUESDAY DURING THE LAB

Scoring an alignment using the BLOSUM62 substitution matrix and gap penalty
T H I S I S A S E Q E N C E T H A T S E Q E N C E 5 8 -1 1 4 5 5 5 6 9 5 Gap opening penalty = -11 Gap extension penalty = -1 The overall score (S) = 52 + ( ) = 40

Dynamic Programming Algorithms
For any given pair of sequences, if gaps are allowed there is a large number of possible alignments. 49

Dynamic Programming Algorithms
For any given pair of sequences, if gaps are allowed there is a large number of possible alignments. Dynamic programming algorithms: can explore the full range of alignments using a variety of different constraints, by dividing the problem of alignment into many smaller parts Needleman and Wunsch published the original program in the 1970’s and there have been many modifications and improvements since. 50

Global alignment versus local alignment
Global alignment (Needleman-Wunsch) extends from one end of each sequence to the other. Local alignment finds optimally matching regions within two sequences (“subsequences”). Local alignment is almost always used for database searches such as BLAST. It is useful to find domains (or limited regions of homology) within sequences. Smith and Waterman (1981) solved the problem of performing optimal local sequence alignment. Other methods (BLAST, FASTA) are faster but less thorough. 51

Needleman-Wunsch: dynamic programming
N-W is guaranteed to find optimal alignments, although the algorithm does not search all possible alignments. It is an example of a dynamic programming algorithm: an optimal path (alignment) is identified by incrementally extending optimal subpaths. Thus, a series of decisions is made at each step of the alignment to find the pair of residues with the best score. 52

4.2 Scoring alignments: dot plots
Dot-plots are a simple way to visualize pairwise sequence similarity But, they are the beginning of generating optimal alignments as well. Fig 4.1

Three steps to global alignment with the Needleman-Wunsch algorithm
[1] set up a matrix of two sequences [2] score the matrix [3] identify the optimal alignment(s) 54

Global alignment with the algorithm of Needleman and Wunsch (1970)
• Two sequences can be compared in a matrix along x- and y-axes. • If they are identical, a path along a diagonal can be drawn • Find the optimal subpaths, and add them up to achieve the best score. This involves --adding gaps when needed --allowing for conservative substitutions --choosing a scoring system (simple or complicated) N-W is guaranteed to find optimal alignment(s) 55

Four possible outcomes in aligning two sequences
1 2 [1] identity (stay along a diagonal) [2] mismatch (stay along a diagonal) [3] gap in one sequence (move vertically!) [4] gap in the other sequence (move horizontally!) 56

The initial stage of dynamic programming
Gap extension penalty (E) = -8 BLOSUM62 substitution matrix Figure 5.8

Figure 5.10 Gap extension penalty (E) = -8 BLOSUM62 substitution matrix Figure 5.8

-16 Figure 5.10 Gap extension penalty (E) = -8 BLOSUM62 substitution matrix Figure 5.8

The initial stage of dynamic programming: filling in the matrix
-1 Figure 5.10 Thr  Ile Score= -1 Gap extension penalty (E) = -8 BLOSUM62 substitution matrix Figure 5.8

Score = -4 Gap extension penalty (E) = -8 BLOSUM62 substitution matrix Figure 5.9 The final stage of dynamic programming: traceback

Score = 7 Gap extension penalty (E) = -4 BLOSUM62 substitution matrix Figure 5.11 The final stage of dynamic programming: traceback

Local alignment: the Smith-Waterman (SW) algorithm
Remember: two protein sequences may not exhibit homology along their full length Page 88, Page 136 65

Local alignment: the Smith-Waterman (SW) algorithm
Remember: two protein sequences may not exhibit homology along their full length SW is a modification of the Needleman-Wunsch algorithm Instead of looking at each sequence in its entirety, the method compares segments of all possible lengths and chooses the segments that optimize the similarity measure Page 88, Page 136 66

Local alignment algorithm: optimal subsequence alignments less than zero (<0) are rejected
Score = 12 Gap extension penalty (E) = -8 (!!!!!) BLOSUM62 substitution matrix Figure 5.15

Outline: sequence alignments (Ch 5)
Dynamic Programming Algorithms (Ch 5.2) Global alignment: Needleman-Wunsch Local alignments: Smith-Waterman Heuristic Search Algorithms (Ch 5.3)  BLAST Alignment Score Significance (Ch 5.4) Will cover this topic during the lab

Introduction to sequence alignment Mike Hallett (David Walsh)

Similar presentations

Presentation on theme: "Introduction to sequence alignment Mike Hallett (David Walsh)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to sequence alignment Mike Hallett (David Walsh)

Similar presentations

Presentation on theme: "Introduction to sequence alignment Mike Hallett (David Walsh)"— Presentation transcript:

Similar presentations

About project

Feedback