Biology 162 Computational Genetics Todd Vision Fall Aug 2004

Biology 162 Computational Genetics Todd Vision Fall 2004 31 Aug 2004
Pairwise alignment Biology 162 Computational Genetics Todd Vision Fall 2004 31 Aug 2004

Preview General concepts in alignment How to read a dotplot
Scoring matrices and gap penalties Basic dynamic programming algorithms Needleman-Wunsch Smith-Waterman Using more realistic gap penalties

Homology Two sequences are homologous if they are descended from a common ancestral sequence. Homology is either all or nothing Only similarity can be quantitative An alignment is a model of positional homology between nucleotide or amino acid residues

Some applications of sequence alignment
Sequence/genome assembly Locating exons within genomic sequences Functional annotation by homology search against a database Identification of conserved signatures/motifs/domains Molecular evolution and phylogenetics Structural homology modeling

Alignments classified by
Span Global, encompassing full-length sequences Local, restricted to conserved segments Number of sequences Pairwise, involving only two sequences Multiple, involving more than two Algorithm Optimality guarantee Heuristic

Amino acids versus DNA DNA sequences give much worse alignments than amino acid sequences Fewer letters Less realistic scoring matrices Some applications can align codons How large would that scoring matrix be? If that’s not possible Use aa alignment to guide DNA alignment of coding sequences Use conceptual translations (6 potential coding frames) for database searches

Dotplots: phage l cI vs. P22 c2 repressor
B A Window size Stringency

Internal repeats: human LDL protein
Window size = 23 bp Stringency = 7 bp

Inversions

Hanging ends y y x x Overlap Nesting x y

Percent amino acid identity
Twilight Zone 100 90 80 70 60 50 40 30 20 10 Percent amino acid identity

Scoring an alignment Possible relationships at a position
Match (identity) Mismatch (substitution) Gap (insertion/deletion, or indel) A scoring matrix is used for matches and mismatches Typically binary for nucleic acids PAM, BLOSUM, & others for amino acids Gap penalties must be “tacked on” The alignment score is the sum of the scores at each position in the alignment, including gaps

LOD scores Let pab be the expected frequency of aligned residue pair a and b among all aligned residues Let qa be the frequency of individual amino acid a

Point Accepted Mutation (PAM) matrices
Trained on alignments of closely related proteins PAM1 implies 1 substitution per 100 amino acids PAM250 = (PAM1)250 Training set strongly biased toward globular proteins (more suitable matrices are available for more specific protein classes)

Choosing the right PAM matrix
Low PAM values discriminate among amino acids more dramatically As the exponent increases, values within a row converge on amino acid frequencies Choice of matrix typically depends on observed % identity Classic chicken and egg problem PAM250 corresponds to 20% identity Assumes substitution rate is equal among sites (Poisson model), which we know to be false

BLOSUM scoring matrices
Trained on ungapped alignments (blocks) of divergent sequences to capture ‘long-term’ substitution patterns Named BLOSUMx, where x (from 0 to 100) is the minimum percent identity of the sequences in the alignment. (The smaller the value of x, the more divergent the sequences). Note that numbers have opposite meanings for PAM and BLOSUM! BLOSUM62 is in wide use (eg it is the default in BLAST)

BLOSUM62 BLOSUM62 A R N D C Q E G H I L K M F P S T W Y V B Z X *
* BLOSUM62

Gap penalties Naïve score Affine scores
Each gap position receives independent penalty of d Affine scores Score depends on length of contiguous gap Gap opening penalty d Gap extension penalty e

Dynamic programming A problem solving technique that employs recursion to solve a larger problem by solving a nested set of similar subproblems

Application to pairwise alignment
Imagine We know the score for the first i-1 and j-1 residues of sequences x and y i-1 and j-1 are aligned in the optimal alignment There are three possibilities for the next position in the alignment A gap in sequence x A gap in sequence y A match or mismatch between i and j The maximum scoring alignment among these has to be in the optimal global alignment

Overview of algorithm We can use this fact to recursively fill out a matrix containing the score F(i,j) of the optimal alignment for every pair of residues i and j We also store a pointer to one of three previously filled out cells in the matrix, forming a path graph The optimal global alignment must be a path within the path graph It can be found by performing a traceback from the final cell in the recursion

Path Graph Diagonal moves represent matches and mismatches
Horizontal and vertical moves represent gaps (indels)

Needleman-Wunsch recursion
xi, yj match yj aligned to gap xi aligned to gap Let s(C,C)=1, s(C,G)=s(G,C)=-2, and d=-1 y ?C ?- A G A C 3 2 ?- ?G A G A C 3 2 A C 3 3 x A C 3 x: AC y: AC

Needleman-Wunsch c g t g c g t c t g t g a Initialize F(0,0) = 0
Use pointers to remember path Match=+1, Mismatch=-1, Gap=-2 arbitrary order of precedence:  , ,  Needleman-Wunsch c g t g c g t c t g t g a  -2  -4  -6  -8  -10  -12  -14 -2  -4  -6  -8  -10  -12  -14

Needleman-Wunsch: completed path graph
c g t g c g t c t g t g a -2 -4 -6 -8 -10 -12 -14 -2 +1 -1 -3 -5 -7 -9 -11 -4 -1  0 -8 -6 -3 -1  0 -8 -5 -2 -10 -7 +2 -12 -9  0 -14 -11 -3

Needleman-Wunsch: traceback
c g t g c g t c t g t g a -2 -4 -6 -8 -10 -12 -14 -2 +1 -1 -3 -5 -7 -9 -11 -4 -1  0 -8 -6 -3 -1  0 -8 -5 -2 -10 -7 +2 -12 -9  0 -14 -11 -3 cgtgcg-t | || | | c-tgtgat optimal global alignment:

Complexity of algorithm
For sequences of length m and n We consider 3 options at each cell We store mn scores and pointers We trace back no more than m+n steps 3mn +m+n in time, 3mn in memory O(mn) in both time and memory If m=n, O(n2)

Smith-Waterman algorithm
Local pairwise alignment Cells with negative scores are set to zero Traceback from highest scoring cell Stop when 0 is encountered Also O(nm)

Smith-Waterman recursion
xi, yj match yj aligned to gap xi aligned to gap score ≤ 0

Smith-Waterman algorithm
c g t g c g t c t g t g a +1 +2 +3 +1 +1 cgtgcgt ||| ctgtgat optimal local alignment:

Guaranteeing a local alignment
Use of SW algorithm alone does not guarantee “local” behavior Sensitive to the scoring function (should be negative for random sequences) Use of LOD scores help ensure this Gap penalties must also be chosen carefully If it is cheaper to insert a gap than to tolerate a mismatch, then gaps will be inserted where no alignment is possible

More realistic gap penalties
General gap function g(g) Requires O(n3) operations

Affine gap penalties Gap score: g(g)=-d-(g-1)e Can be done in O(mn)
We need to keep track of three scores (and pointers) at each cell

A theme with variations
Overlapping or nested sequences Do not penalize hanging ends Repeated sequences Asymmetric algorithm can find multiple local alignments of x in y or y in x The basic idea admits many variations

Things to keep in mind about pairwise alignments
There may be multiple optima Optimality is only guaranteed with respect to the scoring function – the alignment may still be biologically wrong! O(mn) is still too big when n is the size of a major sequence database

Summary Dotplots are an excellent visual tool to decide whether and what kind of alignment is appropriate PAM and BLOSUM series matrices provide empirical LOD values for scoring alignments Different flavors of alignment are produced by variants on a basic dynamic programming algorithm Needleman-Wunsch for global alignments Smith-Waterman for local alignments Affine gap penalties balance biological realism with computational feasibility

Reading assignment Nicholas et al. (2002) Strategies for multiple sequence alignment. Biotechniques 32, (handout)

Biology 162 Computational Genetics Todd Vision Fall Aug 2004

Similar presentations

Presentation on theme: "Biology 162 Computational Genetics Todd Vision Fall Aug 2004"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Biology 162 Computational Genetics Todd Vision Fall Aug 2004

Similar presentations

Presentation on theme: "Biology 162 Computational Genetics Todd Vision Fall Aug 2004"— Presentation transcript:

Similar presentations

About project

Feedback