Introduction to bioinformatics Lecture 5 Pair-wise sequence alignment

Introduction to bioinformatics Lecture 5 Pair-wise sequence alignment

Bioinformatics “Nothing in Biology makes sense except in the light of evolution” (Theodosius Dobzhansky ( )) “Nothing in bioinformatics makes sense except in the light of Biology”

Divergent evolution Ancestral sequence: ABCD ACCD (B C) ABD (C ø)
ACCD or ACCD Pairwise Alignment AB─D A─BD mutation deletion

Divergent evolution Ancestral sequence: ABCD ACCD (B C) ABD (C ø)
ACCD or ACCD Pairwise Alignment AB─D A─BD mutation deletion true alignment

A bit on divergent evolution
Ancestral sequence G C A C One substitution - one visible Two substitutions - one visible Sequence 1 Sequence 2 (c) G (d) G 1: ACCTGTAATC 2: ACGTGCGATC * ** D = 3/10 (fraction different sites (nucleotides)) G A A A Back mutation - not visible Two substitutions - none visible G

Convergent evolution Often with shorter motifs (e.g. active sites)
Motif (function) has evolved more than once independently, e.g. starting with two very different sequences adopting different folds Sequences and associated structures remain different, but (functional) motif can become identical Classical example: serine proteinase and chymotrypsin

Serine proteinase (subtilisin) and chymotrypsin
Different evolutionary origins, there are Similarities in the reaction mechanisms. Chymotrypsin, subtilisin and carboxypeptidase C have a catalytic triad of serine, aspartate and histidine in common: serine acts as a nucleophile, aspartate as an electrophile, and histidine as a base. The geometric orientations of the catalytic residues are similar between families, despite different protein folds. The linear arrangements of the catalytic residues reflect different family relationships. For example the catalytic triad in the chymotrypsin clan (SA) is ordered HDS, but is ordered DHS in the subtilisin clan (SB) and SDH in the carboxypeptidase clan (SC).

A protein sequence alignment
MSTGAVLIY--TSILIKECHAMPAGNE----- ---GGILLFHRTHELIKESHAMANDEGGSNNS * * * **** *** A DNA sequence alignment attcgttggcaaatcgcccctatccggccttaa att---tggcggatcg-cctctacgggcc---- *** **** **** ** ******

Searching for similarities
What is the function of the new gene? The “lazy” investigation (i.e., no biologial experiments, just bioinformatics techniques): – Find a set of similar protein sequences to the unknown sequence – Identify similarities and differences – For long proteins: identify domains

Evolutionary and functional relationships
Reconstruct evolutionary relation: Based on sequence -Identity (simplest method) -Similarity Homology (common ancestry: the ultimate goal) Other (e.g., 3D structure) Functional relation: Sequence Structure Function

Searching for similarities
Common ancestry is more interesting: Makes it more likely that genes share the same function Homology: sharing a common ancestor – a binary property (yes/no) – it’s a nice tool: When (an unknown) gene X is homologous to (a known) gene G it means that we gain a lot of information on X: what we know about G can be transferred to X as a good suggestion.

How to go from DNA to protein sequence
A piece of double stranded DNA: 5’ attcgttggcaaatcgcccctatccggc 3’ 3’ taagcaaccgtttagcggggataggccg 5’ DNA direction is from 5’ to 3’

How to go from DNA to protein sequence
6-frame translation using the codon table (last lecture): 5’ attcgttggcaaatcgcccctatccggc 3’ 3’ taagcaaccgtttagcggggataggccg 5’

Evolution and three-dimensional protein structure information
Isocitrate dehydrogenase: The distance from the active site (in yellow) determines the rate of evolution (red = fast evolution, blue = slow evolution) Dean, A. M. and G. B. Golding: Pacific Symposium on Bioinformatics 2000

Bioinformatics tool tool Algorithm Data Biological Interpretation
(model)

Example today: Pairwise sequence alignment needs sense of evolution
Global dynamic programming MDAGSTVILCFVG Evolution M D A S T I L C G Amino Acid Exchange Matrix Search matrix MDAGSTVILCFVG- Gap penalties (open,extension) MDAAST-ILC--GS

How to determine similarity
Frequent evolutionary events at the DNA level: 1. Substitution 2. Insertion, deletion 3. Duplication 4. Inversion We will restrict ourselves to these events

A protein sequence alignment
MSTGAVLIY--TSILIKECHAMPAGNE----- ---GGILLFHRTHELIKESHAMANDEGGSNNS * * * **** *** A DNA sequence alignment attcgttggcaaatcgcccctatccggccttaa att---tggcggatcg-cctctacgggcc---- *** **** **** ** ******

Dynamic programming Scoring alignments
– Substitution (or match/mismatch) • DNA • proteins – Gap penalty • Linear: gp(k)=ak • Affine: gp(k)=b+ak • Concave, e.g.: gp(k)=log(k) The score for an alignment is the sum of the scores of all alignment columns

Sa,b = gp(k) = gapinit + kgapextension affine gap penalties

DNA: define a score for match/mismatch of letters
Simple: Used in genome alignments: A C G T 1 -1 A C G T 91 -114 -31 -123 100 -125

T D W V T A L K T D W L - - I K 2020 10 1 Affine gap penalties (open, extension) Amino Acid Exchange Matrix Score: s(T,T)+s(D,D)+s(W,W)+s(V,L)-Po-2Px + +s(L,I)+s(K,K)

Amino acid exchange matrices
2020 How do we get one? And how do we get associated gap penalties? First systematic method to derive a.a. exchange matrices by Margaret Dayhoff et al. (1968) – Atlas of Protein Structure

amino acid exchange matrix (log odds)
Q E G H I L K M F P S T W Y V B Z A R N D C Q E G H I L K M F P S T W Y V B Z PAM250 matrix amino acid exchange matrix (log odds) Positive exchange values denote mutations that are more likely than randomly expected, while negative numbers correspond to avoided mutations compared to the randomly expected situation

Amino acid exchange matrices
Amino acids are not equal: 1. Some are easily substituted because they have similar: • physico-chemical properties • structure 2. Some mutations between amino acids occur more often due to similar codons The two above observations give us ways to define substitution matrices

Pair-wise alignment T D W V T A L K T D W L - - I K
Combinatorial explosion - 1 gap in 1 sequence: n+1 possibilities - 2 gaps in 1 sequence: (n+1)n - 3 gaps in 1 sequence: (n+1)n(n-1), etc. 2n (2n)! n = ~ n (n!) n 2 sequences of a.a.: ~1088 alignments 2 sequences of 1000 a.a.: ~10600 alignments!

Introduction to bioinformatics Lecture 5 Pair-wise sequence alignment

Similar presentations

Presentation on theme: "Introduction to bioinformatics Lecture 5 Pair-wise sequence alignment"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to bioinformatics Lecture 5 Pair-wise sequence alignment

Similar presentations

Presentation on theme: "Introduction to bioinformatics Lecture 5 Pair-wise sequence alignment"— Presentation transcript:

Similar presentations

About project

Feedback