Download presentation
Presentation is loading. Please wait.
Published byJodie Gordon Modified over 8 years ago
1
Function preserves sequences Christophe Roos - MediCel ltd christophe.roos@medicel.fi Similarity is a tool in understanding the information in a sequence Mutations change sequences
2
Spring 2002Christophe Roos - 4/6 Sequence comparison Sequence comparison Function by analogy: If sequences are conserved their function is probably also conserved. Functional domains: If some parts of the sequences are more conserved than other parts, there must be an underlying biological reason for it. Establishing relationship/differences in function: By quantification of sequence relationships it is possible to estimate function of novel genes Establishing relationship between species - Why, how? Compare two sequences of similar length Compare two sequences of very different length Compare several sequences Allow gaps or not? Scoring: yes-no or good-intermediate-bad The best or all above a threshold?
3
Spring 2002Christophe Roos - 4/6 Sequence comparison Sequence comparison – metrics The scoring matrix The score for a match The penality for a mismatch The penality for the insertion of a gap (gap-open) The penality for elongating a gap (gap-length) Local or global similarities ? GA-CGGATTAG GATCGGAATAG mismatch gap match
4
Spring 2002Christophe Roos - 4/6 Sequence comparison Scoring matrices When evaluating the occurrence of a pair, one scales the meaningfulness of its being there. The matrix is a table of values that describe the probability of a residue pair occurring in an alignment. Probabilities are derived from samples of alignments known to be valid. They can then be used to evaluate similarity of sequences with unknown function to sequences with known function.
5
Spring 2002Christophe Roos - 4/6 Sequence comparison Scoring matrices For DNA, they are usually binary: either there is similarity or there is not. For proteins, they reflect the chemical nature and frequencies of the amino acids, and cover a larger range of values.
6
Spring 2002Christophe Roos - 4/6 Sequence comparison Commonly used matrices for proteins Blosum matrices are derived from Blocks database that contain ungapped alignments from families of related proteins. The number indicates the similarity threshold level: Blosum62, Blosum45 PAM matrices are scaled according to a model of evolutionary distance from alignments of closely related sequences. One PAM-%1 unit is 1% change over all positions.
7
Spring 2002Christophe Roos - 4/6 Sequence comparison Walking through an alignment matrix Start with a gap (-) agains itself, score it 0. Fill in one row at a time At each position compute the scores that result for each of the choices: move one step in each sequence (diagonal), skip one horizontal or one vertical. Choose the best of the three values and save it. Score +5 for match, -4 for mismatch and –7 for a gap. If write 0. Traceback along the highest scoring path. Example: 10-4=6 (diagonal) 10-7=3 (gap, horizontal or vertical)
8
Spring 2002Christophe Roos - 4/6 Sequence comparison Global alignments and local ones Aligning 2 sequences along their whole length is done by stepping through the matrix from top left to bottom right. The best-scoring path can be traced through the matrix, resulting in an optimal alignment. The Needleman- Wunsch algorithm belongs to this class. Sequences are often modular, therefore similarities can be only local and global alignments will fail. The Smith-Waterman is a dynamic programming algorithm that performs local alignment of 2 sequences. If the cumulative score up to some point in the sequence is negative, it can be abandoned. It can also end anywhere in the matrix.
9
Spring 2002Christophe Roos - 4/6 Sequence comparison Example: local alignment of 2 sequences Web page: enter two sequences and search for local alignments. Two are found.
10
Spring 2002Christophe Roos - 4/6 Sequence comparison Iterate: compare one against many By iterating pairwise comparisons, one can compare one sequence agains a database of many sequences. Algorithms such as Smith & Waterman are too slow (quality optimised). Multistep algoritms have been developed for this task –Fasta: (i) use only every k:th position (k is usually 2 for proteins and 6 for DNA) and search short sequences (k-tups). (ii) score the 10 ungapped alignments with most identical k-tups. (iii) try to merge into a gapped alignment without reducing the score below a threshold. –Blast: (i) create a list of short words that score enough when compared to the query (ii) search these words in a precomputed table of all words and their positions in the database (iii) extend into ungapped or even gapped local alignments.
11
Spring 2002Christophe Roos - 4/6 Sequence comparison Example: BLAST one against SwissProt Web page: enter the sequence and search for local alignments. Several are found and listed both graphically and as text. Note the modularity of the query: two domains are apparent.
12
Spring 2002Christophe Roos - 4/6 Sequence comparison Iterate (2): compare many against many Multiple sequence alignments Example: The eyeless gene is also called PAX6 and can be found in several species: birds, mammals, reptiles, fish, invertebrates
13
Spring 2002Christophe Roos - 4/6 Sequence comparison Multiple sequence alignments CLUSTAL W (1.81) multiple sequence alignment PAX6_CHICK ------------------------------------------------------------ PAX6_HUMAN ------------------------------------MQNS----------------HSGV 8 PAX6_MOUSE ------------------------------------MQNS----------------HSGV 8 PAX6_COTJA ------------------------------------MQNS----------------HSGV 8 PAX6_BRARE -----------------MPQKEYYNRATWESGVASMMQNS----------------HSGV 27 PAX6_ORYLA -----------------MPQKEYHNQATWESGVASMMQNS----------------HSGV 27 PAX6_XENLA ------------------------------------MQNS----------------HSGV 8 PAX6_DROME MRNLPCLGTAGGSGLGGIAGKPSPTMEAVEASTASHRHSTSSYFATTYYHLTDDECHSGV 60 PAX6_CHICK ------------------------------------------------------------ PAX6_HUMAN NQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRP 68 PAX6_MOUSE NQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRP 68 PAX6_COTJA NQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRP 68 PAX6_BRARE NQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRP 87 PAX6_ORYLA NQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRP 87 PAX6_XENLA NQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRP 68 PAX6_DROME NQLGGVFVGGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRP 120 PAX6_CHICK ------------------------------------------------------------ PAX6_HUMAN RAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLLSEGVCTNDNIPSVSSINRVLR 128 PAX6_MOUSE RAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLLSEGVCTNDNIPSVSSINRVLR 128 PAX6_COTJA RAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLLSEGVCTNDNIPSVSSINRVLR 128 PAX6_BRARE RAIGGSKPRVATPEVVGKIAQYKRECPSIFAWEIRDRLLSEGVCTNDNIPSVSSINRVLR 147 PAX6_ORYLA RAIGGSKPRVATPEVVAKIAQYKRECPSIFAWEIRDRLLSEGICTNDNIPSVSSINRVLR 147 PAX6_XENLA RAIGGSKPRVATPEVVNKIAHYKRECPSIFAWEIRDRLLSEGVCTNDNIPSVSSINRVLR 128 PAX6_DROME RAIGGSKPRVATAEVVSKISQYKRECPSIFAWEIRDRLLQENVCTNDNIPSVSSINRVLR 180 PAX6_CHICK ------------------------------------------------------------ PAX6_HUMAN NLASEKQQMGA------------------------------------------------- 139 PAX6_MOUSE NLASEKQQMGA------------------------------------------------- 139 PAX6_COTJA NLASEKQQMGA------------------------------------------------- 139 PAX6_BRARE NLASEKQQMGA------------------------------------------------- 158 PAX6_ORYLA NLASEKQQMGA------------------------------------------------- 158 PAX6_XENLA NLASDKQQMGS------------------------------------------------- 139 PAX6_DROME NLAAQKEQQSTGSGSSSTSAGNSISAKVSVSIGGNVSNVASGSRGTLSSSTDLMQTATPL 240 First all sequence pairs are aligned and scored, then in a second round a multiple sequence alignment is built up. In this case (PAX6 proteins from vertebrates and fruit fly), two domains are more conserved than the rest of the sequence. Only the first domain is shown here.
14
Spring 2002Christophe Roos - 4/6 Sequence comparison Multiple sequences in phylogeny Once a multiple sequence alignment is done, it can be used for finding –Domains (previous slide) –Relationship (evolutionary distance) The distance is calculated as the amount of mutations needed to evolve from a putative ancestor to all used ‘present-day’ sequences. Then a path including all sequences is computed. Different metrics can be used (most parsimonious, maximum likelihood, etc).
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.