Download presentation
Presentation is loading. Please wait.
Published byOliver Carter Modified over 9 years ago
1
Multiple Sequence Alignment School of B&I TCD May 2010
2
MSA A central technique in bioinformatics –homology searching –multiple sequence alignment –phylogenetic trees
3
An example “all you have to do” is re-write your sequences so that similar features finish up in the same columns
4
Evolutionary relationship “similar features” ideally means homologous – with a shared ancestor clustalW and T-coffee mimic the process of evolution –by weighting similar residues by how conserved they are in evolution Important AAs don’t mutate Less important AAs change easily, even randomly –by inserting judicious gaps
5
Applications Discover conserved patterns/motifs –A step to describing a protein domain –MSA can add a distant relative to your protein family To define DNA regulatory elements. Prediction of 2 nd Structure and helps 3-D A step to phylogenetic trees: PCR analysis/primer design –find most and least degenerate regions of your sequence
6
So why difficult? Trivial 2 seq alignment: 3 possibilities. As length and # of seqs increase, number of possible permutations goes astronomical FGDERTHHS FGD--DHRS FGDERTHHS FGDD--HRS FGDERTHHS FGD-D-HRS Where put the gap?
7
Some data Cat ATGAAACGTCGGATCTAA Dog ATGAATCGACCCATCTAA Mus ATGGCGTGGCTTGGCATGTGA Rat ATGGCATGTCGTGGCATGTAG Protocol step 1 Align each pair of seqs C-D, C-M, C-R etc Get a score for each alignment And make a …
8
Similarity matrix Cat Dog Mus Rat Cat ID 14 10 10 Dog ID 10 10 Mus ID 16 Rat ID Number of identical residues –Which pair of sequences is most similar?
9
Progressive alignment Align the two most similar sequences, inserting any gaps. Mus/Rat: lock these sequences together (call it “RODent) Return to similarity matrix to find next most similar seqs or sequence cluster Dog/Cat: align and lock (call it CARnivore) –if next step requires a gap, then gap inserted in both carnivore sequences Align next most …(now its iterative)
10
An alignment Cat ATGAAACGTCGG---ATCTAA Dog ATGAATCGACCC---ATCTAA Mus ATGGCGTGGCTTGGCATGTGA Rat ATGGCATGTCGTGGCATGTAG *** * * ** * Good: Always a two “sequence” problem –So computationally possible Bad: Can’t rewrite or decouple (part of) the dog/cat alignment in the light of later info. Locked in a (suboptimal?) trough.
11
Choosing the right seqs Use MSA to inform you! Always use AA/protein if possible –can copygaps back to DNA later Start with 6-15 sequences Eliminate very different (<30% id) seqs Eliminate identical sequences Watch out for partial sequences …or sequences that need ++ gaps to align Check for repeats with dotlet, Lalign
12
Less is more Large alignments –take ++ CPU and time –are hard to do well –are difficult to display –are difficult to use: in trees for example –may include marginal seqs that wreck whole alignment So start small and add/eliminate seqs until you have a clear informative picture
13
Level of variation is important Choose sequence family with best rate of evolution for your taxonomic group –Histones evolve very slow (compare kingdoms) –Transferrins are fast (compare classes,orders) Closely related sequences may have identical protein (but variable DNA) Distantly related sequences no DNA signal (“saturated”)
14
Comparing related sequences Case 1, human vs chimp Seq1 A C G T A A A A G C | | | | | | | | | Seq2 A A G T A A A A G C How many changes? D=0.1 d=? Case 2 aardvark vs human Seq1 A C G T A A A A G C | | | Seq2 A C A C G G A T A G How many changes? D=0.7 d=? Need to compensate for multiple hits. G 100mya G G 90mya G G 70mya C A 50mya C C 30mya C C 10mya G A now G
15
Multiple substitution Ancestor G GC G AC G AA G C 1 seen A A 0 seen A C 1 seen Greater distance – more likely multiple substitution What really happened: What diffs we can see:
16
EBI: loads of options
17
T-coffee Minimal input parameters and STILL a better job than ClustalW
18
Output EBI clustalW Pairwise distance etc Alignment Guidetree What you submitted Jalview alignment editor
19
An alignment fragment ACT_CANAL -MDGEEVAALIIDNGSGMCKA ACT_CANDU -MDGEEVAALVIDNGSGMCKA ACT_PICAN -MDGEDVAALVIDNGSGMCKA ACT_PICPA -MDGEDVAALVIDNGSGMCKA ACT_KLULA -MDS-EVAALVIDNGSGMCKA ACT_YEAST -MDS-EVAALVIDNGSGMCKA ACT_YARLI -MED-ETVALVIDNGSGMCKA ACT2_ABSGL MSMEEDIAALVIDNASGMCKA ACT2_SCHCO --MDDEIQAVVIDNGSGMCKA : *:::**.****** * All AA in column identical : AA similar size & hydrophobicity. AA similar size or hydrophobicity ClustalW format
20
The alignment, so what next? Look at it very closely Hand edit if necessary (probably) Eliminate problem sequences and redo? Use display option best for next step –Phylip format for trees
21
Parameter changes Substit matrix PAM, Gonnet, Blosum –Clustalw chooses which matrix within family PAM30 for closely related pairs; PAM120; PAM250 for more distant –Difficult alignment: matrix change may help Gap penalty (open and extend) have optimal values for each family: find which by trial and error. –Clustalw puts gaps (which are often external loops) near previous gaps (longer loop) MSA does the grunt work. YOU do the fine tuning.
22
Alignment display: weblogo Always remember: sequence represents a 3-D structure
23
Patterns to recognise (more reliable in MSA than in single seq) Alternate hydrophobic residues –Surface -sheet (zig-zag-zig-zag) Runs of hydrophobic residues –Interior/buried -sheet Residues with 3.5AA spacing ( amphipathic ) – -helix WNNWFNNFNNWNNNF Gaps/indels –Probably surface not core MSA improves 2ndary structure ( -helix -sheet) prediction by >6%)
24
Conserved residues W,F,Y large hydrophobic, internal/core –conserved WFY best signal for domains G,P turns, can mark end of -helix -sheet C conserved with reliable spacing speaks C-C disulphide bridges - defensins H,S often catalytic sites in proteases (and other enzymes) KRDE charged: ligand binding or salt-bridge L very common AA but not conserved –except in Leucine zipper L234567L234567L234567L
25
Finish with an alignment: defensins 3 pairs of C residues: 3 disulphide bridges
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.