Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2002.10 Multiple alignments, PATTERNS, PSI-BLAST.

Slides:



Advertisements
Similar presentations
1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.
Advertisements

BLAST Sequence alignment, E-value & Extreme value distribution.
Introduction to Bioinformatics
What you should know by now Concepts: Pairwise alignment Global, semi-global and local alignment Dynamic programming Sequence similarity (Sum-of-Pairs)
Sequence Analysis Tools
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Multiple Sequence Alignments
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
© Wiley Publishing All Rights Reserved. Searching Sequence Databases.
Chapter 5 Multiple Sequence Alignment.
3D-COFFEE Mixing Sequences and Structures Cédric Notredame.
Multiple sequence alignment
Biology 4900 Biocomputing.
An Introduction to Multiple Sequence Alignments Cédric Notredame.
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
Protein Sequence Alignment and Database Searching.
Sequence Alignment Techniques. In this presentation…… Part 1 – Searching for Sequence Similarity Part 2 – Multiple Sequence Alignment.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
© Wiley Publishing All Rights Reserved. Building Multiple- Sequence Alignments.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Sequence Based Analysis Tutorial NIH Proteomics Workshop Lai-Su Yeh, Ph.D. Protein Information Resource at Georgetown University Medical Center.
Using the T-Coffee Multiple Sequence Alignment Package I - Overview Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program.
Construction of Substitution Matrices
Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders Weiwei Zhong.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Multiple Sequence Alignment. How to score a MSA? Very commonly: Sum of Pairs = SP Compute the pairwise score of all pairs of sequences and sum them. Gap.
Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Multiple alignments, PATTERNS, PSI-BLAST.
Multiple sequence alignment
Cédric Notredame (07/11/2015) Recent Progress in Multiple Sequence Alignments: A Survey Cédric Notredame.
Cédric Notredame (08/12/2015) Molecular Evolution Cédric Notredame.
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Sequence Based Analysis Tutorial
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Exercises Pairwise alignment Homology search (BLAST) Multiple alignment (CLUSTAL W) Iterative Profile Search: Profile Search –Pfam –Prosite –PSI-BLAST.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.
Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique CN+LF An introduction to multiple alignments © Cédric Notredame.
Construction of Substitution matrices
Step 3: Tools Database Searching
T-COFFEE, a novel method for Multiple Sequence Alignments Cédric Notredame.
Cédric Notredame (22/02/2016) Comparing Two Protein Sequences Cédric Notredame.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Multiple Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 13, 2004 ChengXiang Zhai Department of Computer Science University.
T-COFFEE, a novel method for combining biological information Cédric Notredame.
Biology 224 Instructor: Tom Peavy October 18 & 20, Multiple Sequence.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
BIOINFORMATICS Ayesha M. Khan Spring Lec-6.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Multiple Sequence Alignment Dr. Urmila Kulkarni-Kale Bioinformatics Centre University of Pune
Pairwise alignment Now we know how to do it: How do we get a multiple alignment (three or more sequences)? Multiple alignment: much greater combinatorial.
INTRODUCTION TO BIOINFORMATICS
The ideal approach is simultaneous alignment and tree estimation.
Comparing Two Protein Sequences
Recent Progress in Multiple Sequence Alignments: A Survey
Multiple Sequence Alignment
An Introduction to Multiple Sequence Alignments
An Introduction to Multiple Sequence Alignments
Sequence Based Analysis Tutorial
Sequence Based Analysis Tutorial
Comparing Two Protein Sequences
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Multiple alignments, PATTERNS, PSI-BLAST

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Overview Multiple alignments How-to, Goal, problems, use Patterns PROSITE database, syntax, use PSI-BLAST BLAST, matrices, use [ Profiles/HMMs ] …

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF What is a multiple sequence alignment? What can it do for me? How can I produce one of these? How can I use it? chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM mouse AKDDRIRYDNEMKSWEEQMAE * :.*. :

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF How can I use a multiple alignment? chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP unknown -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM unknown AKDDRIRYDNEMKSWEEQMAE * :.*. : Extrapolation SwissProt Unkown Sequence Homology?

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF How can I use a multiple alignment? SwissProt Unkown Sequence Match? chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM mouse AKDDRIRYDNEMKSWEEQMAE * :.*. : Extrapolation Prosite Patterns

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF How can I use a multiple alignment? Extrapolation Prosite Patterns chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-IQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM mouse AKDDRIRYDNEMKSWEEQMAE * :.*. : L? K>R Prosite Profiles -More Sensitive -More Specific A F D E F G H Q I V L W

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF How can I use a multiple alignment? Phylogeny chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM mouse AKDDRIRYDNEMKSWEEQMAE * :.*. : chite wheat trybr mouse -Evolution -Paralogy/Orthology

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF How can I use a multiple alignment? Phylogeny chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM mouse AKDDRIRYDNEMKSWEEQMAE * :.*. : Struc. Prediction PhD For secondary Structure Prediction: 75% Accurate. Threading: is improving but is not yet as good.

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF How can I use a multiple alignment? Phylogeny Struc. Prediction chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM mouse AKDDRIRYDNEMKSWEEQMAE * :.*. : Caution! Automatic Multiple Sequence Alignment methods are not always perfect…

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF The problem why is it difficult to compute a multiple sequence alignment? chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * Computation What is the good alignment? Biology What is a good alignment?

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF The problem why is it difficult to compute a multiple sequence alignment? CIRCULAR PROBLEM.... Good Sequences Good Alignment

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF What do I need to know to make a good multiple alignment? How do sequences evolve? How does the computer align the sequences? How can I choose my sequences? What is the best program? How can I use my alignment?

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF An alignment is a story ADKPKRPLSAYMLWLN ADKPRRPLS-YMLWLN ADKPKRPKPRLSAYMLWLN Mutations + Selection ADKPRRP---LS-YMLWLN ADKPKRPKPRLSAYMLWLN Insertion Deletion Mutation

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Homology Same sequences -> same origin? -> same function? -> same 3D fold? Length %Sequence Identity 30% 100 Same 3D Fold Twilight Zone

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Convergent evolution AFGP with (ThrAlaAla)n Similar To Trypsynogen AFGP with (ThrAlaAla)n NOT Similar to Trypsinogen N S Chen et al, 97, PNAS, 94,

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Residues and mutations All residues are equal, but some more than others… P G S C L I T V A W Y F Q H K R E DN Aliphatic Aromatic Hydrophobic Polar Small M Accurate matrices are data driven rather than knowledge driven G C

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Substitution matrices Different Flavors: Pam: 250, 350 Blosum: 45, 62 …

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF What is the best substition matrix? Mutation rates depend on families Choosing the right matrix may be tricky Gonnet250 > BLOSUM62 > PAM250 Depends on the family, the program used and its tuning FamilySN Histone36.40 Insulin Interleukin I  Globin Apolipoprot. AI Interferon G Rates in Substitutions/site/Billion Years as measured on Mouse Vs Human (0.08 Billion years)

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Insertions and deletions? Indel Cost L Cost L L Affine Gap Penalty Cost=GOP+GEP*L

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF How to align many sequences? Exact algorithms are computing time consuming Needlemann & Wunsch Smith & Waterman -> heuristic required! 2 Globins =>1 sec

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Globins =>2 mn How to align many sequences? Exact algorithms are computing time consuming Needlemann & Wunsch Smith & Waterman -> heuristic required!

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Globins =>5 hours How to align many sequences? Exact algorithms are computing time consuming Needlemann & Wunsch Smith & Waterman -> heuristic required!

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Globins =>3 weeks How to align many sequences? Exact algorithms are computing time consuming Needlemann & Wunsch Smith & Waterman -> heuristic required!

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Globins =>9 years How to align many sequences? Exact algorithms are computing time consuming Needlemann & Wunsch Smith & Waterman -> heuristic required!

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF How to align many sequences? Exact algorithms are computing time consuming Needlemann & Wunsch Smith & Waterman -> heuristic required! 7 Globins =>1000 years

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Globins => years How to align many sequences? Exact algorithms are computing time consuming Needlemann & Wunsch Smith & Waterman -> heuristic required!

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Existing methods 1-Carillo and Lipman: -MSA, DCA. -Few Small Closely Related Sequence. 2-Segment Based: -DIALIGN, MACAW. -May Align Too Few Residues -Do Well When They Can Run. 3-Iterative: -HMMs, HMMER, SAM. -Slow, Sometimes Inacurate -Good Profile Generators 4-Progressive: -ClustalW, Pileup, Multalign… -Fast and Sensitive

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Progressive alignment Feng and Dolittle, 1980; Taylor 1981 Dynamic Programming Using A Substitution Matrix

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Progressive alignment Feng and Dolittle, 1980; Taylor Depends on the ORDER of the sequences (Tree). -Depends on the CHOICE of the sequences. -Depends on the PARAMETERS: Substitution Matrix. Penalties (Gop, Gep). Sequence Weight. Tree making Algorithm.

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Selecting sequences from a BLAST output

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF A common mistake Sequences too closely related Identical sequences brings no information Multiple sequence alignments thrive on diversity PRVA_MACFU SMTDLLNAEDIKKAVGAFSAIDSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKSGFIEE PRVA_HUMAN SMTDLLNAEDIKKAVGAFSATDSFDHKKFFQMVGLKKKSADDVKKVFHMLDKDKSGFIEE PRVA_GERSP SMTDLLSAEDIKKAIGAFAAADSFDHKKFFQMVGLKKKTPDDVKKVFHILDKDKSGFIEE PRVA_MOUSE SMTDVLSAEDIKKAIGAFAAADSFDHKKFFQMVGLKKKNPDEVKKVFHILDKDKSGFIEE PRVA_RAT SMTDLLSAEDIKKAIGAFTAADSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKSGFIEE PRVA_RABIT AMTELLNAEDIKKAIGAFAAAESFDHKKFFQMVGLKKKSTEDVKKVFHILDKDKSGFIEE :**::*.*******:***:* :****************..::******:*********** PRVA_MACFU DELGFILKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES PRVA_HUMAN DELGFILKGFSPDARDLSAKETKMLMAAGDKDGDGKIGVDEFSTLVAES PRVA_GERSP DELGFILKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVSES PRVA_MOUSE DELGSILKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVAES PRVA_RAT DELGSILKGFSSDARDLSAKETKTLMAAGDKDGDGKIGVEEFSTLVAES PRVA_RABIT EELGFILKGFSPDARDLSVKETKTLMAAGDKDGDGKIGADEFSTLVSES :*** ******.******.**** *:************.:******:**

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Respect information! PRVA_MACFU SMTDLLN----AEDIKKA PRVA_HUMAN SMTDLLN----AEDIKKA PRVA_GERSP SMTDLLS----AEDIKKA PRVA_MOUSE SMTDVLS----AEDIKKA PRVA_RAT SMTDLLS----AEDIKKA PRVA_RABIT AMTELLN----AEDIKKA TPCC_MOUSE MDDIYKAAVEQLTEEQKNEFKAAFDIFVLGAEDGCISTKELGKVMRMLGQNPTPEELQEM : :*..*:::: PRVA_MACFU VGAFSAIDS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGFI PRVA_HUMAN VGAFSATDS--FDHKKFFQMVG------LKKKSADDVKKVFHMLDKDKSGFIEEDELGFI PRVA_GERSP IGAFAAADS--FDHKKFFQMVG------LKKKTPDDVKKVFHILDKDKSGFIEEDELGFI PRVA_MOUSE IGAFAAADS--FDHKKFFQMVG------LKKKNPDEVKKVFHILDKDKSGFIEEDELGSI PRVA_RAT IGAFTAADS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGSI PRVA_RABIT IGAFAAAES--FDHKKFFQMVG------LKKKSTEDVKKVFHILDKDKSGFIEEEELGFI TPCC_MOUSE IDEVDEDGSGTVDFDEFLVMMVRCMKDDSKGKSEEELSDLFRMFDKNADGYIDLDELKMM :.. *.*..:*: *: * *. :::..:*:::**:.*:*: :** : PRVA_MACFU LKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES- PRVA_HUMAN LKGFSPDARDLSAKETKMLMAAGDKDGDGKIGVDEFSTLVAES- PRVA_GERSP LKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVSES- PRVA_MOUSE LKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVAES- PRVA_RAT LKGFSSDARDLSAKETKTLMAAGDKDGDGKIGVEEFSTLVAES- PRVA_RABIT LKGFSPDARDLSVKETKTLMAAGDKDGDGKIGADEFSTLVSES- TPCC_MOUSE LQ---ATGETITEDDIEELMKDGDKNNDGRIDYDEFLEFMKGVE *:... ::.: : *: ***:.**:*. :** :: -This alignment is not informative about the relation between TPCC MOUSE and the rest of the sequences. -A better spread of the sequences is needed

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Selecting diverse sequences PRVB_CYPCA -AFAGVLNDADIAAALEACKAADSFNHKAFFAKVGLTSKSADDVKKAFAIIDQDKSGFIE PRVB_BOACO -AFAGILSDADIAAGLQSCQAADSFSCKTFFAKSGLHSKSKDQLTKVFGVIDRDKSGYIE PRV1_SALSA MACAHLCKEADIKTALEACKAADTFSFKTFFHTIGFASKSADDVKKAFKVIDQDASGFIE PRVB_LATCH -AVAKLLAAADVTAALEGCKADDSFNHKVFFQKTGLAKKSNEELEAIFKILDQDKSGFIE PRVB_RANES -SITDIVSEKDIDAALESVKAAGSFNYKIFFQKVGLAGKSAADAKKVFEILDRDKSGFIE PRVA_MACFU -SMTDLLNAEDIKKAVGAFSAIDSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKSGFIE PRVA_ESOLU --AKDLLKADDIKKALDAVKAEGSFNHKKFFALVGLKAMSANDVKKVFKAIDADASGFIE : *:.:..*.:*. * ** *: * : * :* * **:** PRVB_CYPCA EDELKLFLQNFKADARALTDGETKTFLKAGDSDGDGKIGVDEFTALVKA- PRVB_BOACO EDELKKFLQNFDGKARDLTDKETAEFLKEGDTDGDGKIGVEEFVVLVTKG PRV1_SALSA VEELKLFLQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQ- PRVB_LATCH DEELELFLQNFSAGARTLTKTETETFLKAGDSDGDGKIGVDEFQKLVKA- PRVB_RANES QDELGLFLQNFRASARVLSDAETSAFLKAGDSDGDGKIGVEEFQALVKA- PRVA_MACFU EDELGFILKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES PRVA_ESOLU EEELKFVLKSFAADGRDLTDAETKAFLKAADKDGDGKIGIDEFETLVHEA :**.*:.*.* *: ** ::.* **** **::** ** -A REASONABLE model now exists. -Going further:remote homologues.

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Aligning remote homologues PRVA_MACFU SMTDLLNA----EDIKKA PRVA_ESOLU AKDLLKA----DDIKKA PRVB_CYPCA AFAGVLND----ADIAAA PRVB_BOACO AFAGILSD----ADIAAG PRV1_SALSA MACAHLCKE----ADIKTA PRVB_LATCH AVAKLLAA----ADVTAA PRVB_RANES SITDIVSE----KDIDAA TPCS_RABIT -TDQQAEARSYLSEEMIAEFKAAFDMFDADGG-GDISVKELGTVMRMLGQTPTKEELDAI TPCS_PIG -TDQQAEARSYLSEEMIAEFKAAFDMFDADGG-GDISVKELGTVMRMLGQTPTKEELDAI TPCC_MOUSE MDDIYKAAVEQLTEEQKNEFKAAFDIFVLGAEDGCISTKELGKVMRMLGQNPTPEELQEM : :: PRVA_MACFU VGAFSAIDS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGFI PRVA_ESOLU LDAVKAEGS--FNHKKFFALVG------LKAMSANDVKKVFKAIDADASGFIEEEELKFV PRVB_CYPCA LEACKAADS--FNHKAFFAKVG------LTSKSADDVKKAFAIIDQDKSGFIEEDELKLF PRVB_BOACO LQSCQAADS--FSCKTFFAKSG------LHSKSKDQLTKVFGVIDRDKSGYIEEDELKKF PRV1_SALSA LEACKAADT--FSFKTFFHTIG------FASKSADDVKKAFKVIDQDASGFIEVEELKLF PRVB_LATCH LEGCKADDS--FNHKVFFQKTG------LAKKSNEELEAIFKILDQDKSGFIEDEELELF PRVB_RANES LESVKAAGS--FNYKIFFQKVG------LAGKSAADAKKVFEILDRDKSGFIEQDELGLF TPCS_RABIT IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNADGYIDAEELAEI TPCS_PIG IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNMDGYIDAEELAEI TPCC_MOUSE IDEVDEDGSGTVDFDEFLVMMVRCMKDDSKGKSEEELSDLFRMFDKNADGYIDLDELKMM :..:... *: * : * :* :.*:*: :**. PRVA_MACFU LKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES- PRVA_ESOLU LKSFAADGRDLTDAETKAFLKAADKDGDGKIGIDEFETLVHEA- PRVB_CYPCA LQNFKADARALTDGETKTFLKAGDSDGDGKIGVDEFTALVKA-- PRVB_BOACO LQNFDGKARDLTDKETAEFLKEGDTDGDGKIGVEEFVVLVTKG- PRV1_SALSA LQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQ-- PRVB_LATCH LQNFSAGARTLTKTETETFLKAGDSDGDGKIGVDEFQKLVKA-- PRVB_RANES LQNFRASARVLSDAETSAFLKAGDSDGDGKIGVEEFQALVKA-- TPCS_RABIT FR---ASGEHVTDEEIESLMKDGDKNNDGRIDFDEFLKMMEGVQ TPCS_PIG FR---ASGEHVTDEEIESIMKDGDKNNDGRIDFDEFLKMMEGVQ TPCC_MOUSE LQ---ATGETITEDDIEELMKDGDKNNDGRIDYDEFLEFMKGVE ::.. :: : ::.* :.** *. :** ::

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Going further… PRVA_MACFU VGAFSAIDS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGFI PRVB_BOACO LQSCQAADS--FSCKTFFAKSG------LHSKSKDQLTKVFGVIDRDKSGYIEEDELKKF PRV1_SALSA LEACKAADT--FSFKTFFHTIG------FASKSADDVKKAFKVIDQDASGFIEVEELKLF TPCS_RABIT IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNADGYIDAEELAEI TPCS_PIG IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNMDGYIDAEELAEI TPCC_MOUSE IDEVDEDGSGTVDFDEFLVMMVRCMKDDSKGKSEEELSDLFRMFDKNADGYIDLDELKMM TPC_PATYE SDEMDEEATGRLNCDAWIQLFER---KLKEDLDERELKEAFRVLDKEKKGVIKVDVLRWI. :... ::. : * :* :.* *. : *. PRVA_MACFU LKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES-- PRVB_BOACO LQNFDGKARDLTDKETAEFLKEGDTDGDGKIGVEEFVVLVTKG-- PRV1_SALSA LQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQ--- TPCS_RABIT FR---ASGEHVTDEEIESLMKDGDKNNDGRIDFDEFLKMMEGVQ- TPCS_PIG FR---ASGEHVTDEEIESIMKDGDKNNDGRIDFDEFLKMMEGVQ- TPCC_MOUSE LQ---ATGETITEDDIEELMKDGDKNNDGRIDYDEFLEFMKGVE- TPC_PATYE LS---SLGDELTEEEIENMIAETDTDGSGTVDYEEFKCLMMSSDA :. :: : :: * :..* :. :** ::

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF What makes a good alignment… The more divergeant the sequences, the better The fewer indels, the better Nice ungapped blocks separated with indels Different classes of residues within a block: Completely conserved Size and hydropathy conserved Size or hydropathy conserved The ultimate evaluation is a matter of personal judgment and knowledge

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Avoiding pitfalls

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Keep a biological perspective chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM mouse AKDDRIRYDNEMKSWEEQMAE * :.*. : chite AD--K----PKR-PLYMLWLNS-ARESIKRENPDFK-VT-EVAKKGGELWRGL- wheat -DPNK----PKRAP-FFVFMGE-FREEFKQKNPKNKSVA-AVGKAAGERWKSLS trybr -K--KDSNAPKR-AMT-MFFSSDFR-S-KH-S-DLS-IV-EMSKAAGAAWKELG mouse ----K----PKR-PRYNIYVSESFQEA-K--D-D-S-AQGKL-KLVNEAWKNLS * ***.:: ::... : *... : *. *: * chite KSEWEAKAATAKQNY-I--RALQE-YERNG-G- wheat KAPYVAKANKLKGEY-N--KAIAA-YNK-GESA trybr RKVYEEMAEKDKERY----K--RE-M mouse KQAYIQLAKDDRIRYDNEMKSWEEQMAE----- : : * :.* : DIFFERENT PARAMETERS

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Do not overtune!!! DO NOT PLAY WITH PARAMETERS! IF YOU KNOW THE ALIGNMENT YOU WANT: MAKE IT YOURSELF! chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM mouse AKDDRIRYDNEMKSWEEQMAE * :.*. : chite ---ADKPKRPL-SAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAP-SAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS-----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPR-SAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. *.:... :.. *. *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM mouse AKDDRIRYDNEMKSWEEQMAE * :.*. :

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Choosing the right method PROBLEM PROGRAM ClustalW MSA DIALIGN II METHOD Source: BaliBase Thompson et al, NAR, 1999

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Conclusion The best alignment method: Your brain The right data The best evaluation method: Your eyes Experimental information (SwissProt) What can I conclude? Homology -> information extrapolation How can I go further? Patterns Profiles HMMs …