An Introduction to Multiple Sequence Alignments Cédric Notredame.

Slides:



Advertisements
Similar presentations
Multiple Sequence Alignment (MSA) I519 Introduction to Bioinformatics, Fall 2012.
Advertisements

Multiple Sequence Alignment
Introduction to Bioinformatics
COFFEE: an objective function for multiple sequence alignments
Heuristic alignment algorithms and cost matrices
Sequence analysis lecture 6 Sequence analysis course Lecture 6 Multiple sequence alignment 2 of 3 Multiple alignment methods.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
What you should know by now Concepts: Pairwise alignment Global, semi-global and local alignment Dynamic programming Sequence similarity (Sum-of-Pairs)
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Lecture 6 – 07/01/08 Multiple sequence alignment 2 Sequence analysis 2007 Optimizing.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Multiple Sequence Alignments
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Chapter 5 Multiple Sequence Alignment.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
3D-COFFEE Mixing Sequences and Structures Cédric Notredame.
Multiple sequence alignment
Biology 4900 Biocomputing.
Alineamiento Múltiple de secuencias. It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify.
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja.
Protein Sequence Alignment and Database Searching.
© Wiley Publishing All Rights Reserved. Building Multiple- Sequence Alignments.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Eidhammer et al. Protein Bioinformatics Chapter 4 1 Multiple Global Sequence Alignment and Phylogenetic trees Inge Jonassen and Ingvar Eidhammer.
Using the T-Coffee Multiple Sequence Alignment Package I - Overview Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program.
Construction of Substitution Matrices
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Integrating Biological Information In Multiple Sequence Alignments Confronting Bits and Pieces of Information Cédric Notredame CNRS-Marseille, France
Sequence Alignment Only things that are homologous should be compared in a phylogenetic analysis Homologous – sharing a common ancestor This is true for.
Multiple Sequence Alignment. How to score a MSA? Very commonly: Sum of Pairs = SP Compute the pairwise score of all pairs of sequences and sum them. Gap.
Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Multiple alignments, PATTERNS, PSI-BLAST.
Multiple sequence alignment
Cédric Notredame (07/11/2015) Recent Progress in Multiple Sequence Alignments: A Survey Cédric Notredame.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Aligning Sequences With T-Coffee Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Grundlagen der Bioinformatik Multiples Sequenzalignment Juni 2007.
Doug Raiford Lesson 5.  Dynamic programming methods  Needleman-Wunsch (global alignment)  Smith-Waterman (local alignment)  BLAST Fixed: best Linear:
Sequence Alignment.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.
Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique CN+LF An introduction to multiple alignments © Cédric Notredame.
Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Multiple alignments, PATTERNS, PSI-BLAST.
Step 3: Tools Database Searching
T-COFFEE, a novel method for Multiple Sequence Alignments Cédric Notredame.
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
1 Multiple Sequence Alignment(MSA). 2 Multiple Alignment Number of sequences >2 Global alignment Seek an alignment that maximizes score.
Cédric Notredame (22/02/2016) Comparing Two Protein Sequences Cédric Notredame.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Multiple Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 13, 2004 ChengXiang Zhai Department of Computer Science University.
T-COFFEE, a novel method for combining biological information Cédric Notredame.
BIOINFORMATICS Ayesha M. Khan Spring Lec-6.
Aligning Kinases Applying MSA Analysis to the CDK family.
Pairwise alignment Now we know how to do it: How do we get a multiple alignment (three or more sequences)? Multiple alignment: much greater combinatorial.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
The ideal approach is simultaneous alignment and tree estimation.
Recent Progress in Multiple Sequence Alignments: A Survey
An Introduction to Multiple Sequence Alignments
An Introduction to Multiple Sequence Alignments
Sequence Based Analysis Tutorial
Presentation transcript:

An Introduction to Multiple Sequence Alignments Cédric Notredame

chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM mouse AKDDRIRYDNEMKSWEEQMAE * :.*. :

Manguel M, Samaniego F.J., Abraham Wald’s Work on Aircraft Suvivability, J. American Statistical Association. 79, , (1984)

Our Scope How Can I Use My Alignment? How Does The Computer Align The Sequences? How Can I Assemble a Mult. Aln? What are the Difficulties?

Outline -Why Do We Need Multiple Sequence Alignment ? -The progressive Alignment Algorithm -A possible Strategy… -Potential Difficulties

Pre-requisite -How Do Sequences Evolve? -How can We COMPARE Sequences ? -How can We ALIGN Sequences ?

Why Do We Need Multiple Sequence Alignment ?

Sometimes Two Sequences Are Not Enough… The man with TWO watches NEVER knows the time

chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM mouse AKDDRIRYDNEMKSWEEQMAE * :.*. : What is A Multiple Sequence Alignment? Structural Criteria: Residues are arranged so that those playing a similar role end up in the same column. Evolution Criteria: Residues are arranged so that those having the same ancestor end up in the same column.

Phylogenic Relation Functional Relation

chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP unknown -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM unknown AKDDRIRYDNEMKSWEEQMAE * :.*. : How Can I Use A Multiple Sequence Alignment? Extrapolation Beyond The Twilight Zone SwissProt Unkown Sequence Homology? Less Than 30 % id BUT Conserved where it MATTERS

chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM mouse AKDDRIRYDNEMKSWEEQMAE * :.*. : How Can I Use A Multiple Sequence Alignment? Extrapolation Prosite Patterns

chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM mouse AKDDRIRYDNEMKSWEEQMAE * :.*. : How Can I Use A Multiple Sequence Alignment? Extrapolation Prosite Patterns P-K-R-[PA]-x(1)-[ST]…

chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM mouse AKDDRIRYDNEMKSWEEQMAE * :.*. : How Can I Use A Multiple Sequence Alignment? Extrapolation Prosite Patterns SwissProt Uncharacterised Signature Match?

chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-IQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM mouse AKDDRIRYDNEMKSWEEQMAE * :.*. : How Can I Use A Multiple Sequence Alignment? Extrapolation Prosite Patterns Profiles And HMMs -More Sensitive -More Specific L? K>R A F D E F G H Q I V L W

A PROSITE PROFILE A Substitution Cost For Every Amino Acid, At Every Position

chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM mouse AKDDRIRYDNEMKSWEEQMAE * :.*. : How Can I Use A Multiple Sequence Alignment? Extrapolation Motifs/Patterns Phylogeny chite wheat trybr mouse -Evolution -Paralogy/Orthology Profiles

chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM mouse AKDDRIRYDNEMKSWEEQMAE * :.*. : How Can I Use A Multiple Sequence Alignment? Extrapolation Motifs/Patterns Phylogeny Profiles Struc. Prediction Column Constraint  Evolution Constraint  Structure Constraint

chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM mouse AKDDRIRYDNEMKSWEEQMAE * :.*. : How Can I Use A Multiple Sequence Alignment? Extrapolation Motifs/Patterns Phylogeny Profiles Struc. Prediction PsiPred OR PhD For secondary Structure Prediction: 75% Accurate. Threading: is improving but is not yet as good.

chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM mouse AKDDRIRYDNEMKSWEEQMAE * :.*. : How Can I Use A Multiple Sequence Alignment? Automatic Multiple Sequence Alignment methods are not always perfect… You know better… With your big BRAIN

Why Is It Difficult To Compute A multiple Sequence Alignment? A CROSSROAD PROBLEM BIOLOGY: What is A Good Alignment COMPUTATION What is THE Good Alignment chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: *

The Biological Problem. Same as PairWise Alignment Problem We do NOT know how Sequences Evolve. We do NOT understand the Relation Between Structures and Sequences. We would NOT recognize the Correct Alignment if we had it IN FRONT of our eyes…

The Biological Problem. The Charlie Chaplin Paradox

The Biological Problem. How to Evaluate an Alignment -Substitution Matrix (Blosum) -An Evaluation Function A AAACCAAACC -Gap Penalties. -A nice set of Sequences A A A C Sums of Pairs: Cost=6 C Over-estimation of the Substitutions Easy to compute

The COMPUTATIONAL Problem. Producing the Alignment -Substitution Matrix (Blosum) -An Evaluation Function -Gap Penalties. -A nice set of Sequences -An Alignment Algorithm GLOBAL Alignment Will It Work ?

HOW CAN I ALIGN MANY SEQUENCES 2 Globins =>1 Min

3 Globins =>2 hours HOW CAN I ALIGN MANY SEQUENCES

4 Globins => 10 days HOW CAN I ALIGN MANY SEQUENCES

5 Globins => 3 years HOW CAN I ALIGN MANY SEQUENCES

6 Globins =>300 years HOW CAN I ALIGN MANY SEQUENCES ! DHEA Loaded

7 Globins => years HOW CAN I ALIGN MANY SEQUENCES Solidified Fossil, Old stuff

8 Globins =>3 Million years HOW CAN I ALIGN MANY SEQUENCES

The Progressive Multiple Alignment Algorithm (Clustal W)

Making An Alignment Any Exact Method would be TOO SLOW We will use a Heuristic Algorithm. Progressive Alignment Algorithm is the most Popular -Fast -ClustalW -Greedy Heuristic (No Guarranty).

Progressive Alignment Feng and Dolittle, 1988; Taylor 1989 Clustering

Dynamic Programming Using A Substitution Matrix Progressive Alignment

-Depends on the ORDER of the sequences (Tree). -Depends on the CHOICE of the sequences. -Depends on the PARAMETERS: Substitution Matrix. Penalties (Gop, Gep). Sequence Weight. Tree making Algorithm.

Progressive Alignment When Does It Work Works Well When Phylogeny is Dense No outlayer Sequence. Image: River Crossing

SeqA GARFIELD THE LAST FA-T CAT SeqB GARFIELD THE FAST CA-T --- SeqC GARFIELD THE VERY FAST CAT SeqD THE ---- FA-T CAT CLUSTALW (Score=20, Gop=-1, Gep=0, M=1) SeqA GARFIELD THE LAST FA-T CAT SeqB GARFIELD THE FAST ---- CAT SeqC GARFIELD THE VERY FAST CAT SeqD THE ---- FA-T CAT CORRECT (Score=24) Progressive Alignment When Doesn’t It Work

GARFIELD THE LAST FAT CAT GARFIELD THE FAST CAT --- GARFIELD THE LAST FAT CAT GARFIELD THE FAST CAT GARFIELD THE VERY FAST CAT THE FAT CAT GARFIELD THE VERY FAST CAT THE ---- FA-T CAT GARFIELD THE LAST FA-T CAT GARFIELD THE FAST CA-T --- GARFIELD THE VERY FAST CAT THE ---- FA-T CAT

Building the Right Multiple Sequence Alignment.

Recognizing The Right Sequences When you Meet Them…

Gathering Sequences: BLAST

Common Mistake: Sequences Too Closely Related PRVA_MACFU SMTDLLNAEDIKKAVGAFSAIDSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKSGFIEE PRVA_HUMAN SMTDLLNAEDIKKAVGAFSATDSFDHKKFFQMVGLKKKSADDVKKVFHMLDKDKSGFIEE PRVA_GERSP SMTDLLSAEDIKKAIGAFAAADSFDHKKFFQMVGLKKKTPDDVKKVFHILDKDKSGFIEE PRVA_MOUSE SMTDVLSAEDIKKAIGAFAAADSFDHKKFFQMVGLKKKNPDEVKKVFHILDKDKSGFIEE PRVA_RAT SMTDLLSAEDIKKAIGAFTAADSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKSGFIEE PRVA_RABIT AMTELLNAEDIKKAIGAFAAAESFDHKKFFQMVGLKKKSTEDVKKVFHILDKDKSGFIEE :**::*.*******:***:* :****************..::******:*********** PRVA_MACFU DELGFILKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES PRVA_HUMAN DELGFILKGFSPDARDLSAKETKMLMAAGDKDGDGKIGVDEFSTLVAES PRVA_GERSP DELGFILKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVSES PRVA_MOUSE DELGSILKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVAES PRVA_RAT DELGSILKGFSSDARDLSAKETKTLMAAGDKDGDGKIGVEEFSTLVAES PRVA_RABIT EELGFILKGFSPDARDLSVKETKTLMAAGDKDGDGKIGADEFSTLVSES :*** ******.******.**** *:************.:******:** -IDENTICAL SEQUENCES BRING NO INFORMATION FOR THE MULTIPLE SEQUENCE ALIGNMENT -MULTIPLE SEQUENCE ALIGNMENTS THRIVE ON DIVERSITY…

Sequence Weighting Within ClustalW

Selecting Diverse Sequences (Opus II)

Respect Information! This Alignment Is not Informative about the relation Betwwen TPCC MOUSE and the rest of the sequences. PRVA_MACFU SMTDLLN----AEDIKKA PRVA_HUMAN SMTDLLN----AEDIKKA PRVA_GERSP SMTDLLS----AEDIKKA PRVA_MOUSE SMTDVLS----AEDIKKA PRVA_RAT SMTDLLS----AEDIKKA PRVA_RABIT AMTELLN----AEDIKKA TPCC_MOUSE MDDIYKAAVEQLTEEQKNEFKAAFDIFVLGAEDGCISTKELGKVMRMLGQNPTPEELQEM : :*..*:::: PRVA_MACFU VGAFSAIDS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGFI PRVA_HUMAN VGAFSATDS--FDHKKFFQMVG------LKKKSADDVKKVFHMLDKDKSGFIEEDELGFI PRVA_GERSP IGAFAAADS--FDHKKFFQMVG------LKKKTPDDVKKVFHILDKDKSGFIEEDELGFI PRVA_MOUSE IGAFAAADS--FDHKKFFQMVG------LKKKNPDEVKKVFHILDKDKSGFIEEDELGSI PRVA_RAT IGAFTAADS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGSI PRVA_RABIT IGAFAAAES--FDHKKFFQMVG------LKKKSTEDVKKVFHILDKDKSGFIEEEELGFI TPCC_MOUSE IDEVDEDGSGTVDFDEFLVMMVRCMKDDSKGKSEEELSDLFRMFDKNADGYIDLDELKMM -A better Spread of the Sequences is needed

Selecting Diverse Sequences (Opus II)

PRVB_CYPCA -AFAGVLNDADIAAALEACKAADSFNHKAFFAKVGLTSKSADDVKKAFAIIDQDKSGFIE PRVB_BOACO -AFAGILSDADIAAGLQSCQAADSFSCKTFFAKSGLHSKSKDQLTKVFGVIDRDKSGYIE PRV1_SALSA MACAHLCKEADIKTALEACKAADTFSFKTFFHTIGFASKSADDVKKAFKVIDQDASGFIE PRVB_LATCH -AVAKLLAAADVTAALEGCKADDSFNHKVFFQKTGLAKKSNEELEAIFKILDQDKSGFIE PRVB_RANES -SITDIVSEKDIDAALESVKAAGSFNYKIFFQKVGLAGKSAADAKKVFEILDRDKSGFIE PRVA_MACFU -SMTDLLNAEDIKKAVGAFSAIDSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKSGFIE PRVA_ESOLU --AKDLLKADDIKKALDAVKAEGSFNHKKFFALVGLKAMSANDVKKVFKAIDADASGFIE : *:.:..*.:*. * ** *: * : * :* * **:** PRVB_CYPCA EDELKLFLQNFKADARALTDGETKTFLKAGDSDGDGKIGVDEFTALVKA- PRVB_BOACO EDELKKFLQNFDGKARDLTDKETAEFLKEGDTDGDGKIGVEEFVVLVTKG PRV1_SALSA VEELKLFLQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQ- PRVB_LATCH DEELELFLQNFSAGARTLTKTETETFLKAGDSDGDGKIGVDEFQKLVKA- PRVB_RANES QDELGLFLQNFRASARVLSDAETSAFLKAGDSDGDGKIGVEEFQALVKA- PRVA_MACFU EDELGFILKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES PRVA_ESOLU EEELKFVLKSFAADGRDLTDAETKAFLKAADKDGDGKIGIDEFETLVHEA :**.*:.*.* *: ** ::.* **** **::** ** -A REASONABLE Model Now Exists. -Going Further:Remote Homologues.

Aligning Remote Homologues PRVA_MACFU SMTDLLNA----EDIKKA PRVA_ESOLU AKDLLKA----DDIKKA PRVB_CYPCA AFAGVLND----ADIAAA PRVB_BOACO AFAGILSD----ADIAAG PRV1_SALSA MACAHLCKE----ADIKTA PRVB_LATCH AVAKLLAA----ADVTAA PRVB_RANES SITDIVSE----KDIDAA TPCS_RABIT -TDQQAEARSYLSEEMIAEFKAAFDMFDADGG-GDISVKELGTVMRMLGQTPTKEELDAI TPCS_PIG -TDQQAEARSYLSEEMIAEFKAAFDMFDADGG-GDISVKELGTVMRMLGQTPTKEELDAI TPCC_MOUSE MDDIYKAAVEQLTEEQKNEFKAAFDIFVLGAEDGCISTKELGKVMRMLGQNPTPEELQEM : :: PRVA_MACFU VGAFSAIDS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGFI PRVA_ESOLU LDAVKAEGS--FNHKKFFALVG------LKAMSANDVKKVFKAIDADASGFIEEEELKFV PRVB_CYPCA LEACKAADS--FNHKAFFAKVG------LTSKSADDVKKAFAIIDQDKSGFIEEDELKLF PRVB_BOACO LQSCQAADS--FSCKTFFAKSG------LHSKSKDQLTKVFGVIDRDKSGYIEEDELKKF PRV1_SALSA LEACKAADT--FSFKTFFHTIG------FASKSADDVKKAFKVIDQDASGFIEVEELKLF PRVB_LATCH LEGCKADDS--FNHKVFFQKTG------LAKKSNEELEAIFKILDQDKSGFIEDEELELF PRVB_RANES LESVKAAGS--FNYKIFFQKVG------LAGKSAADAKKVFEILDRDKSGFIEQDELGLF TPCS_RABIT IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNADGYIDAEELAEI TPCS_PIG IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNMDGYIDAEELAEI TPCC_MOUSE IDEVDEDGSGTVDFDEFLVMMVRCMKDDSKGKSEEELSDLFRMFDKNADGYIDLDELKMM :..:... *: * : * :* :.*:*: :**. PRVA_MACFU LKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES- PRVA_ESOLU LKSFAADGRDLTDAETKAFLKAADKDGDGKIGIDEFETLVHEA- PRVB_CYPCA LQNFKADARALTDGETKTFLKAGDSDGDGKIGVDEFTALVKA-- PRVB_BOACO LQNFDGKARDLTDKETAEFLKEGDTDGDGKIGVEEFVVLVTKG- PRV1_SALSA LQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQ-- PRVB_LATCH LQNFSAGARTLTKTETETFLKAGDSDGDGKIGVDEFQKLVKA-- PRVB_RANES LQNFRASARVLSDAETSAFLKAGDSDGDGKIGVEEFQALVKA-- TPCS_RABIT FR---ASGEHVTDEEIESLMKDGDKNNDGRIDFDEFLKMMEGVQ TPCS_PIG FR---ASGEHVTDEEIESIMKDGDKNNDGRIDFDEFLKMMEGVQ TPCC_MOUSE LQ---ATGETITEDDIEELMKDGDKNNDGRIDYDEFLEFMKGVE ::.. :: : ::.* :.** *. :** ::

Some Guidelines …

Do Not Use Two Many Sequences…

Reading Your Alignment

Going Further… PRVA_MACFU VGAFSAIDS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGFI PRVB_BOACO LQSCQAADS--FSCKTFFAKSG------LHSKSKDQLTKVFGVIDRDKSGYIEEDELKKF PRV1_SALSA LEACKAADT--FSFKTFFHTIG------FASKSADDVKKAFKVIDQDASGFIEVEELKLF TPCS_RABIT IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNADGYIDAEELAEI TPCS_PIG IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNMDGYIDAEELAEI TPCC_MOUSE IDEVDEDGSGTVDFDEFLVMMVRCMKDDSKGKSEEELSDLFRMFDKNADGYIDLDELKMM TPC_PATYE SDEMDEEATGRLNCDAWIQLFER---KLKEDLDERELKEAFRVLDKEKKGVIKVDVLRWI. :... ::. : * :* :.* *. : *. PRVA_MACFU LKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES-- PRVB_BOACO LQNFDGKARDLTDKETAEFLKEGDTDGDGKIGVEEFVVLVTKG-- PRV1_SALSA LQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQ--- TPCS_RABIT FR---ASGEHVTDEEIESLMKDGDKNNDGRIDFDEFLKMMEGVQ- TPCS_PIG FR---ASGEHVTDEEIESIMKDGDKNNDGRIDFDEFLKMMEGVQ- TPCC_MOUSE LQ---ATGETITEDDIEELMKDGDKNNDGRIDYDEFLEFMKGVE- TPC_PATYE LS---SLGDELTEEEIENMIAETDTDGSGTVDYEEFKCLMMSSDA :. :: : :: * :..* :. :** ::

WHAT MAKES A GOOD ALIGNMENT… -THE MORE DIVERGEANT THE SEQUENCES, THE BETTER -THE FEWER INDELS, THE BETTER -NICE UNGAPPED BLOCKS SEPARATED WITH INDELS -DIFFERENT CLASSES OF RESIDUES WITHIN A BLOCK: Completely Conserved Conserved For Size and Hydropathy Conserved For Size or Hydropathy -THE ULTIMATE EVALUATION IS A MATTER OF PERSONNAL JUDGEMENT AND KNOWLEDGE.

Potential Difficulties

DO NOT OVERTUNE!!! chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM mouse AKDDRIRYDNEMKSWEEQMAE * :.*. : DO NOT PLAY WITH PARAMETERS IF YOU KNOW THE ALIGNMENT YOU WANT: MAKE IT YOURSELF! chite ---ADKPKRPL-SAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAP-SAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPR-SAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :*:.:... :.. *. *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM mouse AKDDRIRYDNEMKSWEEQMAE * :.*. :

TUNING or NOT TUNING!!! -MOST METHODS ARE TUNED FOR WORKING WELL ON AVERAGE -PARAMETERS BEHAVIOUR DO NOT NECESSARILY FOLLOW THE THEORY (i.e. Substitution Matrices). -A GOOD ALIGNMENT IS USUALLY ROBUST(i.e. Changes little). -TUNE IF YOU WANT TO CONVINCE YOURSELF. -PARAMETERS TO TUNE USUALLY INCLUDE: GOP/ GEP MATRIX SENSITIVITY Vs SPEED GOP GEP Substitution Matrices (Etzold and al. 1993) Gonnet61.7 % Blosum % Pam %

KEEP A BIOLOGICAL PERSPECTIVE chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * chite AD--K----PKR-PLYMLWLNS-ARESIKRENPDFK-VT-EVAKKGGELWRGL- wheat -DPNK----PKRAP-FFVFMGE-FREEFKQKNPKNKSVA-AVGKAAGERWKSLS trybr -K--KDSNAPKR-AMT-MFFSSDFR-S-KH-S-DLS-IV-EMSKAAGAAWKELG mouse ----K----PKR-PRYNIYVSESFQEA-K--D-D-S-AQGKL-KLVNEAWKNLS * ***.:: ::... : *... : *. *: * DIFFERENT PARAMETERS WRONG ALIGNMENT !!!

REPEATS THERE IS A PROBLEM WHEN TWO SEQUENCES DO NOT CONTAIN THE SAME NUMBER OF REPEATS IT IS THEN BETTER TO MANUALLY EXTRACT THE REPEATS AND TO ALIGN THEM. INDIVIDUAL REPEATS CAN BE RECOGNIZED USING DOTTER

Naming Your Sequences The Right Way

What Are The Available Methods ???

Simultaneous Alignments : MSA 1) Set Bounds on each pair of sequences (Carillo and Lipman) 2) Compute the Maln within the Hyperspace -Few Small Closely Related Sequence. -Do Well When They Can Run. -Memory and CPU hungry

Simultaneous Alignments : DCA -Few Small Closely Related Sequence, but less limited than MSA -Do Well When Can Run. -Memory and CPU hungry, but less than MSA

Dialign II 1) Identify best chain of segments on each pair of sequence. Assign a Pvalue to each Segment Pair. 3) Assemble the alignment according to the segment pairs. 2) Ré-évaluate each segment pair according to its consistency with the others

Muscle

Progressive Iterative Methods -HMMs, HMMER, SAM, MUSCLE -Slow, Sometimes Inaccurate -Good Profile Generators

Progressive MUSCLE

Progressive MUSCLE phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py

MAFFT Fast Fourrier Transformé

Prank

Stachmo

Mixing Heterogenous Data With T-Coffee Local AlignmentGlobal Alignment Multiple Sequence Alignment Multiple Alignment StructuralSpecialist

Struct Vs Struct Seq Vs Struct Thread Evaluation on Homestrad Superpose Seq Vs Seq Local Global Mixing Sequences and Structures with T-Coffee

What is The Best Method ?

A better Question… What is the Best Alignment ? What is the best bit of my alignment ?

What is the Local Quality of my Alignment ? II I

Choosing the right method

Situation  Solution

Priority  Solution Method Priority TreesProfile2D –Pred3D-PredFunc-Pred Accuracy Speed

Purpose  Solution

Conclusion

-The BEST alignment Method: Your Brain The Right Data -Beware of repeated elements Multiple Alignment -The Best Evaluation Procedure: Experimental Data (SwissProt) -Choosing The Sequences Well is Important

Know Your Problem: What do you want to do with your MSA Multiple Alignment

Addresses MAFFT Progressive/iterativewww.biophys.kyoto-u.jp/katoh POA Progressive/Simultaneouswww.bioinformatics.ucla.edu/poa MUSCLE Progressive/Iterativewww.drive5.com/muscle