Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Multiple alignments, PATTERNS, PSI-BLAST.

Slides:

Advertisements

Similar presentations

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.

Advertisements

1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.

Heuristic alignment algorithms and cost matrices

Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.

Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.

Introduction to bioinformatics

Sequence Analysis Tools

Sequence similarity.

Similar Sequence Similar Function Charles Yan Spring 2006.

Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.

Exploring Protein Sequences Tutorial 5. Exploring Protein Sequences Multiple alignment –ClustalW Motif discovery –MEME –Jaspar.

Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.

Multiple Sequence Alignments

Introduction to Bioinformatics From Pairwise to Multiple Alignment.

Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.

© Wiley Publishing All Rights Reserved. Searching Sequence Databases.

Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.

Multiple sequence alignment

An Introduction to Multiple Sequence Alignments Cédric Notredame.

Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒黃尹柔田耕豪蕭逸嫻謝朝茂莊閔傑 2014/05/12 1.

An Introduction to Bioinformatics

Protein Sequence Alignment and Database Searching.

Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.

Sequence Alignment Techniques. In this presentation…… Part 1 – Searching for Sequence Similarity Part 2 – Multiple Sequence Alignment.

Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.

Database 5: protein domain/family. Protein domain/family: some definitions Most proteins have « modular » structures Estimation: ~ 3 domains / protein.

Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.

Sequence analysis: Macromolecular motif recognition Sylvia Nagl.

Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.

Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.

Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.

Sequence Based Analysis Tutorial NIH Proteomics Workshop Lai-Su Yeh, Ph.D. Protein Information Resource at Georgetown University Medical Center.

Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,

CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.

Construction of Substitution Matrices

Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.

HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.

Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.

PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.

Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,

Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.

Sequence Based Analysis Tutorial

©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.

Exercises Pairwise alignment Homology search (BLAST) Multiple alignment (CLUSTAL W) Iterative Profile Search: Profile Search –Pfam –Prosite –PSI-BLAST.

Doug Raiford Lesson 5.  Dynamic programming methods  Needleman-Wunsch (global alignment)  Smith-Waterman (local alignment)  BLAST Fixed: best Linear:

Sequence Alignment.

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique CN+LF An introduction to multiple alignments © Cédric Notredame.

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Multiple alignments, PATTERNS, PSI-BLAST.

Construction of Substitution matrices

Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics.

Step 3: Tools Database Searching

©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

Copyright OpenHelix. No use or reproduction without express written consent1.

Cédric Notredame (22/02/2016) Comparing Two Protein Sequences Cédric Notredame.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,

Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,

Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.

Protein Families, Motifs & Domains.

Comparing Two Protein Sequences

An Introduction to Multiple Sequence Alignments

An Introduction to Multiple Sequence Alignments

Sequence Based Analysis Tutorial

Sequence Based Analysis Tutorial

BLAST Slides adapted & edited from a set by

BLAST Slides adapted & edited from a set by

Presentation transcript:

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Multiple alignments, PATTERNS, PSI-BLAST

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Overview Multiple alignments How-to, Goal, problems, use Patterns PROSITE database, syntax, use PSI-BLAST BLAST, matrices, use [ Profiles/HMMs ] …

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF What is a multiple sequence alignment? What can it do for me? How can I produce one of these? How can I use it? chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM mouse AKDDRIRYDNEMKSWEEQMAE * :.*. :

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF How can I use a multiple alignment? chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP unknown -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM unknown AKDDRIRYDNEMKSWEEQMAE * :.*. : Extrapolation SwissProt Unkown Sequence Homology?

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF How can I use a multiple alignment? SwissProt Unkown Sequence Match? chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM mouse AKDDRIRYDNEMKSWEEQMAE * :.*. : Extrapolation Prosite Patterns

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF How can I use a multiple alignment? Extrapolation Prosite Patterns chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-IQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM mouse AKDDRIRYDNEMKSWEEQMAE * :.*. : L? K>R Prosite Profiles -More Sensitive -More Specific A F D E F G H Q I V L W

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF How can I use a multiple alignment? Phylogeny chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM mouse AKDDRIRYDNEMKSWEEQMAE * :.*. : chite wheat trybr mouse -Evolution -Paralogy/Orthology

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF How can I use a multiple alignment? Phylogeny chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM mouse AKDDRIRYDNEMKSWEEQMAE * :.*. : Struc. Prediction PhD For secondary Structure Prediction: 75% Accurate. Threading: is improving but is not yet as good.

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF How can I use a multiple alignment? Phylogeny Struc. Prediction chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM mouse AKDDRIRYDNEMKSWEEQMAE * :.*. : Caution! Automatic Multiple Sequence Alignment methods are not always perfect…

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF The problem why is it difficult to compute a multiple sequence alignment? chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * Computation What is the good alignment? Biology What is a good alignment?

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF The problem why is it difficult to compute a multiple sequence alignment? CIRCULAR PROBLEM.... Good Sequences Good Alignment

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF What do I need to know to make a good multiple alignment? How do sequences evolve? How does the computer align the sequences? How can I choose my sequences? What is the best program? How can I use my alignment?

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF An alignment is a story ADKPKRPLSAYMLWLN ADKPRRPLS-YMLWLN ADKPKRPKPRLSAYMLWLN Mutations + Selection ADKPRRP---LS-YMLWLN ADKPKRPKPRLSAYMLWLN Insertion Deletion Mutation

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Homology Same sequences -> same origin? -> same function? -> same 3D fold? Length %Sequence Identity 30% 100 Same 3D Fold Twilight Zone

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Convergent evolution AFGP with (ThrAlaAla)n Similar To Trypsynogen AFGP with (ThrAlaAla)n NOT Similar to Trypsinogen N S Chen et al, 97, PNAS, 94,

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Residues and mutations All residues are equal, but some more than others… P G S C L I T V A W Y F Q H K R E DN Aliphatic Aromatic Hydrophobic Polar Small M Accurate matrices are data driven rather than knowledge driven G C

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Substitution matrices Different Flavors: Pam: 250, 350 Blosum: 45, 62 …

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF What is the best substition matrix? Mutation rates depend on families Choosing the right matrix may be tricky Gonnet250 > BLOSUM62 > PAM250 Depends on the family, the program used and its tuning FamilySN Histone36.40 Insulin Interleukin I  Globin Apolipoprot. AI Interferon G Rates in Substitutions/site/Billion Years as measured on Mouse Vs Human (0.08 Billion years)

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Insertions and deletions? Indel Cost L Cost L L Affine Gap Penalty Cost=GOP+GEP*L

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Globins =>9 years 8 Globins => years How to align many sequences? Exact algorithms are computing time consuming Needlemann & Wunsch Smith & Waterman -> heuristic required! 7 Globins =>1000 years 2 Globins =>1 sec 3 Globins =>2 mn 4 Globins =>5 hours5 Globins =>3 weeks

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Existing methods 1-Carillo and Lipman: -MSA, DCA. -Few Small Closely Related Sequence. 2-Segment Based: -DIALIGN, MACAW. -May Align Too Few Residues -Do Well When They Can Run. 3-Iterative: -HMMs, HMMER, SAM. -Slow, Sometimes Inacurate -Good Profile Generators 4-Progressive: -ClustalW, Pileup, Multalign… -Fast and Sensitive

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Progressive alignment Feng and Dolittle, 1980; Taylor 1981 Dynamic Programming Using A Substitution Matrix

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Progressive alignment Feng and Dolittle, 1980; Taylor Depends on the ORDER of the sequences (Tree). -Depends on the CHOICE of the sequences. -Depends on the PARAMETERS: Substitution Matrix. Penalties (Gop, Gep). Sequence Weight. Tree making Algorithm.

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Selecting sequences from a BLAST output

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF A common mistake Sequences too closely related Identical sequences brings no information Multiple sequence alignments thrive on diversity PRVA_MACFU SMTDLLNAEDIKKAVGAFSAIDSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKSGFIEE PRVA_HUMAN SMTDLLNAEDIKKAVGAFSATDSFDHKKFFQMVGLKKKSADDVKKVFHMLDKDKSGFIEE PRVA_GERSP SMTDLLSAEDIKKAIGAFAAADSFDHKKFFQMVGLKKKTPDDVKKVFHILDKDKSGFIEE PRVA_MOUSE SMTDVLSAEDIKKAIGAFAAADSFDHKKFFQMVGLKKKNPDEVKKVFHILDKDKSGFIEE PRVA_RAT SMTDLLSAEDIKKAIGAFTAADSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKSGFIEE PRVA_RABIT AMTELLNAEDIKKAIGAFAAAESFDHKKFFQMVGLKKKSTEDVKKVFHILDKDKSGFIEE :**::*.*******:***:* :****************..::******:*********** PRVA_MACFU DELGFILKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES PRVA_HUMAN DELGFILKGFSPDARDLSAKETKMLMAAGDKDGDGKIGVDEFSTLVAES PRVA_GERSP DELGFILKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVSES PRVA_MOUSE DELGSILKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVAES PRVA_RAT DELGSILKGFSSDARDLSAKETKTLMAAGDKDGDGKIGVEEFSTLVAES PRVA_RABIT EELGFILKGFSPDARDLSVKETKTLMAAGDKDGDGKIGADEFSTLVSES :*** ******.******.**** *:************.:******:**

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Respect information! PRVA_MACFU SMTDLLN----AEDIKKA PRVA_HUMAN SMTDLLN----AEDIKKA PRVA_GERSP SMTDLLS----AEDIKKA PRVA_MOUSE SMTDVLS----AEDIKKA PRVA_RAT SMTDLLS----AEDIKKA PRVA_RABIT AMTELLN----AEDIKKA TPCC_MOUSE MDDIYKAAVEQLTEEQKNEFKAAFDIFVLGAEDGCISTKELGKVMRMLGQNPTPEELQEM : :*..*:::: PRVA_MACFU VGAFSAIDS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGFI PRVA_HUMAN VGAFSATDS--FDHKKFFQMVG------LKKKSADDVKKVFHMLDKDKSGFIEEDELGFI PRVA_GERSP IGAFAAADS--FDHKKFFQMVG------LKKKTPDDVKKVFHILDKDKSGFIEEDELGFI PRVA_MOUSE IGAFAAADS--FDHKKFFQMVG------LKKKNPDEVKKVFHILDKDKSGFIEEDELGSI PRVA_RAT IGAFTAADS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGSI PRVA_RABIT IGAFAAAES--FDHKKFFQMVG------LKKKSTEDVKKVFHILDKDKSGFIEEEELGFI TPCC_MOUSE IDEVDEDGSGTVDFDEFLVMMVRCMKDDSKGKSEEELSDLFRMFDKNADGYIDLDELKMM :.. *.*..:*: *: * *. :::..:*:::**:.*:*: :** : PRVA_MACFU LKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES- PRVA_HUMAN LKGFSPDARDLSAKETKMLMAAGDKDGDGKIGVDEFSTLVAES- PRVA_GERSP LKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVSES- PRVA_MOUSE LKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVAES- PRVA_RAT LKGFSSDARDLSAKETKTLMAAGDKDGDGKIGVEEFSTLVAES- PRVA_RABIT LKGFSPDARDLSVKETKTLMAAGDKDGDGKIGADEFSTLVSES- TPCC_MOUSE LQ---ATGETITEDDIEELMKDGDKNNDGRIDYDEFLEFMKGVE *:... ::.: : *: ***:.**:*. :** :: -This alignment is not informative about the relation between TPCC MOUSE and the rest of the sequences. -A better spread of the sequences is needed

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Selecting diverse sequences PRVB_CYPCA -AFAGVLNDADIAAALEACKAADSFNHKAFFAKVGLTSKSADDVKKAFAIIDQDKSGFIE PRVB_BOACO -AFAGILSDADIAAGLQSCQAADSFSCKTFFAKSGLHSKSKDQLTKVFGVIDRDKSGYIE PRV1_SALSA MACAHLCKEADIKTALEACKAADTFSFKTFFHTIGFASKSADDVKKAFKVIDQDASGFIE PRVB_LATCH -AVAKLLAAADVTAALEGCKADDSFNHKVFFQKTGLAKKSNEELEAIFKILDQDKSGFIE PRVB_RANES -SITDIVSEKDIDAALESVKAAGSFNYKIFFQKVGLAGKSAADAKKVFEILDRDKSGFIE PRVA_MACFU -SMTDLLNAEDIKKAVGAFSAIDSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKSGFIE PRVA_ESOLU --AKDLLKADDIKKALDAVKAEGSFNHKKFFALVGLKAMSANDVKKVFKAIDADASGFIE : *:.:..*.:*. * ** *: * : * :* * **:** PRVB_CYPCA EDELKLFLQNFKADARALTDGETKTFLKAGDSDGDGKIGVDEFTALVKA- PRVB_BOACO EDELKKFLQNFDGKARDLTDKETAEFLKEGDTDGDGKIGVEEFVVLVTKG PRV1_SALSA VEELKLFLQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQ- PRVB_LATCH DEELELFLQNFSAGARTLTKTETETFLKAGDSDGDGKIGVDEFQKLVKA- PRVB_RANES QDELGLFLQNFRASARVLSDAETSAFLKAGDSDGDGKIGVEEFQALVKA- PRVA_MACFU EDELGFILKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES PRVA_ESOLU EEELKFVLKSFAADGRDLTDAETKAFLKAADKDGDGKIGIDEFETLVHEA :**.*:.*.* *: ** ::.* **** **::** ** -A REASONABLE model now exists. -Going further:remote homologues.

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Aligning remote homologues PRVA_MACFU SMTDLLNA----EDIKKA PRVA_ESOLU AKDLLKA----DDIKKA PRVB_CYPCA AFAGVLND----ADIAAA PRVB_BOACO AFAGILSD----ADIAAG PRV1_SALSA MACAHLCKE----ADIKTA PRVB_LATCH AVAKLLAA----ADVTAA PRVB_RANES SITDIVSE----KDIDAA TPCS_RABIT -TDQQAEARSYLSEEMIAEFKAAFDMFDADGG-GDISVKELGTVMRMLGQTPTKEELDAI TPCS_PIG -TDQQAEARSYLSEEMIAEFKAAFDMFDADGG-GDISVKELGTVMRMLGQTPTKEELDAI TPCC_MOUSE MDDIYKAAVEQLTEEQKNEFKAAFDIFVLGAEDGCISTKELGKVMRMLGQNPTPEELQEM : :: PRVA_MACFU VGAFSAIDS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGFI PRVA_ESOLU LDAVKAEGS--FNHKKFFALVG------LKAMSANDVKKVFKAIDADASGFIEEEELKFV PRVB_CYPCA LEACKAADS--FNHKAFFAKVG------LTSKSADDVKKAFAIIDQDKSGFIEEDELKLF PRVB_BOACO LQSCQAADS--FSCKTFFAKSG------LHSKSKDQLTKVFGVIDRDKSGYIEEDELKKF PRV1_SALSA LEACKAADT--FSFKTFFHTIG------FASKSADDVKKAFKVIDQDASGFIEVEELKLF PRVB_LATCH LEGCKADDS--FNHKVFFQKTG------LAKKSNEELEAIFKILDQDKSGFIEDEELELF PRVB_RANES LESVKAAGS--FNYKIFFQKVG------LAGKSAADAKKVFEILDRDKSGFIEQDELGLF TPCS_RABIT IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNADGYIDAEELAEI TPCS_PIG IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNMDGYIDAEELAEI TPCC_MOUSE IDEVDEDGSGTVDFDEFLVMMVRCMKDDSKGKSEEELSDLFRMFDKNADGYIDLDELKMM :..:... *: * : * :* :.*:*: :**. PRVA_MACFU LKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES- PRVA_ESOLU LKSFAADGRDLTDAETKAFLKAADKDGDGKIGIDEFETLVHEA- PRVB_CYPCA LQNFKADARALTDGETKTFLKAGDSDGDGKIGVDEFTALVKA-- PRVB_BOACO LQNFDGKARDLTDKETAEFLKEGDTDGDGKIGVEEFVVLVTKG- PRV1_SALSA LQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQ-- PRVB_LATCH LQNFSAGARTLTKTETETFLKAGDSDGDGKIGVDEFQKLVKA-- PRVB_RANES LQNFRASARVLSDAETSAFLKAGDSDGDGKIGVEEFQALVKA-- TPCS_RABIT FR---ASGEHVTDEEIESLMKDGDKNNDGRIDFDEFLKMMEGVQ TPCS_PIG FR---ASGEHVTDEEIESIMKDGDKNNDGRIDFDEFLKMMEGVQ TPCC_MOUSE LQ---ATGETITEDDIEELMKDGDKNNDGRIDYDEFLEFMKGVE ::.. :: : ::.* :.** *. :** ::

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Going further… PRVA_MACFU VGAFSAIDS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGFI PRVB_BOACO LQSCQAADS--FSCKTFFAKSG------LHSKSKDQLTKVFGVIDRDKSGYIEEDELKKF PRV1_SALSA LEACKAADT--FSFKTFFHTIG------FASKSADDVKKAFKVIDQDASGFIEVEELKLF TPCS_RABIT IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNADGYIDAEELAEI TPCS_PIG IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNMDGYIDAEELAEI TPCC_MOUSE IDEVDEDGSGTVDFDEFLVMMVRCMKDDSKGKSEEELSDLFRMFDKNADGYIDLDELKMM TPC_PATYE SDEMDEEATGRLNCDAWIQLFER---KLKEDLDERELKEAFRVLDKEKKGVIKVDVLRWI. :... ::. : * :* :.* *. : *. PRVA_MACFU LKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES-- PRVB_BOACO LQNFDGKARDLTDKETAEFLKEGDTDGDGKIGVEEFVVLVTKG-- PRV1_SALSA LQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQ--- TPCS_RABIT FR---ASGEHVTDEEIESLMKDGDKNNDGRIDFDEFLKMMEGVQ- TPCS_PIG FR---ASGEHVTDEEIESIMKDGDKNNDGRIDFDEFLKMMEGVQ- TPCC_MOUSE LQ---ATGETITEDDIEELMKDGDKNNDGRIDYDEFLEFMKGVE- TPC_PATYE LS---SLGDELTEEEIENMIAETDTDGSGTVDYEEFKCLMMSSDA :. :: : :: * :..* :. :** ::

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF What makes a good alignment… The more divergeant the sequences, the better The fewer indels, the better Nice ungapped blocks separated with indels Different classes of residues within a block: Completely conserved Size and hydropathy conserved Size or hydropathy conserved The ultimate evaluation is a matter of personal judgment and knowledge

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Avoiding pitfalls

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Keep a biological perspective chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM mouse AKDDRIRYDNEMKSWEEQMAE * :.*. : chite AD--K----PKR-PLYMLWLNS-ARESIKRENPDFK-VT-EVAKKGGELWRGL- wheat -DPNK----PKRAP-FFVFMGE-FREEFKQKNPKNKSVA-AVGKAAGERWKSLS trybr -K--KDSNAPKR-AMT-MFFSSDFR-S-KH-S-DLS-IV-EMSKAAGAAWKELG mouse ----K----PKR-PRYNIYVSESFQEA-K--D-D-S-AQGKL-KLVNEAWKNLS * ***.:: ::... : *... : *. *: * chite KSEWEAKAATAKQNY-I--RALQE-YERNG-G- wheat KAPYVAKANKLKGEY-N--KAIAA-YNK-GESA trybr RKVYEEMAEKDKERY----K--RE-M mouse KQAYIQLAKDDRIRYDNEMKSWEEQMAE----- : : * :.* : DIFFERENT PARAMETERS

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Do not overtune!!! DO NOT PLAY WITH PARAMETERS! IF YOU KNOW THE ALIGNMENT YOU WANT: MAKE IT YOURSELF! chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM mouse AKDDRIRYDNEMKSWEEQMAE * :.*. : chite ---ADKPKRPL-SAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAP-SAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS-----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPR-SAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. *.:... :.. *. *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM mouse AKDDRIRYDNEMKSWEEQMAE * :.*. :

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Choosing the right method PROBLEM PROGRAM ClustalW MSA DIALIGN II METHOD Source: BaliBase Thompson et al, NAR, 1999

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Conclusion The best alignment method: Your brain The right data The best evaluation method: Your eyes Experimental information (SwissProt) What can I conclude? Homology -> information extrapolation How can I go further? Patterns Profiles HMMs …

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF The database

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF History Founded by Amos Bairoch 1988 First release in the PC/Gene software 1990 Synchronisation with Swiss-Prot 1994 Integration of « profiles » 1999 PROSITE joins InterPro November 2001 Current release 16.50

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Database content Official Release ~1400Patterns PSxxxxxPATTERN ~100Profiles PSxxxxxMATRIX 4Rules PSxxxxxRULE ~1100Documentations PDOCxxxxx Pre-Release ~250ProfilesPSxxxxxMATRIX ~150DocumentationsQDOCxxxxx

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Pattern « philosophy » Target: definition of sites with biological information catalytic, metal binding, S-S bridge, cofactor binding, prosthetic group, PTM Easy to understand and to design, example Q-x(3)-N-[SA]-C-G-x(3)-[LIVM](2)-H-[SA]-[LIVM]-[SA]

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Pattern syntax Regular expression (REGEXP) language: Each position is separated by a dash « - » amino acids are represented by single letter code « x » represent any amino acid [] group of amino acid acceptable for a position {} group of amino acid not acceptable for a position () multiple or range e.g., A(1,3) means 1 to 3 A < anchor at beginning of sequence > anchor at end of sequence

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Profile « philosophy » Aim: identification of domains and not protein families Gene discovery vs automatic annotation Importance of score and calibration Possible manual tuning (by a well trained expert… ;-) -> allowed by the profile syntax -> no direct link to multiple alignment

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Database content: PATTERN ID UCH_2_1; PATTERN. AC PS00972; DT JUN-1994 (CREATED); SEP-2000 (DATA UPDATE); SEP-2000 (INFO UPDATE). DE Ubiquitin carboxyl-terminal hydrolases family 2 signature 1. PA G-[LIVMFY]-x(1,3)-[AGC]-[NASM]-x-C-[FYW]-[LIVMFC]-[NST]-[SACV]-x-[LIVMS]-Q. NR /RELEASE=38,80000; NR /TOTAL=41(41); /POSITIVE=41(41); /UNKNOWN=0(0); /FALSE_POS=0(0); NR /FALSE_NEG=2; /PARTIAL=0; CC /TAXO-RANGE=??E??; /MAX-REPEAT=1; CC /SITE=7,active_site(?); DR Q93008, FAFX_HUMAN, T; O00507, FAFY_HUMAN, T; P55824, FAF_DROME, T; DR P70398, FAF_MOUSE, T; P54578, TGT_HUMAN, T; P40826, TGT_RABIT, T; DR P25037, UBP1_YEAST, T; O42726, UBP2_KLULA, T; Q01476, UBP2_YEAST, T; (…) DR P38187, UBPD_YEAST, T; Q24574, UBPE_DROME, T; Q14694, UBPE_HUMAN, T; DR P52479, UBPE_MOUSE, T; P38237, UBPE_YEAST, T; P50101, UBPF_YEAST, T; DR Q02863, UBPG_YEAST, T; P43593, UBPH_YEAST, T; Q61068, UBPW_MOUSE, T; DR P34547, UBPX_CAEEL, T; Q09931, UBPY_CAEEL, T; DR P53874, UBPA_YEAST, N; Q17361, UBPT_CAEEL, N; DO PDOC00750; //

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Database content: Profile ID UCH_2_3; MATRIX. AC PS50235; DT SEP-2000 (CREATED); SEP-2000 (DATA UPDATE); SEP-2000 (INFO UPDATE). DE Ubiquitin carboxyl-terminal hydrolases family 2 profile. MA /GENERAL_SPEC: ALPHABET='ABCDEFGHIKLMNPQRSTVWYZ'; LENGTH=193; TOPOLOGY=LINEAR; MA /DISJOINT: DEFINITION=PROTECT; N1=10; N2=185; MA /NORMALIZATION: MODE=1; FUNCTION=LINEAR; R1=1.3922; R2= ; TEXT='NScore'; MA /CUT_OFF: LEVEL=0; SCORE=910; N_SCORE=9.0; MODE=1; MA /CUT_OFF: LEVEL=-1; SCORE=610; N_SCORE=6.5; MODE=1; MA /DEFAULT: B1=-100; E1=-100; MI=-105; MD=-105; IM=-105; DM=-105; I=-20; D=-20; MA /I: B1=0; BI=-105; BD=-105; MA /M: SY='T'; M=0,-14,2,-19,-16,-9,-21,-18,-6,-10,-5,-5,-12,-21,-15,-6,0,9,6,-29,-11,-16; (…) MA /M: SY='D'; M=-11,12,-27,17,6,-21,-9,-4,-21,-4,-18,-14,5,-12,0,-6,-3,-8,-19,-26,-11,2; MA /I: E1=0; NR /RELEASE=38,80000; NR /TOTAL=47(47); /POSITIVE=47(47); /UNKNOWN=0(0); /FALSE_POS=0(0); NR /FALSE_NEG=0; /PARTIAL=0; CC /TAXO-RANGE=??E??; /MAX-REPEAT=1; DR Q01988, UBPB_CANFA, T; Q93008, FAFX_HUMAN, T; O00507, FAFY_HUMAN, T; DR P55824, FAF_DROME, T; P70398, FAF_MOUSE, T; P53010, PAN2_YEAST, T; (…) DR Q09798, YAA4_SCHPO, T; P43589, YFH5_YEAST, T; DO PDOC00750; //

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Database content: documentation {PDOC00750} {PS00972; UCH_2_1} {PS00973; UCH_2_2} {PS50235; UCH_2_3} {BEGIN} ********************************************************************** * Ubiquitin carboxyl-terminal hydrolases family 2 signatures/profile * ********************************************************************** Ubiquitin carboxyl-terminal hydrolases (EC ) (UCH) (deubiquitinating enzymes) [1,2] are thiol proteases that recognize and hydrolyze the peptide bond at the C-terminal glycine of ubiquitin. These enzymes are involved in the processing of poly-ubiquitin precursors as well as that of ubiquinated proteins. There are two distinct families of UCH. The second class consist of large proteins (800 to 2000 residues) and is currently represented by: - Yeast UBP1, UBP2, UBP3, UBP4 (or DOA4/SSV7), UBP5, UBP7, UBP9, UBP10, UBP11, UBP12, UBP13, UBP14, UBP15 and UBP16. - Human tre-2. - Human isopeptidase T. - Human isopeptidase T-3. (…)

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Database content: documentation also probably implicated in the catalytic mechanism. We have developed signature pattern for both conserved regions. We also developed a profile including the two regions covered by the patterns. -Consensus pattern: G-[LIVMFY]-x(1,3)-[AGC]-[NASM]-x-C-[FYW]-[LIVMFC]-[NST]- [SACV]-x-[LIVMS]-Q [C is the putative active site residue] -Sequences known to belong to this class detected by the pattern: ALL, except for two sequences. (…) -Note: these proteins belong to family C19 in the classification of peptidases [3,E1]. -Note: this documentation entry is linked to both a signature pattern and a profile. As the profile is much more sensitive than the pattern, you should use it if you have access to the necessary software tools to do so. -Last update: September 2000 / Patterns and text revised; profile added. [ 1] Jentsch S., Seufert W., Hauser H.-P. Biochim. Biophys. Acta 1089: (1991). [ 2] D'andrea A., Pellman D. Crit. Rev. Biochem. Mol. Biol. 33: (1998). [ 3] Rawlings N.D., Barrett A.J. Meth. Enzymol. 244: (1994). [E1]

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Tools EMBOSS fuzzpro, fuzztran, fuzznuc, patmatdb, patmatmotifs FINDPATTERN, SCANPROSITE... PFSCAN & PFRAMESCAN Pftools 2.2 (pfmake, pfw, pfscan, pfsearch) Fortran source code (open source) Binaries (solaris, linux, hpux, irix, win32, macosX) GeneMatcher (

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF PSI-BLAST What is it? Derived from NCBI-BLAST2.0 Position Specific Iterative BLAST Difference with BLAST PSSM / checkpoint Advantage / Disadvantage

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF PSI-BLAST Position specific iterative BLAST (PSI-BLAST) refers to a feature of BLAST 2.0 in which a profile (or position specific scoring matrix, PSSM) is constructed (automatically) from a multiple alignment of the highest scoring hits in an initial BLAST search. The PSSM is generated by calculating position-specific scores for each position in the alignment. Highly conserved positions receive high scores and weakly conserved positions receive scores near zero. The profile is used to perform a second (etc.) BLAST search (replacing the normal matrix, e.g. BLOSUM62) and the results of each "iteration" used to refine the profile. This iterative searching strategy results in increased sensitivity.

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF BLAST algorithm

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Differences with BLAST The two E-values Automatically or manually selecting the matches The substitution matrix The iteration

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF PSI-BLAST E-values Two different E value settings need to be specified in the PSI- BLAST program. The first of these (upper) sets the threshold for the initial BLAST search. The default value is 10 as in the standard BLAST program. The second E value (lower) is the threshold value for inclusion in the position specific matrix used for PSI-BLAST iterations. The default setting is The E values specified allow the user to see (and selectively, based on prior knowledge, include) all of the BLAST hits up to E=10; but to automatically include only those hits exceeding a relatively rigorous E value threshold of

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF BLAST PSSM or weight matrix A substitution matrix for an alphabet of size A is of size AxA A PSSM for an alphabet of size A is of size AxN where N is the length of the query A R N.. Y V A R N Y V M I S E C U E N C I A.. A R N Y V

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF BLAST Iteration

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF PHI-BLAST: a link with PATTERNS PHI-BLAST means Pattern-Hit Initiated BLAST PHI-BLAST expects as input a protein query sequence and a pattern contained in that sequence. PHI-BLAST searches the specified database for other protein sequences that also contain the input pattern and have significant similarity to the query sequence in the vicinity of the pattern occurrences. Statistical significance is reported using E-values as for other forms of BLAST, but the statistical method for computing the E-values is different. PHI-BLAST is integrated with Position-Specific Iterated BLAST (PSI-BLAST), so that the results of a PHI-BLAST query can be used to initiate one or more rounds of PSI-BLAST searching.

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF The good and the bad Advantages Fast User friendly interface Local bias statistics Single software Disadvantages Could be confusing No position specific gap penalty Fixed query length Complex PSSM/checkpoint for reuse Difficult scan vs search

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF How to « PSI-BLAST » efficiently? Choose carefuly your query sequence Limit the size to the domain, but maximize Check matches: include or exclude based on biological knowledge Do not overfit!! Try reverse experiment to certify