Presentation is loading. Please wait.

Presentation is loading. Please wait.

Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Multiple alignments, PATTERNS, PSI-BLAST.

Similar presentations


Presentation on theme: "Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Multiple alignments, PATTERNS, PSI-BLAST."— Presentation transcript:

1 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Multiple alignments, PATTERNS, PSI-BLAST

2 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Overview Multiple alignments How-to, Goal, problems, use Patterns PROSITE database, syntax, use PSI-BLAST BLAST, matrices, use [ Profiles/HMMs ] …

3 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 What is a multiple sequence alignment? What can it do for me? How can I produce one of these? How can I use it? chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM--------- mouse AKDDRIRYDNEMKSWEEQMAE * :.*. :

4 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 How can I use a multiple alignment? chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP unknown -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM--------- unknown AKDDRIRYDNEMKSWEEQMAE * :.*. : Extrapolation SwissProt Unkown Sequence Homology?

5 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 How can I use a multiple alignment? SwissProt Unkown Sequence Match? chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM--------- mouse AKDDRIRYDNEMKSWEEQMAE * :.*. : Extrapolation Prosite Patterns

6 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 How can I use a multiple alignment? Extrapolation Prosite Patterns chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-IQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM--------- mouse AKDDRIRYDNEMKSWEEQMAE * :.*. : L? K>R Prosite Profiles -More Sensitive -More Specific A F D E F G H Q I V L W

7 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 How can I use a multiple alignment? Phylogeny chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM--------- mouse AKDDRIRYDNEMKSWEEQMAE * :.*. : chite wheat trybr mouse -Evolution -Paralogy/Orthology

8 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 How can I use a multiple alignment? Phylogeny chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM--------- mouse AKDDRIRYDNEMKSWEEQMAE * :.*. : Struc. Prediction PhD For secondary Structure Prediction: 75% Accurate. Threading: is improving but is not yet as good.

9 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 How can I use a multiple alignment? Phylogeny Struc. Prediction chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM--------- mouse AKDDRIRYDNEMKSWEEQMAE * :.*. : Caution! Automatic Multiple Sequence Alignment methods are not always perfect…

10 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 The problem why is it difficult to compute a multiple sequence alignment? chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * Computation What is the good alignment? Biology What is a good alignment?

11 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 The problem why is it difficult to compute a multiple sequence alignment? CIRCULAR PROBLEM.... Good Sequences Good Alignment

12 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 What do I need to know to make a good multiple alignment? How do sequences evolve? How does the computer align the sequences? How can I choose my sequences? What is the best program? How can I use my alignment?

13 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 An alignment is a story ADKPKRPLSAYMLWLN ADKPRRPLS-YMLWLN ADKPKRPKPRLSAYMLWLN Mutations + Selection ADKPRRP---LS-YMLWLN ADKPKRPKPRLSAYMLWLN Insertion Deletion Mutation

14 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Homology Same sequences -> same origin? -> same function? -> same 3D fold? Length %Sequence Identity 30% 100 Same 3D Fold Twilight Zone

15 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Convergent evolution AFGP with (ThrAlaAla)n Similar To Trypsynogen AFGP with (ThrAlaAla)n NOT Similar to Trypsinogen N S Chen et al, 97, PNAS, 94, 3811-16

16 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Residues and mutations All residues are equal, but some more than others… P G S C L I T V A W Y F Q H K R E DN Aliphatic Aromatic Hydrophobic Polar Small M Accurate matrices are data driven rather than knowledge driven G C

17 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Substitution matrices Different Flavors: Pam: 250, 350 Blosum: 45, 62 …

18 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 What is the best substition matrix? Mutation rates depend on families Choosing the right matrix may be tricky Gonnet250 > BLOSUM62 > PAM250 Depends on the family, the program used and its tuning FamilySN Histone36.40 Insulin4.00.1 Interleukin I4.61.4  Globin5.10.6 Apolipoprot. AI4.51.6 Interferon G8.62.8 Rates in Substitutions/site/Billion Years as measured on Mouse Vs Human (0.08 Billion years)

19 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Insertions and deletions? Indel Cost L Cost L L Affine Gap Penalty Cost=GOP+GEP*L

20 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 6 Globins =>9 years 8 Globins =>150 000 years How to align many sequences? Exact algorithms are computing time consuming Needlemann & Wunsch Smith & Waterman -> heuristic required! 7 Globins =>1000 years 2 Globins =>1 sec 3 Globins =>2 mn 4 Globins =>5 hours5 Globins =>3 weeks

21 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Existing methods 1-Carillo and Lipman: -MSA, DCA. -Few Small Closely Related Sequence. 2-Segment Based: -DIALIGN, MACAW. -May Align Too Few Residues -Do Well When They Can Run. 3-Iterative: -HMMs, HMMER, SAM. -Slow, Sometimes Inacurate -Good Profile Generators 4-Progressive: -ClustalW, Pileup, Multalign… -Fast and Sensitive

22 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Progressive alignment Feng and Dolittle, 1980; Taylor 1981 Dynamic Programming Using A Substitution Matrix

23 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Progressive alignment Feng and Dolittle, 1980; Taylor 1981 -Depends on the ORDER of the sequences (Tree). -Depends on the CHOICE of the sequences. -Depends on the PARAMETERS: Substitution Matrix. Penalties (Gop, Gep). Sequence Weight. Tree making Algorithm.

24 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Selecting sequences from a BLAST output

25 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 A common mistake Sequences too closely related Identical sequences brings no information Multiple sequence alignments thrive on diversity PRVA_MACFU SMTDLLNAEDIKKAVGAFSAIDSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKSGFIEE PRVA_HUMAN SMTDLLNAEDIKKAVGAFSATDSFDHKKFFQMVGLKKKSADDVKKVFHMLDKDKSGFIEE PRVA_GERSP SMTDLLSAEDIKKAIGAFAAADSFDHKKFFQMVGLKKKTPDDVKKVFHILDKDKSGFIEE PRVA_MOUSE SMTDVLSAEDIKKAIGAFAAADSFDHKKFFQMVGLKKKNPDEVKKVFHILDKDKSGFIEE PRVA_RAT SMTDLLSAEDIKKAIGAFTAADSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKSGFIEE PRVA_RABIT AMTELLNAEDIKKAIGAFAAAESFDHKKFFQMVGLKKKSTEDVKKVFHILDKDKSGFIEE :**::*.*******:***:* :****************..::******:*********** PRVA_MACFU DELGFILKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES PRVA_HUMAN DELGFILKGFSPDARDLSAKETKMLMAAGDKDGDGKIGVDEFSTLVAES PRVA_GERSP DELGFILKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVSES PRVA_MOUSE DELGSILKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVAES PRVA_RAT DELGSILKGFSSDARDLSAKETKTLMAAGDKDGDGKIGVEEFSTLVAES PRVA_RABIT EELGFILKGFSPDARDLSVKETKTLMAAGDKDGDGKIGADEFSTLVSES :*** ******.******.**** *:************.:******:**

26 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Respect information! PRVA_MACFU ------------------------------------------SMTDLLN----AEDIKKA PRVA_HUMAN ------------------------------------------SMTDLLN----AEDIKKA PRVA_GERSP ------------------------------------------SMTDLLS----AEDIKKA PRVA_MOUSE ------------------------------------------SMTDVLS----AEDIKKA PRVA_RAT ------------------------------------------SMTDLLS----AEDIKKA PRVA_RABIT ------------------------------------------AMTELLN----AEDIKKA TPCC_MOUSE MDDIYKAAVEQLTEEQKNEFKAAFDIFVLGAEDGCISTKELGKVMRMLGQNPTPEELQEM : :*..*:::: PRVA_MACFU VGAFSAIDS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGFI PRVA_HUMAN VGAFSATDS--FDHKKFFQMVG------LKKKSADDVKKVFHMLDKDKSGFIEEDELGFI PRVA_GERSP IGAFAAADS--FDHKKFFQMVG------LKKKTPDDVKKVFHILDKDKSGFIEEDELGFI PRVA_MOUSE IGAFAAADS--FDHKKFFQMVG------LKKKNPDEVKKVFHILDKDKSGFIEEDELGSI PRVA_RAT IGAFTAADS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGSI PRVA_RABIT IGAFAAAES--FDHKKFFQMVG------LKKKSTEDVKKVFHILDKDKSGFIEEEELGFI TPCC_MOUSE IDEVDEDGSGTVDFDEFLVMMVRCMKDDSKGKSEEELSDLFRMFDKNADGYIDLDELKMM :.. *.*..:*: *: * *. :::..:*:::**:.*:*: :** : PRVA_MACFU LKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES- PRVA_HUMAN LKGFSPDARDLSAKETKMLMAAGDKDGDGKIGVDEFSTLVAES- PRVA_GERSP LKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVSES- PRVA_MOUSE LKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVAES- PRVA_RAT LKGFSSDARDLSAKETKTLMAAGDKDGDGKIGVEEFSTLVAES- PRVA_RABIT LKGFSPDARDLSVKETKTLMAAGDKDGDGKIGADEFSTLVSES- TPCC_MOUSE LQ---ATGETITEDDIEELMKDGDKNNDGRIDYDEFLEFMKGVE *:... ::.: : *: ***:.**:*. :** :: -This alignment is not informative about the relation between TPCC MOUSE and the rest of the sequences. -A better spread of the sequences is needed

27 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Selecting diverse sequences PRVB_CYPCA -AFAGVLNDADIAAALEACKAADSFNHKAFFAKVGLTSKSADDVKKAFAIIDQDKSGFIE PRVB_BOACO -AFAGILSDADIAAGLQSCQAADSFSCKTFFAKSGLHSKSKDQLTKVFGVIDRDKSGYIE PRV1_SALSA MACAHLCKEADIKTALEACKAADTFSFKTFFHTIGFASKSADDVKKAFKVIDQDASGFIE PRVB_LATCH -AVAKLLAAADVTAALEGCKADDSFNHKVFFQKTGLAKKSNEELEAIFKILDQDKSGFIE PRVB_RANES -SITDIVSEKDIDAALESVKAAGSFNYKIFFQKVGLAGKSAADAKKVFEILDRDKSGFIE PRVA_MACFU -SMTDLLNAEDIKKAVGAFSAIDSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKSGFIE PRVA_ESOLU --AKDLLKADDIKKALDAVKAEGSFNHKKFFALVGLKAMSANDVKKVFKAIDADASGFIE : *:.:..*.:*. * ** *: * : * :* * **:** PRVB_CYPCA EDELKLFLQNFKADARALTDGETKTFLKAGDSDGDGKIGVDEFTALVKA- PRVB_BOACO EDELKKFLQNFDGKARDLTDKETAEFLKEGDTDGDGKIGVEEFVVLVTKG PRV1_SALSA VEELKLFLQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQ- PRVB_LATCH DEELELFLQNFSAGARTLTKTETETFLKAGDSDGDGKIGVDEFQKLVKA- PRVB_RANES QDELGLFLQNFRASARVLSDAETSAFLKAGDSDGDGKIGVEEFQALVKA- PRVA_MACFU EDELGFILKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES PRVA_ESOLU EEELKFVLKSFAADGRDLTDAETKAFLKAADKDGDGKIGIDEFETLVHEA :**.*:.*.* *: ** ::.* **** **::** ** -A REASONABLE model now exists. -Going further:remote homologues.

28 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Aligning remote homologues PRVA_MACFU ------------------------------------------SMTDLLNA----EDIKKA PRVA_ESOLU -------------------------------------------AKDLLKA----DDIKKA PRVB_CYPCA ------------------------------------------AFAGVLND----ADIAAA PRVB_BOACO ------------------------------------------AFAGILSD----ADIAAG PRV1_SALSA -----------------------------------------MACAHLCKE----ADIKTA PRVB_LATCH ------------------------------------------AVAKLLAA----ADVTAA PRVB_RANES ------------------------------------------SITDIVSE----KDIDAA TPCS_RABIT -TDQQAEARSYLSEEMIAEFKAAFDMFDADGG-GDISVKELGTVMRMLGQTPTKEELDAI TPCS_PIG -TDQQAEARSYLSEEMIAEFKAAFDMFDADGG-GDISVKELGTVMRMLGQTPTKEELDAI TPCC_MOUSE MDDIYKAAVEQLTEEQKNEFKAAFDIFVLGAEDGCISTKELGKVMRMLGQNPTPEELQEM : :: PRVA_MACFU VGAFSAIDS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGFI PRVA_ESOLU LDAVKAEGS--FNHKKFFALVG------LKAMSANDVKKVFKAIDADASGFIEEEELKFV PRVB_CYPCA LEACKAADS--FNHKAFFAKVG------LTSKSADDVKKAFAIIDQDKSGFIEEDELKLF PRVB_BOACO LQSCQAADS--FSCKTFFAKSG------LHSKSKDQLTKVFGVIDRDKSGYIEEDELKKF PRV1_SALSA LEACKAADT--FSFKTFFHTIG------FASKSADDVKKAFKVIDQDASGFIEVEELKLF PRVB_LATCH LEGCKADDS--FNHKVFFQKTG------LAKKSNEELEAIFKILDQDKSGFIEDEELELF PRVB_RANES LESVKAAGS--FNYKIFFQKVG------LAGKSAADAKKVFEILDRDKSGFIEQDELGLF TPCS_RABIT IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNADGYIDAEELAEI TPCS_PIG IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNMDGYIDAEELAEI TPCC_MOUSE IDEVDEDGSGTVDFDEFLVMMVRCMKDDSKGKSEEELSDLFRMFDKNADGYIDLDELKMM :..:... *: * : * :* :.*:*: :**. PRVA_MACFU LKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES- PRVA_ESOLU LKSFAADGRDLTDAETKAFLKAADKDGDGKIGIDEFETLVHEA- PRVB_CYPCA LQNFKADARALTDGETKTFLKAGDSDGDGKIGVDEFTALVKA-- PRVB_BOACO LQNFDGKARDLTDKETAEFLKEGDTDGDGKIGVEEFVVLVTKG- PRV1_SALSA LQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQ-- PRVB_LATCH LQNFSAGARTLTKTETETFLKAGDSDGDGKIGVDEFQKLVKA-- PRVB_RANES LQNFRASARVLSDAETSAFLKAGDSDGDGKIGVEEFQALVKA-- TPCS_RABIT FR---ASGEHVTDEEIESLMKDGDKNNDGRIDFDEFLKMMEGVQ TPCS_PIG FR---ASGEHVTDEEIESIMKDGDKNNDGRIDFDEFLKMMEGVQ TPCC_MOUSE LQ---ATGETITEDDIEELMKDGDKNNDGRIDYDEFLEFMKGVE ::.. :: : ::.* :.** *. :** ::

29 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Going further… PRVA_MACFU VGAFSAIDS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGFI PRVB_BOACO LQSCQAADS--FSCKTFFAKSG------LHSKSKDQLTKVFGVIDRDKSGYIEEDELKKF PRV1_SALSA LEACKAADT--FSFKTFFHTIG------FASKSADDVKKAFKVIDQDASGFIEVEELKLF TPCS_RABIT IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNADGYIDAEELAEI TPCS_PIG IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNMDGYIDAEELAEI TPCC_MOUSE IDEVDEDGSGTVDFDEFLVMMVRCMKDDSKGKSEEELSDLFRMFDKNADGYIDLDELKMM TPC_PATYE SDEMDEEATGRLNCDAWIQLFER---KLKEDLDERELKEAFRVLDKEKKGVIKVDVLRWI. :... ::. : * :* :.* *. : *. PRVA_MACFU LKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES-- PRVB_BOACO LQNFDGKARDLTDKETAEFLKEGDTDGDGKIGVEEFVVLVTKG-- PRV1_SALSA LQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQ--- TPCS_RABIT FR---ASGEHVTDEEIESLMKDGDKNNDGRIDFDEFLKMMEGVQ- TPCS_PIG FR---ASGEHVTDEEIESIMKDGDKNNDGRIDFDEFLKMMEGVQ- TPCC_MOUSE LQ---ATGETITEDDIEELMKDGDKNNDGRIDYDEFLEFMKGVE- TPC_PATYE LS---SLGDELTEEEIENMIAETDTDGSGTVDYEEFKCLMMSSDA :. :: : :: * :..* :. :** ::

30 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 What makes a good alignment… The more divergeant the sequences, the better The fewer indels, the better Nice ungapped blocks separated with indels Different classes of residues within a block: Completely conserved Size and hydropathy conserved Size or hydropathy conserved The ultimate evaluation is a matter of personal judgment and knowledge

31 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Avoiding pitfalls

32 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Keep a biological perspective chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM--------- mouse AKDDRIRYDNEMKSWEEQMAE * :.*. : chite AD--K----PKR-PLYMLWLNS-ARESIKRENPDFK-VT-EVAKKGGELWRGL- wheat -DPNK----PKRAP-FFVFMGE-FREEFKQKNPKNKSVA-AVGKAAGERWKSLS trybr -K--KDSNAPKR-AMT-MFFSSDFR-S-KH-S-DLS-IV-EMSKAAGAAWKELG mouse ----K----PKR-PRYNIYVSESFQEA-K--D-D-S-AQGKL-KLVNEAWKNLS * ***.:: ::... : *... : *. *: * chite KSEWEAKAATAKQNY-I--RALQE-YERNG-G- wheat KAPYVAKANKLKGEY-N--KAIAA-YNK-GESA trybr RKVYEEMAEKDKERY----K--RE-M------- mouse KQAYIQLAKDDRIRYDNEMKSWEEQMAE----- : : * :.* : DIFFERENT PARAMETERS

33 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Do not overtune!!! DO NOT PLAY WITH PARAMETERS! IF YOU KNOW THE ALIGNMENT YOU WANT: MAKE IT YOURSELF! chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :::.:... :.. *. *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM--------- mouse AKDDRIRYDNEMKSWEEQMAE * :.*. : chite ---ADKPKRPL-SAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAP-SAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS-----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPR-SAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. *.:... :.. *. *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM--------- mouse AKDDRIRYDNEMKSWEEQMAE * :.*. :

34 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Choosing the right method PROBLEM PROGRAM ClustalW MSA DIALIGN II METHOD Source: BaliBase Thompson et al, NAR, 1999

35 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Conclusion The best alignment method: Your brain The right data The best evaluation method: Your eyes Experimental information (SwissProt) What can I conclude? Homology -> information extrapolation How can I go further? Patterns Profiles HMMs …

36 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 The database

37 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 History Founded by Amos Bairoch 1988 First release in the PC/Gene software 1990 Synchronisation with Swiss-Prot 1994 Integration of « profiles » 1999 PROSITE joins InterPro November 2001 Current release 16.50

38 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Database content Official Release ~1400Patterns PSxxxxxPATTERN ~100Profiles PSxxxxxMATRIX 4Rules PSxxxxxRULE ~1100Documentations PDOCxxxxx Pre-Release ~250ProfilesPSxxxxxMATRIX ~150DocumentationsQDOCxxxxx

39 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Pattern « philosophy » Target: definition of sites with biological information catalytic, metal binding, S-S bridge, cofactor binding, prosthetic group, PTM Easy to understand and to design, example Q-x(3)-N-[SA]-C-G-x(3)-[LIVM](2)-H-[SA]-[LIVM]-[SA]

40 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Pattern syntax Regular expression (REGEXP) language: Each position is separated by a dash « - » amino acids are represented by single letter code « x » represent any amino acid [] group of amino acid acceptable for a position {} group of amino acid not acceptable for a position () multiple or range e.g., A(1,3) means 1 to 3 A < anchor at beginning of sequence > anchor at end of sequence

41 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Profile « philosophy » Aim: identification of domains and not protein families Gene discovery vs automatic annotation Importance of score and calibration Possible manual tuning (by a well trained expert… ;-) -> allowed by the profile syntax -> no direct link to multiple alignment

42 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Database content: PATTERN ID UCH_2_1; PATTERN. AC PS00972; DT JUN-1994 (CREATED); SEP-2000 (DATA UPDATE); SEP-2000 (INFO UPDATE). DE Ubiquitin carboxyl-terminal hydrolases family 2 signature 1. PA G-[LIVMFY]-x(1,3)-[AGC]-[NASM]-x-C-[FYW]-[LIVMFC]-[NST]-[SACV]-x-[LIVMS]-Q. NR /RELEASE=38,80000; NR /TOTAL=41(41); /POSITIVE=41(41); /UNKNOWN=0(0); /FALSE_POS=0(0); NR /FALSE_NEG=2; /PARTIAL=0; CC /TAXO-RANGE=??E??; /MAX-REPEAT=1; CC /SITE=7,active_site(?); DR Q93008, FAFX_HUMAN, T; O00507, FAFY_HUMAN, T; P55824, FAF_DROME, T; DR P70398, FAF_MOUSE, T; P54578, TGT_HUMAN, T; P40826, TGT_RABIT, T; DR P25037, UBP1_YEAST, T; O42726, UBP2_KLULA, T; Q01476, UBP2_YEAST, T; (…) DR P38187, UBPD_YEAST, T; Q24574, UBPE_DROME, T; Q14694, UBPE_HUMAN, T; DR P52479, UBPE_MOUSE, T; P38237, UBPE_YEAST, T; P50101, UBPF_YEAST, T; DR Q02863, UBPG_YEAST, T; P43593, UBPH_YEAST, T; Q61068, UBPW_MOUSE, T; DR P34547, UBPX_CAEEL, T; Q09931, UBPY_CAEEL, T; DR P53874, UBPA_YEAST, N; Q17361, UBPT_CAEEL, N; DO PDOC00750; //

43 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Database content: Profile ID UCH_2_3; MATRIX. AC PS50235; DT SEP-2000 (CREATED); SEP-2000 (DATA UPDATE); SEP-2000 (INFO UPDATE). DE Ubiquitin carboxyl-terminal hydrolases family 2 profile. MA /GENERAL_SPEC: ALPHABET='ABCDEFGHIKLMNPQRSTVWYZ'; LENGTH=193; TOPOLOGY=LINEAR; MA /DISJOINT: DEFINITION=PROTECT; N1=10; N2=185; MA /NORMALIZATION: MODE=1; FUNCTION=LINEAR; R1=1.3922; R2=.00836191; TEXT='NScore'; MA /CUT_OFF: LEVEL=0; SCORE=910; N_SCORE=9.0; MODE=1; MA /CUT_OFF: LEVEL=-1; SCORE=610; N_SCORE=6.5; MODE=1; MA /DEFAULT: B1=-100; E1=-100; MI=-105; MD=-105; IM=-105; DM=-105; I=-20; D=-20; MA /I: B1=0; BI=-105; BD=-105; MA /M: SY='T'; M=0,-14,2,-19,-16,-9,-21,-18,-6,-10,-5,-5,-12,-21,-15,-6,0,9,6,-29,-11,-16; (…) MA /M: SY='D'; M=-11,12,-27,17,6,-21,-9,-4,-21,-4,-18,-14,5,-12,0,-6,-3,-8,-19,-26,-11,2; MA /I: E1=0; NR /RELEASE=38,80000; NR /TOTAL=47(47); /POSITIVE=47(47); /UNKNOWN=0(0); /FALSE_POS=0(0); NR /FALSE_NEG=0; /PARTIAL=0; CC /TAXO-RANGE=??E??; /MAX-REPEAT=1; DR Q01988, UBPB_CANFA, T; Q93008, FAFX_HUMAN, T; O00507, FAFY_HUMAN, T; DR P55824, FAF_DROME, T; P70398, FAF_MOUSE, T; P53010, PAN2_YEAST, T; (…) DR Q09798, YAA4_SCHPO, T; P43589, YFH5_YEAST, T; DO PDOC00750; //

44 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Database content: documentation {PDOC00750} {PS00972; UCH_2_1} {PS00973; UCH_2_2} {PS50235; UCH_2_3} {BEGIN} ********************************************************************** * Ubiquitin carboxyl-terminal hydrolases family 2 signatures/profile * ********************************************************************** Ubiquitin carboxyl-terminal hydrolases (EC 3.1.2.15) (UCH) (deubiquitinating enzymes) [1,2] are thiol proteases that recognize and hydrolyze the peptide bond at the C-terminal glycine of ubiquitin. These enzymes are involved in the processing of poly-ubiquitin precursors as well as that of ubiquinated proteins. There are two distinct families of UCH. The second class consist of large proteins (800 to 2000 residues) and is currently represented by: - Yeast UBP1, UBP2, UBP3, UBP4 (or DOA4/SSV7), UBP5, UBP7, UBP9, UBP10, UBP11, UBP12, UBP13, UBP14, UBP15 and UBP16. - Human tre-2. - Human isopeptidase T. - Human isopeptidase T-3. (…)

45 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Database content: documentation also probably implicated in the catalytic mechanism. We have developed signature pattern for both conserved regions. We also developed a profile including the two regions covered by the patterns. -Consensus pattern: G-[LIVMFY]-x(1,3)-[AGC]-[NASM]-x-C-[FYW]-[LIVMFC]-[NST]- [SACV]-x-[LIVMS]-Q [C is the putative active site residue] -Sequences known to belong to this class detected by the pattern: ALL, except for two sequences. (…) -Note: these proteins belong to family C19 in the classification of peptidases [3,E1]. -Note: this documentation entry is linked to both a signature pattern and a profile. As the profile is much more sensitive than the pattern, you should use it if you have access to the necessary software tools to do so. -Last update: September 2000 / Patterns and text revised; profile added. [ 1] Jentsch S., Seufert W., Hauser H.-P. Biochim. Biophys. Acta 1089:127-139(1991). [ 2] D'andrea A., Pellman D. Crit. Rev. Biochem. Mol. Biol. 33:337-352(1998). [ 3] Rawlings N.D., Barrett A.J. Meth. Enzymol. 244:461-486(1994). [E1] http://www.expasy.ch/cgi-bin/lists?peptidas.txt

46 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Tools EMBOSS fuzzpro, fuzztran, fuzznuc, patmatdb, patmatmotifs FINDPATTERN, SCANPROSITE... http://www.expasy.org/tools/#pattern PFSCAN & PFRAMESCAN http://www.isrec.isb-sib.ch/software Pftools 2.2 (pfmake, pfw, pfscan, pfsearch) Fortran source code (open source) Binaries (solaris, linux, hpux, irix, win32, macosX) GeneMatcher (http://www.paracel.com)

47 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 PSI-BLAST What is it? Derived from NCBI-BLAST2.0 Position Specific Iterative BLAST Difference with BLAST PSSM / checkpoint Advantage / Disadvantage

48 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 PSI-BLAST Position specific iterative BLAST (PSI-BLAST) refers to a feature of BLAST 2.0 in which a profile (or position specific scoring matrix, PSSM) is constructed (automatically) from a multiple alignment of the highest scoring hits in an initial BLAST search. The PSSM is generated by calculating position-specific scores for each position in the alignment. Highly conserved positions receive high scores and weakly conserved positions receive scores near zero. The profile is used to perform a second (etc.) BLAST search (replacing the normal matrix, e.g. BLOSUM62) and the results of each "iteration" used to refine the profile. This iterative searching strategy results in increased sensitivity.

49 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 BLAST algorithm

50 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Differences with BLAST The two E-values Automatically or manually selecting the matches The substitution matrix The iteration

51 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 PSI-BLAST E-values Two different E value settings need to be specified in the PSI- BLAST program. The first of these (upper) sets the threshold for the initial BLAST search. The default value is 10 as in the standard BLAST program. The second E value (lower) is the threshold value for inclusion in the position specific matrix used for PSI-BLAST iterations. The default setting is 0.001. The E values specified allow the user to see (and selectively, based on prior knowledge, include) all of the BLAST hits up to E=10; but to automatically include only those hits exceeding a relatively rigorous E value threshold of 0.001.

52 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 BLAST PSSM or weight matrix A substitution matrix for an alphabet of size A is of size AxA A PSSM for an alphabet of size A is of size AxN where N is the length of the query A R N.. Y V A 4 -1 -2 -2 0 R -1 5 0 -2 -3 N -2 0 6 -2 -3. Y -2 -2 -2 7 -1 V 0 -3 -3.. -1 4 M I S E C U E N C I A.. A 0 2 1 0 0 0 0 -1 0 -1 3 R -1 -1 0 0 -1 0 0 0 -1 -1 -1 N -1 -1 0 0 -1 0 0 5 -1 0 -1. Y -1 -1 -1 -1 -1 0 -1 -1 -1 -1 -1 V -1 -2 -1 -1 -1 0 -1 -1 -1 3 -1

53 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 BLAST Iteration

54 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 PHI-BLAST: a link with PATTERNS PHI-BLAST means Pattern-Hit Initiated BLAST PHI-BLAST expects as input a protein query sequence and a pattern contained in that sequence. PHI-BLAST searches the specified database for other protein sequences that also contain the input pattern and have significant similarity to the query sequence in the vicinity of the pattern occurrences. Statistical significance is reported using E-values as for other forms of BLAST, but the statistical method for computing the E-values is different. PHI-BLAST is integrated with Position-Specific Iterated BLAST (PSI-BLAST), so that the results of a PHI-BLAST query can be used to initiate one or more rounds of PSI-BLAST searching.

55 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 The good and the bad Advantages Fast User friendly interface Local bias statistics Single software Disadvantages Could be confusing No position specific gap penalty Fixed query length Complex PSSM/checkpoint for reuse Difficult scan vs search

56 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 How to « PSI-BLAST » efficiently? Choose carefuly your query sequence Limit the size to the domain, but maximize Check matches: include or exclude based on biological knowledge Do not overfit!! Try reverse experiment to certify


Download ppt "Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Multiple alignments, PATTERNS, PSI-BLAST."

Similar presentations


Ads by Google