Medical Natural Sciences Year 2: Introduction to Bioinformatics Lecture 9: Multiple sequence alignment (III) Centre for Integrative Bioinformatics VU.

Slides:



Advertisements
Similar presentations
Multiple Alignment Anders Gorm Pedersen Molecular Evolution Group
Advertisements

Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Multiple Sequence Alignment (MSA) I519 Introduction to Bioinformatics, Fall 2012.
Optimal Sum of Pairs Multiple Sequence Alignment David Kelley.
COFFEE: an objective function for multiple sequence alignments
Chapter 9 Structure Prediction. Motivation Given a protein, can you predict molecular structure Want to avoid repeated x-ray crystallography, but want.
Heuristic alignment algorithms and cost matrices
Sequence analysis course Lecture 7 Multiple sequence alignment 3 of 3 Optimizing progressive multiple alignment methods.
1-month Practical Course Genome Analysis (Integrative Bioinformatics & Genomics) Lecture 3: Pair-wise alignment Centre for Integrative Bioinformatics VU.
1-month Practical Course Genome Analysis Lecture 5: Multiple Sequence Alignment Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
What you should know by now Concepts: Pairwise alignment Global, semi-global and local alignment Dynamic programming Sequence similarity (Sum-of-Pairs)
BNFO 602 Lecture 2 Usman Roshan. Sequence Alignment Widely used in bioinformatics Proteins and genes are of different lengths due to error in sequencing.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
BNFO 240 Usman Roshan. Last time Traceback for alignment How to select the gap penalties? Benchmark alignments –Structural superimposition –BAliBASE.
Sequence Analysis Tools
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Lecture 6 – 07/01/08 Multiple sequence alignment 2 Sequence analysis 2007 Optimizing.
Sequence similarity.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
1-month Practical Course Genome Analysis Lecture 4: Pair-wise alignment Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam The.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Multiple Sequence Alignments
1-month Practical Course Genome Analysis Lecture 5: Multiple Sequence Alignment Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Chapter 5 Multiple Sequence Alignment.
3D-COFFEE Mixing Sequences and Structures Cédric Notredame.
Developing Pairwise Sequence Alignment Algorithms
Multiple sequence alignment
Bioiformatics I Fall Dynamic programming algorithm: pairwise comparisons.
Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU.
“Homology-enhanced probabilistic consistency” multiple sequence alignment : a case study on transmembrane protein Jia-Ming Chang 2013-July-09 Chang, J-M,
Pair-wise alignment quality versus sequence identity (Vogt et al., JMB 249, ,1995)
Protein Sequence Alignment and Database Searching.
Bioinformatics For MNW 2 nd Year Jaap Heringa FEW/FALW Integrative Bioinformatics Institute VU (IBIVU)
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Using the T-Coffee Multiple Sequence Alignment Package I - Overview Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program.
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Chapter 3 Computational Molecular Biology Michael Smith
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Manually Adjusting Multiple Alignments Chris Wilton.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Grundlagen der Bioinformatik Multiples Sequenzalignment Juni 2007.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
DNA, RNA and protein are an alien language
Expected accuracy sequence alignment Usman Roshan.
1 Multiple Sequence Alignment(MSA). 2 Multiple Alignment Number of sequences >2 Global alignment Seek an alignment that maximizes score.
Introduction to bioinformatics Lecture 7 Multiple sequence alignment (1)
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Multiple Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 13, 2004 ChengXiang Zhai Department of Computer Science University.
T-COFFEE, a novel method for combining biological information Cédric Notredame.
Multiple Sequence Alignment Dr. Urmila Kulkarni-Kale Bioinformatics Centre University of Pune
Pairwise alignment Now we know how to do it: How do we get a multiple alignment (three or more sequences)? Multiple alignment: much greater combinatorial.
Multiple sequence alignment (msa)
Multiple Sequence Alignment
In Bioinformatics use a computational method - Dynamic Programming.
Sequence Based Analysis Tutorial
1-month Practical Course
Sequence Alignment Algorithms Morten Nielsen BioSys, DTU
Introduction to bioinformatics Lecture 8
1-month Practical Course
Presentation transcript:

Medical Natural Sciences Year 2: Introduction to Bioinformatics Lecture 9: Multiple sequence alignment (III) Centre for Integrative Bioinformatics VU

Intermezzo: Symmetry-derived secondary structure prediction using multiple sequence alignments (SymSSP) Victor Simossis Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam, The Netherlands

Symmetry-derived secondary structure prediction using multiple sequence alignments (SymSSP) Modern state-of-the-art methods use multiple sequence alignments Methods like PhD, Profs, SSPro, etc., predict for the top sequence in the alignment by cutting out positions with gaps in the top sequence What if two helices ‘out of phase’ are pasted together? Or a strand and a helix? Approach: correct by permuting alignments and consensus prediction

Secondary structure periodicity patterns Burried  -strand Edge  -strand  -helix hydrophobic hydrophilic

Symmetry-derived Secondary structure prediction using MA (SymSSP) EEEEE HHHHHH EEEEE HH EEEE? ?HHHHH EEE H EEEEE HHHHH? ??EE HH EEEEEE ?HHHHH EEEE HH EEEEE HHHHHH EEE HH EEEE? ?HHHHH EEE H EEEEE HHHHH? ??EE HH EEEEE ?HHHHH EEEE HH EEEEE HHHH EEE HH EEEE? ?HHH EEE H EEEEE HHH? ??EE HH EEEEE HHH? EEEE HH EEEEE HHHHHH EEE HHHH EEEE? ?HHHHH EEE ?HHH EEEEE HHHHH? ??EE HHHH EEEEE ?HHHHH EEEE HHHH EEEEE HHHHH EEE HEEEE HHHH EE HHHEEEE HHHHH EEE HEEEE HHH EEE HH

Optimal segmentation of predicted secondary structures H score …. E score …. C score … EEEEE HHHHHH EEEEE HH EEEE? ?HHHHH EEE H EEEEE HHHHH? ??EE HH EEEEEE ?HHHHH EEEE HH 1 ->1 1 -> 2 1 ->3 1 ->4 ? Score …. Region …. C E H Each sequence within an alignment gives rise to a library of n secondary structure predictions, where n is the number of sequences in the alignment. The predictions are recorded by secondary structure type and region position in a single matrix

Optimal segmentation of predicted secondary structures by Dynamic Programming sequence position window size Max score Offset Label H score E score C score The recorded values are used in a weighted function according to their secondary structure type, that gives each position a window-specific score. The more probable the secondary structure element, the higher the score. Restrictions: H only if ws >= 4 E only if ws >= 2 5 H 26 Segmentation score (Total score of each path) ? score Region

Example of an optimally segmented secondary structure prediction library for sequence 3chy 3chy GYVV-----KPFTAATLEEKLNKIFEKLGM chy <- 1fx1 ??????????????? ee ?? hhhhhhhhhhhhhh ???????? 3chy <- FLAV_DESDE ??????????????? ee ?? hhhhhhhhhhhhhhh ???????? 3chy <- FLAV_DESVH ??????????????? ee ?? hhhhhhhhhhhhhh ???????? 3chy <- FLAV_DESGI ??????????????? eee ?? ??hhhhhhhhhhhhh ???????? 3chy <- FLAV_DESSA ??????????????? eee ?? ??hhhhhhhhhhhhh ???????? 3chy <- 4fxn ??????????????? eee ?? hhhhhhhhhhhhh ????????? 3chy <- FLAV_MEGEL ????????????????eee ?? hh?hhhhhhhhhhh ????????? 3chy <- 2fcr e ? eeeeeee hhhhhhhhhhhhhhh ?????? 3chy <- FLAV_ANASP ? eeeeeee hhhhhhhhhhhhhhh ?????? 3chy <- FLAV_ECOLI eeeeeee hhhhhhhhhhhhhhh hhhhh 3chy <- FLAV_AZOVI ? eeeeeee hhhhhhhhhhhhhhh ???? 3chy <- FLAV_ENTAG e eeeeeeee hhhhhhhhhhhhhhhh? ?????? 3chy <- FLAV_CLOAB eeeeeee hhhhhhhhhh ??????????? 3chy <- 3chy hhhhhhhhhhhhhh Consensus EEEE----- HHHHHHHHHHHHH Consensus-DSSP ****.....****xx*************** PHD HHHHHHHHHHHHHH PHD-DSSP xxxx.....******************x** DSSP EEEE.....SS HHHHHHHHHHHHHHHT LumpDSSP EEEE..... HHHHHHHHHHHHHHH......

Symmetry-derived secondary structure prediction (SymSSP) Tried over 120 different consensus weighting schemes (global, regional, positional) Over ~2700 Homstrad alignments and compared to PHD, on average 0.5% better 60% of the alignments are improved, 20% not affected and 20% is made worse Tried to correlate schemes with “cheap” a priori data (pairwise identities, sequence lengths, number of sequences, etc.)

Integrating secondary structure prediction and multiple sequence alignment Low key example shown of fairly homogeneous data (strings of letters in both cases) But already difficult to do and methods are not easily tunable How to scale up to knowledge-integrating and inference engines?

Profile pre-processing Secondary structure-induced alignment Globalised local alignment Matrix extension Objective: try to avoid (early) errors Strategies for multiple sequence alignment

Globalised local alignment Aim: fill each DP search matrix with the highest possible local alignment going through that cell Problem: Forward calculation + traceback for each local alignment is too slow Solution: Double dynamic programming 1.Local DP in forward and reverse direction (no traceback) + matrix summation 2.Global DP over matrix from step 1 + traceback

Globalised local alignment += 1. Local (SW) alignment (M + P o,e ) 2. Global (NW) alignment (no M or P o,e ) Double dynamic programming

M = BLOSUM62, P o = 0, P e = 0

M = BLOSUM62, P o = 12, P e = 1

M = BLOSUM62, P o = 60, P e = 5

Profile pre-processing Secondary structure-induced alignment Globalised local alignment Matrix extension Objective: try to avoid (early) errors Strategies for multiple sequence alignment

Integrating alignment methods and alignment information with T-Coffee Integrating different pair-wise alignment techniques (NW, SW,..) Combining different multiple alignment methods (consensus multiple alignment) Combining sequence alignment methods with structural alignment techniques Plug in user knowledge

Matrix extension T-Coffee Tree-based Consistency Objective Function For alignmEnt Evaluation Cedric Notredame Des Higgins J. Mol. Biol., 302, ;2000 Jaap HeringaJ. Mol. Biol., 302, ;2000

Using different sources of alignment information Clustal Dialign Clustal Lalign Structure alignments Manual T-Coffee

Progressive multiple alignment Guide treeMultiple alignment Score 1-2 Score 1-3 Score 4-5 Scores Similarity matrix 5×5

Default T-COFFEE Uses information from all sequences for each pair-wise alignment Reconciles global and local alignment information

T-Coffee matrix extension

Search matrix extension

T-Coffee Combine different alignment techniques by adding scores: W(A(x), B(y)) =  S(A(x), B(y)) –A(x) is residue x in sequence A –summation is over the scores S of the global and local alignments containing the residue pair (A(x), B(y)) –S is sequence identity percentage of the associated alignment Combine direct alignment seqA- seqB with each seqA- seqI-seqB: W’(A(x), B(y)) = W(A(x), B(y)) +  I  A,B Min(W(A(x), I(z)), W(I(z), B(y))) –Summation over all third sequences I other than A or B

T-Coffee Direct alignment Other sequences

T-Coffee library system Seq1AA1Seq2AA2Weight 3V315L3310 3V316L3414 5L336R3521 5l336I3635

T-Coffee progressive alignment MDAGSTVILCFVG MDAASTILCGSMDAASTILCGS Amino Acid Exchange Matrix Gap penalties (open,extension) Search matrix MDAGSTVILCFVG- MDAAST-ILC--GS

Kinase nucleotide binding sites

Comparing T-coffee with other methods

but..... T-COFFEE (V1.23) multiple sequence alignment Flavodoxin-cheY 1fx1 ----PKALIVYGSTTGNTEYTAETIARQLANAG-YEVDSRDAASVE-AGGLFEGFDLVLLGCSTWGDDSIE------LQDDFIPL-FDSLEETGAQGRK----- FLAV_DESVH ---MPKALIVYGSTTGNTEYTAETIARELADAG-YEVDSRDAASVE-AGGLFEGFDLVLLGCSTWGDDSIE------LQDDFIPL-FDSLEETGAQGRK----- FLAV_DESGI ---MPKALIVYGSTTGNTEGVAEAIAKTLNSEG-METTVVNVADVT-APGLAEGYDVVLLGCSTWGDDEIE------LQEDFVPL-YEDLDRAGLKDKK----- FLAV_DESSA ---MSKSLIVYGSTTGNTETAAEYVAEAFENKE-IDVELKNVTDVS-VADLGNGYDIVLFGCSTWGEEEIE------LQDDFIPL-YDSLENADLKGKK----- FLAV_DESDE ---MSKVLIVFGSSTGNTESIAQKLEELIAAGG-HEVTLLNAADAS-AENLADGYDAVLFGCSAWGMEDLE------MQDDFLSL-FEEFNRFGLAGRK fxn MKIVYWSGTGNTEKMAELIAKGIIESG-KDVNTINVSDVN-IDELL-NEDILILGCSAMGDEVLE ESEFEPF-IEEIS-TKISGKK----- FLAV_MEGEL -----MVEIVYWSGTGNTEAMANEIEAAVKAAG-ADVESVRFEDTN-VDDVA-SKDVILLGCPAMGSEELE DSVVEPF-FTDLA-PKLKGKK----- FLAV_CLOAB ----MKISILYSSKTGKTERVAKLIEEGVKRSGNIEVKTMNLDAVD-KKFLQ-ESEGIIFGTPTYYAN ISWEMKKW-IDESSEFNLEGKL fcr -----KIGIFFSTSTGNTTEVADFIGKTLGAKA---DAPIDVDDVTDPQAL-KDYDLLFLGAPTWNTGA----DTERSGTSWDEFLYDKLPEVDMKDLP----- FLAV_ENTAG ---MATIGIFFGSDTGQTRKVAKLIHQKLDGIA---DAPLDVRRAT-REQF-LSYPVLLLGTPTLGDGELPGVEAGSQYDSWQEF-TNTLSEADLTGKT----- FLAV_ANASP ---SKKIGLFYGTQTGKTESVAEIIRDEFGNDV---VTLHDVSQAE-VTDL-NDYQYLIIGCPTWNIGEL QSDWEGL-YSELDDVDFNGKL----- FLAV_AZOVI ----AKIGLFFGSNTGKTRKVAKSIKKRFDDET-M-SDALNVNRVS-AEDF-AQYQFLILGTPTLGEGELPGLSSDCENESWEEF-LPKIEGLDFSGKT----- FLAV_ECOLI ----AITGIFFGSDTGNTENIAKMIQKQLGKDV---ADVHDIAKSS-KEDL-EAYDILLLGIPTWYYGEA QCDWDDF-FPTLEEIDFNGKL chy ADKELKFLVVD--DFSTMRRIVRNLLKELGFN-NVE-EAEDGVDALNKLQ-AGGYGFVISDWNMPNMDGLE LLKTIRADGAMSALPVLMV :... :. :: 1fx VACFGCGDSS--YEYFCGA-VDAIEEKLKNLGAEIVQDG LRIDGDPRAA--RDDIVGWAHDVRGAI FLAV_DESVH VACFGCGDSS--YEYFCGA-VDAIEEKLKNLGAEIVQDG LRIDGDPRAA--RDDIVGWAHDVRGAI FLAV_DESGI VGVFGCGDSS--YTYFCGA-VDVIEKKAEELGATLVASS LKIDGEPDSA----EVLDWAREVLARV FLAV_DESSA VSVFGCGDSD--YTYFCGA-VDAIEEKLEKMGAVVIGDS LKIDGDPE----RDEIVSWGSGIADKI FLAV_DESDE VAAFASGDQE--YEHFCGA-VPAIEERAKELGATIIAEG LKMEGDASND--PEAVASFAEDVLKQL fxn VALFGS------YGWGDGKWMRDFEERMNGYGCVVVETP LIVQNEPD--EAEQDCIEFGKKIANI FLAV_MEGEL VGLFGS------YGWGSGEWMDAWKQRTEDTGATVIGTA IV--NEMP--DNAPECKELGEAAAKA FLAV_CLOAB GAAFSTANSI--AGGSDIA-LLTILNHLMVKGMLVY----SGGVAFGKPKTHLGYVHINEIQENEDENARIFGERIANKVKQIF fcr VAIFGLGDAEGYPDNFCDA-IEEIHDCFAKQGAKPVGFSNPDDYDYEESKSVRDG-KFLGLPLDMVNDQIPMEKRVAGWVEAVVSETGV FLAV_ENTAG VALFGLGDQLNYSKNFVSA-MRILYDLVIARGACVVGNWPREGYKFSFSAALLENNEFVGLPLDQENQYDLTEERIDSWLEKLKPAVL FLAV_ANASP VAYFGTGDQIGYADNFQDA-IGILEEKISQRGGKTVGYWSTDGYDFNDSKALRNG-KFVGLALDEDNQSDLTDDRIKSWVAQLKSEFGL FLAV_AZOVI VALFGLGDQVGYPENYLDA-LGELYSFFKDRGAKIVGSWSTDGYEFESSEAVVDG-KFVGLALDLDNQSGKTDERVAAWLAQIAPEFGLSL---- FLAV_ECOLI VALFGCGDQEDYAEYFCDA-LGTIRDIIEPRGATIVGHWPTAGYHFEASKGLADDDHFVGLAIDEDRQPELTAERVEKWVKQISEELHLDEILNA 3chy TAEAKKENIIAAAQAGASGYVVKPFT---AATLEEKLNKIFEKLGM

Evaluating multiple alignments Conflicting standards of truth –evolution –structure –function With orphan sequences no additional information Benchmarks depending on reference alignments Quality issue of available reference alignment databases Different ways to quantify agreement with reference alignment (sum-of-pairs, column score) “Charlie Chaplin” problem

Evaluating multiple alignments As a standard of truth, often a reference alignment based on structural superpositioning is taken

Evaluation measures QueryReference Column score Sum-of-Pairs score

Scoring a multiple alignment Query Sum-of-Pairs score: For each alignment position: take the sum of all pairs (add a.a. exchange values) As an option, subtract gap penalties

Evaluating multiple alignments  SP BAliBASE alignment nseq * len

Summary Weighting schemes simulating simultaneous multiple alignment –Profile pre-processing (global/local) –Matrix extension (well balanced scheme) Smoothing alignment signals –globalised local alignment Using additional information –secondary structure driven alignment Schemes strike balance between speed and sensitivity

References Heringa, J. (1999) Two strategies for sequence comparison: profile-preprocessed and secondary structure-induced multiple alignment. Comp. Chem. 23, Notredame, C., Higgins, D.G., Heringa, J. (2000) T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol., 302, Heringa, J. (2002) Local weighting schemes for protein multiple sequence alignment. Comput. Chem., 26(5),

Where to find this….