Protein Sequence Alignment Multiple Sequence Alignment

Slides:



Advertisements
Similar presentations
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Advertisements

Pairwise alignments.
Measuring the degree of similarity: PAM and blosum Matrix
Introduction to Bioinformatics
Heuristic alignment algorithms and cost matrices
Sequence Alignment.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Scoring Matrices June 19, 2008 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
It & Health 2009 Summary Thomas Nordahl Petersen.
Scoring Matrices June 22, 2006 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
BNFO 240 Usman Roshan. Last time Traceback for alignment How to select the gap penalties? Benchmark alignments –Structural superimposition –BAliBASE.
Introduction to bioinformatics
Multiple sequence alignments and motif discovery Tutorial 5.
Multiple sequence alignment
Sequence Alignment III CIS 667 February 10, 2004.
Sequence similarity search Glance to the protein world.
Introduction to Bioinformatics Algorithms Sequence Alignment.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Exploring Protein Sequences Tutorial 5. Exploring Protein Sequences Multiple alignment –ClustalW Motif discovery –MEME –Jaspar.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Multiple Sequence Alignments
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Bioinformatics Sequence Analysis III
Basics of Sequence Alignment and Weight Matrices and DOT Plot
1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
BLAST Workshop Maya Schushan June 2009.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
. Sequence Alignment. Sequences Much of bioinformatics involves sequences u DNA sequences u RNA sequences u Protein sequences We can think of these sequences.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Biology 4900 Biocomputing.
Multiple sequence alignments Introduction to Bioinformatics Jacques van Helden Aix-Marseille Université (AMU), France Lab.
Construction of Substitution Matrices
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Sequence Alignment Only things that are homologous should be compared in a phylogenetic analysis Homologous – sharing a common ancestor This is true for.
In-Class Assignment #1: Research CD2
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Sequence Alignment.
Construction of Substitution matrices
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
Alignment methods April 17, 2007 Quiz 1—Question on databases Learning objectives- Understand difference between identity, similarity and homology. Understand.
1 Multiple Sequence Alignment(MSA). 2 Multiple Alignment Number of sequences >2 Global alignment Seek an alignment that maximizes score.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Sequence similarity search II Searching for remote homologies.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Sequence similarity search Glance to the protein world.
Pairwise Sequence Alignment and Database Searching
Sequence similarity, BLAST alignments & multiple sequence alignments
Multiple sequence alignment (msa)
Sequence Alignment.
Protein Sequence Alignments
Tutorial 3 – Protein Scoring Matrices PAM & BLOSUM
Sequence Based Analysis Tutorial
Alignment IV BLOSUM Matrices
Presentation transcript:

Protein Sequence Alignment Multiple Sequence Alignment Part 3 Protein Sequence Alignment Multiple Sequence Alignment

    Table 3.1. Web sites for alignment of sequence pairs Name of site Bayes block alignera http://www.wadsworth.org/resnres/bioinfo Zhu et al. (1998) Likelihood-weighted sequence alignmentb http://stateslab.bioinformatics.med.umich.edu/service see Web site PipMaker (percent identity plot), a graphical tool for assessing long alignments http://www.bx.psu.edu/miller_lab/ Schwartz et al. (2000) BCM Search Launcherc http://searchlauncher.bcm.tmc.edu/ SIM—Local similarity program for finding alternative alignments http://us.expasy.org/ Huang et al. (1990); Huang and Miller (1991); Pearson and Miller (1992) Global alignment programs (GAP, NAP) http://genome.cs.mtu.edu/align/align.html Huang (1994) FASTA program suited http://fasta.bioch.virginia.edu/ Pearson and Miller (1992); Pearson (1996) Pairwise BLASTe http://www.ncbi.nlm.nih.gov/blast/bl2seq/bl2.html Altschul et al. (1990) AceViewf shows alignment of mRNAs and ESTs to the genome sequence http://www.ncbi.nlm.nih.gov/IEB/Research/Acembly BLATf Fast alignment for finding genes in genome http://genome.ucsc.edu Kent (2002) GeneSeqerf predicts genes and aligns mRNA and genome sequences http://www.bioinformatics.iastate.edu/bioinformatics2go/ Usuka et al. (2000) SIM4f http://globin.cse.psu.edu Floria et al. (1998)

Protein Sequence Alignment

Protein Pairwise Sequence Alignment The alignment tools are similar to the DNA alignment tools BLASTP, FASTA Main difference: instead of scoring match (+2) and mismatch (-1) we have similarity scores: Score s(i,j) > 0 if amino acids i and j have similar properties Score s(i,j) is  0 otherwise How should we score s(i,j)?

The 20 Amino Acids

Chemical Similarities Between Amino Acids Acids & Amides DENQ (Asp, Glu, Asn, Gln) Basic HKR (His, Lys, Arg) Aromatic FYW (Phe, Tyr, Trp) Hydrophilic ACGPST (Ala, Cys, Gly, Pro, Ser, Thr) Hydrophobic ILMV (Ile, Leu, Met, Val)

Amino Acid Substitutions Matrices For aligning amino acids, we need a scoring matrix of 20 rows  20 columns Matrices represent biological processes Mutation causes changes in sequence Evolution tends to conserve protein function Similar function requires similar amino acids Could base matrix on amino acid properties In practice: based on empirical data

identity similarity

Given an alignment of closely related sequences we can score the relation between amino acids based on how frequently they substitute each other AGHKKKR D SFHRRRAGC D E - S In this column E & D are found 8/10

Amino Acid Matrices Symmetric matrix of 20x20 entries: entry (i,j)=entry(j,i) Entry (i,i) is greater than any entry (i,j), ji. Entry (i,j): the score of aligning amino acid i against amino acid j.

PAM - Point Accepted Mutations Developed by Margaret Dayhoff, 1978. Analyzed very similar protein sequences Proteins are evolutionary close. Alignment is easy. Point mutations - mainly substitutions Accepted mutations - by natural selection. Used global alignment. Counted the number of substitutions (i,j) per amino acid pair: Many i<->j substitutions => high score s(i,j) Found that common substitutions occurred involving chemically similar amino acids.

PAM 250 Similar amino acids are close to each other. Regions define conserved substitutions.

Selecting a PAM Matrix Low PAM numbers: short sequences, strong local similarities. High PAM numbers: long sequences, weak similarities. PAM120 recommended for general use (40% identity) PAM60 for close relations (60% identity) PAM250 for distant relations (20% identity) If uncertain, try several different matrices PAM40, PAM120, PAM250 recommended

BLOSUM Blocks Substitution Matrix Steven and Jorga G. Henikoff (1992) Based on BLOCKS database (www.blocks.fhcrc.org) Families of proteins with identical function Highly conserved protein domains Ungapped local alignment to identify motifs Each motif is a block of local alignment Counts amino acids observed in same column Symmetrical model of substitution AABCDA… BBCDA DABCDA. A.BBCBB BBBCDABA.BCCAA AAACDAC.DCBCDB CCBADAB.DBBDCC AAACAA… BBCCC

BLOSUM Matrices Different BLOSUMn matrices are calculated independently from BLOCKS BLOSUMn is based on sequences that are at most n percent identical.

Selecting a BLOSUM Matrix For BLOSUMn, higher n suitable for sequences which are more similar BLOSUM62 recommended for general use BLOSUM80 for close relations BLOSUM45 for distant relations

Multiple Sequence Alignment

Multiple Alignment Like pairwise alignment n input sequences instead of 2 Add indels to make same length Local and global alignments Score columns in alignment independently Seek an alignment to maximize score

Alignment Example GTCGTAGTCGGCTCGAC GTCTAGCGAGCGTGAT GCGAAGAGGCGAGC GCCGTCGCGTCGTAAC 1*1 2*0.75 11*0.5 Score=8 GTCGTAGTCG-GC-TCGAC GTC-TAG-CGAGCGT-GAT GC-GAAG-AG-GCG-AG-C GCCGTCG-CG-TCGTA-AC 4*1 11*0.75 2*0.5 Score=13.25 Score : 4/4 =1 , 3/4 =0.75 , 2/4=0.5 , 1/4= 0

Dynamic Programming Pairwise A–B alignment table Cell (i,j) = score of best alignment between first i elements of A and first j elements of B Complexity: length of A  length of B 3-way A–B–C alignment table Cell (i,j,k) = score of best alignment between first i elements of A, first j of B, first k of C Complexity: length A  length B  length C

MSA Complexity n-way S1–S2–…–Sn-1–Sn alignment table Cell (x1,…,xn) = best alignment score between first x1 elements of S1, …, xn elements of Sn Complexity: length S1  …  length Sn Example: protein family alignment 100 proteins, 1000 amino acids each Complexity: 10300 table cells Calculation time: beyond the big bang!

Feasible Approach Based on pairwise alignment scores Build n by n table of pairwise scores Align similar sequences first After alignment, consider as single sequence Continue aligning with further sequences

Sum of pairwise alignment scores For n sequences, there are n(n-1)/2 pairs GTCGTAGTCG-GC-TCGAC GTC-TAG-CGAGCGT-GAT GC-GAAG-AG-GCG-AG-C GCCGTCG-CG-TCGTA-AC

1 GTCGTAGTCG-GC-TCGAC 2 GTC-TAG-CGAGCGT-GAT 3 GC-GAAGAGGCG-AGC 4 GCCGTCGCGTCGTAAC

ClustalW Algorithm Progressive Sequences Alignment (Higgins and Sharp 1988) Compute pairwise alignment for all the pairs of sequences. Use the alignment scores to build a phylogenetic tree such that similar sequences are neighbors in the tree distant sequences are distant from each other in the tree. The sequences are progressively aligned according to the branching order in the guide tree. http://www.ebi.ac.uk/clustalw/

Progressive Sequence Alignment (Protein sequences example) N Y L S N K Y L S N F S N F L S N K/- Y L S N F L/- S N K/- Y/F L/- S

Treating Gaps in ClustalW Penalty for opening gaps and additional penalty for extending the gap Gaps found in initial alignment remain fixed New gaps are introduced as more sequences are added (decreased penalty if gap exists) Decreased within stretches of hydrophilic residues

MSA Approaches Progressive approach CLUSTALW (CLUSTALX) PILEUP T-COFFEE Iterative approach: Repeatedly realign subsets of sequences. MultAlin, DiAlign. Statistical Methods: Hidden Markov Models SAM2K Genetic algorithm SAGA