Presentation is loading. Please wait.

Presentation is loading. Please wait.

Homologues finding and Multiple Sequence Alignment Maya Schushan November 2010.

Similar presentations


Presentation on theme: "Homologues finding and Multiple Sequence Alignment Maya Schushan November 2010."— Presentation transcript:

1 Homologues finding and Multiple Sequence Alignment Maya Schushan November 2010

2 Outline- introduction to alignments 1. Introduction 4. Pairwise Alignment: Smith-Waterman Needlman-Wunch 5. Multiple Sequence Alignment: ClustalW MUSCLE T-coffee 2. Applications 3. General Alignment Methodology

3 T C A T G C A T T G T C A T G C A T T G T C A T G C A T T G or ? A process of lining-up 2 or more sequences to achieve maximum level of identity, in order to find homologies. Introduction What Is An Alignment?

4 Comparing 2 (pairwise) or more (multiple) sequences. Searching for a series of identical or similar characters in the sequences. VLSPADKTNVKAAWAKVGAHAAGHG ||| | | |||| | |||| VLSEAEWQLVLHVWAKVEADVAGHG Introduction

5 Basic Terms Introduction Identity: Sequences or Sub-sequences that are invariant. Homology: Relation of sequences which is a result of divergence from a common ancestor. Similarity: Sequences or Sub-sequences that are related. C A G C A T

6 Homologues: Orthology vs Paralogy Introduction Reproduced from NCBI education website

7 Introduction The Limits of Sequence Similarity

8 Outline 1. Introduction 4. Pairwise Alignment: Smith-Waterman Needlman-Wunch 5. Multiple Sequence Alignment: ClustalW MUSCLE T-coffee 2. Applications 3. General Alignment Methodology

9 Why Sequence Alignment? 1.Predict characteristics of a protein – Applications VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGSSSNIGS--ITVNWYQQLPG LRLSCTGSGFIFSS--YAMYWYQQAPG LSLTCTGSGTSFDD-QYYSTWYQQPPG

10 Applications A model is generated according to a template structure of a homologous protein Why Sequence Alignment?

11 2. Learn about evolutionary relationships – Two sequences from different organisms are similar  they may have a common ancestor. Needed for construction of phylogenetic trees Applications Why Sequence Alignment?

12 3. Research of disease – Comparison of sequences between individuals can detect changes that are related to diseases Analysis of residues’ substitutions: mutation or polymorphism? Applications Why Sequence Alignment?

13 4. Find similar sequences in a database The commonly used BLAST and FASTA search programs have to utilize a form of an alignment to detect similar sequences to the sequence in hand The methods employed has to be very fast, to make the search in a database containing millions of sequences feasible Applications Why Sequence Alignment?

14 Examples for specific applications: Evolutionary conservation analysis (ConSeq/ConSurf) Motif and domain prediction (Prosite/InterPro/Pfam) Phylogenetic trees … ConSurf analysis of PDB entry 1hyt-hydrolase Applications Why Sequence Alignment?

15 Outline 1. Introduction 4. Pairwise Alignment: Smith-Waterman Needlman-Wunch 5. Multiple Sequence Alignment: ClustalW MUSCLE T-coffee 2. Applications 3. General Alignment Methodology

16 Example: Aligning Two Globins Human Hemoglobin (HH): VLSPADKTNVKAAWGKVGAHAGYEG Sperm Whale Myoglobin (SWM): VLSEGEWQLVLHVWAKVEADVAGHG General Alignment Methodology

17 Example: Aligning Two Globins (HH) VLSPADKTNVKAAWGKVGAHAGYEG (SWM) VLSEGEWQLVLHVWAKVEADVAGHG No Gaps: Percent identity: 36 Percent similarity: 40

18 (HH) VLSPADKTNVKAAWGKVGAH-AGYEG (SWM) VLSEGEWQLVLHVWAKVEADVAGH-G General Alignment Methodology Example: Aligning Two Globins With Gaps: Gaps: 2 Percent identity: 45.833 (instead of 36 without gaps) Percent similarity: 54.167 (instead of 40 without gaps)

19 Sequence Modifications 1. Insertion - an insertion of a letter or several letters to the sequence. AAGA  AAGTA 2. Deletion - deleting a letter (or more) from the sequence. AAGA  AGA 3. Substitution - replacing a sequence letter by another. AAGA  AACA General Alignment Methodology INEDL - Insertions + Deletions

20 Measuring An Alignment S = ACTG S’ = AC_TG S’ = ACTG S’ = ACTG T = AGT T’ = A_GT_ T’ = AGT_ T’ = _AGT Good: Identical characters- match. Bad: Different characters- mismatch; gap (InDel). Each pair of characters gets a value, depending on its identity. The similarity score of the alignment is the sum of pair values. General Alignment Methodology

21 Alignment Scoring 1. Assume independent mutation model 2. Score at each position –Positive if the same/similar (e.g. –Negative if different or gap 3. Score of an alignment is sum of position score General Alignment Methodology

22 Different scoring  different best alignments Scoring systems implicitly represent a particular theory of evolution –Some mismatches are more plausible Transition vs. Transversion Lys  Arg ≠ Lys  Cys –Gap extension Vs. Gap opening General Alignment Methodology Alignment Scoring

23 Alignment Scoring Scoring Matrix A matrix n  n : n=4 for DNA, n=20 for proteins Each entry matrix defines the score for observing the two letters in the alignment –Positive if likely to change –Negative otherwise TCGA 1A 1-5G 1 C 1 T General Alignment Methodology

24 DNA scoring matrices Transitions – purine to purine or pyrmidine to pyrmidine (4 possibilities) Transversions – purine to pyrmidine or pyrmidine to purine (8 possibilities) By chance alone transversions should occur twice as often as transitions. De-facto transitions are more frequent than transversions. General Alignment Methodology

25 TCGAFrom To 2A 2-4G 2-6 C 2-4-6 T Match Transition Transversion DNA scoring matrices General Alignment Methodology

26 Observation: some substitutions are more frequent than others, e.g., chemically similar amino acids As for DNA, protein matrices define the probabilities of change between the different amino acids Popular matrices are based on empirical data: PAM & BLOSUM General Alignment Methodology Proteins scoring matrices T L Y D K T L Y E K T L Y D K T L Y Q K T L Y D K In the fourth column E and D are found in 7 / 8

27 General Alignment Methodology Proteins scoring matrices Amino AcidCategory Asp (D) Glu(E) Asn (N) Gln (Q)Acids and Amides His (H) Lys (K) Arg (R)Basic Phe (F) Tyr (Y) Trp (W)Aromatic Ala (A) Cys (C) Gly (G) Pro (P) Ser (S) Thr (T)Hydrophilic Ile (I) Leu (L) Met (M) Val (V)Hydrophobic

28 BLOSUM Matrices General Alignment Methodology Based on BLOCKS database: ~2000 blocks from 500 families of related proteins Blocks: short conserved patterns of 3-60 aa without gaps Different BLOSUMn matrices are calculated independently from BLOCKS BLOSUMn is based on sequences that shared at least n percent identity

29 Low BLUSOM numbers for distant sequences High BLUSOM numbers for similar sequence Generally: –BLOSUM62 for general use –BLOSUM80 for close relations –BLOSUM45 for distant relations BLOSUM Matrices General Alignment Methodology

30 InDels are rare in evolution: once created, easy to extend: Gap open – penalty for the first residue in a gap Gap extension – penalty for additional residue in a gap. Types of Gap Penalties General Alignment Methodology (insertions or deletions)

31 Motivation: Aligning cDNAs to Genomic DNA Conclusion: gap opening and extension should be ranked differently to properly align the sequences Genomic DNA cDNA query General Alignment Methodology Types of Gap Penalties

32 The final score of the alignment is the sum of the positive scores and penalty scores: + Number of Identities + Number of Similarities - Number of Gap insertions - Number of Gap extensions Alignment score Summary: Scoring and Alignment Scoring Matrix Gap penalties General Alignment Methodology

33 Outline 1. Introduction 4. Pairwise Alignment: Smith-Waterman Needlman-Wunch 5. Multiple Sequence Alignment: ClustalW MUSCLE T-coffee 2. Applications 3. General Alignment Methodology

34 Local alignment – finds regions of similarity in parts of the sequences. Local vs. Global Global alignment – finds the best alignment across the whole two sequences. ADLGAVFALCDRYFQ |||| |||| | ADLGRTQN-CDRYYQ ADLG CDRYFQ |||| |||| | ADLG CDRYYQ Pairwise Alignment

35 Global: Needleman & Wunsch (1970) The best alignment over the entire length of two sequences only The Needleman-Wunsch algorithm is appropriate for finding the alignment of two sequences which are: (i) of the similar length; (ii) similar across their entire lengths. Example: SIMILARITY PI-LLAR--- Needleman, S. B. and Wunsch, C. D., 1970

36 Pairwise Alignment Global: Needleman & Wunsch (1970)

37 Pairwise Alignment Local: Smith & Waterman (1981) Makes an optimal alignment of the best segment of similarity Suitable when comparing substantially different sequences, and have only a short patches of similarity Use when one sequence is short and the other is very long Can return a number of highly aligned segments For example, the local alignment of SIMILARITY and PILLAR : MILAR ILLAR Smith, T.F. and Waterman, M.S., 1981

38 Pairwise Alignment Local: Smith & Waterman (1981) Smith, T.F. and Waterman, M.S., 1981

39 Pair of sequences Local or global alignment Scoring: –Gap penalties: opening/extension –Scoring matrix Pairwise Alignment User Input

40 Outline 1. Introduction 4. Pairwise Alignment: Smith-Waterman Needlman-Wunch 5. Multiple Sequence Alignment: ClustalW MUSCLE T-coffee 2. Applications 3. General Alignment Methodology

41 Multiple Sequence Alignment Pairwise Vs. Multiple Sequence Alignment Alignments help to analyze sequence data: organize and visualize. F G K  G K G F G K F G K G Pairwise: For 2 sequences MSA: For more than 2 sequences F G K  G K G F G K F G K G - G K Q G K G - - K F G K G

42 Multiple Sequence Alignment Rules For Choosing Sequences Very similar sequences have little information Very different sequences cause trouble…<30% identical with more than half of the other sequences in the set Choose sequences as distantly related as possible –Sequence between 30-80% identical with more than half of the sequences in the set The more sequences the better

43 Multiple Sequence Alignment Similarity Score of MSA Each position gets a value, depending on its identity. The similarity score of the alignment is the sum of all position values. A popular way to compute position values: SP - Sum of Pairs - each pair gets the score from the similarity matrix (PAM, BLOSUM). Goal: Find MSA with maximum similarity score Bad News: This problem is NP hard

44 APPROXIMATE FAST ACCURATE SLOW Multiple Sequence Alignment More than a handful of MSA methods exist…

45 Multiple Sequence Alignment ClustalW (1994)- Introduction Thompson, J.D. et al, 1994 This heuristic approach works because it uses the biological meaning of MSA Based on the idea that the sequences we usually want to align are phylogenetically related: a pairwise alignment algorithm is used iteratively, first to align the most closely related pair of sequences, then the next most similar one to that pair. Rule “once a gap, always a gap”: The gaps between more similar pairs of sequences should not be affected by more distantly related ones.

46 Multiple Sequence Alignment ClustalW- Progressive Alignment Hbb_Human 1 Hbb_Horse 2 Hba_Human 3 Hba_Horse 4 Myg_Whale 5 1. Quick pairwise alignment calculate distance matrix - 17- 5960- 59 13- 77 75 - Hbb_Human Hbb_Horse Hba_Human Hba_Horse Myg_Whale 2. Build a guide tree using the NJ phylogenetic method 3. Progressive alignment following guide tree

47 A D C B Multiple Sequence Alignment ClustalW- Progressive Alignment ABCD A---- B1--- C78-- D1152-

48 Multiple Sequence Alignment ClustalW- Additional Features Sequence weighting : – Each sequence gets a weight derived from the guide tree – Close sequences are down-weighted – Distant sequences receive high weights – The weights are normalized so that the highest is 1 W(Hbb_Human) =.081 + ½*.226 + ¼*.061 + 1/5*.015 + 1/6*.062 = 0.221 w1 w2 w3 w4 w5 w6 w7

49 Multiple Sequence Alignment ClustalW- Problems Sequences that are similar only in some smaller regions  ClustalW tries to find global alignments, not local. Sequence that contains a large insertion compared to the rest  global not local Sequence that contains a repetitive element, while another sequence only contains one copy. Vs

50 Multiple Sequence Alignment MUSCLE- Introduction The most recent popular MSA software Considered to be the most accurate MSA software available today The basic idea: iterative progressive alignment Edgar, R.C., 2004

51 Applying new score function to the profile alignments Refinement of the initial results Multiple Sequence Alignment MUSCLE Innovations Edgar R.C., 2004 Faster distance estimation between the input sequences Faster construction of an evolutionary tree (UPGMA instead of NJ in ClustalW ) faster more accurate

52 An edge is chosen from the progressive alignment tree. The tree is divided into two subtrees by deleting this edge. The MSA from each subtree is computed by progressive alignment. The two MSAs are aligned, generating an entire new MSA If the new MSA achieves higher score than the previous  keep it Multiple Sequence Alignment MUSCLE Innovations- Refinement Step MSA1 -------------- -------------- -------------- Old MSA ---------------------- ---------------------- ---------------------- ---------------------- ---------------------- MSA2 -------------- -------------- New MSA ---------------------- ---------------------- ---------------------- ---------------------- ----------------------

53 Multiple Sequence Alignment MUSCLE- It’s Even More Complicated…

54 Multiple Sequence Alignment All Against All- SH2 domains T-coffee MUSCLE Edgar, R.C., 2004

55 Multiple Sequence Alignment All Against All- BaliBase 2005 Edgar, R.C., 2004 MUSCLE is superior in some cases….

56 Multiple Sequence Alignment All Against All- PREFAB Edgar, R.C., 2004 T-coffee in others…  Trial and error is the best approach


Download ppt "Homologues finding and Multiple Sequence Alignment Maya Schushan November 2010."

Similar presentations


Ads by Google