1 Homology and sequence alignment.. Homology Homology = Similarity between objects due to a common ancestry Hund = Dog, Schwein = Pig.

Slides:

Advertisements

Similar presentations

1 Lesson 2 Aligning sequences and searching databases.

Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.

BLAST Sequence alignment, E-value & Extreme value distribution.

Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.

1 ALIGNMENT OF NUCLEOTIDE & AMINO-ACID SEQUENCES.

Aligning sequences and searching databases

1 Exercise: BIOINFORMATIC DATABASES and BLAST. 2 Outline  NCBI and Entrez  Pubmed  Google scholar  RefSeq  Swissprot  Fasta format  PDB: Protein.

Heuristic alignment algorithms and cost matrices

1 Multiple sequence alignment Lesson 4. 2 VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG.

Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.

BLAST Basic Local Alignment Search Tool. BLAST החכה BLAST (Basic Local Alignment Search Tool) allows rapid sequence comparison of a query sequence [[רצף.

Bioinformatics and Phylogenetic Analysis

Slide 1 EE3J2 Data Mining Lecture 20 Sequence Analysis 2: BLAST Algorithm Ali Al-Shahib.

|| || ||||| ||| || || ||||||||||||||||||| MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE… ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACG TGGAAGACTGTGGTGGTGAGGCCCTGGGCAGGTTTGTATGGAGGTTACAAGGCTGCTTAAG.

1 Multiple sequence alignment Lesson What is a multiple sequence alignment?

Introduction to bioinformatics

1 ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACG TGGAAGACTGTGGTGGTGAGGCCCTGGGCAGGTTTGTATGGAGGTTACAAGGCTGCTTAAG GAGGGAGGATGGAAGCTGGGCATGTGGAGACAGACCACCTCCTGGATTTATGACAGGAACT.

Sequence similarity.

Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.

Sequence homology and alignment

Similar Sequence Similar Function Charles Yan Spring 2006.

Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.

Pairwise and Multiple Sequence Alignment Lesson 2

Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.

1 Lesson 3 Aligning sequences and searching databases.

Sequence alignment, E-value & Extreme value distribution

Introduction to Bioinformatics From Pairwise to Multiple Alignment.

BLAST Basic Local Alignment Search Tool. BLAST החכה BLAST (Basic Local Alignment Search Tool) allows rapid sequence comparison of a query sequence [[רצף.

From Pairwise Alignment to Database Similarity Search.

Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.

Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.

Multiple Alignment Modified from Tolga Can’s lecture notes (METU)

Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.

An Introduction to Bioinformatics

Protein Sequence Alignment and Database Searching.

BLAST Workshop Maya Schushan June 2009.

NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)

Sequence analysis: Macromolecular motif recognition Sylvia Nagl.

Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.

Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,

Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?

BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.

CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.

Comp. Genomics Recitation 3 The statistics of database searching.

Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.

Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.

Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.

Basic Local Alignment Search Tool BLAST Why Use BLAST?

Part 2- OUTLINE Introduction and motivation How does BLAST work?

Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.

Sequence Alignment.

Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics.

Step 3: Tools Database Searching

The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.

©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.

Pairwise Sequence Alignment Exercise 2. || || ||||| ||| || || ||||||||||||||||||| MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE… ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACG.

Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,

9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.

BLAST Anders Gorm Pedersen & Rasmus Wernersson.

Identifying templates for protein modeling:

Basic Local Alignment Search Tool

Basic Local Alignment Search Tool (BLAST)

Bioinformatics Lecture 2 By: Dr. Mehdi Mansouri

Basic Local Alignment Search Tool

BLAST Slides adapted & edited from a set by

Sequence alignment, E-value & Extreme value distribution

Presentation transcript:

1 Homology and sequence alignment.

Homology Homology = Similarity between objects due to a common ancestry Hund = Dog, Schwein = Pig

3 Sequence homology VLSPAVKWAKVGAHAAGHG ||| || |||| | |||| VLSEAVLWAKVEADVAGHG Similarity between sequences as a result of common ancestry.

4 Sequence alignment Alignment: Comparing two (pairwise) or more (multiple) sequences. Searching for a series of identical or similar characters in the sequences.

5 Why align? VLSPAVKWAKV ||| || |||| VLSEAVLWAKV 1.To detect if two sequence are homologous. If so, homology may indicate similarity in function (and structure). 2.Required for evolutionary studies (e.g., tree reconstruction). 3.To detect conservation (e.g., a tyrosine that is evolutionary conserved is more likely to be a phosphorylation site).

6 Sequence alignment If two sequences share a common ancestor – for example human and dog hemoglobin, we can represent their evolutionary relationship using a tree VLSPAV-WAKV ||| || |||| VLSEAVLWAKV VLSPAV - WAKV VLSEAVLWAKV

7 Perfect match VLSPAV-WAKV ||| || |||| VLSEAVLWAKV VLSPAV - WAKV VLSEAVLWAKV A perfect match suggests that no change has occurred from the common ancestor (although this is not always the case).

8 A substitution VLSPAV-WAKV ||| || |||| VLSEAVLWAKV VLSPAV - WAKV VLSEAVLWAKV A substitution suggests that at least one change has occurred since the common ancestor (although we cannot say in which lineage it has occurred).

9 Indel VLSPAV-WAKV ||| || |||| VLSEAVLWAKV VLSPAV - WAKV VLSEAVLWAKV Option 1: The ancestor had L and it was lost here. In such a case, the event was a deletion. VLSEAVLWAKV

10 Indel VLSPAV-WAKV ||| || |||| VLSEAVLWAKV VLSPAV - WAKV VLSEAVWAKV Option 2: The ancestor was shorter and the L was inserted here. In such a case, the event was an insertion. VLSEAVLWAKV L

11 Indel VLSPAV - WAKV Normally, given two sequences we cannot tell whether it was an insertion or a deletion, so we term the event as an indel. VLSEAVLWAKV Deletion?Insertion?

12 Global vs. Local Global alignment – finds the best alignment across the entire two sequences. Local alignment – finds regions of similarity in parts of the sequences. ADLGAVFALCDRYFQ |||| |||| | ADLGRTQN-CDRYYQ ADLG CDRYFQ |||| |||| | ADLG CDRYYQ Global alignment: forces alignment in regions which differ Local alignment will return only regions of good alignment

13 Global alignment PTK2 protein tyrosine kinase 2 of human and rhesus monkey

14 Proteins are comprised of domains Domain B Protein tyrosine kinase domain Domain A Human PTK2 :

15 Protein tyrosine kinase domain In leukocytes, a different gene for tyrosine kinase is expressed. Domain X Protein tyrosine kinase domain Domain A

16 Domain X Protein tyrosine kinase domain Domain B Protein tyrosine kinase domain Domain A Leukocyte TK PTK2 The sequence similarity is restricted to a single domain

17 Global alignment of PTK and LTK

18 Local alignment of PTK and LTK

19 Conclusions Use global alignment when the two sequences share the same overall sequence arrangement. Use local alignment to detect regions of similarity.

20 How alignments are computed

21 Pairwise alignment AAGCTGAATTCGAA AGGCTCATTTCTGA AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- One possible alignment:

22 AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- This alignment includes: 2 mismatches 4 indels (gap) 10 perfect matches

23 Choosing an alignment for a pair of sequences AAGCTGAATTCGAA AGGCTCATTTCTGA AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA- Which alignment is better? Many different alignments are possible for 2 sequences:

24 Scoring system (naïve) AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- Score: = (+1)x10 + (-2)x2 + (-1)x4 = 2Score: = (+1)x9 + (-2)x2 + (-1)x6 = -1 A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA- Higher score  Better alignment Perfect match: +1 Mismatch: -2 Indel (gap): -1

25 Alignment scoring - scoring of sequence similarity: Assumes independence between positions: each position is considered separately Scores each position: Positive if identical (match) Negative if different (mismatch or gap) Total score = sum of position scores Can be positive or negative

26 Scoring system In the example above, the choice of +1 for match,-2 for mismatch, and -1 for gap is quite arbitrary Different scoring systems  different alignments We want a good scoring system…

27 DNA scoring matrices Can take into account biological phenomena such as: Transition-transversion

28 Amino-acid scoring matrices Take into account physico-chemical properties

29 Scoring gaps (I) In advanced algorithms, two gaps of one amino-acid are given a different score than one gap of two amino acids. This is solved by giving a penalty to each gap that is opened. Gap extension penalty < Gap opening penalty

30 Homology versus chance similarity How to check if the score is significant? A. Take the two sequences  Compute score. B. Take one sequence randomly  shuffle it -> find score with the second sequence. Repeat 100,000 times. If the score in A is at the top 5% of the scores in B  the similarity is significant.

31 How close? Rule of thumb: Proteins are homologous if they are at least 25% identical (length >100) DNA sequences are homologous if they are at least 70% identical

32 Twilight zone < 25% identity in proteins – may be homologous and may not be…. (Note that 5% identity will be obtained completely by chance!)

33 Searching a sequence database Idea: In order to find homologous sequences to a sequence of interest, one should compute its pairwise alignment against all known sequences in a database, and detect the best scoring significant homologs The same idea in short: Use your sequence as a query to find homologous sequences in a sequence database

34 Some terminology Query sequence - the sequence with which we are searching Hit – a sequence found in the database, suspected as homologous

35 Query sequence: DNA or protein? For coding sequences, we can use the DNA sequence or the protein sequence to search for similar sequences. Which is preferable?

36 Protein is better! Selection (and hence conservation) works (mostly) at the protein level: CTTTCA = Leu-Ser TTGAGT = Leu-Ser

37 Query type Nucleotides: a four letter alphabet Amino acids: a twenty letter alphabet Two random DNA sequences will, on average, have 25% identity Two random protein sequences will, on average, have 5% identity

38 Conclusion The amino-acid sequence is often preferable for homology search

39 How do we search a database? If each pairwise alignment takes 1/10 of a second, and if the database contains 10 7 sequences, it will take 10 6 seconds = 11.5 days to complete one search. 150,000 searches (at least!!) are performed per day. >82,000,000 sequence records in GenBank.

40 Conclusion Using the exact comparison pairwise alignment algorithm between the query and all DB entries – too slow

41 Heuristic Definition: a heuristic is a design to solve a problem that does not provide an exact solution (but is not too bad) but reduces the time complexity of the exact solution

42 BLAST

43 BLAST BLAST - Basic Local Alignment and Search Tool A heuristic for searching a database for similar sequences

44 DNA or Protein All types of searches are possible Query:DNAProtein Database:DNAProtein blastn – nuc vs. nuc blastp – prot vs. prot blastx – translated query vs. protein database tblastn – protein vs. translated nuc. DB tblastx – translated query vs. translated database Translated databases: trEMBL genPept

45 BLAST - underlying hypothesis The underlying hypothesis: when two sequences are similar there are short ungapped regions of high similarity between them The heuristic: 1.Discard irrelevant sequences 2.Perform exact local alignment only with the remaining sequences

46 How do we discard irrelevant sequences quickly? Divide the database into words of length w (default: w = 3 for protein and w = 7 for DNA) Save the words in a look-up table that can be searched quickly WTDFGYPAILKGGTAC WTD TDF DFG FGY GYP …

47 BLAST: discarding sequences When the user enters a query sequence, it is also divided into words Search the database for consecutive neighboring words

48 Neighbor words neighbor words are defined according to a scoring matrix (e.g., BLOSUM62 for proteins) with a certain cutoff level GFB GFC (20) GPC (11) WAC (5)

49 E-value The number of times we will theoretically find an alignment with a score ≥ Y of a random sequence vs. a random database Theoretically, we could trust any result with an E-value ≤ 1 In practice – BLAST uses estimations. E-values of and lower indicate a significant homology. E-values between and should be checked (similar domains, maybe non-homologous). E-values between and 1 do not indicate a good homology

Web servers for pairwise alignment

BLAST 2 sequences (bl2Seq) at NCBI Produces the local alignment of two given sequences using BLAST (Basic Local Alignment Search Tool) engine for local alignmentBLAST Does not use an exact algorithm but a heuristic

Back to NCBI

BLAST – bl2seq

Bl2Seq - query blastn – nucleotide blastp – protein

Bl2seq results

Match Dissimilarity Gaps Similarity Low complexity

BLAST – programs Query:DNAProtein Database:DNAProtein

BLAST – Blastp

Blastp - results

Blastp – results (cont’)

Blast scores: Bits score – A score for the alignment according to the number of similarities, identities, etc. Expected-score (E-value) –The number of alignments with the same score one can “expect” to see by chance when searching a random database of a particular size. The closer the e-value is to zero, the greater the confidence that the hit is really a homolog

Blastp – acquiring sequences

blastp – acquiring sequences

64 VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLTCLVKGFYPSD--IAVEWWSNG-- Similar to pairwise alignment BUT n sequences are aligned instead of just 2 Multiple sequence alignment

65 MSA = Multiple Sequence Alignment Each row represents an individual sequence Each column represents the ‘same’ position VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLTCLVKGFYPSD--IAVEWWSNG-- Multiple sequence alignment

66 Conserved positions Columns in which all the sequences contain the same amino acids or nucleotides Important for the function or structure VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGSSSNIGS--ITVNWYQQLPG LRLSCTGSGFIFSS--YAMYWYQQAPG LSLTCTGSGTSFDD-QYYSTWYQQPPG

67 Consensus sequence TGTTCTA TGTTCAA TCTTCAA TGTTCAA A consensus sequence holds the most frequent character of the alignment at each column

68 Profile = PSSM = Position Specific Score Matrix GTTCTA GTTCAA CTTCAA A C G T

69 Alignment methods There is no available optimal solution for MSA – all methods are heuristics: Progressive/hierarchical alignment (Clustal) Iterative alignment (mafft, muscle)

70 ABCDEABCDE Compute the pairwise alignments for all against all (6 pairwise alignments). The similarities are converted to distances and stored in a table First step: Progressive alignment EDCBA A 8B 1715C D E

71 A D C B E Cluster the sequences to create a tree (guide tree): represents the order in which pairs of sequences are to be alignedrepresents the order in which pairs of sequences are to be aligned similar sequences are neighbors in the treesimilar sequences are neighbors in the tree distant sequences are distant from each other in the treedistant sequences are distant from each other in the tree Second step: EDCBA A 8B 1715C D E The guide tree is imprecise and is NOT the tree which truly describes the evolutionary relationship between the sequences!

72 Third step: A D C B E 1. Align the most similar (neighboring) pairs sequence

73 Third step: A D C B E 2. Align pairs of pairs sequence profile

74 Third step: A D C B E sequence profile Main disadvantages: Sub-optimal tree topology Misalignments resulting from globally aligning pairs of sequences.

75 ABCDEABCDE Iterative alignment Guide tree MSA Pairwise distance table A D C B Iterate until the MSA does not change (convergence) E

76 Case study: Using homology searching The human kinome

77 Kinases and phosphatases

78 Multi-tasking enzymes Signal transduction Metabolism Transcription Cell-cycle Differentiation  Function of nervous and immune system … And more

79 How many kinases in the human genome? 1950’s, discovery that reversible phosphorylation regulates the activity of glycogen phosphorylase 1970’s, advent of cloning and sequencing produced a speculation that the vertebrate genome encodes as many as 1,001 kinases

– human genome sequence … As well – databases of Genbank, Swissprot, and dbEST How can we find out how many kinases are out there? How many kinases in the human genome?

81 The human kinome In 2002, Manning, Whyte, Martinez, Hunter and Sudarsanam set out to: 1.Search and cross-reference all these databases for all kinases 2.Characterize all found kinases

82 ePKs and aPKs Eukaryotic protein kinases (majority) catalytic domain Atypical protein kinases Sequence homology of the catalytic domain; additional regulatory domains are non-homologous No sequence homology to ePKs; some aPK subfamilies have structural similarity to ePKs

83 The search Several profiles were built: based on the catalytic domain of: (a) 70 known ePKs from yeast, worm, fly, and human with > 50% identity in the ePK domain (b) each subfamily of known aPKs HMM-profile searches and PSI-BLAST searches were performed

84 The results… 478 ePKs 40 aPKs Total of 518 kinases in the human genome (half of the prediction in the 1970’s) [1.7% of human genes]