1 ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACG TGGAAGACTGTGGTGGTGAGGCCCTGGGCAGGTTTGTATGGAGGTTACAAGGCTGCTTAAG GAGGGAGGATGGAAGCTGGGCATGTGGAGACAGACCACCTCCTGGATTTATGACAGGAACT.

Slides:



Advertisements
Similar presentations
1 Lesson 2 Aligning sequences and searching databases.
Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Bioinformatics Tutorial I BLAST and Sequence Alignment.
BLAST Sequence alignment, E-value & Extreme value distribution.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
Lecture 8 Alignment of pairs of sequence Local and global alignment
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically.
Aligning sequences and searching databases
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
1 Exercise: BIOINFORMATIC DATABASES and BLAST. 2 Outline  NCBI and Entrez  Pubmed  Google scholar  RefSeq  Swissprot  Fasta format  PDB: Protein.
Sequence Similarity Searching Class 4 March 2010.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Heuristic alignment algorithms and cost matrices
BLAST Basic Local Alignment Search Tool. BLAST החכה BLAST (Basic Local Alignment Search Tool) allows rapid sequence comparison of a query sequence [[רצף.
We continue where we stopped last week: FASTA – BLAST
Overview of sequence database searching techniques and multiple alignment May 1, 2001 Quiz on May 3-Dynamic programming- Needleman-Wunsch method Learning.
|| || ||||| ||| || || ||||||||||||||||||| MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE… ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACG TGGAAGACTGTGGTGGTGAGGCCCTGGGCAGGTTTGTATGGAGGTTACAAGGCTGCTTAAG.
Introduction to bioinformatics
Sequence similarity.
1 Exercise 1 Bioinformatics Databases. 2 What’s in a database?  Sequences – genes, proteins, etc.  Full genomes  Annotation – information about the.
Sequence homology and alignment
Similar Sequence Similar Function Charles Yan Spring 2006.
Pairwise and Multiple Sequence Alignment Lesson 2
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
1 Lesson 3 Aligning sequences and searching databases.
Sequence alignment, E-value & Extreme value distribution
BLAST Basic Local Alignment Search Tool. BLAST החכה BLAST (Basic Local Alignment Search Tool) allows rapid sequence comparison of a query sequence [[רצף.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Sequence comparison: Local alignment
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
An Introduction to Bioinformatics
. Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology  Two genes are homologous if they share a common evolutionary.
Protein Sequence Alignment and Database Searching.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
BLAST Workshop Maya Schushan June 2009.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Part 2- OUTLINE Introduction and motivation How does BLAST work?
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
Pairwise Sequence Alignment Exercise 2. || || ||||| ||| || || ||||||||||||||||||| MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE… ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACG.
Sequence Similarity The bioinformatics for molecular biologists lecture series.
1 Homology and sequence alignment.. Homology Homology = Similarity between objects due to a common ancestry Hund = Dog, Schwein = Pig.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Sequence comparison: Local alignment
Basic Local Alignment Search Tool (BLAST)
Bioinformatics Lecture 2 By: Dr. Mehdi Mansouri
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool (BLAST)
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

1 ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACG TGGAAGACTGTGGTGGTGAGGCCCTGGGCAGGTTTGTATGGAGGTTACAAGGCTGCTTAAG GAGGGAGGATGGAAGCTGGGCATGTGGAGACAGACCACCTCCTGGATTTATGACAGGAACT GATTGCTGTCTCCTGTGCTGCTTTCACCCCTCAGGCTGCTGGTCGTGTATCCCTGGACCCA GAGGTTCTTTGAAAGCTTTGGGGACTTGTCCACTCCTGCTGCTGTGTTCGCAAATGCTAAG GTAAAAGCCCATGGCAAGAAGGTGCTAACTTCCTTTGGTGAAGGTATGAATCACCTGGACA ACCTCAAGGGCACCTTTGCTAAACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCC TGAGAATTTCAAGGTGAGTCAATATTCTTCTTCTTCCTTCTTTCTATGGTCAAGCTCATGT CATGGGAAAAGGACATAAGAGTCAGTTTCCAGTTCTCAATAGAAAAAAAAATTCTGTTTGC ATCACTGTGGACTCCTTGGGACCATTCATTTCTTTCACCTGCTTTGCTTATAGTTATTGTT TCCTCTTTTTCCTTTTTCTCTTCTTCTTCATAAGTTTTTCTCTCTGTATTTTTTTAACACA ATCTTTTAATTTTGTGCCTTTAAATTATTTTTAAGCTTTCTTCTTTTAATTACTACTCGTT TCCTTTCATTTCTATACTTTCTATCTAATCTTCTCCTTTCAAGAGAAGGAGTGGTTCACTA CTACTTTGCTTGGGTGTAAAGAATAACAGCAATAGCTTAAATTCTGGCATAATGTGAATAG GGAGGACAATTTCTCATATAAGTTGAGGCTGATATTGGAGGATTTGCATTAGTAGTAGAGG TTACATCCAGTTACCGTCTTGCTCATAATTTGTGGGCACAACACAGGGCATATCTTGGAAC AAGGCTAGAATATTCTGAATGCAAACTGGGGACCTGTGTTAACTATGTTCATGCCTGTTGT CTCTTCCTCTTCAGCTCCTGGGCAATATGCTGGTGGTTGTGCTGGCTCGCCACTTTGGCAA GGAATTCGACTGGCACATGCACGCTTGTTTTCAGAAGGTGGTGGCTGGTGTGGCTAATGCC CTGGCTCACAAGTACCATTGA MVNLTSDEKTAVLALWNKVDVEDCGGEALGRLLVVYPWTQRFFE… || || ||||| ||| || || ||||||||||||||||||| MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE… Motivation

2 Lesson 2 Aligning sequences and searching databases

3 Homology and sequence alignment

Homology = Similarity between objects due to a common ancestry Homology

5 Sequence homology VLSPAVKWAKVGAHAAGHG ||| || |||| | |||| VLSEAVLWAKVEADVAGHG Similarity between sequences as a result of common ancestry.

6 Sequence alignment Alignment: Comparing two (pairwise) or more (multiple) sequences. Searching for a series of identical or similar characters in the sequences.

7 Why align? VLSPAVKWAKV ||| || |||| VLSEAVLWAKV 1.To detect if two sequence are homologous. If so, homology may indicate similarity in function (and structure). 2.Required for evolutionary studies (e.g., tree reconstruction). 3.To detect conservation (e.g., a tyrosine that is evolutionary conserved is more likely to be a phosphorylation site.

8 Insertions, deletions, and substitutions

9 Three types of changes: 1.Substitution – a replacement of one (or more) sequence letter by another: 2.Insertion - an insertion of a letter or several letters to the sequence: 3.Deletion - deleting a letter (or more) from the sequence: AA A TA Evolutionary changes in sequences Insertion + Deletion  Indel AAG GAAA C G

10 Sequence alignment If two sequences share a common ancestor – for example human and armadillo hemoglobin, we can represent their evolutionary relationship using a tree VLSPAV-WAKV ||| || |||| VLSEAVLWAKV VLSPAV - WAKV VLSEAVLWAKV

11 Perfect match VLSPAV-WAKV ||| || |||| VLSEAVLWAKV VLSPAV - WAKV VLSEAVLWAKV A perfect match suggests that no change has occurred from the common ancestor (although this is not always the case). VLSEAVLWAKV

12 A substitution VLSPAV-WAKV ||| || |||| VLSEAVLWAKV VLSPAV - WAKV VLSEAVLWAKV A substitution suggests that at least one change has occurred since the common ancestor (although we cannot say in which lineage it has occurred). VLSEAVLWAKV VLSPAVLWAKV

13 Indel VLSPAV-WAKV ||| || |||| VLSEAVLWAKV VLSPAV - WAKV * Option 1: The ancestor had L and it was lost here *. In such a case, the event was a deletion. VLSEAVLWAKV *

14 Indel VLSPAV-WAKV ||| || |||| VLSEAVLWAKV VLSPAV - WAKV VLSEAV WAKV * Option 2: The ancestor was shorter and the L was inserted here *. In such a case, the event was an insertion. VLSEAVLWAKV L *

15 Indel VLSPAV - WAKV Normally, given two sequences we cannot tell whether it was an insertion or a deletion, so we term the event as an indel. VLSEAVLWAKV Deletion?Insertion?

16 Indels in protein coding genes Indels in protein coding genes are often of 3bp, 6bp, 9bp, etc... Gene Search In fact, searching for indels of length 3K (K=1,2,3,…) can help algorithms that search a genome for open reading frames (ORFs).

17 Global and Local pairwise alignments

18 Global vs. Local Global alignment – finds the best alignment across the entire two sequences. Local alignment – finds regions of similarity in parts of the sequences. ADLGAVFALCDRYFQ |||| |||| | ADLGRTQN-CDRYYQ ADLG CDRYFQ |||| |||| | ADLG CDRYYQ Local alignment will return only regions of good alignment Global alignment: forces alignment in regions which differ

19 Global alignment PTK2 protein tyrosine kinase 2 of human and rhesus monkey

20 Proteins are comprised of domains Domain B Protein tyrosine kinase domain Domain A Human PTK2 :

21 Protein tyrosine kinase domain In leukocytes, a different gene for tyrosine kinase is expressed. Domain X Protein tyrosine kinase domain Domain A

22 Domain X Protein tyrosine kinase domain Domain B Protein tyrosine kinase domain Domain A Leukocyte TK PTK2 The sequence similarity is restricted to a single domain

23 Global alignment of PTK and LTK

24 Local alignment of PTK and LTK

25 Conclusions Use global alignment when the two sequences share the same overall sequence arrangement. Use local alignment to detect regions of similarity.

26 How alignments scores are computed?

27 Pairwise alignment AAGCTGAATTCGAA AGGCTCATTTCTGA AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- One possible alignment:

28 AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- This alignment includes: 2 mismatches 4 indels (gap) 10 perfect matches

29 Choosing an alignment for a pair of sequences AAGCTGAATTCGAA AGGCTCATTTCTGA AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA- Which alignment is better? Many different alignments are possible for 2 sequences:

30 Scoring system (naïve) AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- Score: = (+1)x10 + (-2)x2 + (-1)x4 = 2Score: = (+1)x9 + (-2)x2 + (-1)x6 = -1 A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA- Higher score  Better alignment Perfect match: +1 Mismatch: -2 Indel (gap): -1

31 Scoring systems

32 Scoring system In the example above, the choice of +1 for match,-2 for mismatch, and -1 for gap is quite arbitrary Different scoring systems  different alignments We want a good scoring system…

33 Scoring matrix TCGA 2A 2-6G 2 C 2 T Representing the scoring system as a table or matrix n X n (n is the number of letters the alphabet contains. n=4 for nucleotides, n=20 for amino acids) Symmetric

34 DNA scoring matrices Uniform substitutions between all nucleotides: TCGAFrom To 2A 2-6G 2 C 2 T Match Mismatch

35 DNA scoring matrices Can take into account biological phenomena such as: Transition-transversion

36 Amino-acid scoring matrices Take into account physico- chemical properties

37 Amino-acid substitution matrices Actual substitutions: –Based on empirical data –Commonly used by many bioinformatics programs –PAM & BLOSUM

38 Protein matrices – actual substitutions The idea: Given an alignment of a large number of closely related sequences we can score the relation between amino acids based on how frequently they substitute each other M G Y D E M G Y E E M G Y D E M G Y Q E M G Y D E M G Y E E In the 4 th Column D/E is found in 7/8 of the cases (compared with 5/8 to D/Q and E/Q).

39 BLOSUM: Blo cks Su bstitution M atrix Based on BLOCKS database –~2000 blocks from 500 families of related proteins –Families of proteins with identical function Blocks are short conserved patterns of 3-60 aa without gaps AABCDA----BBCDA DABCDA----BBCBB BBBCDA-AA-BCCAA AAACDA-A--CBCDB CCBADA---DBBDCC AAACAA----BBCCC

40 BLOSUM Each block represents a sequence alignment with different identity percentage For each block the amino-acid substitution rates were calculated to create the BLOSUM matrix

41 BLOSUM Matrices BLOSUMn is based on sequences that share at least n percent identity BLOSUM62 represents closer sequences than BLOSUM45

42 Example : Blosum62 Derived from blocks where the sequences share at least 62% identity

43 Scoring gaps In advanced algorithms, two gaps of one amino-acid ( X-Y- ) are given a different score than one gap of two amino acids ( X--Y ). This is performed by giving different penalty for “opening” a gap and for extending a gap Gap extension penalty < Gap opening penalty

44 Intermediate summary 1.Scoring system = substitution matrix + gap penalty. 2.Used for both global and local alignment 3.For amino acids, there are two types of substitution matrices: PAM and BLOSUM

45 Computational aspects

46 Many possible alignments AAGCTGAATTCGAA AGGCTCATTTCTGA AAGCT-GAATT-C-GAA A-GGCT-CATTTCTGA- AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- AAG-CTGAATT-C-GAA AGGCT-CATTT-CTGA- It is not trivial (for most people) to figure out how to go over all possible pairwise alignments and find the one with the highest score.

47 Optimal alignment algorithms Needleman-Wunsch (global) [1970] Smith-Waterman (local) [1981] Their algorithm’s complexity is O(mn) (m – length of sequence 1, n – length of sequence 2). Informally: If one doubles the sequence length of one sequence  it doubles the computation time. If one doubles both  it quadruples the computation time. For proteins of lengths < 1000 it takes much less than a second to compute the alignments.

48 Dynamic programming Solving a problem with many overlapping sub-problems Example: Fibonacci sequnce: 1, 1, 2, 3, 5, 8,13,… F)1) = F(2) = 1; F(n) = F(n-1) + F(n-2)

49 Dynamic programming Naïvely solving F(7): F(7) = F(6) + F(5) = F(5) + F(4) + F(4) + F(3) = F(4) + F(3) + F(3) + F(2) + F(3) + F(2) +F(2) + F(1) = F(3) + F(2) + F(2) + F(1) + F(2) + F(1) + F(2) + F(2) + F(1) + F(2) +F(2) + F(1) = F(2) + F(1) + F(2) + F(2) + F(1) + F(2) + F(1) + F(2) + F(2) + F(1) + F(2) +F(2) + F(1) = 13 F)1) = F(2) = 1; F(n) = F(n-1) + F(n-2)

50 Dynamic programming F(7) using Dynamic programming: F(3) = F(2) + F(1) = = 2 F(4) = F(3) + F(2) = = 3 F(5) = F(4) + F(3) = = 5 F(6) = F(5) + F(4) = = 8 F(7) = F(6) + F(5) = = 13

51 Needleman Wunsch (1970)     Gap penaltyjiF jiF yxsjiF jiF ji )1,( ),1( ),,()1,1( max),(  0iF),( i ×Gap penalty 0,  jF)( j ×Gap penalty Base Case: Recursion rule Finds the best alignment for the first i characters of seq1 with the first j of seq2

52 Needleman Wunsch (1970)     Gap penaltyjiF jiF yxsjiF jiF ji )1,( ),1( ),,()1,1( max),(  0iF),( i ×Gap penalty 0,  jF)( j ×Gap penalty Base Case: Recursion rule Cool alignment applet:

53 Searching databases

54 Searching a sequence database Idea: In order to find homologous sequences to a sequence of interest, one should compute its pairwise alignment against all known sequences in a database, and detect the best scoring significant homologs The same idea in short: Use your sequence as a query to find homologous sequences in a sequence database

55 Some terminology Query sequence - the sequence with which we are searching Hit – a sequence found in the database, suspected as homologous

56 Protein or DNA search

57 Query sequence: DNA or protein? For coding sequences, we can use the DNA sequence or the protein sequence to search for similar sequences. Which is preferable?

58 Protein is better! Selection (and hence conservation) works (mostly) at the protein level: CTTTCA = Leu-Ser TTGAGT = Leu-Ser

59 Query type Nucleotides: 4 letter alphabet Amino acids: 20 letter alphabet Two random DNA sequences will, on average, have 25% identity Two random protein sequences will, on average, have 5% identity

60 Conclusion The amino-acid sequence is often preferable for homology search

61 Computation time

62 How do we search a database? If each pairwise alignment takes 1/10 of a second, and if the database contains 10 7 sequences, it will take 10 6 seconds = 11.5 days to complete one search. 150,000 searches (at least!!) are performed per day. >82,000,000 sequence records in GenBank.

63 Conclusion Using the exact comparison pairwise alignment algorithm between the query and all DB entries – too slow

64 Heuristic Definition: a heuristic is a design to solve a problem that does not provide an exact solution (but is not too bad) but reduces the time complexity of the exact solution

65 BLAST

66 BLAST BLAST - Basic Local Alignment and Search Tool A heuristic for searching a database for similar sequences The heuristic based on restrictions of the similarity (such as using ungapped word matching instead of single character matching).

67 Query:DNAProtein Database:DNAProtein DNA or Protein All types of searches are possible blastn – nuc vs. nuc blastp – prot vs. prot blastx – translated query vs. protein database tblastn – protein vs. translated nuc. DB tblastx – translated query vs. translated database

68 E-value The number of times we will theoretically find an alignment with a score ≥ Y of a random sequence vs. a random database Theoretically, we could trust any result with an E-value ≤ 1 In practice – BLAST uses estimations. E-values of and lower indicate a significant homology. E-values between and should be checked (similar domains, maybe non-homologous). E-values between and 1 do not indicate a good homology

69 Filtering low complexity Low complexity regions : e.g., Proline rich areas (in proteins), Alu repeats (in DNA) Regions of low complexity generate high scores of alignment, BUT – this does not indicate homology

70 Solution In BLAST there is an option to mask low- complexity regions in the query sequence (such regions are represented as XXXXX in query)