Presentation is loading. Please wait.

Presentation is loading. Please wait.

Database searching. Purposes of similarity search Function prediction by homology (in silico annotation) Function prediction by homology (in silico annotation)

Similar presentations


Presentation on theme: "Database searching. Purposes of similarity search Function prediction by homology (in silico annotation) Function prediction by homology (in silico annotation)"— Presentation transcript:

1 Database searching

2 Purposes of similarity search Function prediction by homology (in silico annotation) Function prediction by homology (in silico annotation) Search for identified gene in other organisms Search for identified gene in other organisms Identifying regulatory elements Identifying regulatory elements Assisting in sequence assembly Assisting in sequence assemblyProblems Similar sequences can have different functions Similar sequences can have different functions Non-homologous sequences can have identical function Non-homologous sequences can have identical function Feature space <> Sequence space Feature space <> Sequence space

3 Main tools FASTA FASTA BLAST=Basic Local Alignment Search Tool BLAST=Basic Local Alignment Search ToolProcedure 1. Choose scoring matrix 2. Find best local alignments using scoring matrix 3. Determine statistical significance of result List in decreasing order of significance List in decreasing order of significance

4 Blosum substitution matrix log odds scores 2log(proportion observed/proportion expected)

5 FASTA Step 1 : Find hot-spots Step 1 : Find hot-spots (i.e. pairs of words of length k) that exactly match. (hashing) Step 2: Locate best “diagonal runs”(sequences of consecutive hot spots on a diagonal) Step 2: Locate best “diagonal runs”(sequences of consecutive hot spots on a diagonal) Step 3 : Combine sub-alignments Step 3 : Combine sub-alignments from diagonal runs into a longer alignment

6 Exercise (hashing Tables of FASTA) sequence 1: ACNGTSCHQE sequence 2: GCHCLSAGQD Prepare Table of offset values = matching diagonals

7 Solution sequence 1: ACNGTSCHQE sequence 2: GCHCLSAGQD sequence 1: ACNGTSCHQE C S Q <<offset = 0 sequence 2: GCHCLSAGQD sequence 1: ACNGTSCHQE--- G C <<offset = -3 sequence 2: ---GCHCLSAGQD sequence 1: ACNGTSCHQE----- CH <<offset = -5 sequence 2: -----GCHCLSAGQD S T

8 FASTA (cont.) For each offset: Rescan 10 regions with highest density of identities using BLOSUM50 matrix. For each offset: Rescan 10 regions with highest density of identities using BLOSUM50 matrix. Best score = init1 Best score = init1 Join diagonals (initn = sum individual scores – gap penalties) Join diagonals (initn = sum individual scores – gap penalties) Construct optimal local alignment (opt score) Construct optimal local alignment (opt score) Asses significance (E-value) Asses significance (E-value) Perform Smith-Waterman on best matches Perform Smith-Waterman on best matches

9 The main steps of gapped BLAST 1. Specify word length (3 for proteins, 11 for nucleotides) 2. Filtering for complexity 3. Make list of words to search for 4. Exact search 5. Join matches, and extend ungapped alignment 6. Calculate E-values 7. Join high-scoring pairs 8. Perform Smith-Waterman on best matches

10 Filtering sequences Replacing sequence regions of low complexity K with X Find K for sequence GGGG and for sequence ATCG L!= 4*3*2*1 = 24 n G = 4, n C = 0, n T = 0, n A = 0  n i ! = 4! * 0! * 0! * 0! = 24 K = ¼ log 4 (24/24) = 0 L!= 4*3*2*1 = 24 n G = 1, n C = 1, n T = 1, n A = 1  n i ! = 1! * 1! * 1! * 1! = 1 K = ¼ log 4 (24/1) = 0.573

11 The BLAST algorithm Break the search sequence into words Break the search sequence into words W = 3 for proteins, W = 12 for DNA W = 3 for proteins, W = 12 for DNA Include in the search all words that score above a certain value (T) for any search word Include in the search all words that score above a certain value (T) for any search word MCGPFILGTYC MCG CGP MCG, CGP, GPF, PFI, FIL, ILG, LGT, GTY, TYC MCGCGP MCTMGP… MCNCTP …

12 The BLAST search algorithm

13  Search for the words in the database  Word locations can be precomputed and indexed  Searching for a short string in a long string Searching the database

14 Search Significance Scores A search will always return some hits. A search will always return some hits. How can we determine how “unusual” a particular alignment score is? How can we determine how “unusual” a particular alignment score is? Assumptions Assumptions

15 Assessing significance requires a distribution I have an apple of diameter 5”. Is that unusual? I have an apple of diameter 5”. Is that unusual? Diameter (cm) Frequency

16 Is a match significant? Match scores for aligning my sequence with random sequences. Match scores for aligning my sequence with random sequences. Depends on: Depends on: Scoring system Scoring system Database Database Sequence to search for Sequence to search for Length Length Composition Composition How do we determine the random sequences? How do we determine the random sequences? Match score Frequency

17 Generating “random” sequences Random uniform model: P(G) = P(A) = P(C) = P(T) = 0.25 Random uniform model: P(G) = P(A) = P(C) = P(T) = 0.25 Doesn’t reflect nature Doesn’t reflect nature Use sequences from a database Use sequences from a database Might have genuine homology Might have genuine homology We want unrelated sequences We want unrelated sequences Random shuffling of sequences Random shuffling of sequences Preserves composition Preserves composition Removes true homology Removes true homology

18 What distribution do we expect to see? The mean of n random (i.i.d.) events tends towards a Gaussian distribution. The mean of n random (i.i.d.) events tends towards a Gaussian distribution. Example: Throw n dice and compute the mean. Example: Throw n dice and compute the mean. Distribution of means: Distribution of means: n = 2 n = 1000

19 Determining significance of match The score of an ungapped alignment is The score of an ungapped alignment is S = sum s(x i,y i ). The scores of individual sites are independent. The scores of individual sites are independent. The distribution of the sum of independent random variables is a normal distribution (central limit theorem). The distribution of the sum of independent random variables is a normal distribution (central limit theorem).

20 Determining significance of match However, we don't select scores randomly. We take the maximum extension of the initial word (HSP). The distribution of the maximum score of a large number N of i.i.d. random variables is called the extreme value distribution.

21 Comparing distributions   Extreme Value:Gaussian:

22 P(Score greater than x)= Probability of observing a score S > x m’ and n’ are effective query and database sequence lengths; K and l are substitution matrix parameters. P -values

23 Determining significance of match E-value = expected number of sequences scoring above S in the given database E-value = expected number of sequences scoring above S in the given database Low E-values => significant matches When E < 0.01 P-values and E-values are nearly identical When E < 0.01 P-values and E-values are nearly identical BIT-score: Sum of scores for local alignments

24 Smith-Waterman local alignment

25 BLAST parameters Lowering the neighborhood word threshold (T) allows more distantly related sequences to be found, at the expense of increased noise in the results set. Lowering the neighborhood word threshold (T) allows more distantly related sequences to be found, at the expense of increased noise in the results set. Raising the segment extension cutoff (X) returns longer extensions for each hit. Raising the segment extension cutoff (X) returns longer extensions for each hit. Changing the minimum E-value changes the threshold for reporting a hit. Changing the minimum E-value changes the threshold for reporting a hit.

26 BLAST flavours Basic flavours Basic flavours BLASTP (proteins to protein database) BLASTP (proteins to protein database) BLASTN (nucleotides to nucleotide database) BLASTN (nucleotides to nucleotide database) BLASTX (translated nucleotides to protein database) BLASTX (translated nucleotides to protein database) TBLASTN (protein to translated database) TBLASTN (protein to translated database) TBLASTX (translated nucleotides to translated database) - SLOW TBLASTX (translated nucleotides to translated database) - SLOW

27 Example Cloned sequence from Lotus japonicus Amino-acid level (BlastP) Amino-acid level (BlastP)BlastP LLANGNFVLRESGNKDQDGLVWQSFDFPTDTLLPQMKLGWDRKTGLNKI LRSWKSPSDPSSGYYSYKLEFQGLPEYFLNNRDSPTHRSGPWDGIRFSGIPEK Nucleotide level (BlastN) Nucleotide level (BlastN)BlastN cttctcgcta atggcaattt cgtgctaaga gagtctggca acaaagatca agatgggtta gtgtggcaga gtttcgattt tcccactgac actttactcc cgcagatgaa actgggatgg gatcgcaaaa cagggcttaa caaaatcctc agatcctgga aaagcccaag tgatccgtcaagtgggtatt actcgtataa actcgaattt caagggctcc ctgagtattt tttaaacaac agagactcgc caactcaccg gagcggtccg tgggatggta tccgatttag tggtattcca

28

29

30

31

32

33

34 Matrix parameters

35 Gap parameters

36 Hits

37 Synteny between the rat, mouse and human genomes (Nature 2004) Synteny between the rat, mouse and human genomes (Nature 2004)

38 Iterated searches Advanced family searches PSI-BLAST (Position Specific Iterated BLAST) PSI-BLAST (Position Specific Iterated BLAST)

39 PSI-blast Search with BLAST using the given query. Search with BLAST using the given query. while (there are new significant hits) while (there are new significant hits) combine all significant hits into a profile combine all significant hits into a profile search with BLAST using the profile search with BLAST using the profile end end

40 PSI-BLAST Greedy algorithm


Download ppt "Database searching. Purposes of similarity search Function prediction by homology (in silico annotation) Function prediction by homology (in silico annotation)"

Similar presentations


Ads by Google