Database searching. Purposes of similarity search Function prediction by homology (in silico annotation) Function prediction by homology (in silico annotation)

Slides:

Advertisements

Similar presentations

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.

Advertisements

1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.

Bioinformatics Tutorial I BLAST and Sequence Alignment.

BLAST Sequence alignment, E-value & Extreme value distribution.

BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.

Rationale for searching sequence databases

Searching Sequence Databases

Database searching. Purposes of similarity search Function prediction by homology (in silico annotation) Function prediction by homology (in silico annotation)

Lecture outline Database searches

Heuristic alignment algorithms and cost matrices

We continue where we stopped last week: FASTA – BLAST

Database searching. Purposes of similarity search Function prediction by homology (in silico annotation) Function prediction by homology (in silico annotation)

Slide 1 EE3J2 Data Mining Lecture 20 Sequence Analysis 2: BLAST Algorithm Ali Al-Shahib.

. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.

Sequence Alignment vs. Database Task: Given a query sequence and millions of database records, find the optimal alignment between the query and a record.

Introduction to bioinformatics

Similar Sequence Similar Function Charles Yan Spring 2006.

Chapter 2 Sequence databases A list of the databases’ uniform resource locators (URLs) discussed in this section is in Box 2.1.

Database searching. Purposes of similarity search Function prediction by homology (in silico annotation) Function prediction by homology (in silico annotation)

Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 16 th, 2014.

Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman.

Blast heuristics Morten Nielsen Department of Systems Biology, DTU.

Sequence alignment, E-value & Extreme value distribution

From Pairwise Alignment to Database Similarity Search.

Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 17 th, 2013.

Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.

Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.

Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒黃尹柔田耕豪蕭逸嫻謝朝茂莊閔傑 2014/05/12 1.

Database Searching BLAST and FastA.

An Introduction to Bioinformatics

BLAST What it does and what it means Steven Slater Adapted from pt.

NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)

Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.

1 Lecture outline Database searches –BLAST –FASTA Statistical Significance of Sequence Comparison Results –Probability of matching runs –Karin-Altschul.

Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.

Local alignment, BLAST and Psi-BLAST October 25, 2012 Local alignment Quiz 2 Learning objectives-Learn the basics of BLAST and Psi-BLAST Workshop-Use BLAST2.

Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,

Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?

BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.

CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.

NCBI resources II: web-based tools and ftp resources Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.

Comp. Genomics Recitation 3 The statistics of database searching.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

Rationale for searching sequence databases June 25, 2003 Writing projects due July 11 Learning objectives- FASTA and BLAST programs. Psi-Blast Workshop-Use.

Basic Local Alignment Search Tool BLAST Why Use BLAST?

Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.

Part 2- OUTLINE Introduction and motivation How does BLAST work?

Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.

Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.

©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

Construction of Substitution matrices

Sequence Search Abhishek Niroula Department of Experimental Medical Science Lund University

Step 3: Tools Database Searching

©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

What is BLAST? Basic BLAST search What is BLAST?

BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.

CISC667, S07, Lec7, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Sequence pairwise alignment Score statistics: E-value and p-value Heuristic algorithms:

9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.

Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.

What is BLAST? Basic BLAST search What is BLAST?

Basics of BLAST Basic BLAST Search - What is BLAST?

BLAST Anders Gorm Pedersen & Rasmus Wernersson.

Identifying templates for protein modeling:

Sequence alignment, Part 2

Basic Local Alignment Search Tool (BLAST)

Basic Local Alignment Search Tool

Sequence alignment, E-value & Extreme value distribution

Searching Sequence Databases

Presentation transcript:

Database searching

Purposes of similarity search Function prediction by homology (in silico annotation) Function prediction by homology (in silico annotation) Search for identified gene in other organisms Search for identified gene in other organisms Identifying regulatory elements Identifying regulatory elements Assisting in sequence assembly Assisting in sequence assemblyProblems Similar sequences can have different functions Similar sequences can have different functions Non-homologous sequences can have identical function Non-homologous sequences can have identical function Feature space <> Sequence space Feature space <> Sequence space

Some databases nr (GenBank nucleotide and protein) nr (GenBank nucleotide and protein) nr Month: monthly update Month: monthly update swissprot (protein) swissprot (protein) swissprot EST EST EST pdb (proteins with 3D structures) pdb (proteins with 3D structures) pdb Various genome databases (human, mouse etc) Various genome databases (human, mouse etc)

Main tools FASTA FASTA BLAST=Basic Local Alignment Search Tool BLAST=Basic Local Alignment Search ToolProcedure 1. Choose scoring matrix 2. Find best local alignments using scoring matrix 3. Determine statistical significance of result List in decreasing order of significance List in decreasing order of significance

Blosum substitution matrix log odds scores 2log(proportion observed/proportion expected)

FASTA Step 1 : Find hot-spots Step 1 : Find hot-spots (i.e. pairs of words of length k) that exactly match. (hashing) Step 2: Locate best “diagonal runs”(sequences of consecutive hot spots on a diagonal) Step 2: Locate best “diagonal runs”(sequences of consecutive hot spots on a diagonal) Step 3 : Combine sub-alignments Step 3 : Combine sub-alignments form diagonal runs into a longer alignment

Exercise (hashing Tables of FASTA) sequence 1: ACNGTSCHQE sequence 2: GCHCLSAGQD Prepare Table of offset values = matching diagonals

Solution sequence 1: ACNGTSCHQE sequence 2: GCHCLSAGQD sequence 1: ACNGTSCHQE C S Q <<offset = 0 sequence 2: GCHCLSAGQD sequence 1: ACNGTSCHQE--- G C <<offset = -3 sequence 2: ---GCHCLSAGQD sequence 1: ACNGTSCHQE----- CH <<offset = -5 sequence 2: -----GCHCLSAGQD S T

The main steps of gapped BLAST 1. Specify word length (3 for proteins, 11 for nucleotides) 2. Filtering for complexity 3. Make list of words to search for 4. Exact search 5. Join matches, and extend ungapped alignment 6. Calculate E-values 7. Join high-scoring pairs 8. Perform Smith-Waterman on best matches

Filtering sequences Replacing sequence regions of low complexity K with X Find K for sequence GGGG and for sequence ATCG L!= 4*3*2*1 = 24 n G = 4, n C = 0, n T = 0, n A = 0  n i ! = 4! * 0! * 0! * 0! = 24 K = ¼ log 4 (24/24) = 0 L!= 4*3*2*1 = 24 n G = 1, n C = 1, n T = 1, n A = 1  n i ! = 1! * 1! * 1! * 1! = 1 K = ¼ log 4 (24/1) = 0.573

The BLAST algorithm Break the search sequence into words Break the search sequence into words W = 3 for proteins, W = 12 for DNA W = 3 for proteins, W = 12 for DNA Include in the search all words that score above a certain value (T) for any search word Include in the search all words that score above a certain value (T) for any search word MCGPFILGTYC MCG CGP MCG, CGP, GPF, PFI, FIL, ILG, LGT, GTY, TYC MCGCGP MCTMGP… MCNCTP …

The BLAST search algorithm

 Search for the words in the database  Word locations can be precomputed and indexed  Searching for a short string in a long string Searching the database

Search Significance Scores A search will always return some hits. A search will always return some hits. How can we determine how “unusual” a particular alignment score is? How can we determine how “unusual” a particular alignment score is? Assumptions Assumptions

Assessing significance requires a distribution I have an apple of diameter 5”. Is that unusual? I have an apple of diameter 5”. Is that unusual? Diameter (cm) Frequency

Is a match significant? Match scores for aligning my sequence with random sequences. Match scores for aligning my sequence with random sequences. Depends on: Depends on: Scoring system Scoring system Database Database Sequence to search for Sequence to search for Length Length Composition Composition How do we determine the random sequences? How do we determine the random sequences? Match score Frequency

Generating “random” sequences Random uniform model: P(G) = P(A) = P(C) = P(T) = 0.25 Random uniform model: P(G) = P(A) = P(C) = P(T) = 0.25 Doesn’t reflect nature Doesn’t reflect nature Use sequences from a database Use sequences from a database Might have genuine homology Might have genuine homology We want unrelated sequences We want unrelated sequences Random shuffling of sequences Random shuffling of sequences Preserves composition Preserves composition Removes true homology Removes true homology

What distribution do we expect to see? The mean of n random (i.i.d.) events tends towards a Gaussian distribution. The mean of n random (i.i.d.) events tends towards a Gaussian distribution. Example: Throw n dice and compute the mean. Example: Throw n dice and compute the mean. Distribution of means: Distribution of means: n = 2 n = 1000

Determining significance of match The score of an ungapped alignment is The score of an ungapped alignment is S = sum s(x i,y i ). The scores of individual sites are independent. The scores of individual sites are independent. The distribution of the sum of independent random variables is a normal distribution (central limit theorem). The distribution of the sum of independent random variables is a normal distribution (central limit theorem).

Determining significance of match However, we don't select scores randomly. We take the maximum extension of the initial word (HSP). The distribution of the maximum score of a large number N of i.i.d. random variables is called the extreme value distribution.

Comparing distributions   Extreme Value:Gaussian:

Determining P-values If we can estimate  and , then we can determine, for a given match score x, the probability that a random match with score x or greater would have occurred in the database. If we can estimate  and , then we can determine, for a given match score x, the probability that a random match with score x or greater would have occurred in the database. For sequence matches, a scoring system and database can be parameterized by two parameters, K and, related to  and . For sequence matches, a scoring system and database can be parameterized by two parameters, K and, related to  and . It would be nice if we could compare hit significance without regard to the database and scoring system used! It would be nice if we could compare hit significance without regard to the database and scoring system used!

P(Score greater than x)= Probability of observing a score S > x m’ and n’ are effective query and database sequence lengths; K and l are substitution matrix parameters. P -values

Determining significance of match E-value = expected number of sequences scoring above S in the given database E-value = expected number of sequences scoring above S in the given database Low E-values => significant matches When E < 0.01 P-values and E-values are nearly identical When E < 0.01 P-values and E-values are nearly identical BIT-score: Sum of scores for local alignments

Smith-Waterman local alignment

BLAST parameters Lowering the neighborhood word threshold (T) allows more distantly related sequences to be found, at the expense of increased noise in the results set. Lowering the neighborhood word threshold (T) allows more distantly related sequences to be found, at the expense of increased noise in the results set. Raising the segment extension cutoff (X) returns longer extensions for each hit. Raising the segment extension cutoff (X) returns longer extensions for each hit. Changing the minimum E-value changes the threshold for reporting a hit. Changing the minimum E-value changes the threshold for reporting a hit.

BLAST flavours Basic flavours Basic flavours BLASTP (proteins to protein database) BLASTP (proteins to protein database) BLASTN (nucleotides to nucleotide database) BLASTN (nucleotides to nucleotide database) BLASTX (translated nucleotides to protein database) BLASTX (translated nucleotides to protein database) TBLASTN (protein to translated database) TBLASTN (protein to translated database) TBLASTX (translated nucleotides to translated database) - SLOW TBLASTX (translated nucleotides to translated database) - SLOW

Example Cloned sequence from Lotus japonicus Amino-acid level (BlastP) Amino-acid level (BlastP)BlastP LLANGNFVLRESGNKDQDGLVWQSFDFPTDTLLPQMKLGWDRKTGLNKI LRSWKSPSDPSSGYYSYKLEFQGLPEYFLNNRDSPTHRSGPWDGIRFSGIPEK Nucleotide level (BlastN) Nucleotide level (BlastN)BlastN cttctcgcta atggcaattt cgtgctaaga gagtctggca acaaagatca agatgggtta gtgtggcaga gtttcgattt tcccactgac actttactcc cgcagatgaa actgggatgg gatcgcaaaa cagggcttaa caaaatcctc agatcctgga aaagcccaag tgatccgtcaagtgggtatt actcgtataa actcgaattt caagggctcc ctgagtattt tttaaacaac agagactcgc caactcaccg gagcggtccg tgggatggta tccgatttag tggtattcca

Matrix parameters

Gap parameters

Hits

Synteny between the rat, mouse and human genomes (Nature 2004) Synteny between the rat, mouse and human genomes (Nature 2004)

Iterated searches Advanced family searches PSI-BLAST (Position Specific Iterated BLAST) PSI-BLAST (Position Specific Iterated BLAST)

PSI-blast Search with BLAST using the given query. Search with BLAST using the given query. while (there are new significant hits) while (there are new significant hits) combine all significant hits into a profile combine all significant hits into a profile search with BLAST using the profile search with BLAST using the profile end end

PSI-BLAST Greedy algorithm