Database searching. Purposes of similarity search Function prediction by homology (in silico annotation) Function prediction by homology (in silico annotation)

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.
Bioinformatics Tutorial I BLAST and Sequence Alignment.
BLAST Sequence alignment, E-value & Extreme value distribution.
Rationale for searching sequence databases
Database Searching for Similar Sequences Search a sequence database for sequences that are similar to a query sequence Search a sequence database for sequences.
Database searching. Purposes of similarity search Function prediction by homology (in silico annotation) Function prediction by homology (in silico annotation)
We continue where we stopped last week: FASTA – BLAST
Slide 1 EE3J2 Data Mining Lecture 20 Sequence Analysis 2: BLAST Algorithm Ali Al-Shahib.
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
1 BLAST – A heuristic algorithm Anjali Tiwari Pannaben Patel Pushkala Venkataraman.
Introduction to bioinformatics
Similar Sequence Similar Function Charles Yan Spring 2006.
BLAST.
Database searching. Purposes of similarity search Function prediction by homology (in silico annotation) Function prediction by homology (in silico annotation)
Database searching. Purposes of similarity search Function prediction by homology (in silico annotation) Function prediction by homology (in silico annotation)
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 16 th, 2014.
Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Sequence alignment, E-value & Extreme value distribution
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 17 th, 2013.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
Database Searching BLAST and FastA.
An Introduction to Bioinformatics
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Bacterial Genetics - Assignment and Genomics Exercise: Aims –To provide an overview of the development and.
Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.
Module 3 Sequence and Protein Analysis (Using web-based tools) Working with Pathogen Genomes - Uruguay 2008.
Local alignment, BLAST and Psi-BLAST October 25, 2012 Local alignment Quiz 2 Learning objectives-Learn the basics of BLAST and Psi-BLAST Workshop-Use BLAST2.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
1 P6a Extra Discussion Slides Part 1. 2 Section A.
NCBI resources II: web-based tools and ftp resources Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
Comp. Genomics Recitation 3 The statistics of database searching.
A Tutorial of Sequence Matching in Oracle Haifeng Ji* and Gang Qian** * Oklahoma City Community College ** University of Central Oklahoma.
Rationale for searching sequence databases June 25, 2003 Writing projects due July 11 Learning objectives- FASTA and BLAST programs. Psi-Blast Workshop-Use.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Construction of Substitution matrices
Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Sequence Search Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
What is BLAST? Basic BLAST search What is BLAST?
CISC667, S07, Lec7, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Sequence pairwise alignment Score statistics: E-value and p-value Heuristic algorithms:
Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.
Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
What is BLAST? Basic BLAST search What is BLAST?
Blast Basic Local Alignment Search Tool
Basics of BLAST Basic BLAST Search - What is BLAST?
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Bioinformatics and BLAST
BLAST.
Sequence alignment, Part 2
Comparative Genomics.
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

Database searching

Purposes of similarity search Function prediction by homology (in silico annotation) Function prediction by homology (in silico annotation) Search for identified gene in other organisms Search for identified gene in other organisms Identifying regulatory elements Identifying regulatory elements Assisting in sequence assembly Assisting in sequence assemblyProblems Similar sequences can have different functions Similar sequences can have different functions Non-homologous sequences can have identical function Non-homologous sequences can have identical function Feature space <> Sequence space Feature space <> Sequence space

Some databases nr (GenBank nucleotide and protein) nr (GenBank nucleotide and protein) nr Month: monthly update Month: monthly update swissprot (protein) swissprot (protein) swissprot pdb (proteins with 3D structures) pdb (proteins with 3D structures) pdb Various genome databases (human, mouse etc) Various genome databases (human, mouse etc)

Main tools FASTA FASTA BLAST=Basic Local Alignment Search Tool BLAST=Basic Local Alignment Search ToolProcedure 1. Choose scoring matrix 2. Find best local alignments using scoring matrix 3. Determine statistical significance of result List in decreasing order of significance List in decreasing order of significance

Blosum substitution matrix log odds scores 2log(proportion observed/proportion expected)

FASTA Step 1 : Find hot-spots Step 1 : Find hot-spots (i.e. pairs of words of length k) that exactly match. (hashing) Step 2: Locate best “diagonal runs”(sequences of consecutive hot spots on a diagonal) Step 2: Locate best “diagonal runs”(sequences of consecutive hot spots on a diagonal) Step 3 : Combine sub-alignments Step 3 : Combine sub-alignments form diagonal runs into a longer alignment

Exercise (hashing Tables of FASTA) sequence 1: ACNGTSCHQE sequence 2: GCHCLSAGQD Prepare Table of offset values = matching diagonals

Solution sequence 1: ACNGTSCHQE sequence 2: GCHCLSAGQD sequence 1: ACNGTSCHQE C S Q <<offset = 0 sequence 2: GCHCLSAGQD sequence 1: ACNGTSCHQE--- G C <<offset = -3 sequence 2: ---GCHCLSAGQD sequence 1: ACNGTSCHQE----- CH <<offset = -5 sequence 2: -----GCHCLSAGQD S T

The main steps of gapped BLAST 1. Specify word length (3 for proteins, 11 for nucleotides) 2. Filtering for complexity 3. Make list of words to search for 4. Exact search 5. Join matches, and extend ungapped alignment 6. Calculate E-values 7. Join high-scoring pairs 8. Perform Smith-Waterman on best matches

Filtering sequences Removing sequence regions of low complexity K Remove repeated elements (REPEATMASKER) Find K for sequence GGGG and for sequence ATCG

The BLAST search algorithm

Determining significance of match The score of an ungapped alignment is The score of an ungapped alignment is S = sum s(x i,y i ). The scores of individual sites are independent. The scores of individual sites are independent. The distribution of the sum of independent random variables is a normal distribution (central limit theorem). The distribution of the sum of independent random variables is a normal distribution (central limit theorem).

Determining significance of match However, we don't select scores randomly. We take the maximum extension of the initial word (HSP). However, we don't select scores randomly. We take the maximum extension of the initial word (HSP). The distribution of the maximum score of a large number N of i.i.d. random variables is called the extreme value distribution. The distribution of the maximum score of a large number N of i.i.d. random variables is called the extreme value distribution. P(Score greater than x)= Probability of observing a score S > x m’ and n’ are effective query and database lengths; K and l are substitution matrix parameters.

Determining significance of match E-value = expected number of sequences scoring above S in the given database E-value = expected number of sequences scoring above S in the given database BIT-score: Sum of scores for local alignments BIT-score: Sum of scores for local alignments

Smith-Waterman local alignment

Example – nodulation Cloned sequence from Lotus japonicus Amino-acid level (BlastP) Amino-acid level (BlastP)BlastP LLANGNFVLRESGNKDQDGLVWQSFDFPTDTLLPQMKLGWDRKTGLNKI LRSWKSPSDPSSGYYSYKLEFQGLPEYFLNNRDSPTHRSGPWDGIRFSGIPEK Nucleotide level (BlastN) Nucleotide level (BlastN)BlastN cttctcgcta atggcaattt cgtgctaaga gagtctggca acaaagatca agatgggtta gtgtggcaga gtttcgattt tcccactgac actttactcc cgcagatgaa actgggatgg gatcgcaaaa cagggcttaa caaaatcctc agatcctgga aaagcccaag tgatccgtcaagtgggtatt actcgtataa actcgaattt caagggctcc ctgagtattt tttaaacaac agagactcgc caactcaccg gagcggtccg tgggatggta tccgatttag tggtattcca

BLAST flavours Basic flavours Basic flavours BLASTP (proteins to protein database) BLASTP (proteins to protein database) BLASTN (nucleotides to nucleotide database) BLASTN (nucleotides to nucleotide database) BLASTX (translated nucleotides to protein database) BLASTX (translated nucleotides to protein database) TBLASTN (protein to translated database) TBLASTN (protein to translated database) TBLASTX (translated nucleotides to translated database) - SLOW TBLASTX (translated nucleotides to translated database) - SLOW

Iterated searches Advanced family searches PSI-BLAST (Position Specific Iterated BLAST) PSI-BLAST (Position Specific Iterated BLAST)

PSI-blast Search with BLAST using the given query. Search with BLAST using the given query. while (there are new significant hits) while (there are new significant hits) combine all significant hits into a profile combine all significant hits into a profile search with BLAST using the profile search with BLAST using the profile end end

PSI-BLAST Greedy algorithm

Speeding up BLAST TeraBLAST on deCypher Hard-wiring the BLAST algorithm Hard-wiring the BLAST algorithm Up to 6 trillion BLASTN comparisons per second per DeCypher node. Up to 6 trillion BLASTN comparisons per second per DeCypher node. Up to 1 trillion amino acid comparisons per second per DeCypher node. Up to 1 trillion amino acid comparisons per second per DeCypher node. A single multi-node equipment rack replaces over 1,000 CPUs of BLAST throughput. A single multi-node equipment rack replaces over 1,000 CPUs of BLAST throughput.

TeraBLAST on deCypher Example: Human Genome Self-Similarity Search Example: Human Genome Self-Similarity Search (query x target)/2 = [3.4x10 9 ] 2 /2 = 5.8x10 18 base pair comparisons Search time for CPU server, software-BLASTN: Search time for CPU server, software-BLASTN: 5.8x10 18 /2x10 10 = 290 million seconds; or ~9.2 years. 5.8x10 18 /2x10 10 = 290 million seconds; or ~9.2 years. Search time for smallest DeCypher using Tera- BLASTN: 5.8x10 18 /2x10 12 = 2.9 million seconds; or ~33 days. Search time for smallest DeCypher using Tera- BLASTN: 5.8x10 18 /2x10 12 = 2.9 million seconds; or ~33 days.

Other genome projects Synteny of mouse chr 16 with the human genome (Science 2002) Synteny of mouse chr 16 with the human genome (Science 2002)