Database Searches Guoqing Lu Office: E115 Beadle Center Tel: (402) 472-4982 Website:

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.
BLAST Sequence alignment, E-value & Extreme value distribution.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Measuring the degree of similarity: PAM and blosum Matrix
Introduction to Bioinformatics
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
Rationale for searching sequence databases
Heuristic alignment algorithms and cost matrices
Sequence analysis course
We continue where we stopped last week: FASTA – BLAST
Introduction to bioinformatics
Similar Sequence Similar Function Charles Yan Spring 2006.
Heuristic Approaches for Sequence Alignments
BLAST.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 16 th, 2014.
Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman.
From Pairwise Alignment to Database Similarity Search.
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Database Searching BLAST and FastA.
An Introduction to Bioinformatics
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Protein Sequence Alignment and Database Searching.
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
BLAST : Basic local alignment search tool B L A S T !
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Bacterial Genetics - Assignment and Genomics Exercise: Aims –To provide an overview of the development and.
Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Rationale for searching sequence databases June 25, 2003 Writing projects due July 11 Learning objectives- FASTA and BLAST programs. Psi-Blast Workshop-Use.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Sequence Alignment.
Construction of Substitution matrices
Doug Raiford Phage class: introduction to sequence databases.
Step 3: Tools Database Searching
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Practice -- BLAST search in your own computer 1.Download data file from the course web page, or Ensemble. Save in the blast\dbs folder. 2.Start a CMD window,
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
CISC667, S07, Lec7, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Sequence pairwise alignment Score statistics: E-value and p-value Heuristic algorithms:
Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Blast Basic Local Alignment Search Tool
Basics of BLAST Basic BLAST Search - What is BLAST?
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Identifying templates for protein modeling:
Basic Local Alignment Search Tool
BLAST Slides adapted & edited from a set by
Sequence alignment, E-value & Extreme value distribution
BLAST Slides adapted & edited from a set by
Presentation transcript:

Database Searches Guoqing Lu Office: E115 Beadle Center Tel: (402) Website:

Motivation Find which sequences in the database are related to your sequence X is like A A has function M X is likely to have function M Applications include: identifying orthologs and paralogs discovering new genes or proteins discovering variants of genes or proteins investigating expressed sequence tags (ESTs) exploring protein structure and function

Concerns in Database Searches Sensitivity –The ability of a search method to find most of the members of the protein family represented by the query sequence Selectivity –The ability of a search method to locate a protein family without making a false-positive classification of members of other families. Speed –Heuristic, not the optimal but practical

A Naïve Approach For each target sequence: –Align query to target sequence using e.g., dynamic programming –Report alignment if above cutoff Repeat for the next sequence

FASTA Developed by Lipman & Pearson (1985) –Search for matching sequence patterns or words, or k-tuples Local alignment –Tries to find paths of regional similarity, rather than trying to find the best alignment between 2 sequences Heuristic –Not guaranteed to find the best alignment between 2 sequences; it may miss matches –Uses a strategy which is expected to find most matches, but sacrifices complete sensitivity in order to gain speed Has gone through a series of updates and enhancements leading to version 3, denoted FASTA3 FTP: ftp.virginia.edu/pub/

FASTA Algorithm – Step 1 Identify regions shared by two sequences with highest density of identities –4-6 for nucleotide searches –2 for protein –Merge along the diagonals

FASTA Algorithm – Step 2 Re-calculate INIT1 using scoring matrix, e.g., PAM250 Keep up top 10 scoring segments Each segment is a partial alignment without gaps

FASTA Algorithm – Step 3 Merge INIT1 regions that pass a threshold by allowing gaps between them INITN score is sum of INIT1 scores minus gaps

FASTA Algorithm – Step 4 Using dynamic programming to optimize the alignment in a narrow band that encompasses the top scoring segments OPT score

FASTA – Statistics FASTA calculates a z-score for the sequence pair by multiplying the alignment score by ln[(length(query)/length(db_sequence)] Using the distribution of the z-score, the program can estimate the number of sequences that would be expected to produce, purely by chance, a z-score greater than or equal to the z- score obtained in the search. This is reported as the E value

bit score - assume 30, you would have to score, on average, about 1 billion independent segment pairs to find a score this good by chance

TFASTA Used to search a DNA database using a protein query sequence Find any DNA sequences that may code for a protein of interest TFASTA is very slow !!!

BLAST Basic Local Alignment Search Tool Developed as a way to perform a sequence similarity search by an algorithm that is faster than FASTA while being as sensitive (Altschul et al 1990, 1994, 1997)

BLAST Build NWL & Search Database for NWH P-P: 7 Q-Q: 5 G-G: 6 …

BLAST Extend the Word Hits

The expect value E is the number of alignments with scores greater than or equal to score S that are expected to occur by chance in a database search. The key equation describing an E value is: E = Kmn e - S How to Interpret a BLAST Search: Expect Value

This equation is derived from a description of the extreme value distribution S = the score E = the expect value = the number of HSPs expected to occur with a score of at least S m, n = the length of two sequences, K = Karlin Altschul statistics E = Kmn e - S

From Raw Scores to Bit Scores There are two kinds of scores: – raw scores (calculated from a substitution matrix) and bit scores (normalized scores) Bit scores are comparable between different searches because they are normalized to account for the use of different scoring matrices and different database sizes S’ = bit score = ( S - lnK) / ln2 The E value corresponding to a given bit score is: E = mn 2 -S’ Bit scores allow you to compare results between different database searches, even using different scoring matrices.

A p value is a different way of representing the significance of an alignment. p = 1 - e -  How to Interpret BLAST: E values and p values Ep (about 0.1) (about 0.05) (about 0.001)

Blastp – Compares an amino acid query sequence against a protein sequence database. Blastn – Compares a nucleotide query sequence against a nucleotide sequence database. Blastx – Compares a nucleotide query sequence translated in all reading frames against a protein sequence database. Tblastn - Compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames. Tblastx - Compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.

PSI-Blast Position Specific Iterative Blast The sequences extracted from a Blast2 search are aligned and a statistical profile is derived from the multiple alignment The profile is then used as a query for the next search, and this loop is iterated a number of times that is controled by the user Documentation at NCBIDocumentation

PHI-Blast Pattern-Hit Initiated Blast The search space is restricted to the database sequences that match a motif, which is specified by the user and has also to be contained in the query sequence. The resulting alignment is anchored to this motif.

Both: heuristic BLAST: several alignments per database entry FASTA: one alignment per database entry Word length: BLAST:11, FASTA:6 Therefore, FASTA may be better for DNA seqs BLAST treats automatically low complexity sequences Both provide: Ranking of alignments Alignment scores Statistical significance of alignments Alignments Comparison of BLAST and FASTA Algorithms

Scoring VLSPADKTNVKAAWGKVGA ||| | | || VLSEGEWQLVLHVWAKVEA The alignment score = Matches - penalty for gap - penalty for gap extension Do matches (V,V), (L,L), (S,S) have the same value???

Point Accepted Mutation Matrices (Dayhoff 1978) – List the likelihood of change from one base or amino acid to another in homologous sequences during evolution AA PAM matrices are derived from families of closely related sequences. Evolutionary distance of 1 PAM = probability of 1 point mutation per 100 residues Likelihood (odds) ratio for residues a and b: –Probability a-b is a mutation / probability a-b is chance PAM matrices contain log-odds figures –Val>0: likely mutation –Val=0: random mutation –Val<0: unlikely mutation 250 PAM: similarity scores equivalent to 20% identity Low PAM – good for finding short, strong local similarities High PAM - long weak similarity PAM Matrices

BLOSUM Matrices Blocks Substitution Matrices ( Henikoff & Henikoff 1992) Aligned, ungapped conserved region of a protein family Calculate the frequency with which any amino acid can appear at each position Compute the probability that any amino acid can substitute for any other Frequencies obtained from protein blocks constructed regardless of evolutionary distance Blocks represent regions of conserved sequence similarities Conservation due to functional constraints, thus calculated frequencies reflect functional constraints Much larger data set used than for the PAM matrix BLOSUM64 is roughly equivalent to PAM120

Exercise 1 Search the following sequence against Swiss- Prot database for similarity using Fasta and NCBI-Blast2 programs at > unknown protein MAVACAVAVRPLVQVAVASAVSTAAPASSKPAVKLAASAVSAVALTTVSVSAGLLATTAVEDPRFHAADCQS RSADASASCEDLQPSTSTCTSAVRDANRPTRRVRRSGSKAQRRGSTTLTASVPSMAAAVVLPPKIALRRRHR LRLRAGHSATAAATDKTPREQPDKPAALPEDLLPADATSTSSTGKISSAAVCCGLLAHCSAAQLHAILCGLV QAVASSSVKGNNRKLLLGSKLRKLLEGVGVAPANGKAYTAADVAALSGPKLERLRATLKSQPGLLLWFLLFT APAKLQALQAALLPGGAGDRSFEEWRAAIDAVAGSGHEQLAAAQEVRGRQSACVEGSTAGNTATTATITTTN NNPASHGGVYTALTGTEVTGKKPAALPEDLLPADATSTSSTGKISSAAVCCGLLAHCSAAQLHAILCGLVQA VASSSVKGNNRKLLLGSKLRKLLEGVGVAPANGKAYTAADVAALSGPKLERLRATLKSQPGLLLWFLLFTAP AKLQALQAALLPGGAGDRSFEEWRAAIDAVAGSGHEQLAAAQEVRGRQSACVEGSTAGNTATTATITTTNNN PASHGGVYTALTGTEVTGKAAANKDLSRTRTTSHRNRCVSESGSTRNKSRSSSSRSSSTHSVEYAEPKAGCS QPAATVPGCVPEIISAAIPPLAPLALHIRRAIVKELLEARPPGWNTFLYSWLQAAGLSEFLPANGTCRMYMA DRKQLVLRVGAMREEQVDAFLTCMCKAHGHSTWLARYLHMLGPEVSQLLS

GCG Introduction

GCG entry/data format Entry –database:accession- number –genbank:zmzein; gb: * The format of records –NCBINCBI –EMBLEMBL Editing –seqed [filename] –reformat [filename] –Reverse [filename]

Find Sequences - Lookup Lookup: homo sapiens, ldlr stringsearch

Retrieve sequences - fetch fetch gb_pl:zmzein

Sequence edit - seqed Ctrl+D : reformat [filename] reverse [filename] seqed zmzein.gb_pl

SeqLab Menu bar Currently loaded list file Mode selector Attributes List file contents

Menus - File

Menus - Edit

Menus - Functions

Menus - Options

Menus - Windows

Add sequence from databases

Two Modes: Main List and Editor

Mode change –place cursor over the words Main List –hold down your mouse button, showing you a choice between Main List and Editor –slide the cursor down over the word "Editor" –still depressing the mouse button, and then release Do not confuse the Editor Mode with the Edit menu at the top of the window!

Adding files to Main List / Editor Three kinds of files –files from your Unix directory –files in the sequence databanks –files you've retrieved from databanks on the net into your directory If you have a file in Fasta format (a single line with a ">" sign and the name of the sequence, followed by as many lines of sequence letters), you can, in Editor Mode only, Import sequence from the File menu

Main List - view sequence Weight: define significance of the sequences in comparisons of other sequences Join: join or concatenate with next sequence in the list that has an identical “Join: name” - Be used Assemble, Translate programs

Editing sequence Editor –Cut, Copy, and Paste –Lock, Group, and unGroup