Database Searches Guoqing Lu Office: E115 Beadle Center Tel: (402) 472-4982 Website:

Database Searches Guoqing Lu Office: E115 Beadle Center Tel: (402) 472-4982 Email: glu3@unl.edu Website: http://biocore.unl.eduglu3@unl.edu

Motivation Find which sequences in the database are related to your sequence X is like A A has function M X is likely to have function M Applications include: identifying orthologs and paralogs discovering new genes or proteins discovering variants of genes or proteins investigating expressed sequence tags (ESTs) exploring protein structure and function

Concerns in Database Searches Sensitivity –The ability of a search method to find most of the members of the protein family represented by the query sequence Selectivity –The ability of a search method to locate a protein family without making a false-positive classification of members of other families. Speed –Heuristic, not the optimal but practical

A Naïve Approach For each target sequence: –Align query to target sequence using e.g., dynamic programming –Report alignment if above cutoff Repeat for the next sequence

FASTA Developed by Lipman & Pearson (1985) –Search for matching sequence patterns or words, or k-tuples Local alignment –Tries to find paths of regional similarity, rather than trying to find the best alignment between 2 sequences Heuristic –Not guaranteed to find the best alignment between 2 sequences; it may miss matches –Uses a strategy which is expected to find most matches, but sacrifices complete sensitivity in order to gain speed Has gone through a series of updates and enhancements leading to version 3, denoted FASTA3 FTP: ftp.virginia.edu/pub/

FASTA Algorithm – Step 1 Identify regions shared by two sequences with highest density of identities –4-6 for nucleotide searches –2 for protein –Merge along the diagonals

FASTA Algorithm – Step 2 Re-calculate INIT1 using scoring matrix, e.g., PAM250 Keep up top 10 scoring segments Each segment is a partial alignment without gaps

FASTA Algorithm – Step 3 Merge INIT1 regions that pass a threshold by allowing gaps between them INITN score is sum of INIT1 scores minus gaps

FASTA Algorithm – Step 4 Using dynamic programming to optimize the alignment in a narrow band that encompasses the top scoring segments OPT score

FASTA – Statistics FASTA calculates a z-score for the sequence pair by multiplying the alignment score by ln[(length(query)/length(db_sequence)] Using the distribution of the z-score, the program can estimate the number of sequences that would be expected to produce, purely by chance, a z-score greater than or equal to the z- score obtained in the search. This is reported as the E value

bit score - assume 30, you would have to score, on average, about 1 billion independent segment pairs to find a score this good by chance

TFASTA Used to search a DNA database using a protein query sequence Find any DNA sequences that may code for a protein of interest TFASTA is very slow !!! http://www.ebi.ac.uk/services/index.html

BLAST Basic Local Alignment Search Tool Developed as a way to perform a sequence similarity search by an algorithm that is faster than FASTA while being as sensitive (Altschul et al 1990, 1994, 1997)

BLAST Build NWL & Search Database for NWH P-P: 7 Q-Q: 5 G-G: 6 …

BLAST Extend the Word Hits

The expect value E is the number of alignments with scores greater than or equal to score S that are expected to occur by chance in a database search. The key equation describing an E value is: E = Kmn e - S How to Interpret a BLAST Search: Expect Value

This equation is derived from a description of the extreme value distribution S = the score E = the expect value = the number of HSPs expected to occur with a score of at least S m, n = the length of two sequences, K = Karlin Altschul statistics E = Kmn e - S

From Raw Scores to Bit Scores There are two kinds of scores: – raw scores (calculated from a substitution matrix) and bit scores (normalized scores) Bit scores are comparable between different searches because they are normalized to account for the use of different scoring matrices and different database sizes S’ = bit score = ( S - lnK) / ln2 The E value corresponding to a given bit score is: E = mn 2 -S’ Bit scores allow you to compare results between different database searches, even using different scoring matrices.

A p value is a different way of representing the significance of an alignment. p = 1 - e -  How to Interpret BLAST: E values and p values Ep 100.99995460 50.99326205 20.86466472 10.63212056 0.10.09516258 (about 0.1) 0.050.04877058 (about 0.05) 0.0010.00099950 (about 0.001) 0.00010.0001000

Blastp – Compares an amino acid query sequence against a protein sequence database. Blastn – Compares a nucleotide query sequence against a nucleotide sequence database. Blastx – Compares a nucleotide query sequence translated in all reading frames against a protein sequence database. Tblastn - Compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames. Tblastx - Compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.

PSI-Blast Position Specific Iterative Blast The sequences extracted from a Blast2 search are aligned and a statistical profile is derived from the multiple alignment The profile is then used as a query for the next search, and this loop is iterated a number of times that is controled by the user Documentation at NCBIDocumentation

PHI-Blast Pattern-Hit Initiated Blast The search space is restricted to the database sequences that match a motif, which is specified by the user and has also to be contained in the query sequence. The resulting alignment is anchored to this motif.

Both: heuristic BLAST: several alignments per database entry FASTA: one alignment per database entry Word length: BLAST:11, FASTA:6 Therefore, FASTA may be better for DNA seqs BLAST treats automatically low complexity sequences Both provide: Ranking of alignments Alignment scores Statistical significance of alignments Alignments Comparison of BLAST and FASTA Algorithms

Scoring VLSPADKTNVKAAWGKVGA ||| | | || VLSEGEWQLVLHVWAKVEA The alignment score = Matches - penalty for gap - penalty for gap extension Do matches (V,V), (L,L), (S,S) have the same value???

Point Accepted Mutation Matrices (Dayhoff 1978) – List the likelihood of change from one base or amino acid to another in homologous sequences during evolution AA PAM matrices are derived from families of closely related sequences. Evolutionary distance of 1 PAM = probability of 1 point mutation per 100 residues Likelihood (odds) ratio for residues a and b: –Probability a-b is a mutation / probability a-b is chance PAM matrices contain log-odds figures –Val>0: likely mutation –Val=0: random mutation –Val<0: unlikely mutation 250 PAM: similarity scores equivalent to 20% identity Low PAM – good for finding short, strong local similarities High PAM - long weak similarity PAM Matrices

BLOSUM Matrices Blocks Substitution Matrices ( Henikoff & Henikoff 1992) Aligned, ungapped conserved region of a protein family Calculate the frequency with which any amino acid can appear at each position Compute the probability that any amino acid can substitute for any other Frequencies obtained from protein blocks constructed regardless of evolutionary distance Blocks represent regions of conserved sequence similarities Conservation due to functional constraints, thus calculated frequencies reflect functional constraints Much larger data set used than for the PAM matrix BLOSUM64 is roughly equivalent to PAM120

Exercise 1 Search the following sequence against Swiss- Prot database for similarity using Fasta and NCBI-Blast2 programs at http://www.ebi.ac.uk/services/index.html http://www.ebi.ac.uk/services/index.html > unknown protein MAVACAVAVRPLVQVAVASAVSTAAPASSKPAVKLAASAVSAVALTTVSVSAGLLATTAVEDPRFHAADCQS RSADASASCEDLQPSTSTCTSAVRDANRPTRRVRRSGSKAQRRGSTTLTASVPSMAAAVVLPPKIALRRRHR LRLRAGHSATAAATDKTPREQPDKPAALPEDLLPADATSTSSTGKISSAAVCCGLLAHCSAAQLHAILCGLV QAVASSSVKGNNRKLLLGSKLRKLLEGVGVAPANGKAYTAADVAALSGPKLERLRATLKSQPGLLLWFLLFT APAKLQALQAALLPGGAGDRSFEEWRAAIDAVAGSGHEQLAAAQEVRGRQSACVEGSTAGNTATTATITTTN NNPASHGGVYTALTGTEVTGKKPAALPEDLLPADATSTSSTGKISSAAVCCGLLAHCSAAQLHAILCGLVQA VASSSVKGNNRKLLLGSKLRKLLEGVGVAPANGKAYTAADVAALSGPKLERLRATLKSQPGLLLWFLLFTAP AKLQALQAALLPGGAGDRSFEEWRAAIDAVAGSGHEQLAAAQEVRGRQSACVEGSTAGNTATTATITTTNNN PASHGGVYTALTGTEVTGKAAANKDLSRTRTTSHRNRCVSESGSTRNKSRSSSSRSSSTHSVEYAEPKAGCS QPAATVPGCVPEIISAAIPPLAPLALHIRRAIVKELLEARPPGWNTFLYSWLQAAGLSEFLPANGTCRMYMA DRKQLVLRVGAMREEQVDAFLTCMCKAHGHSTWLARYLHMLGPEVSQLLS

GCG Introduction

GCG entry/data format Entry –database:accession- number –genbank:zmzein; gb: * The format of records –NCBINCBI –EMBLEMBL Editing –seqed [filename] –reformat [filename] –Reverse [filename]

Find Sequences - Lookup Lookup: homo sapiens, ldlr stringsearch

Retrieve sequences - fetch fetch gb_pl:zmzein

Sequence edit - seqed Ctrl+D : reformat [filename] reverse [filename] seqed zmzein.gb_pl

SeqLab Menu bar Currently loaded list file Mode selector Attributes List file contents

Menus - File

Menus - Edit

Menus - Functions

Menus - Options

Menus - Windows

Add sequence from databases

Two Modes: Main List and Editor

Mode change –place cursor over the words Main List –hold down your mouse button, showing you a choice between Main List and Editor –slide the cursor down over the word "Editor" –still depressing the mouse button, and then release Do not confuse the Editor Mode with the Edit menu at the top of the window!

Adding files to Main List / Editor Three kinds of files –files from your Unix directory –files in the sequence databanks –files you've retrieved from databanks on the net into your directory If you have a file in Fasta format (a single line with a ">" sign and the name of the sequence, followed by as many lines of sequence letters), you can, in Editor Mode only, Import sequence from the File menu

Main List - view sequence Weight: define significance of the sequences in comparisons of other sequences Join: join or concatenate with next sequence in the list that has an identical “Join: name” - Be used Assemble, Translate programs

Editing sequence Editor –Cut, Copy, and Paste –Lock, Group, and unGroup

Database Searches Guoqing Lu Office: E115 Beadle Center Tel: (402) 472-4982 Website:

Similar presentations

Presentation on theme: "Database Searches Guoqing Lu Office: E115 Beadle Center Tel: (402) 472-4982 Website:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Database Searches Guoqing Lu Office: E115 Beadle Center Tel: (402) 472-4982 Website:

Similar presentations

Presentation on theme: "Database Searches Guoqing Lu Office: E115 Beadle Center Tel: (402) 472-4982 Website:"— Presentation transcript:

Similar presentations

About project

Feedback