From Pairwise Alignment to Database Similarity Search.

From Pairwise Alignment to Database Similarity Search

Best score for aligning part of sequences Dynamic programming Algorithm: Smith-Waterman Table cells never score below zero Best score for aligning the full length sequences Dynamic programming Algorithm: Needelman- Wunch Table cells are allowed any score Global Local Pairwise Alignment Summary

Gap Scores >Human DNA CATGCGACTGACcgacgtcgatcgatacgactagctagcATCGATCATA >Human mRNA CATGCGACTGACATCGATCATA Biologically, indels occur in groups we want our gap score to reflect this

Gap Scores Standard solution: affine gap model –Once-off cost for opening a gap –Lower cost for extending the gap –Changes required to algorithm

Affine Gap Penalty w x = g + r(x-1) w x : total gap penalty; g: gap open penalty; r: gap extend penalty ;x: gap length gap penalty chosen –Gaps not excluded –Gaps not over included

Complexity Complexity is determined by size of table –Aligning a sequence of length m against one of length n requires calculating (m  n) cells Example: –Aligning two mRNA sequences of 8,000 bp requires 64,000,000 cells –Aligning an mRNA and a 10 7 bp chromosome requires ~10 11 cells

7 Discover function Sequences that are similar probably have the same function

new sequence ? Sequence Database ≈ Similar function

Searching Databases for similar sequences Naïve solution: Use exact algorithm to compare each sequence in the database to query. Is this reasonable ?? How much time will it take to calculate?

Complexity for genomes Human genome contains 3  10 9 base pairs –Searching an mRNA against HG requires ~10 13 cells -Even efficient exact algorithms will be extremely slow when preformed millions of times. -Running the computations in parallel is expensive.

So what can we do?

Searching databases Solutions: 1.Use a heuristic (approximate) algorithm to discard most irrelevant sequences. 2.Perform the exact algorithm on the small group of remaining sequences.

Heuristic strategy Remove regions that are not useful for meaningful alignments Preprocess database into new data structure to enable fast accession

AAAAAAAAAAA ATATATATATATA Transposable elements (LINEs, SINEs) What sequences to remove? Low complexity sequences

Low Complexity Sequences What's wrong with them? Produce artificial high scoring alignments. So what do we do? We apply Low Complexity masking to the database and the query sequence Mask TCGATCGTATATATACGGGGGGTA TCGATCGNNNNNNNNCNNNNNNTA

Low Complexity Sequences Complexity is calculated as: Where N=4 in DNA (4 bases), L is the length of the sequence And n i the number of each residue in the sequence K=1/L log N (L!/Π n i !) all i For the sequence GGGG: L! =4x3x2x1=24 n g =4 n c =0 n a =0 n t =0 Πn i =24x1x1x1=24 K =1/4 log 4 (24/24)=0 For the sequence CTGA: L! =4x3x2x1=24 ng =1 nc =1 na =1 nt =1 Πni =1x1x1x1 K =1/4 log 4 (24/1)=0.573

Heuristic strategy Remove low-complexity regions that are not useful for meaningful alignments Preprocess database into new data structure to enable fast accession

Heuristic (approximate solution) Methods: FASTA and BLAST FASTA (Lipman & Pearson 1985) –First fast sequence searching algorithm for comparing a query sequence against a database BLAST - Basic Local Alignment Search Technique (Altschul et al 1990) –improvement of FASTA: Search speed, ease of use, statistical rigor

FASTA and BLAST Common idea - a good alignment contains subsequences of absolute identity: –First, identify very short (almost) exact matches. –Next, the best short hits from the 1st step are extended to longer regions of similarity. –Finally, the best hits are optimized using the Smith- Waterman algorithm.

FastA (fast alignment) Assumption: a good alignment probably matches some identical ‘words’ Example: Aligning a query sequence to a database Database record: ACTTGTAGATACAAAATGTG Query sequence: A-TTGTCG-TACAA-ATCTG

Preprocess of all the sequences in the database. Find short words and organize in dictionaries. Process the query sequence and prepare a dictionary. –ATGGCTGCTCAAGT…. ATGGTGGCGGCT… … FastA Query

FastA locates regions of the query sequence and the search set sequence that have high densities of exact word matches. For DNA sequences the word length used is 6. Words in seq1 Words in seq2

The 10 highest-scoring sequence regions are saved and re-scored using a scoring matrix. seq1 seq2

FastA determines if any of the initial regions from different diagonals may be joined together to form an approximate alignment with gaps. Only non-overlapping regions may be joined. seq1 seq2

The score for the joined regions is the sum of the scores of the initial regions minus a joining penalty for each gap. seq1 seq2

FastA final stage Apply an exact algorithm of local alignment on surviving records, computing the final alignment score. Calculate an Alignment score (S) Evaluate the statistical significance

Assessing Alignment Significance Determine probability of alignment occurring at random Ideal No Good Random Related

Assessing Alignment Significance –Z’ score = deviation (in sd) of the actual score from the mean of random scores Z=(x-mean)/sd –Opt: the number of optimized scores observed. –E: the number of sequences expected in the score range.

BLAST Basic Local Alignment Search Tool Developed to be as sensitive as FastA but much faster. Also searches for short words. –Protein 3 letter words –DNA 11 letter words. –Words can be similar, not only identical

BLAST The central idea of the BLAST algorithm is to confine attention to segment pairs that contain a word pair of length w with a score of at least T.” Altschul et al. (1990)

BLAST (Protein Sequence Example) 1.Search the database for matching word pairs (> T) Example: …FSGTWYA… A list of words (w=3) is: FSG SGT GTW TWY WYA YSG TGT ATW SWY WFA FTG SVT GSW TWF WYS

BLAST (Protein Sequence Example) 1.Search the database for matching word pairs (>T) 2.Extend word pairs as much as possible, i.e., as long as the total score increases Result: High-scoring Segment Pairs (HSPs) THEFIRSTLINIHFSGTWYAAMESIRPATRICKREAD INVIEIAFDGTWTCATTNAMHEWASNINETEEN

BLAST 3. Try to connect HSPs by aligning the sequences in between them: THEFIRSTLINIHFSGTWYAA____M_ESIRPATRICKREAD INVIEIAFDGTWTCATTNAMHEW___ASNINETEEN The Gapped Blast algorithm allows several segments that are separated by short gaps to be connected together to one alignment

How to interpret a BLAST search: The score is a measure of the similarity of the query to the sequence shown. The E-value is a measure of the reliability of the score.

The expect value E-value is the number of alignments with scores greater than or equal to score S that are expected to occur by chance in a database search. An E value is related to a probability value p. The key equation describing an E value is: E = Kmn e -lS page 105 How to interpret a BLAST search:

BLAST- E value: Increases linearly with length of query sequence Increases linearly with length of database Decreases exponentially with score of alignment –K,λ: statistical parameters dependent upon scoring system and background residue frequencies m = length of query ; n= length of database ; s= score

From raw scores to bit scores There are two kinds of scores: raw scores (calculated from a substitution matrix) and bit scores (normalized scores) Bit scores are comparable between different searches because they are normalized to account for the use of different scoring matrices and different database sizes S’ = bit score = (lS - lnK) / ln2 The E value corresponding to a given bit score is: E = mn 2 -S’ Bit scores allow you to compare results between different database searches, even using different scoring matrices. page 106

What is a Good E-value - thumb rules E values of less than 0.00001 show that sequences are almost always homologues. Greater E values, can represent homologues as well. Generally the decision whether an E-value is biologically significant depends on the size of database that is searched Sometimes a real match has an E value > 1 Sometimes a similar E value occurs for a short exact match and long less exact match

Significance of Gapped Alignments Gapped alignments use same statistics and K cannot be easily estimated Empirical estimations and gap scores determined by looking at random alignments

BLAST BLAST is a family of programs Query:DNAProtein Database:DNAProtein

Choose the BLAST program ProgramInputDatabase 1 blastnDNADNA 1 blastpproteinprotein 6 blastxDNAprotein 6 tblastnprotein DNA 36 tblastxDNA DNA

Example :The lipocalins (each dot is a protein) retinol-binding protein odorant-binding protein apolipoprotein D Example is taken from Bioinformatics and Functional Genomics by Jonathan Pevsner (ISBN 0-471-21004-8). Copyright © 2003 by John Wiley & Sons, Inc.

BLAST search with PAEP as a query finds many other lipocalins

Assessing whether proteins are homologous RBP4 and PAEP: Low bit score, E value 0.49, 24% identity but they are indeed homologous.

From Pairwise Alignment to Database Similarity Search.

Similar presentations

Presentation on theme: "From Pairwise Alignment to Database Similarity Search."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

From Pairwise Alignment to Database Similarity Search.

Similar presentations

Presentation on theme: "From Pairwise Alignment to Database Similarity Search."— Presentation transcript:

Similar presentations

About project

Feedback