From Pairwise Alignment to Database Similarity Search.

Slides:

Advertisements

Similar presentations

Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪莊凱翔.

Advertisements

Bioinformatics Tutorial I BLAST and Sequence Alignment.

BLAST Sequence alignment, E-value & Extreme value distribution.

Sequence Alignment Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan

Space/Time Tradeoff and Heuristic Approaches in Pairwise Alignment.

Searching Sequence Databases

Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.

Heuristic alignment algorithms and cost matrices

. Class 4: Sequence Alignment II Gaps, Heuristic Search.

We continue where we stopped last week: FASTA – BLAST

. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.

Sequence Alignment vs. Database Task: Given a query sequence and millions of database records, find the optimal alignment between the query and a record.

1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.

Heuristic alignment algorithms; Cost matrices 2.5 – 2.9 Thomas van Dijk.

Pairwise Sequence Alignment Part 2. Outline Global alignments-continuation Local versus Global BLAST algorithms Evaluating significance of alignments.

Introduction to bioinformatics

Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.

From Pairwise Alignment to Database Similarity Search.

Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 16 th, 2014.

From Pairwise Alignment to Database Similarity Search.

Protein Sequence Comparison Patrice Koehl

Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman.

Blast heuristics Morten Nielsen Department of Systems Biology, DTU.

Sequence alignment, E-value & Extreme value distribution

Introduction to Bioinformatics From Pairwise to Multiple Alignment.

From Pairwise Alignment to Database Similarity Search.

Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,

TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,

Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive

Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.

Inferring function by homology The fact that functionally important aspects of sequences are conserved across evolutionary time allows us to find, by homology.

Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.

Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒黃尹柔田耕豪蕭逸嫻謝朝茂莊閔傑 2014/05/12 1.

BLAST What it does and what it means Steven Slater Adapted from pt.

BLAST : Basic local alignment search tool B L A S T !

Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.

Biology 224 Tom Peavy Sept 20 & 22, 2010

Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.

Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.

1 Lecture outline Database searches –BLAST –FASTA Statistical Significance of Sequence Comparison Results –Probability of matching runs –Karin-Altschul.

Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?

BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

Rationale for searching sequence databases June 25, 2003 Writing projects due July 11 Learning objectives- FASTA and BLAST programs. Psi-Blast Workshop-Use.

Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.

Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.

Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?

Biology 4900 Biocomputing.

Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.

Sequence Alignment.

Doug Raiford Phage class: introduction to sequence databases.

Step 3: Tools Database Searching

Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.

BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.

Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.

Chap. 4: Multiple Sequence Alignment

Introduction to Bioinformatics

Introduction to Bioinformatics DNA and Protein Database Searching BLAST: Basic local alignment search tool Xiaolong Wang College of Life Sciences Ocean.

Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.

9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.

Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.

Courtesy of Jonathan Pevsner

Basic Local Alignment Sequence Tool (BLAST)

Homology Search Tools Kun-Mao Chao (趙坤茂)

BLAST Anders Gorm Pedersen & Rasmus Wernersson.

Identifying templates for protein modeling:

Homology Search Tools Kun-Mao Chao (趙坤茂)

Basic Local Alignment Search Tool

Homology Search Tools Kun-Mao Chao (趙坤茂)

Sequence alignment, E-value & Extreme value distribution

Presentation transcript:

From Pairwise Alignment to Database Similarity Search

Best score for aligning part of sequences Dynamic programming Algorithm: Smith-Waterman Table cells never score below zero Best score for aligning the full length sequences Dynamic programming Algorithm: Needelman- Wunch Table cells are allowed any score Global Local Pairwise Alignment Summary

3 Sequences that are similar probably have the same function Why do we care to align sequences?

new sequence ? Sequence Database ≈ Similar function Discover Function of a new sequence

Searching Databases for similar sequences Naïve solution: Use exact algorithm to compare each sequence in the database to query. Is this reasonable ?? How much time will it take to calculate?

Complexity for genomes Human genome contains 3  10 9 base pairs –Searching an mRNA against HG requires ~10 13 cells -Even efficient exact algorithms will be extremely slow when preformed millions of times even with parallel computing.

So what can we do?

Searching databases Solution: Use a heuristic (approximate) algorithm to discard most irrelevant sequences and perform the exact algorithm on the small group of remaining sequences.

Heuristic strategy Remove regions that are not useful for meaningful alignments Preprocess database into new data structure to enable fast accession

Heuristic strategy Remove regions that are not useful for meaningful alignments Preprocess database into new data structure to enable fast accession

AAAAAAAAAAA ATATATATATATA Transposable elements (LINEs, SINEs) What sequences to remove? Low complexity sequences

Low Complexity Sequences What's wrong with them? Produce artificial high scoring alignments. So what do we do? We apply Low Complexity masking to the database and the query sequence Mask TCGATCGTATATATACGGGGGGTA TCGATCGNNNNNNNNCNNNNNNTA

Low Complexity Sequences Complexity is calculated as: Where N=4 in DNA (4 bases), L is the length of the sequence and n i the number of each residue in the sequence K=1/L log N (L!/Π n i !) all i For the sequence GGGG: L! =4x3x2x1=24 n g =4 n c =0 n a =0 n t =0 Πn i =24x1x1x1=24 K =1/4 log 4 (24/24)=0 For the sequence CTGA: L! =4x3x2x1=24 ng =1 nc =1 na =1 nt =1 Πni =1x1x1x1 K =1/4 log 4 (24/1)=0.573

Heuristic strategy Remove low-complexity regions that are not useful for meaningful alignments Preprocess database into new data structure to enable fast accession

Heuristic (approximate solution) Methods: FASTA and BLAST FASTA (Lipman & Pearson 1985) –First fast sequence searching algorithm for comparing a query sequence against a database BLAST - Basic Local Alignment Search Technique (Altschul et al 1990) –improvement of FASTA: Search speed, ease of use, statistical rigor

FASTA and BLAST Common idea - a good alignment contains subsequences of absolute identity: –First, identify very short (almost) exact matches. –Next, the best short hits from the 1st step are extended to longer regions of similarity. –Finally, the best hits are optimized using the Smith- Waterman algorithm.

FastA (fast alignment) Assumption: a good alignment probably matches some identical ‘words’ Example: Aligning a query sequence to a database Database record: ACTTGTAGATACAAAATGTG Query sequence: A-TTGTCG-TACAA-ATCTG

Preprocess of all the sequences in the database. Find short words and organize in dictionaries. Process the query sequence and prepare a dictionary. –ATGGCTGCTCAAGT…. ATGGTGGCGGCT… … FastA Query

FastA locates regions of the query sequence and the search set sequence that have high densities of exact word matches. For DNA sequences the word length used is 6. Words in seq1 Words in seq2

The 10 highest-scoring sequence regions are saved and re-scored using a scoring matrix. seq1 seq2

FastA determines if any of the initial regions from different diagonals may be joined together to form an approximate alignment with gaps. Only non-overlapping regions may be joined. seq1 seq2

The score for the joined regions is the sum of the scores of the initial regions minus a joining penalty for each gap. seq1 seq2

BLAST Basic Local Alignment Search Tool Developed to be as sensitive as FastA but much faster. Also searches for short words. –Protein 3 letter words –DNA 11 letter words. –Words can be similar, not only identical

BLAST (Protein Sequence Example) 1.Search the database for matching word pairs (> T) Example: …FSGTWYA… A list of words (w=3) is: FSG SGT GTW TWY WYA YSG TGT ATW SWY WFA FTG SVT GSW TWF WYS

BLAST (Protein Sequence Example) 1.Search the database for matching word pairs (>T) 2.Extend word pairs as much as possible, i.e., as long as the total score increases Result: High-scoring Segment Pairs (HSPs) THEFIRSTLINIHFSGTWYAAMESIRPATRICKREAD INVIEIAFDGTWTCATTNAMHEWASNINETEEN

BLAST 3. Try to connect HSPs by aligning the sequences in between them: THEFIRSTLINIHFSGTWYAA____M_ESIRPATRICKREAD INVIEIAFDGTWTCATTNAMHEW___ASNINETEEN The Gapped Blast algorithm allows several segments that are separated by short gaps to be connected together to one alignment

How to interpret a BLAST search: The score is a measure of the similarity of the query to the sequence shown. How do we know if the score is significant? -Statistical significance -Biological significance

Assessing Alignment Significance Determine probability of alignment occurring at random IdealNo Good Random Related Score Frequency For each score we can count the probability of getting it by chance

The expect value E-value is the number of alignments with scores greater than or equal to score S that are expected to occur by chance in a database search. An E value is related to a probability value p (p-value). page 105 How to interpret a BLAST search: For each blast score we get an E-value

BLAST- E value: Increases linearly with length of query sequence Increases linearly with length of database Decreases exponentially with score of alignment –K,λ: statistical parameters dependent upon scoring system and background residue frequencies m = length of query ; n= length of database ; s= score

From raw scores to bit scores Bit scores S’ are normalized and are comparable in different databases The E value corresponding to a given bit score is: E = mn 2 -S’ page 106

What is a Good E-value (Thumb rule) E values of less than show that sequences are almost always homologues. Greater E values, can represent homologues as well. Generally the decision whether an E-value is biologically significant depends on the size of database that is searched Sometimes a real match has an E value > 1 Sometimes a similar E value occurs for a short exact match and long less exact match

Treating Gaps in BLAST >Human DNA CATGCGACTGACcgacgtcgatcgatacgactagctagcATCGATCATA >Human mRNA CATGCGACTGACATCGATCATA Biologically, indels occur in groups we want our gap score to reflect this

Gap Scores Standard solution: affine gap model w x = g + r(x-1) w x : total gap penalty; g: gap open penalty; r: gap extend penalty ;x: gap length –Once-off cost for opening a gap –Lower cost for extending the gap –Changes required to algorithm

Significance of Gapped Alignments Gapped alignments use same statistics and K cannot be easily estimated Empirical estimations and gap scores determined by looking at random alignments

BLAST BLAST is a family of programs Query:DNAProtein Database:DNAProtein

Choose the BLAST program ProgramInputDatabase 1 blastnDNADNA 1 blastpproteinprotein 6 blastxDNAprotein 6 tblastnprotein DNA 36 tblastxDNA DNA

Example :The lipocalins (each dot is a protein) retinol-binding protein odorant-binding protein apolipoprotein D Example is taken from Bioinformatics and Functional Genomics by Jonathan Pevsner (ISBN ). Copyright © 2003 by John Wiley & Sons, Inc.

BLAST search with PAEP as a query finds many other lipocalins

Assessing whether proteins are homologous RBP4 and PAEP: Low bit score, E value 0.49, 24% identity but they are indeed homologous.