Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Bioinformatics Tutorial I BLAST and Sequence Alignment.
BLAST Sequence alignment, E-value & Extreme value distribution.
Sequence Alignment Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan
Space/Time Tradeoff and Heuristic Approaches in Pairwise Alignment.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Heuristic alignment algorithms and cost matrices
Sequence similarity (II). Schedule Mar 23midterm assignedalignment Mar 30midterm dueprot struct/drugs April 6teams assignedprot struct/drugs April 13RNA.
. Class 4: Sequence Alignment II Gaps, Heuristic Search.
We continue where we stopped last week: FASTA – BLAST
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
Sequence Alignment vs. Database Task: Given a query sequence and millions of database records, find the optimal alignment between the query and a record.
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
From Pairwise Alignment to Database Similarity Search.
Pairwise Sequence Alignment Part 2. Outline Global alignments-continuation Local versus Global BLAST algorithms Evaluating significance of alignments.
Introduction to bioinformatics
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
From Pairwise Alignment to Database Similarity Search.
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 16 th, 2014.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
From Pairwise Alignment to Database Similarity Search.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Sequence alignment, E-value & Extreme value distribution
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
From Pairwise Alignment to Database Similarity Search.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive
Inferring function by homology The fact that functionally important aspects of sequences are conserved across evolutionary time allows us to find, by homology.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
BLAST What it does and what it means Steven Slater Adapted from pt.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Biology 224 Tom Peavy Sept 20 & 22, 2010
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Biology 4900 Biocomputing.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
Sequence Alignment.
Construction of Substitution matrices
Doug Raiford Phage class: introduction to sequence databases.
Dynamic programming with more complex models When gaps do occur, they are often longer than one residue.(biology) We can still use all the dynamic programming.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Heuristic Alignment Algorithms Hongchao Li Jan
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.
Chap. 4: Multiple Sequence Alignment
Introduction to Bioinformatics
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Courtesy of Jonathan Pevsner
Basic Local Alignment Sequence Tool (BLAST)
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Identifying templates for protein modeling:
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool
BLAST Slides adapted & edited from a set by
Sequence alignment, E-value & Extreme value distribution
BLAST Slides adapted & edited from a set by
CSE 5290: Algorithms for Bioinformatics Fall 2009
Presentation transcript:

Database Similarity Search

2 Sequences that are similar probably have the same function Why do we care to align sequences?

new sequence ? Sequence Database ≈ Similar function Discover Function of a new sequence

4

Searching Databases for similar sequences Naïve solution: Use exact algorithm to compare each sequence in the database to query. Is this reasonable ?? How much time will it take to calculate?

Complexity for genomes Human genome contains 3  10 9 base pairs –Searching an mRNA against HG requires ~10 12 cells -Even efficient exact algorithms will be extremely slow when preformed millions of times even with parallel computing.

So what can we do?

Searching databases Solution: Use a heuristic (approximate) algorithm

Heuristic strategy Reduce the search space Remove regions that are not useful for meaningful alignments Perform efficient search strategies Preprocess database into new data structure to enable fast accession

Heuristic strategy Reduce the search space Remove regions that are not useful for meaningful alignments Preprocess database into new data structure to enable fast accession

AAAAAAAAAAA ATATATATATATA Transposable elements What sequences to remove? 53% of the genome is repetitive DNA Low complexity sequences (JUNK???)

Low Complexity Sequences What's wrong with them? * Not informative * Produce artificial high scoring alignments. So what do we do? We apply Low Complexity masking to the database and the query sequence Mask TCGATCGTATATATACGGGGGGTA TCGATCGNNNNNNNNCNNNNNNTA

Heuristic strategy Remove low-complexity regions that are not useful for meaningful alignments Perform efficient search strategies Preprocess database into new data structure to enable fast accession

BLAST Basic Local Alignment Search Tool General idea - a good alignment contains subsequences of high identity (local alignment): ACGCCCGGGAGCGC CTGGGCGTATAGCCC –First, identify (most efficiently) short almost exact matches. –Next, extended to longer regions of similarity. –Finally, optimize the alignment using an exact algorithm. Altschul et al 1990

DNA/RNA vs protein alphabet DNA(4) A T G C RNA(4) A U G C Protein (20) ACDEFGHIKLMNPQRSTVWY A T=A G…. A G>>A W…. WHY is it different?

The 20 Amino Acids

A W G

Scoring system for amino acids mismatches

BLAST Basic Local Alignment Search Tool General idea - a good alignment contains subsequences of high identity (local alignment): ACGCCCGGGAGCGC CTGGGCGTATAGCCC –First, identify (most efficiently) short almost exact matches. –Next, extended to longer regions of similarity. –Finally, optimize the alignment using an exact algorithm. Altschul et al 1990

BLAST (Protein Sequence Example) First, identify (most efficiently) short almost exact matches between the query sequence and the database. Query sequence …FSGTWYA… Words of length 3: FSG, SGT, GTW, TWY, WYA

BLAST FSG SGT GTW TWY WYA YSG TGT ATW SWY WFA FTG.. SVT. GSW. TWF.. WYS…. Preprocessing of the database Seq 1 FSGTWYA FSG, SGT, GTW, TWY, WAY Seq 2 FDRTSYVFDR, DRT, RTS, TSY, SYV Seq 3SWRTYVASWR, WRT,RTY, TYV, YVA ……. Seq 3546 Seq 102 Seq 1 BAG OF WORDS

BLAST Query sequence …FSGTWYA… Words of length 3: FSG, SGT, GTW, TWY, WYA… DATABASE FSG SGT GTW TWY WYA YSG TGT ATW SWY WFA FTG SVT GSW TWF WYS…. SEQ N INVIEIAFDGTWTCATTNAMHEWASNINETEEN

BLAST Basic Local Alignment Search Tool General idea - a good alignment contains subsequences of high identity (local alignment): ACGCCCGGGAGCGC CTGGGCGTATAGCCC –First, identify (most efficiently) short almost exact matches. –Next, extended to longer regions of similarity. –Finally, optimize the alignment an exact algorithm. Altschul et al 1990

BLAST 2.Extend word pairs as much as possible, i.e., as long as the total score increases High-scoring Segment Pairs (HSPs) Q: FIRSTLINIHFSGTWYAAMESIRPATRICKREAD D: INVIEIAFDGTWTCATTNAMHEWASNINETEEN Q= query sequence, D= sequence in database 3. Finally, optimize the alignment using an exact algorithm.

Running BLAST to predict a function of a new protein >Arrestin protein (C. elegance) MFIANNCMPQFRWEDMPTTQINIVLAEPRCMAGEFFNAKVLLDSSDPDTVVHSFCAEIKG IGRTGWVNIHTDKIFETEKTYIDTQVQLCDSGTCLPVGKHQFPVQIRIPLNCPSSYESQF GSIRYQMKVELRASTDQASCSEVFPLVILTRSFFDDVPLNAMSPIDFKDEVDFTCCTLPF GCVSLNMSLTRTAFRIGESIEAVVTINNRTRKGLKEVALQLIMKTQFEARSRYEHVNEKK LAEQLIEMVPLGAVKSRCRMEFEKCLLRIPDAAPPTQNYNRGAGESSIIAIHYVLKLTAL PGIECEIPLIVTSCGYMDPHKQAAFQHHLNRSKAKVSKTEQQQRKTRNIVEENPYFR

How to interpret a BLAST score: The score is a measure of the similarity of the query to the sequence shown. How do we know if the score is significant? -Statistical significance -Biological significance

The expectation value E-value is the number of alignments with scores greater than or equal to score S that are expected to occur by chance in a database search. page 105 How to interpret a BLAST search: For each blast score we can calculate an expectation value (E-value)

BLAST- E value: Increases linearly with length of query sequence Increases linearly with length of database Decreases exponentially with score of alignment –K,λ: statistical parameters dependent upon scoring system and background residue frequencies m = length of query ; n= length of database ; s= score

What is a Good E-value (Thumb rule) E values of less than show that sequences are almost always related. Greater E values, can represent functional relationships as well. Sometimes a real (biological) match has an E value > 1 Sometimes a similar E value occurs for a short exact match and long less exact match

How to interpret a BLAST search: The score is a measure of the similarity of the query to the sequence shown. How do we know if the score is significant? -Statistical significance -Biological significance

Treating Gaps in BLAST >Human DNA CATGCGACTGACcgacgtcgatcgatacgactagctagcATCGATCATA >Human mRNA CATGCGACTGACATCGATCATA Sometimes correction to the model are needed to infer biological significance

Gap Scores Standard solution: affine gap model w x = g + r(x-1) w x : total gap penalty; g: gap open penalty; r: gap extend penalty ;x: gap length –Once-off cost for opening a gap –Lower cost for extending the gap –Changes required to algorithm

Gapped BLAST 4. Connect several HSPs by aligning the sequences in between them: THEFIRSTLINIHFSGTWYAA____M_ESIRPATRICKREAD INVIEIAFDGTWTCATTNAMHEW___ASNINETEEN The Gapped Blast algorithm allows several segments that are separated by short gaps to be connected together to one alignment

BLAST BLAST is a family of programs Query:DNAProtein Database:DNAProtein