Problem with N-W and S-W

Slides:



Advertisements
Similar presentations
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Bioinformatics Tutorial I BLAST and Sequence Alignment.
BLAST Sequence alignment, E-value & Extreme value distribution.
Sequence Alignment Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan
Random Walks and BLAST Marek Kimmel (Statistics, Rice)
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
BNFO 235 Lecture 5 Usman Roshan. What we have done to date Basic Perl –Data types: numbers, strings, arrays, and hashes –Control structures: If-else,
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 16 th, 2014.
Protein Sequence Comparison Patrice Koehl
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Sequence alignment, E-value & Extreme value distribution
From Pairwise Alignment to Database Similarity Search.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
BLAST What it does and what it means Steven Slater Adapted from pt.
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Protein Sequence Alignment and Database Searching.
BLAST : Basic local alignment search tool B L A S T !
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?
Using Local Tools: BLAST
Sequence Alignment.
Construction of Substitution matrices
Doug Raiford Phage class: introduction to sequence databases.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Step 3: Tools Database Searching
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
What is sequencing? Video: WlxM (Illumina video) WlxM.
Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.
Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
What is BLAST? Basic BLAST search What is BLAST?
Using Local Tools: BLAST
VCF format: variants c.f. S. Brown NYU
Sequence I/O How to find sequence information from Bio import SeqIO
Blast Basic Local Alignment Search Tool
Homology Search Tools Kun-Mao Chao (趙坤茂)
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Identifying templates for protein modeling:
Sequence comparison: Significance of similarity scores
Homology Search Tools Kun-Mao Chao (趙坤茂)
Fast Sequence Alignments
Basic Local Alignment Search Tool (BLAST)
BIOINFORMATICS Fast Alignment
Using Local Tools: BLAST
Sequence comparison: Significance of similarity scores
Using Local Tools: BLAST
Basic Local Alignment Search Tool
Homology Search Tools Kun-Mao Chao (趙坤茂)
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

Problem with N-W and S-W They are exhaustive, with stringent statistical thresholds As the databases got larger, these began to take too long – computational constraints Need a program that runs an algorithm that is based on dynamic programming (substrings) but uses a heuristic approach (takes assumptions, quicker, not as refined). Therefore can deal with comparing a sequence against 100 million others in a relatively short time… the bigger the database, the more optimal the results.

Two heuristic approaches Each is based on different assumptions and calculations of probability thresholds (ie the statistical significance of results) FASTA All proteins in the db are equally likely to be related to the query  probability multiplied by the number of sequences in database 2. BLAST Query more likely to be related to a sequence its own length or shorter  probability multiplied by N/n N: total number of residues in database n: length of subject sequence

Basic Local Alignment Search Tool The BLAST algorithm Basic Local Alignment Search Tool Breaks down your sequence into smaller segments (words) Does the same for all sequences in the database Looks for exact matches, word by word, and expands those up- and down- stream one base at a time, allowing for a certain number of mismatches Stops when sequence ends or statistical significance becomes too low (too many mismatches) Can find more than one area of similarity between two sequences

The BLAST algorithm STEP 1: k-mers in query sequence query sequence … PQGEFGFAT… query sequence PQG QGE GEF EFG … words STEP 2: Score the words: Match all 3-mers obtained in step 1 with all 203 matches. With a threshold T, keep the best scoring words.

The BLAST algorithm STEP 3: organize the search words in a tree for efficient retrieval STEP 4: repeat step 2-3 for each 3-mer STEP 5: scan the database sequences to find exact matches of words, extend the exact matches to high-scoring segment pair (HSP). … PQGEFGFAT… query sequence … AQGEFEAQT… database sequence ... -2 6 7 2 6 1 -2 -3 7 … HSP, score: 22

The BLAST algorithm STEP 6: evaluate the significance of the HSP score Karlin-Altschul statistics: Using the HSPs, we calculate the expected number of HSPs with score at least S: K, l … constant m, n length of query and database sequence

The BLAST algorithm Bit scores: Raw scores have little meaning without detailed knowledge of the scoring system used, or more simply its statistical parameters K and l. Unless the scoring system is understood, citing a raw score alone is like citing a distance without specifying feet, meters, or light years. Therefore we normalize to a Bitscore so that which is independent of K and l.

The BLAST algorithm P-values:  The number of random HSPs with score ≥S is described by a Poisson distribution (i.e. the probability of finding exactly a HSPs with score ≥S is given by The chance of finding zero HSP with score ≥S is e-E. So the chance to find at least one HSP with score ≥S is

The BLAST algorithm So far our considerations were for pairwise alignments. What are we doing if we search a database? Correction for multiple testing, because a priori all proteins in the database are a priori equally likely to be related to the query. Therefore in the simplest case, E-value is corrected by the number of sequences In the database: E’= E*N An alternative view is that a query is a priori more likely to be related to a long than to a short sequence, because long sequences are often composed of multiple distinct domains. If we assume the a priori chance of relatedness is proportional to sequence length, then the pairwise E-value involving a database sequence of length n should be multiplied by N/n, where N is the total length of the database in residues.

BLAST Using online BLAST from Bio.Blast import NCBIWWW database type from Bio.Blast import NCBIWWW result_handle = NCBIWWW.qblast("blastn", "nt", "8332116") Blast program sequence or from Bio.Blast import NCBIWWW from Bio import SeqIO record = SeqIO.read("m_cold.fasta", format="fasta") result_handle = NCBIWWW.qblast("blastn", "nt", record.seq) result_handle = NCBIWWW.qblast("blastn", "nt", record.format(“fasta”) Just sequence whole seq. information

BLAST Using local BLAST http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download - First we create a command line prompt (for example): blastn -query opuntia.fasta -db nr -out opuntia.xml -evalue 0.001 - then we us the BLAST Python wrappers from BioPython: Blast program query sequence output file evalue threshold database from Bio.Blast.Applications import NcbiblastnCommandline blastn_exe = ”path to blastn" blastn_cline = NcbiblastnCommandline(blastn_exe, query=“query.fasta", db=”opuntia.fasta", evalue=0.001, out="opuntia.xml”) blastn_cline()

BLAST Parsing BLAST output list of alignments holds the information from Bio.Blast import NCBIXML result_handle = open("opuntia.xml") blast_record = NCBIXML.read(result_handle) for alignment in blast_record.alignments: for hsp in alignment.hsps: if hsp.expect < 0.001: print (alignment.title) print (alignment.length) print (hsp.expect) print (hsp.query) print (hsp.match) print (hsp.sbjct) list of alignments holds the information about Blast output E-value query sequence alignment matched db seq.

Algorithm evolution Smith Waterman Local alignment algorithm – finds small, locally similar regions (substrings), matrix-based, each cell in the matrix defined the end of a potential alignment. BLAST – start with highest scoring short pairs and extend and down the sequence. Great, but when you’re talking about millions of reads…

Sanger Sequencing DNA is fragmented Cloned to a plasmid vector Cyclic sequencing reaction Separation by electrophoresis First generation: S35 isotope Second generation: fluorescent tags Third generation: automation

“Next” generation sequencing Not Sanger based biochemistry DNA is fragmented and sequenced directly Much more chemistry, physics, and computer science Characterized by Parallel Sequencing High Throughput Cost Generates billions of sequences per experiment, highly computational

Human Genome Project ENCODE project HapMap project SNP consortium Individual human genomes James Watson, Craig Venter, 3 asian gentlemen

General Workflow (Illumina)

Workflow Outcomes

Data output Different technologies, though essentially similar Differences in number, length and quality of the sequences and in format of output Making data handling across platforms and analysis a challenge The SOLiD 3 Plus currently provides up to 60 gigabases of data in a 12- to 14-day run, and up to a billion sequence tags per run; and the Helicos platform generates 21 to 28 gigabases of data in eight days, and routinely analyzes 600 to 800 million usable strands per run. 2013 info.

Sequencing Workflow

Step 4a. Data Analysis

Step 4b. Read mapping Align your sequences onto a reference genome or other region (unless you are de novo sequencing a new species) Determine the coordinates and annotations if known .. TAGTACCCCATCTTGTAGGTCTGAAACACAAAGTGTGGGGTGTCTAGGGAAGAAGGTGTGTGACCAGGGAGGTCCC .. Genome ATCTTGTAGG GAAACACAAAGTG GTCTAGGGAAGAAGG

Sequence Coverage Deep coverage = more accurate results Deep coverage => detecting variants Deep coverage = more expensive!

From geneious .com

NGS pipeline for finding variants from Dolled-Filhart et al, 2013

It’s all about the alignment Meyerson et al. Nature Reviews Genetics 11, 685-696 (October 2010)

NGS alignment algorithms Smith Waterman BLAST Enter BWT BLAT is precomputed BLAST