Problem with N-W and S-W

Problem with N-W and S-W
They are exhaustive, with stringent statistical thresholds As the databases got larger, these began to take too long – computational constraints Need a program that runs an algorithm that is based on dynamic programming (substrings) but uses a heuristic approach (takes assumptions, quicker, not as refined). Therefore can deal with comparing a sequence against 100 million others in a relatively short time… the bigger the database, the more optimal the results.

Two heuristic approaches
Each is based on different assumptions and calculations of probability thresholds (ie the statistical significance of results) FASTA All proteins in the db are equally likely to be related to the query  probability multiplied by the number of sequences in database 2. BLAST Query more likely to be related to a sequence its own length or shorter  probability multiplied by N/n N: total number of residues in database n: length of subject sequence

Basic Local Alignment Search Tool
The BLAST algorithm Basic Local Alignment Search Tool Breaks down your sequence into smaller segments (words) Does the same for all sequences in the database Looks for exact matches, word by word, and expands those up- and down- stream one base at a time, allowing for a certain number of mismatches Stops when sequence ends or statistical significance becomes too low (too many mismatches) Can find more than one area of similarity between two sequences

The BLAST algorithm STEP 1: k-mers in query sequence query sequence
… PQGEFGFAT… query sequence PQG QGE GEF EFG … words STEP 2: Score the words: Match all 3-mers obtained in step 1 with all 203 matches. With a threshold T, keep the best scoring words.

The BLAST algorithm STEP 3: organize the search words in a tree for efficient retrieval STEP 4: repeat step 2-3 for each 3-mer STEP 5: scan the database sequences to find exact matches of words, extend the exact matches to high-scoring segment pair (HSP). … PQGEFGFAT… query sequence … AQGEFEAQT… database sequence … HSP, score: 22

The BLAST algorithm STEP 6: evaluate the significance of the HSP score
Karlin-Altschul statistics: Using the HSPs, we calculate the expected number of HSPs with score at least S: K, l … constant m, n length of query and database sequence

The BLAST algorithm Bit scores: Raw scores have little meaning without detailed knowledge of the scoring system used, or more simply its statistical parameters K and l. Unless the scoring system is understood, citing a raw score alone is like citing a distance without specifying feet, meters, or light years. Therefore we normalize to a Bitscore so that which is independent of K and l.

The BLAST algorithm P-values: The number of random HSPs with score ≥S is described by a Poisson distribution (i.e. the probability of finding exactly a HSPs with score ≥S is given by The chance of finding zero HSP with score ≥S is e-E. So the chance to find at least one HSP with score ≥S is

The BLAST algorithm So far our considerations were for pairwise alignments. What are we doing if we search a database? Correction for multiple testing, because a priori all proteins in the database are a priori equally likely to be related to the query. Therefore in the simplest case, E-value is corrected by the number of sequences In the database: E’= E*N An alternative view is that a query is a priori more likely to be related to a long than to a short sequence, because long sequences are often composed of multiple distinct domains. If we assume the a priori chance of relatedness is proportional to sequence length, then the pairwise E-value involving a database sequence of length n should be multiplied by N/n, where N is the total length of the database in residues.

BLAST Using online BLAST from Bio.Blast import NCBIWWW
database type from Bio.Blast import NCBIWWW result_handle = NCBIWWW.qblast("blastn", "nt", " ") Blast program sequence or from Bio.Blast import NCBIWWW from Bio import SeqIO record = SeqIO.read("m_cold.fasta", format="fasta") result_handle = NCBIWWW.qblast("blastn", "nt", record.seq) result_handle = NCBIWWW.qblast("blastn", "nt", record.format(“fasta”) Just sequence whole seq. information

BLAST Using local BLAST
- First we create a command line prompt (for example): blastn -query opuntia.fasta -db nr -out opuntia.xml -evalue 0.001 - then we us the BLAST Python wrappers from BioPython: Blast program query sequence output file evalue threshold database from Bio.Blast.Applications import NcbiblastnCommandline blastn_exe = ”path to blastn" blastn_cline = NcbiblastnCommandline(blastn_exe, query=“query.fasta", db=”opuntia.fasta", evalue=0.001, out="opuntia.xml”) blastn_cline()

BLAST Parsing BLAST output list of alignments holds the information
from Bio.Blast import NCBIXML result_handle = open("opuntia.xml") blast_record = NCBIXML.read(result_handle) for alignment in blast_record.alignments: for hsp in alignment.hsps: if hsp.expect < 0.001: print (alignment.title) print (alignment.length) print (hsp.expect) print (hsp.query) print (hsp.match) print (hsp.sbjct) list of alignments holds the information about Blast output E-value query sequence alignment matched db seq.

Algorithm evolution Smith Waterman Local alignment algorithm – finds small, locally similar regions (substrings), matrix-based, each cell in the matrix defined the end of a potential alignment. BLAST – start with highest scoring short pairs and extend and down the sequence. Great, but when you’re talking about millions of reads…

Sanger Sequencing DNA is fragmented Cloned to a plasmid vector
Cyclic sequencing reaction Separation by electrophoresis First generation: S35 isotope Second generation: fluorescent tags Third generation: automation

“Next” generation sequencing
Not Sanger based biochemistry DNA is fragmented and sequenced directly Much more chemistry, physics, and computer science Characterized by Parallel Sequencing High Throughput Cost Generates billions of sequences per experiment, highly computational

Human Genome Project ENCODE project HapMap project SNP consortium Individual human genomes James Watson, Craig Venter, 3 asian gentlemen

General Workflow (Illumina)

Workflow Outcomes

Data output Different technologies, though essentially similar
Differences in number, length and quality of the sequences and in format of output Making data handling across platforms and analysis a challenge The SOLiD 3 Plus currently provides up to 60 gigabases of data in a 12- to 14-day run, and up to a billion sequence tags per run; and the Helicos platform generates 21 to 28 gigabases of data in eight days, and routinely analyzes 600 to 800 million usable strands per run info.

Sequencing Workflow

Step 4a. Data Analysis

Step 4b. Read mapping Align your sequences onto a reference genome or other region (unless you are de novo sequencing a new species) Determine the coordinates and annotations if known .. TAGTACCCCATCTTGTAGGTCTGAAACACAAAGTGTGGGGTGTCTAGGGAAGAAGGTGTGTGACCAGGGAGGTCCC .. Genome ATCTTGTAGG GAAACACAAAGTG GTCTAGGGAAGAAGG

Sequence Coverage Deep coverage = more accurate results
Deep coverage => detecting variants Deep coverage = more expensive!

From geneious .com

NGS pipeline for finding variants
from Dolled-Filhart et al, 2013

It’s all about the alignment
Meyerson et al. Nature Reviews Genetics 11, (October 2010)

NGS alignment algorithms
Smith Waterman BLAST Enter BWT BLAT is precomputed BLAST

Problem with N-W and S-W

Similar presentations

Presentation on theme: "Problem with N-W and S-W"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Problem with N-W and S-W

Similar presentations

Presentation on theme: "Problem with N-W and S-W"— Presentation transcript:

Similar presentations

About project

Feedback