Bioinformatics Algorithms and Data Structures

Name: Bioinformatics Algorithms and Data Structures
Uploaded: 2017-10-15T04:41:31+00:00
Duration: PTM13S56
Description: Bioinformatics Algorithms and Data Structures

Bioinformatics Algorithms and Data Structures
BLAST Lecturer: Dr. Rose BLAST Slides: Adaptation of Nir Friedman’s slides from the Computational Methods in Molecular Biology course (Spring 2001) at Hebrew University, Jerusalem, Israel February 21, 2007

BLAST Q: What is BLAST? A: A: Uhmmm, actually no, BLAST is an acronym:
Basic Local Alignment Search Tool - a set of similarity search programs designed to explore all of the available sequence databases regardless of whether the query is protein or DNA You can find it at:

BLAST Q: Why do you care? A: Because you are going to do a project.
U Membrane protein that transports sodium and hydrogen J Tyrosinase. . people lacking this are albino NM_ MET, an oncogene. . .mutations in this cause cancer NM_ MYC, another oncogene NM_ Alcohol Dehydrogenase. . good to have when drinking NM_ Myosin. . .one of the muscle proteins XM_ Crystallin, the major protein in the lens M Myelin basic protein..protects the neurons NM_ Hemoglobin, oxygen carrying protein in RBC NM_ Albumin, major serum protein. . .does lot of things NM_ Keratin, skin and integument protein

BLAST BLAST is designed to efficiently find alignments of a target string s against large databases Motivation: increase the speed of finding fewer and better hotspots. Idea: Find high scoring matches using a substitution matrix rather than exact matches. We are still searching only for gapless matches.

High-Scoring Pair Two strings s and t are a high scoring pair (HSP) if d(s,t) > T Given a query s[1..n], BLAST construct all words (fixed-length substrings) w, such that w scores > t with a k-substring of s Each such match to such word in the database is called a hit Typical k: 12 for nucleotides, 3-5 for amino acids.

High-Scoring Pair Try to extend each such hit to an alignment with maximal score (still with no gaps). Keep all HSPs Threshold is chosen so that a random match with such a score is unlikely .

Finding Potential Matches
We can locate seed words in a large database in a single pass Construct a FSA that recognizes seed words Use hashing techniques to locate matching words

Extending Potential Matches
Once a seed is found, BLAST attempts to find a local alignment that extends the seed Seeds on the same diagonal are combined (as in FASTA) s t

Which programs are used?
Originally Blast did not allow gaps. Now people use gapped-Blast Gapped blast joins different diagonals. For proteins Blast is superior For nucleotides Fasta is better.

Review: Unrelated Sequences
Our model of unrelated sequences is simple Each position is sampled independently from a distribution over the alphabet  We assume there is a distribution q() that describes the probability of letters in such positions Then: R denotes the assumption that s and t are random unrelated strings

Review: Related Sequences
We assume that each pair of aligned positions (s[i],t[i]) evolved from a common ancestor Let p(a,b) be a distribution over pairs of letters. p(a,b) is the probability that some ancestral letter evolved into this particular pair of letters Here M denotes the assumption that s and t are related strings.

Review: Ratio Test for Alignment
Taking logarithm of both sides, we get

Review: Probabilistic Interpretation of Scoring Rule
If we take then the score of an alignment is the log-ratio between the two models: Score > 0  R is more “probable” Score < 0  U is more “probable”

Problems with Scoring Rule
When searching for an optimal alignment in a big database, there are a number of problems that arise with this simple scheme. We are assuming P(M)=P(R), this assumes there are an equal number of related and unrelated sequences in the database. When searching through a big database, there is high probability that an unrelated sequence will receive a high score When searching for an optimal local alignment, we have many possible starting points, heavily biasing the score towards being a related sequence.

Prior Probability on the models
What we really wish to calculate is: The log score being:

Prior Probability on the models
Our threshold should be:

The Hazard of Large Databases
Define This is the probability that two unrelated sequences will match with score >  by chance Assume there are N strings in our database Assuming that they are independent of each other, and all are unrelated to s, we have

The Hazard of Large Databases
1 f(x,0.001) f(x,0.0001) f(x, ) f(x, ) 0.8 0.6 0.4 0.2 20000 40000 60000 80000 100000

Local Matching Question: Which local alignment query is expected to give a higher score: To a short sequence To a long sequence? A local match can begin at any of the nm entries in the DP matrix. The score is the optimal of all these starting points. If all starting points were independent we would need to calculate the probability of attaining such a score in nm trials.

Score Significance-Fasta
How meaningful is a score? Calculate distribution of scores and related scores Under reasonable assumptions the scores for un-gapped alignment behave according to the Extreme Value Distribution.

Extreme Value Distribution (BLAST)
We ask the following questions: Given a database of size n and a sequence of size m What is the expected number of hits with score at least S? This number is called an E-score Notice this is a Poisson distribution. K corrects for the dependencies  depends on the scoring matrix Doubling n, the length of sequence, doubles expectation Doubling S, the score, causes E() to decrease exponentially

Blast P-value Recall the Poisson distribution:
Probability of finding no hits with a score => S Therefore probability of finding at least one hit with score => S is This is called the P-value.

A Typical Genebank entry

Sequence Information

The Sequence

BLAST programs BLASTN - Nucleotide query searching a nucleotide database. BLASTP - Protein query searching a protein database. BLASTX - Translated nucleotide query sequence (6 frames) searching a protein database. TBLASTN - Protein query searching a translated nucleotide (6 frames) database. TBLASTX - Translated nucleotide query (6 frames) searching a translated nucleotide (6 frames) database

BLAST Search

BLAST Output List of hits For each hit The local alignment itself
Database accession codes, name, description. Score in bits (Usually >30 bits is significant ) Expectation value E() For each hit A header including hit name, description, length Each hit may contain several HSPs Score and expectation value how many identical residues how many residues contributing positively to the score The local alignment itself

BLAST Output

PSI- BLAST (Position Specific Iterated)
BLAST provides a new automatic “profile like” search. Iterative procedure: Perform BLAST on database. Use Significant alignments to construct a “position specific” score matrix. This matrix replaces the query sequence in the next round of database searching. The program may be iterated until no new significant alignments are found. Most commonly used search method today.

Multiple Alignment Proteins can be classified into families:
Common structure. Common function. Common evolutionary origin. For a set of sequences belonging to some family Each pair has some differences But, there are some common motifs in almost all sequences of the family A multiple alignment carries more information than pairwise alignment

Protein Families Consider Zinc Fingers: All have the same function:
Bind to DNA All have similar structure They constitute a Protein Family In a protein family some parts of the sequence (the functional parts) are more conserved than others.

Definition A multiple alignment of strings S1,S2,…,Sk is a series of strings with blanks S’1,S’2,…,S’k such that: |S’1|=|S’2|=…=|S’k| S’j is an extension of Sj obtained by insertion of blanks.

Example AGT..CTT.ACGCG AGTAGCTT...GCG ..TAGC.T..GGCG .CTA.C.TAACCCG
ACTA...TAAC...

Example

Sum of Pairs The sum of pairwise distances between all pairs of sequences for some scoring matrix Not only assumes that alignment of each column is independent, but also each pair of sequences. Each sequence is scored as if descended from k-1 sequences instead of one common ancestor.

Calculation of Multiple Alignment
The optimal alignment can be calculated exactly using k-dimensional dynamic programming. Space complexity O(nk) Time complexity O(2knk) A Heuristic Program called ClustalW quickly finds a good multiple alignment.

Creating a PSSM After aligning the sequences we see that there are some conserved regions. We use the multiple alignment of Blast results to create a Position Specific Scoring Matrix. This matrix represents information from a whole family, it is more strict in highly conserved regions.

PSI- BLAST (Position Specific Iterated)
BLAST provides a new automatic “profile like” search. Iterative procedure: Perform BLAST on database. Use Significant alignments to construct a “position specific” score matrix. This matrix replaces the query sequence in the next round of database searching. The program may be iterated until no new significant alignments are found. Most commonly used search method today.

Bioinformatics Algorithms and Data Structures

Similar presentations

Presentation on theme: "Bioinformatics Algorithms and Data Structures"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Bioinformatics Algorithms and Data Structures

Similar presentations

Presentation on theme: "Bioinformatics Algorithms and Data Structures"— Presentation transcript:

Similar presentations

About project

Feedback