Outline. 1. what is BLAT & why we need it 2

Outline. 1. what is BLAT & why we need it 2
Outline what is BLAT & why we need it BLAT's similarity & difference compared with BLAST BLAT's application forms BLAT's 3 major application conclusion

What is BLAT & why we need it there exist many alignment tools -SmithWaterman's algo :solves two short sequence alignment problem FASTA,NCBIBLAST,MegaBLAST WU-BLAST :provides flexible & fast alignment involving large database -Sim4 :does a fine job with cDNA alignment -SAM,PSI-BLAST :slowly but surely find remote homology

CONT process of assembling and annotating the human genome -aligning three millions ESTs and aligning 13 million mouse whole genome random reads against the human genome -need to be done in less than two weeks in order to have time to process an updated genome every month or two ==>we need a very high speed alignment algorithm so the author developed BLAT the Blast-Like Alignment Tool

CONT -BLAT(compared with existing tools) -more accurate
-500 times faster in mRNA/DNA alignment -50 times faster in protein/protein alignment -BLAT’s steps 1.using nonoverlapping k-mers to create index 2.using index to find homologous region 3.aligning these regions seperately 4.stiches these aligned region into larger alignment 5.revisit small internal exons possibly missed in first stage and adjusts large gap boundaries that have canonical splice sites where feasible

CONT -BLAT’s speed & sensitivity are decided by 1.k-mer size
(finding hits step) 2.mismatch scheme (aligning step) 3.number of required index matches (find hits step)

BLAT's similarity & difference compare with BLAST Similarity: -scans relative short matchs(hits) ie.build index then find hits -extend hits into high-scoring pairs (HSPs)

CONT Difference: -BLAST build index for query sequence but BLAT build index for database -BLAST scans linearly through database but BLAST scans linearly through query sequence -BLAST triggers an extension when one or two hits occur in proximity to each other but BLAT can trigger extensions on any number of perfect or near-perfect hits

CONT Difference: -BLAST returns each area of homology between two sequence but BLAT stitches them together into a larger alignment -BLAT has special code to handle introns in RNA/DNA alignments i.e. BLAT unsplices mRNA onto the genome

BLAT's application forms server-client -building index is a relatively slow procedure a BLAT server is available for keeping index in memory for clients to query ==>good for interactive applications stand-alone -suitable for batch runs on one or more CPUs

BLAT's 3 major application & evaluation -mRNA/DNA alignment -Mouse/Human Translated alignment -client/server version to power interactive searches

CONT Evaluating mRNA/DNA Alignments (compared with Sim4) -test set: remapped 713 mRNAs to genes on chromosome 22 -speed: BLAT:26 sec Sim4:5hr -sensitivity: BLAT: % agreed of the annotated bases Sim4: %

CONT Evaluating Mouse/Human Tanslated Alignment (compared with TBLASTX) -for Human/Mouse :it has been shown that gapless alignment are in many ways preferable to gapped alignment for detecting coding regions

CONT Evaluating Mouse/Human Tanslated Alignment (compared with TBLASTX) speed comparison: method k N matrix time WU-TBLASTX / s WU-TBLASTX BLOSUM s BLAT / s BLAT / s k:the size of perfect matching hit N:how many hits required to trigger a detailed alignment matrix:scoring method

CONT Evaluating Mouse/Human Tanslated Alignment (compared with TBLASTX) sensitivity comparison: method %chr % Refseq Enrichment % Refseq bases exons WU-TBLASTX % % x % BLAT % % x % %chr 22:percentage of chromosome 22 coverd by the alignment %Refseq:percentage of bases inside of human RefSeq coding sequence covered by the alignment Enrichment:column 2 / column 1 and high level indicate more specificity %Refseq exons:percentage of RefSeq coding exons covered by the alignment

CONT server/client to power interactive searches -thousands of interactive sequence searches per day -just one time for building index and keeps index in memory for query ===>efficient -but not as efficient as stand-alone version -because server need to save memory so it only keep the index,not the database

BLAT – The BLAST-Like Alignment Tool
W.James Kent Genome Research 2002 陳韋仰

Database & Query Sequence
Database : nonoverlapping Query sequence : overlapping database ……… K-mer query sequence K-mer

Three Search Criteria Single Perfect Matches
Single Almost Perfect Matches Multiple Perfect Matches

Definition K : The K-mer size
M : The match ratio between homologous areas H : The size of a homologous area G : The size of the database Q : The size of the query sequence A : The alphabet size 20 for amino acids 4 for nucleotides

H : Homologous area size
T How many nonoverlapping K-mers in the homologous region ? H : Homologous area size K : K-mer size

Single Perfect : The probability that a specific K-mer in a homologous region of the database matches perfectly with the corresponding K-mer in the query = (M : The match ratio between homologous areas)

Sensitivity P : The probability that at least one
nonoverlapping K-mer in the homologous region matches perfectly with the corresponding K-mer in the query P = 1 – (1 – )T = 1 – (1 – )T (T : #nonoverlapping K-mers in the homologous region)

Specificity F = (Q - K +1) * ( ) * ( )K
F : The number of nonoverlapping K-mers that are expected to match by chance F = (Q - K +1) * ( ) * ( )K #K-mers in the query sequence #K-mers in the database

Single Perfect (Nucleotide)
M P H = 100 ; G = 3 billion , Q = 500

Single Almost Perfect : The probability that a nonoverlapping
K-mer in a homologous region of the database matches almost perfectly with the corresponding K-mer in the query = K * (1 – M) One letter may mismatch

Sensitivity P : The probability that any nonoverlapping
K-mer in the homologous region matches almost perfectly with the corresponding K-mer in the query P = 1 – (1 – )T

Specificity (Q - K +1) * ( ) * (K * ( )K-1(1 - ( )))
F : The number of nonoverlapping K-mers that are expected to match by chance F = (Q - K +1) * ( ) * ( )K + (Q - K +1) * ( ) * (K * ( )K-1(1 - ( )))

Single Almost Perfect (Nucleotide)
H = 100 ; G = 3 billion , Q = 500

Multiple Perfect There must be N perfect matches, each no further than W letters from each other in the target coordinate, and have the same diagonal coordinate Example : N = 2

Sensitivity N = 1 , = Pn : The probability that there are exactly n
matches within the homologous region Pn = n(1 – )T – n ( ) The probability that there are N or more matches => Pn+ Pn+1 +…+PT

Specificity FN : the number of chance matches of N
K-mers each separated by no more than W from the previous match N = 1, F1 = (Q - K +1) * ( ) * ( )K

Specificity (continued)
S : The probability of a second match occuring within W letters after the first S = 1 – (1 - ( )K)W/K => Consider the Nth match is within W letters after the (N-1)th match FN = S * FN-1 FN = F1 * SN-1

Multiple Perfect (Nucleotide)

Default Match Criteria
Nucleotide : two perfect 11-mer Protein : stand-alone --- single perfect 5-mer client/server --- three perfect 4-mer Reference :

Implementation mickey

Algorithm 1. Search stage 2. Alignment stage
The program detects regions of the two sequences which are likely to be homologous. 2. Alignment stage Examining these regions in more detail and producing alignments for the regions.

Search stage 1. building up an index 2. excluding useless k-mers
creating non-overlapping k-mers and their positions in the database. 2. excluding useless k-mers deleting K-mers that occur too often from index and containing ambiguity codes.

Search stage database … non-overlapping K-mer database position Index

Search stage Index (K-mers) query sequence database position … P1 P2
overlapping K-mers

Search stage Hit list K-mer database position P1
query sequence position P1 database position P1 query sequence position P1 …

Search stage (example)
picture from:

Search stage (example)
According to previous page, we know that… If diagonal values are equal, they are on the same diagonal. K-mer Position (query position, database position) Diagonal (DP-QP) aat 0, 3 +3 cac 6, 0 -6 6, 9

Search stage DP – QP = 0 DP – QP < 0 query Coordinate
database Coordinate … bucket 1 bucket 2 bucket 3

Search stage Don’t care
2. Hits within proto-clumps are then sorted along the database coordinates and put into real clumps if they are within the window limit. 1. Hits that are within the gap limit are bundled together into proto-clumps. picture from:

Search stage Clumps with less than the minimum number of hits are discarded The rest are used to define regions of the database which are homologous to the query sequence. Clumps which are within 300 bases or 100 amino acids in the database are merged together. 500 additional bases are added on each side to form the final homologous region.

Search stage 3. Homologous region 1. Two clumps with the
distance < 300 2. Adding 500 bases on each side 3. Homologous region picture from:

Nucleotide Alignments
1. Search hits generating a hit list between the query and the homologous region of the database. 2. Extend hits

Nucleotide Alignments (Extend hits)
The extension first merges adjacent hits and expands their ends as far as the cDNA and genomic DNA match perfectly. (overlapping hits are also matched) Allow N's in the cDNA to match any single base. unaligned areas

Nucleotide Alignments (Extend hits)
The program then recurses, making up tiles and trying to match in the unaligned areas. The recursion runs until either no tiles are found or until the gap between aligned blocks in the genome or cDNA becomes less than 6 (5 in BLAT) Possibly introns tiles (Using smaller k to find match in BLAT)

Nucleotide Alignments
Extensions that allow 1 or 2 mismatches if followed by multiple matches. Extensions that allow 1 or 2 insertions or deletions (indels) followed by multiple matches are pursued.

Protein Alignments The hits from the search stage are kept and extended into HSPs where a match +2 / mismatch - 1 picture from:

weight = B – Gap penalty = +25
Protein Alignments weight = B – Gap penalty = +25 HSPs B +45 query Coordinate HSPs A +32 Gap penalty = 20 (based on the distance between A and B) database Coordinate

Protein Alignments HSPs B +20 query Coordinate HSPs A +26
select “crossover” point to maximize the sum of the score of A up to the point and B starting at the point. database Coordinate

Protein Alignments A dynamic program then extracts the maximal- scoring alignment by traversing this graph. The HSPs in the maximal-scoring alignment are removed, and if any HSPs are left the dynamic program is run again.

Stitching and Filling in
Stitching alignments together using the same algorithm used to stitch together protein HSPs Multiple homologous regions

CONCLUSION -BLAT is a very effective tool for doing nucleotide alignments between mRNA and DNA in same species -it is more accurate and faster than Sim4 -BLAT's strategy for nucleotide alignments becomes less effective below 90% sequence identity but it can efficiently sequence divergence introduced by sequencing error

CONT High speed alignment program: two major stage 1
CONT High speed alignment program: two major stage 1.search stage: identify regions likely to be homologous -an index of some sort is key 2.alignment stage: does detailed alignments of the previously defined homologous regions.

CONT For search stage: -BLAT indexes database rather than query sequence so it only scan the short query sequence -A program “SSAHA” also indexes the database and it is an extremely effective tool for aligning genomic regions from same organism against each other -but “SSAHA” does not implement “unsplicing”,and always uses a single perfect match as a seed BLAT is more flexible in this aspect

CONT For search stage: -challenge to indexing is twofold: 1
CONT For search stage: -challenge to indexing is twofold: 1.size of index 2.time to generate index -size is not a problem because recently memory size is affordable index is not generated often --->good in both batch or server/client mode

CONT For alignment stage: How index is used is important for speed -triggering an alignment for each matching to the index is not always the optimal strategy -mutiple near-by matches or longer k-mers but tolerating a mismatch both have much specificity for sensivity than one single match(BLAT implement a quick algo on this )

Human-Mouse Alignments with BLASTZ (Schwartz el al. 2000)
Roger Yang 楊伍隆

Source S. Schwartz, W. J. Kent, A. Smit, Z. Zhang, R. Baertsch, R. C. Hardison, D. Haussler, and W. Miller Human-Mouse Alignments with BLASTZ Genome Research, 2003; 13: 103–107

Mouse Genome Analysis Consortium Goals
Study mutation and selection that shape the mouse and human genomes Estimating the fraction of the human genome under selection Determining the degree of regions under selection Measuring regional variation in the rate and pattern of neutral evolution

Sensitivity and Specificity
Sensitivity is needed for aligning large portion of neutrally evolving genomic regions.

Homology BIOLOGY Orthologs: two similar genes in two different species that originated from a common ancestor Paralogs: a gene in an organism is duplicated to occupy two different positions in the same genome

Software Design Issues Algorithm selections
Have a high sensitivity requirement for detecting neutrally evolving DNA Fast algorithms (e.g. BLAT) sacrifice sensitivity for speed BLASTZ, used by PipMaker, is selected for the study.

Algorithm Evolution Gapped BLAST: specifically designed for aligning two long genomic sequence BLASTZ: another implementation of Gapped BLAST Modified BLASTZ: aligning entire mammalian genomes and better sensitivity

Software Design Issues BLASTZ
Same 3-step concept as Gapped BLAST Find short near exact matches Extend each short match without allowing gaps Extend each gap-free match that exceeds a certain threshold by a dynamic programming procedure that allow gaps

Software Design Issues BLASTZ v.s. Gapped BLAST
Option to require the matching regions must occur in the same order and orientation in both sequences Nucleotide substitution matrix1 A C G T 91 -114 -31 -123 100 -125 -100 1. Chiaromonte et al. (2002)

Software Design Issues New BLASTZ speed improvement
Regions are dynamically masked if several other regions are mapped to them. (e.g. zinc fingers or olfactory reception genes on Chromosome 19) Instead of 8 exact consecutive match nucleotides, use as the space seed1 To increase sensitivity, we allow a transition (A-G,G- A,C-T,T-C) in any one of the 12 positions 1. Ma et al. (2002) PatternHunter

Software Design Issues BLASTZ algorithm
BIOLOGY Remove lineage-specific interspersed repeat from both sequences. (Use RepeatMasker for mouse and human) Lineage-specific: after the human-mouse split Interspersed Repeat: non-functional copy of RNA genes inserted into genome by reverse transcriptase. SINE (Short Intersperse sequences) – few hundred bp, about 11% of the human genome LINE (Long Intersperse sequences) – few thousand bp, about 21% of the human genome

2. For all pairs of spaced 12-mers (one from each sequence) that are identical except perhaps for one transition, do the following. 2.1 Extend the induced alignment in each direction without gaps. 2.2 If the gap-free alignment scores more than then 2.2.1 Repeat the extension step with gaps. 2.2.2 Retain the alignment if it scores above

3. Between each pair of adjacent alignments from step 2, repeat step 2 use a more sensitive seeding procedure (e.g. 7- mer exact match) lower score threshold: gap-free for 2000 (instead of 3000), gapped for 2000 (instead of 5000).

4. Adjust sequence positions in the resulting alignments to make them refer to the original sequences (i.e., account for Step 1).

5. Filter the alignments Use axeBest to filter out paralogs from orthologs Paralogs have greater sequence identity in a smaller region. Orthologs are usually longer with greater sequence identity.

Implementation and H/W Issues
Segmentations Mouse (2.5Gb) into ~100 30MB segments Human (2.8Gb) into ~ MB segments (10kb overlap) Running time (with 888Mhz PIII) 481 days 12 hours with 1024 PC Output 9 Gbyte (relative position) Axe: 2.5 Gbyte (actual bases) 3.3% of the human genome is covered by multiple alignments

Software Evaluation Accuracy?
Protein alignments are checked against X-ray crystal structures Gene predictions are checked against cDNA library There’s no gold standard to verify genome alignment The easier to implement, the better

Software Evaluation Speed up options
Early in the project, with only unassembled reads available, all-vs-all is the only option With all the genome sequence known, comparison can be made between smaller regions of human and mouse

Software Evaluation Specificity
Conservation of synteny: Human chromosome 20 is considered to be completely homologous to parts of Mouse chromosome 2

BIOLOGY

Discussion BLAST, BLAT, and PatternHunter are tuned to align protein coding regions, not for fine-scale features of genome evolution 2 BLASTZ design philosophy Intended for all stage of sequencing Do not enforce critical a priori assumptions about which alignments are important; tasks of processing and filtering the initial alignments are left to other flexible programs

Outline. 1. what is BLAT & why we need it 2

Similar presentations

Presentation on theme: "Outline. 1. what is BLAT & why we need it 2"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Outline. 1. what is BLAT & why we need it 2

Similar presentations

Presentation on theme: "Outline. 1. what is BLAT & why we need it 2"— Presentation transcript:

Similar presentations

About project

Feedback