Download presentation
Presentation is loading. Please wait.
1
Outline. 1. what is BLAT & why we need it 2
Outline what is BLAT & why we need it BLAT's similarity & difference compared with BLAST BLAT's application forms BLAT's 3 major application conclusion
2
What is BLAT & why we need it there exist many alignment tools -SmithWaterman's algo :solves two short sequence alignment problem FASTA,NCBIBLAST,MegaBLAST WU-BLAST :provides flexible & fast alignment involving large database -Sim4 :does a fine job with cDNA alignment -SAM,PSI-BLAST :slowly but surely find remote homology
3
CONT process of assembling and annotating the human genome -aligning three millions ESTs and aligning 13 million mouse whole genome random reads against the human genome -need to be done in less than two weeks in order to have time to process an updated genome every month or two ==>we need a very high speed alignment algorithm so the author developed BLAT the Blast-Like Alignment Tool
4
CONT -BLAT(compared with existing tools) -more accurate
-500 times faster in mRNA/DNA alignment -50 times faster in protein/protein alignment -BLAT’s steps 1.using nonoverlapping k-mers to create index 2.using index to find homologous region 3.aligning these regions seperately 4.stiches these aligned region into larger alignment 5.revisit small internal exons possibly missed in first stage and adjusts large gap boundaries that have canonical splice sites where feasible
5
CONT -BLAT’s speed & sensitivity are decided by 1.k-mer size
(finding hits step) 2.mismatch scheme (aligning step) 3.number of required index matches (find hits step)
6
BLAT's similarity & difference compare with BLAST Similarity: -scans relative short matchs(hits) ie.build index then find hits -extend hits into high-scoring pairs (HSPs)
7
CONT Difference: -BLAST build index for query sequence but BLAT build index for database -BLAST scans linearly through database but BLAST scans linearly through query sequence -BLAST triggers an extension when one or two hits occur in proximity to each other but BLAT can trigger extensions on any number of perfect or near-perfect hits
8
CONT Difference: -BLAST returns each area of homology between two sequence but BLAT stitches them together into a larger alignment -BLAT has special code to handle introns in RNA/DNA alignments i.e. BLAT unsplices mRNA onto the genome
9
BLAT's application forms server-client -building index is a relatively slow procedure a BLAT server is available for keeping index in memory for clients to query ==>good for interactive applications stand-alone -suitable for batch runs on one or more CPUs
10
BLAT's 3 major application & evaluation -mRNA/DNA alignment -Mouse/Human Translated alignment -client/server version to power interactive searches
11
CONT Evaluating mRNA/DNA Alignments (compared with Sim4) -test set: remapped 713 mRNAs to genes on chromosome 22 -speed: BLAT:26 sec Sim4:5hr -sensitivity: BLAT: % agreed of the annotated bases Sim4: %
12
CONT Evaluating Mouse/Human Tanslated Alignment (compared with TBLASTX) -for Human/Mouse :it has been shown that gapless alignment are in many ways preferable to gapped alignment for detecting coding regions
13
CONT Evaluating Mouse/Human Tanslated Alignment (compared with TBLASTX) speed comparison: method k N matrix time WU-TBLASTX / s WU-TBLASTX BLOSUM s BLAT / s BLAT / s k:the size of perfect matching hit N:how many hits required to trigger a detailed alignment matrix:scoring method
14
CONT Evaluating Mouse/Human Tanslated Alignment (compared with TBLASTX) sensitivity comparison: method %chr % Refseq Enrichment % Refseq bases exons WU-TBLASTX % % x % BLAT % % x % %chr 22:percentage of chromosome 22 coverd by the alignment %Refseq:percentage of bases inside of human RefSeq coding sequence covered by the alignment Enrichment:column 2 / column 1 and high level indicate more specificity %Refseq exons:percentage of RefSeq coding exons covered by the alignment
15
CONT server/client to power interactive searches -thousands of interactive sequence searches per day -just one time for building index and keeps index in memory for query ===>efficient -but not as efficient as stand-alone version -because server need to save memory so it only keep the index,not the database
16
BLAT – The BLAST-Like Alignment Tool
W.James Kent Genome Research 2002 陳韋仰
17
Database & Query Sequence
Database : nonoverlapping Query sequence : overlapping database ……… K-mer query sequence K-mer
18
Three Search Criteria Single Perfect Matches
Single Almost Perfect Matches Multiple Perfect Matches
19
Definition K : The K-mer size
M : The match ratio between homologous areas H : The size of a homologous area G : The size of the database Q : The size of the query sequence A : The alphabet size 20 for amino acids 4 for nucleotides
20
H : Homologous area size
T How many nonoverlapping K-mers in the homologous region ? H : Homologous area size K : K-mer size
21
Single Perfect : The probability that a specific K-mer in a homologous region of the database matches perfectly with the corresponding K-mer in the query = (M : The match ratio between homologous areas)
22
Sensitivity P : The probability that at least one
nonoverlapping K-mer in the homologous region matches perfectly with the corresponding K-mer in the query P = 1 – (1 – )T = 1 – (1 – )T (T : #nonoverlapping K-mers in the homologous region)
23
Specificity F = (Q - K +1) * ( ) * ( )K
F : The number of nonoverlapping K-mers that are expected to match by chance F = (Q - K +1) * ( ) * ( )K #K-mers in the query sequence #K-mers in the database
24
Single Perfect (Nucleotide)
M P H = 100 ; G = 3 billion , Q = 500
25
Single Almost Perfect : The probability that a nonoverlapping
K-mer in a homologous region of the database matches almost perfectly with the corresponding K-mer in the query = K * (1 – M) One letter may mismatch
26
Sensitivity P : The probability that any nonoverlapping
K-mer in the homologous region matches almost perfectly with the corresponding K-mer in the query P = 1 – (1 – )T
27
Specificity (Q - K +1) * ( ) * (K * ( )K-1(1 - ( )))
F : The number of nonoverlapping K-mers that are expected to match by chance F = (Q - K +1) * ( ) * ( )K + (Q - K +1) * ( ) * (K * ( )K-1(1 - ( )))
28
Single Almost Perfect (Nucleotide)
H = 100 ; G = 3 billion , Q = 500
29
Multiple Perfect There must be N perfect matches, each no further than W letters from each other in the target coordinate, and have the same diagonal coordinate Example : N = 2
30
Sensitivity N = 1 , = Pn : The probability that there are exactly n
matches within the homologous region Pn = n(1 – )T – n ( ) The probability that there are N or more matches => Pn+ Pn+1 +…+PT
31
Specificity FN : the number of chance matches of N
K-mers each separated by no more than W from the previous match N = 1, F1 = (Q - K +1) * ( ) * ( )K
32
Specificity (continued)
S : The probability of a second match occuring within W letters after the first S = 1 – (1 - ( )K)W/K => Consider the Nth match is within W letters after the (N-1)th match FN = S * FN-1 FN = F1 * SN-1
33
Multiple Perfect (Nucleotide)
34
Default Match Criteria
Nucleotide : two perfect 11-mer Protein : stand-alone --- single perfect 5-mer client/server --- three perfect 4-mer Reference :
35
Implementation mickey
36
Algorithm 1. Search stage 2. Alignment stage
The program detects regions of the two sequences which are likely to be homologous. 2. Alignment stage Examining these regions in more detail and producing alignments for the regions.
37
Search stage 1. building up an index 2. excluding useless k-mers
creating non-overlapping k-mers and their positions in the database. 2. excluding useless k-mers deleting K-mers that occur too often from index and containing ambiguity codes.
38
Search stage database … non-overlapping K-mer database position Index
39
Search stage Index (K-mers) query sequence database position … P1 P2
overlapping K-mers
40
Search stage Hit list K-mer database position P1
query sequence position P1 database position P1 query sequence position P1 …
41
Search stage (example)
picture from:
42
Search stage (example)
According to previous page, we know that… If diagonal values are equal, they are on the same diagonal. K-mer Position (query position, database position) Diagonal (DP-QP) aat 0, 3 +3 cac 6, 0 -6 6, 9
43
Search stage DP – QP = 0 DP – QP < 0 query Coordinate
database Coordinate … bucket 1 bucket 2 bucket 3
44
Search stage Don’t care
2. Hits within proto-clumps are then sorted along the database coordinates and put into real clumps if they are within the window limit. 1. Hits that are within the gap limit are bundled together into proto-clumps. picture from:
45
Search stage Clumps with less than the minimum number of hits are discarded The rest are used to define regions of the database which are homologous to the query sequence. Clumps which are within 300 bases or 100 amino acids in the database are merged together. 500 additional bases are added on each side to form the final homologous region.
46
Search stage 3. Homologous region 1. Two clumps with the
distance < 300 2. Adding 500 bases on each side 3. Homologous region picture from:
47
Nucleotide Alignments
1. Search hits generating a hit list between the query and the homologous region of the database. 2. Extend hits
48
Nucleotide Alignments (Extend hits)
The extension first merges adjacent hits and expands their ends as far as the cDNA and genomic DNA match perfectly. (overlapping hits are also matched) Allow N's in the cDNA to match any single base. unaligned areas
49
Nucleotide Alignments (Extend hits)
The program then recurses, making up tiles and trying to match in the unaligned areas. The recursion runs until either no tiles are found or until the gap between aligned blocks in the genome or cDNA becomes less than 6 (5 in BLAT) Possibly introns tiles (Using smaller k to find match in BLAT)
50
Nucleotide Alignments
Extensions that allow 1 or 2 mismatches if followed by multiple matches. Extensions that allow 1 or 2 insertions or deletions (indels) followed by multiple matches are pursued.
51
Protein Alignments The hits from the search stage are kept and extended into HSPs where a match +2 / mismatch - 1 picture from:
52
weight = B – Gap penalty = +25
Protein Alignments weight = B – Gap penalty = +25 HSPs B +45 query Coordinate HSPs A +32 Gap penalty = 20 (based on the distance between A and B) database Coordinate
53
Protein Alignments HSPs B +20 query Coordinate HSPs A +26
select “crossover” point to maximize the sum of the score of A up to the point and B starting at the point. database Coordinate
54
Protein Alignments A dynamic program then extracts the maximal- scoring alignment by traversing this graph. The HSPs in the maximal-scoring alignment are removed, and if any HSPs are left the dynamic program is run again.
55
Stitching and Filling in
Stitching alignments together using the same algorithm used to stitch together protein HSPs Multiple homologous regions
56
CONCLUSION -BLAT is a very effective tool for doing nucleotide alignments between mRNA and DNA in same species -it is more accurate and faster than Sim4 -BLAT's strategy for nucleotide alignments becomes less effective below 90% sequence identity but it can efficiently sequence divergence introduced by sequencing error
57
CONT High speed alignment program: two major stage 1
CONT High speed alignment program: two major stage 1.search stage: identify regions likely to be homologous -an index of some sort is key 2.alignment stage: does detailed alignments of the previously defined homologous regions.
58
CONT For search stage: -BLAT indexes database rather than query sequence so it only scan the short query sequence -A program “SSAHA” also indexes the database and it is an extremely effective tool for aligning genomic regions from same organism against each other -but “SSAHA” does not implement “unsplicing”,and always uses a single perfect match as a seed BLAT is more flexible in this aspect
59
CONT For search stage: -challenge to indexing is twofold: 1
CONT For search stage: -challenge to indexing is twofold: 1.size of index 2.time to generate index -size is not a problem because recently memory size is affordable index is not generated often --->good in both batch or server/client mode
60
CONT For alignment stage: How index is used is important for speed -triggering an alignment for each matching to the index is not always the optimal strategy -mutiple near-by matches or longer k-mers but tolerating a mismatch both have much specificity for sensivity than one single match(BLAT implement a quick algo on this )
61
Human-Mouse Alignments with BLASTZ (Schwartz el al. 2000)
Roger Yang 楊伍隆
62
Source S. Schwartz, W. J. Kent, A. Smit, Z. Zhang, R. Baertsch, R. C. Hardison, D. Haussler, and W. Miller Human-Mouse Alignments with BLASTZ Genome Research, 2003; 13: 103–107
63
Mouse Genome Analysis Consortium Goals
Study mutation and selection that shape the mouse and human genomes Estimating the fraction of the human genome under selection Determining the degree of regions under selection Measuring regional variation in the rate and pattern of neutral evolution
64
Sensitivity and Specificity
Sensitivity is needed for aligning large portion of neutrally evolving genomic regions.
65
Homology BIOLOGY Orthologs: two similar genes in two different species that originated from a common ancestor Paralogs: a gene in an organism is duplicated to occupy two different positions in the same genome
66
Software Design Issues Algorithm selections
Have a high sensitivity requirement for detecting neutrally evolving DNA Fast algorithms (e.g. BLAT) sacrifice sensitivity for speed BLASTZ, used by PipMaker, is selected for the study.
67
Algorithm Evolution Gapped BLAST: specifically designed for aligning two long genomic sequence BLASTZ: another implementation of Gapped BLAST Modified BLASTZ: aligning entire mammalian genomes and better sensitivity
68
Software Design Issues BLASTZ
Same 3-step concept as Gapped BLAST Find short near exact matches Extend each short match without allowing gaps Extend each gap-free match that exceeds a certain threshold by a dynamic programming procedure that allow gaps
69
Software Design Issues BLASTZ v.s. Gapped BLAST
Option to require the matching regions must occur in the same order and orientation in both sequences Nucleotide substitution matrix1 A C G T 91 -114 -31 -123 100 -125 -100 1. Chiaromonte et al. (2002)
70
Software Design Issues New BLASTZ speed improvement
Regions are dynamically masked if several other regions are mapped to them. (e.g. zinc fingers or olfactory reception genes on Chromosome 19) Instead of 8 exact consecutive match nucleotides, use as the space seed1 To increase sensitivity, we allow a transition (A-G,G- A,C-T,T-C) in any one of the 12 positions 1. Ma et al. (2002) PatternHunter
71
Software Design Issues BLASTZ algorithm
BIOLOGY Remove lineage-specific interspersed repeat from both sequences. (Use RepeatMasker for mouse and human) Lineage-specific: after the human-mouse split Interspersed Repeat: non-functional copy of RNA genes inserted into genome by reverse transcriptase. SINE (Short Intersperse sequences) – few hundred bp, about 11% of the human genome LINE (Long Intersperse sequences) – few thousand bp, about 21% of the human genome
72
Software Design Issues BLASTZ algorithm
2. For all pairs of spaced 12-mers (one from each sequence) that are identical except perhaps for one transition, do the following. 2.1 Extend the induced alignment in each direction without gaps. 2.2 If the gap-free alignment scores more than then 2.2.1 Repeat the extension step with gaps. 2.2.2 Retain the alignment if it scores above
73
Software Design Issues BLASTZ algorithm
3. Between each pair of adjacent alignments from step 2, repeat step 2 use a more sensitive seeding procedure (e.g. 7- mer exact match) lower score threshold: gap-free for 2000 (instead of 3000), gapped for 2000 (instead of 5000).
74
Software Design Issues BLASTZ algorithm
4. Adjust sequence positions in the resulting alignments to make them refer to the original sequences (i.e., account for Step 1).
75
Software Design Issues BLASTZ algorithm
5. Filter the alignments Use axeBest to filter out paralogs from orthologs Paralogs have greater sequence identity in a smaller region. Orthologs are usually longer with greater sequence identity.
76
Implementation and H/W Issues
Segmentations Mouse (2.5Gb) into ~100 30MB segments Human (2.8Gb) into ~ MB segments (10kb overlap) Running time (with 888Mhz PIII) 481 days 12 hours with 1024 PC Output 9 Gbyte (relative position) Axe: 2.5 Gbyte (actual bases) 3.3% of the human genome is covered by multiple alignments
77
Software Evaluation Accuracy?
Protein alignments are checked against X-ray crystal structures Gene predictions are checked against cDNA library There’s no gold standard to verify genome alignment The easier to implement, the better
78
Software Evaluation Speed up options
Early in the project, with only unassembled reads available, all-vs-all is the only option With all the genome sequence known, comparison can be made between smaller regions of human and mouse
79
Software Evaluation Specificity
Conservation of synteny: Human chromosome 20 is considered to be completely homologous to parts of Mouse chromosome 2
81
BIOLOGY
82
Discussion BLAST, BLAT, and PatternHunter are tuned to align protein coding regions, not for fine-scale features of genome evolution 2 BLASTZ design philosophy Intended for all stage of sequencing Do not enforce critical a priori assumptions about which alignments are important; tasks of processing and filtering the initial alignments are left to other flexible programs
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.