Download presentation
Presentation is loading. Please wait.
Published byLucinda Arnold Modified over 9 years ago
1
Presented by Mario Flores, Xuepo Ma, and Nguyen Nguyen
2
Outline Short-read alignment – Algorithm – Results Comparisons between short-read and long- read alignment Long-read alignment – Algorithm – Results
3
Motivation Motivation: new DNA sequencing technologies call fast and accurate read alignment programs. MAQ: Pros: accurate, feature rich and fast enough to align short reads from single individual. Cons: MAQ does NOT support gapped alignment for single-end reads => unsuitable for alignment longer reads where indels may occur frequently. Alignment with BWT : efficiently align short sequencing reads against a large reference sequence allowing mismatches and gaps
4
Burrows Wheeler Transfrom actgct$ ctgct$a tgct$ac gct$act ct$actg t$actgc $actgct S[i]B[i]i X: actgct W: gcc Z=1
5
Inexact Matching - number of deference in string W Take string W=“gcc” for example. 1. W(0,0)=“g”, “g” is a substring of X, D(0)=0; 2. W(0,1)=“gc”, “gc” is a substring of X, D(1)=0; 3. W(0,2)=“gcc”, “gcc” is not a substring of X, D(2)=1.
6
Inexact Matching - Searching
7
6,6 2,3 4,4 6,6 3,3 1,1 2,3 3,3 6,6 3,3 1,1 3,3 6,6 3,3 1,1 0,6 X: actgct W: gcc t c a g t c a c a g t c a g t c a a 1,1 ^ ^ ^ ^ ^ 1 2 3 ^ 4 5 6
8
Exact Matching Let the D(i)=0, then the algorithm can search for the exact matching
9
Simulated data Accuracy BWA is more accurate than Bowtie and SOAPv2 based on criterion 1. Speed BWA is the fastest second only to SOAPv2. Memory MAQ’s memory footprint is 1GB, but it increases linearly with the number of reads to be aligned. BWA only uses 2.3 GB for single-end mapping and 3GB for paired-end ( as much as Bowtie). SOAPv2 uses 5.4 GB.
10
Differences between short-read and long- read alignment Short-read alignment Align full-length read Efficient for ungapped alignment or limited gaps Long-read alignment Find local matches Permissive about alignment gaps
11
Motivations Many programs for short sequencing Not many for reads>200 bp BLAT, SSAHA2 New platforms are producing longer sequences: Roche/454 >400bp, Illumina>100 bp, Pacific > 1000 bp Fast and accurate long-read alignment with Burrows-Wheeler transform New algorithm: Burrows Wheeler Aligner’s Smith-Waterman Alignment BWA-SW
12
Before NGS FASTA 1988 BLAST 1997 MegaBLAST 2000 SSAHA2 2001 BLAT 2002 After NGS SOAP 2008 MAQ 2008 Bowtie 2009 BWA 2009 BWA-SW 2010
13
prefix trie Burrows Wheeler Aligner’s Smith-Waterman Alignment BWA-SW Overview Algorithm (1)Build FM-indices for reference and query sequences (2)Represent reference in a prefix trie (3)Represents query in prefix in DAWG (directed acyclic word graph) transformed from the prefix trie of the query sequence String GOOGOL ‘ ∧ ’ start of a string The two numbers in A node gives the SA interval of the node Prefix tree Prefix DAWG Example: a. 3 nodes has SA interval [4,4] b. Their parents have interval [1,2],[1,2] and [1,1] In prefix DAWG The [4,4] node has parents [1,2] and [1,1] Node [4,4] represents the strings ‘OG’, ‘OGO’, ‘OGOL’ ‘
14
Overview Algorithm (4) Dynamic programming with heuristics to accelerate algorithm Heuristics rules: A) Restrict the dynamic programming algorithm around good matches only B) Report only alignments largely non-overlapping Result of these heuristics is: Savings in computing time Burrows Wheeler Aligner’s Smith-Waterman Alignment BWA-SW
15
Heuristic strategies for acceleration (1) Z best : Traverse G(W) in outer loop and T(X) in inner loop, and at each node u in G(W) only keep the top Z best scoring nodes in T(X) that match u rather than keeping all the matching nodes Where G(W) prefix DAWG of query sequence W T(X) prefix trie for reference sequence X u root of G(W) (2) Take only best few alignments covering each region of the query sequence Burrows Wheeler Aligner’s Smith-Waterman Alignment BWA-SW
16
Result Implementation of BWA-SW takes a BWA index and a query FASTA and FASTQ file as inputs. Typical sequencing reads requires less than 4GB. The peak memory is 6.4 GB in total on one query sequence with 1 million base pairs.
17
Simulated data Speed BWA-SW is fastest, and its speed is not sensitive to the read length or error rates. Memory BWA-SW uses about 4GB (as much as BLAT). SSAHA2 uses 2.4GB for >=500 bp reads, and 5.3 GB for shorter reads. BWA-SW supports multi-threading while SSAHA2 and BLAT do not. Accuracy BWA-SW can detect chimera reads, and produces fewer false chimeric reads given lower base errors.
18
Conclusion Short-read alignment cannot be used for long- read alignment due to: – Full-length read vs local matches. – Ungapped or limited gap vs larger number of gaps. BWA-short is more accurate, use less memory and competitively fast. BWA-long is the best in market in speed, accuracy and memory.
19
Questions ?????
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.