Aligning Reads Ramesh Hariharan Strand Life Sciences IISc
What is Read Alignment?
AGGCTACGCATTTCCCATAAAGACCCACGCTTAAGTTC Subject’s Genome AGGCTACGCATGTCCCATAATGACCCACACTTAAGTTC Reference Genome Where do these match in the Reference? Close but not quite the same as the Subject’s Genome
What does “Match” mean?
AGGCTACGCATGTCCCATAATGACCCACACTTAAGTTC Reference Genome GCTACGCA Exact Match CATAAAGAC With Mismatches CACTT_AGT With Gaps
Why mismatches and gaps?
The subject genome could be different from the reference
Reads Reference Genome SNP Deletion Mismatches and Gaps
The reading process could be erroneous
How many mismatches and gaps?
Short reads ~50, few mismatches and gaps Long reads, ~1000, many more mismatches and gaps
How do aligners fare?
BWA: Very few mismatches and gaps CoBWeb BWA-SW: Many mismatches and gaps BowTie: only mismatches, no gaps No paired read handling No handling of adaptor trimming for small RNA Separate handling for RNASeq BowTie2
How does an Aligner work?
For simplicity, assume Exact Match
For each read, scan the entire reference genome sequence SLOW!!!!
CGACG The Reference C C C G G T T T A A C C A A G G A A C C T T Index the Reference
How can we find Exact Matches of a read quickly with this index?
CGACG The Reference C C C G G T T T A A C C A A G G A A C C T T CG C
The problem: 24GB
Can this structure be compressed?
C G AC$ A C $CG C G AC$ C $ CGA G A C$C $ C GAC The Reference This column is the BWT All its circular shifts, sorted lexicographically The Index: now an array instead of a tree The Burrows- Wheeler based Index Sampled to reduce memory at the expense of speed (Ferragina and Manzini) Sampled to reduce memory at the expense of speed (Ferragina and Manzini)
How about Mismatches and Gaps?
BWA, BWA-SW and BowTie force mismatches and gaps into the BW Index searching procedure
CoBWeb uses the BW Index to find a ‘seed’ exact match and does Smith- Waterman around this seed This 15-mer occurs at locations x1, x2… This 15-mer occurs at locations x3, x4… This whole 30-mer occurs at location x5
Dynamic Programming Given a location in the reference with an read anchor, how well does the read match here? Reference Read Anchor 14 mer Smith-Waterman (optimized for large gaps)
Comparison with BWA Read Length 50 Read Length % faster than BWA with comparable results CoBWeb: 3 mismatches and 2 gaps BWA: 2 mismatches + 1 gap of possibly multiple length
Comparison with BWA-SW Read Length mismatches plus 10 gaps CoBWebBWA-SW Reads1m Time taken1130s2242s Incorrectly Mapped mapped incorrecty by BWA-SW The remainder has poor BWA mapping quality
Avadis NGS
Alignment, DNA Var Detection, RNASeq, ChIPSeq, Small RNASeq
Thank You