Download presentation
Presentation is loading. Please wait.
Published byGeorgiana Lyons Modified over 9 years ago
1
Short Read Mapper Evan Zhen CS 124
2
Introduction Find a short sequence in a very long DNA sequence Motivation – It is easy to sequence everyone’s genome, but then finding out where a particular short subsequence is located is not an easy task.
3
Problem Treat it as a standard string search problem, except it only contains characters A,T,C,G – Given a substring S of length L, a reference string R (very large), find all positions in R where S is located with at most D mismatches within the L region
4
Solution 1 – simple search Match substring S with reference string R, character by character – If mismatch, ignore character at R and keep comparing – If all characters matches within D mismatches, return the position in R – Example: R - GGCTACCTTTTAACGATC S - TACCTTTT Match? No
5
Solution 1 – simple search Match substring S with reference string R, character by character – If mismatch, ignore character at R and keep comparing – If all characters matches within D mismatches, return the position in R – Example: R - GGCTACCTTTTAACGATC S - TACCTTTT Match? No
6
Solution 1 – simple search Match substring S with reference string R, character by character – If mismatch, ignore character at R and keep comparing – If all characters matches within D mismatches, return the position in R – Example: R - GGCTACCTTTTAACGATC S - TACCTTTT Match? No
7
Solution 1 – simple search Match substring S with reference string R, character by character – If mismatch, ignore character at R and keep comparing – If all characters matches within D mismatches, return the position in R – Example: R - GGCTACCTTTTAACGATC S - TACCTTTT Match? Yes
8
Solution 2 – map + index Map the reference string R into indexes. Let p = partition string, where length(p) < L. For every position in R, store the index position and the string p at that index. Using this map, searching for S will be just a lookup To compensate mismatch, change characters in S and search again – Example, instead of searching “AAT”, search “GAT” Purpose – allow for multiple searches using the same map, so no need to process R multiple times.
9
Solution 2 – map + index Structure of map: hashtable – Key = partition string p – Value = list of all positions of p Building the map – Read R, character by character – At each read, store the index of p
10
Solution 2 – map + index Example – R - GGCTACCTTTTAACGATC P – length = 5 Key Value (positions)
11
Solution 2 – map + index Example – R - GGCTACCTTTTAACGATC P – length = 5 Key Value (positions) p GGCTA 0
12
Solution 2 – map + index Example – R - GGCTACCTTTTAACGATC P – length = 5 Key Value (positions) p GGCTA 0 GCTAC 1
13
Solution 2 – map + index Example – R - GGCTACCTTTTAACGATC P – length = 5 Key Value (positions) p GGCTA 0 GCTAC 1 CTACC 2
14
Solution 2 – map + index Example – R - GGCTACCTTTTAACGATC P – length = 5 Key Value (positions) GGCTA 0 GCTAC 1 CTACC 2... S - CTACCTTTTA To find S, break S into partitions of length(p). Search each partition and make sure positions are relative to one another.
15
Comparison Simple Search – Pro Easy to implement – Con Can potentially be very slow Map + Index – Pro Faster than simple search if performing multiple searches using same R – Con Hard to implement Can potentially require a lot of memory for storing the indexes
16
Why do these solutions work? Because they search for a subsequence in a larger sequence Both handle mismatches – Simple search – ignores characters in R (aka handle “insertion” types) – Map + index – since map is partitioned, hard to detect insertion types, so adjust the subsequence (aka handle “mutation” types)
17
Implementation Nothing hard-coded – Can easily change constants such as required length of S, length of partitions, max number of mismatches, etc Used Java – Later realized it was a bad idea, but it was a bit too late to rewrite in a different language Simple search – easily implemented Map + index – Map easily handled with hashtable – Accounting for mismatches was challenging
18
Analysis Using Nick’s simulator to generate reference sequence of 1million in length – Generating map ~ 15sec – Simple Search ~ 0.2 sec – Index Search ~ 0.22 sec Odd that the index search is slower – Possible reason – the way I handle mismatches
19
Conclusion Limitations – Subsequence S must be able to be broken down into equal partition sizes for the map – Because it’s written in Java, possible memory limitations To-Do – Find better way to handle mismatch for the Index search Future work – Different algorithms – Rewrite in a different language
20
Thank You
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.