Short Read Mapper Evan Zhen CS 124. Introduction Find a short sequence in a very long DNA sequence Motivation – It is easy to sequence everyone’s genome,

Short Read Mapper Evan Zhen CS 124

Introduction Find a short sequence in a very long DNA sequence Motivation – It is easy to sequence everyone’s genome, but then finding out where a particular short subsequence is located is not an easy task.

Problem Treat it as a standard string search problem, except it only contains characters A,T,C,G – Given a substring S of length L, a reference string R (very large), find all positions in R where S is located with at most D mismatches within the L region

Solution 1 – simple search Match substring S with reference string R, character by character – If mismatch, ignore character at R and keep comparing – If all characters matches within D mismatches, return the position in R – Example: R - GGCTACCTTTTAACGATC S - TACCTTTT Match? No

Solution 1 – simple search Match substring S with reference string R, character by character – If mismatch, ignore character at R and keep comparing – If all characters matches within D mismatches, return the position in R – Example: R - GGCTACCTTTTAACGATC S - TACCTTTT Match? Yes

Solution 2 – map + index Map the reference string R into indexes. Let p = partition string, where length(p) < L. For every position in R, store the index position and the string p at that index. Using this map, searching for S will be just a lookup To compensate mismatch, change characters in S and search again – Example, instead of searching “AAT”, search “GAT” Purpose – allow for multiple searches using the same map, so no need to process R multiple times.

Solution 2 – map + index Structure of map: hashtable – Key = partition string p – Value = list of all positions of p Building the map – Read R, character by character – At each read, store the index of p

Solution 2 – map + index Example – R - GGCTACCTTTTAACGATC P – length = 5 Key Value (positions)

Solution 2 – map + index Example – R - GGCTACCTTTTAACGATC P – length = 5 Key Value (positions) p GGCTA 0

Solution 2 – map + index Example – R - GGCTACCTTTTAACGATC P – length = 5 Key Value (positions) p GGCTA 0 GCTAC 1

Solution 2 – map + index Example – R - GGCTACCTTTTAACGATC P – length = 5 Key Value (positions) p GGCTA 0 GCTAC 1 CTACC 2

Solution 2 – map + index Example – R - GGCTACCTTTTAACGATC P – length = 5 Key Value (positions) GGCTA 0 GCTAC 1 CTACC 2... S - CTACCTTTTA To find S, break S into partitions of length(p). Search each partition and make sure positions are relative to one another.

Comparison Simple Search – Pro Easy to implement – Con Can potentially be very slow Map + Index – Pro Faster than simple search if performing multiple searches using same R – Con Hard to implement Can potentially require a lot of memory for storing the indexes

Why do these solutions work? Because they search for a subsequence in a larger sequence Both handle mismatches – Simple search – ignores characters in R (aka handle “insertion” types) – Map + index – since map is partitioned, hard to detect insertion types, so adjust the subsequence (aka handle “mutation” types)

Implementation Nothing hard-coded – Can easily change constants such as required length of S, length of partitions, max number of mismatches, etc Used Java – Later realized it was a bad idea, but it was a bit too late to rewrite in a different language Simple search – easily implemented Map + index – Map easily handled with hashtable – Accounting for mismatches was challenging

Analysis Using Nick’s simulator to generate reference sequence of 1million in length – Generating map ~ 15sec – Simple Search ~ 0.2 sec – Index Search ~ 0.22 sec Odd that the index search is slower – Possible reason – the way I handle mismatches

Conclusion Limitations – Subsequence S must be able to be broken down into equal partition sizes for the map – Because it’s written in Java, possible memory limitations To-Do – Find better way to handle mismatch for the Index search Future work – Different algorithms – Rewrite in a different language

Thank You

Short Read Mapper Evan Zhen CS 124. Introduction Find a short sequence in a very long DNA sequence Motivation – It is easy to sequence everyone’s genome,

Similar presentations

Presentation on theme: "Short Read Mapper Evan Zhen CS 124. Introduction Find a short sequence in a very long DNA sequence Motivation – It is easy to sequence everyone’s genome,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Short Read Mapper Evan Zhen CS 124. Introduction Find a short sequence in a very long DNA sequence Motivation – It is easy to sequence everyone’s genome,

Similar presentations

Presentation on theme: "Short Read Mapper Evan Zhen CS 124. Introduction Find a short sequence in a very long DNA sequence Motivation – It is easy to sequence everyone’s genome,"— Presentation transcript:

Similar presentations

About project

Feedback