Download presentation
Presentation is loading. Please wait.
1
DNA Sequencing By Dan Massa
2
Problem Description
3
Overview: Sequencing DNA to map the genome of humans and other species became a goal of scientists at the turn of the 21st century, using reference DNA and subject DNA, doctors and scientists can research diseases, disorders, and other deficiencies that can be seen through the analysis of DNA. Once Margaret Sanger developed a way to sequence DNA fragments and store the information digitally, the goal became to build the superstring of base pairs that most accurately depicts a genome, in the least amount of time possible.
4
The Need for Computers:
These DNA fragments are represented in strings that coincide to one side of the base pair, leaving an alphabet of {G, A, C, T}, because while T-A bond and G-C bond, the placement of the bond in the structure is important, but since those are the only two pairs that exist, only one letter needs to be stored to provide an accurate depiction of the genome. The basepairs come in fragments called contigs, which can be between 30bp and 50,000bp, with a mean size typically between 500 and 1000 which need to be assembled into a superstring
5
The Need For Parallel Algorithms
6
String Assembly is an NP-Complete Problem
An NP problem that scales with the the size of the completed superstring, as well as the number of substrings. The completed superstring for a human being is approximately 3.5 billion base pairs, and a sequencing analysis typically has around million substrings. This demands approximately 1,000 CPU hours (41 days) per genome. So parallelization is necessary for timely diagnosis.
7
Cost Analysis While the major drops in cost are attributed to medical equipment advancements, much of the price reduction from onward is attributed to advancements in sequencing tech and parallel processing, starting with CloudBurst and ABySS in 2009.
8
The Generic Algorithm
9
Generic Algorithm (Adapted from TIGR)
Сalculate pairwise alignments of all fragments. Choose two fragments with the largest overlap. Merge chosen fragments. Repeat step 2 and 3 until only one fragment is left.
10
Example Test Case Suppose you are trying to build this superstring:
GACCCGCTAGGGCCTTCCGCAGGTCAGTTCA From the substrings: CTAGGGCCT, TCCGCAGGT, GCCTTCC, GCTAGGG, GACCCGC, CCGCTAG, AGGGCC, CGCAG, CAGGT, TCAGT, TTCA, AGTT, CCTT, CCCG Usually the superstring is not known in advance, but for the sake of the exercise, we need a goal for the end comparison. (Note: the substrings must also be analyzed backwards in practical applications)
11
C T A G
12
Adjusting the Algorithm for Parallel Use
By starting with the longest substrings that are independent (no overlap) or nearly independent, the algorithm can be distributed across multiple threads, so long as the minimum superstring from each thread is compared across all threads, because it needs to remain OR We can use one primary superstring and use the distributed processing power to find the next substring with the most overlap. (This avoids the issue of communicating threads not having the most up to date string when making the comparison, also called De Novo Sequencing)
13
Error Checking and the Bowtie Algorithm
14
Error Checking Contaminated samples, misreads, and other issues can arise during the medical aspect of sequencing, leading to imperfect data being fed into the sequence assembler. This means our substring comparison algorithm must account for the possibility of inaccurate base pairs.
15
Bowtie Algorithm Pseudocode
while{Number of Strings > 1} { For each substring { Find the longest common overlap with superstring that overlaps in one place Error handling for incorrect pairs Choose substring with longest overlap } Concatenation substring with superstring }
20
Successful Implementations
21
ABySS Assembly By Short Sequences is a denovo short read assembly algorithm that is open source and commonly used, it was created in C++ and uses MPI for communication between nodes It’s adaptations from TIGR include: ignoring base pair sequences that are less than 100 characters due to the high probability that they will not contribute to the superstring, and take up computational time, as well as distributing the subsequences across multiple storage units by assigning the base pairs a numeric value {0, 1, 2, 3} and generating hash to uniquely index the subsequences
22
Implementation Notes/Benchmarks
ABySS Benchmarks (vs other assemblers):
23
Future Research
24
Other De Novo Short Read Sequencers
Cloudburst, SPAdes, Velvet, and SOAP de novo (and many more) are useful sequencers that have individual pros and cons, unique approaches to problem solving, implementation, error handling, ect. Each of which could be looked into to make this project more robust
25
Basic de novo assembler implementation
Basic, but not very sophisticated, algorithms exist that are relatively easy to implement for benchmark testing on MST’s cluster system. To deeply understand the inner workings of the algorithms, this approach may be taken. To more deeply understand the state of the medical field, research into relevant implementations, and their associated time and cost analysis can be conducted.
26
References
27
References Gallant, John, David Maier, and James Astorer. "On Finding Minimal Length Superstrings." Journal of Computer and System Sciences 20.1 (1980): Web. Schatz, M. C. "CloudBurst: Highly Sensitive Read Mapping with MapReduce." Bioinformatics (2009): Web. Simpson, J. T., K. Wong, S. D. Jackman, J. E. Schein, S. J.m. Jones, and I. Birol. "ABySS: A Parallel Assembler for Short Read Sequence Data." Genome Research 19.6 (2009): Web. Sutton, Granger G., Owen White, Mark D. Adams, and Anthony R. Kerlavage. "TIGR Assembler: A New Tool for Assembling Large Shotgun Sequencing Projects." Genome Science and Technology 1.1 (1995): Web.
28
References con’t Schatz, Michael C. "High Performance Computing for DNA Sequence Alignment and Assembly." Cold Spring Harbor Laboratory. Stone Ridge Technology, 18 May Web. 29 Sept
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.