DNA Sequencing By Dan Massa.

DNA Sequencing By Dan Massa

Problem Description

Overview: Sequencing DNA to map the genome of humans and other species became a goal of scientists at the turn of the 21st century, using reference DNA and subject DNA, doctors and scientists can research diseases, disorders, and other deficiencies that can be seen through the analysis of DNA. Once Margaret Sanger developed a way to sequence DNA fragments and store the information digitally, the goal became to build the superstring of base pairs that most accurately depicts a genome, in the least amount of time possible.

The Need for Computers:
These DNA fragments are represented in strings that coincide to one side of the base pair, leaving an alphabet of {G, A, C, T}, because while T-A bond and G-C bond, the placement of the bond in the structure is important, but since those are the only two pairs that exist, only one letter needs to be stored to provide an accurate depiction of the genome. The basepairs come in fragments called contigs, which can be between 30bp and 50,000bp, with a mean size typically between 500 and 1000 which need to be assembled into a superstring

The Need For Parallel Algorithms

String Assembly is an NP-Complete Problem
An NP problem that scales with the the size of the completed superstring, as well as the number of substrings. The completed superstring for a human being is approximately 3.5 billion base pairs, and a sequencing analysis typically has around million substrings. This demands approximately 1,000 CPU hours (41 days) per genome. So parallelization is necessary for timely diagnosis.

Cost Analysis While the major drops in cost are attributed to medical equipment advancements, much of the price reduction from onward is attributed to advancements in sequencing tech and parallel processing, starting with CloudBurst and ABySS in 2009.

The Generic Algorithm

Generic Algorithm (Adapted from TIGR)
Сalculate pairwise alignments of all fragments. Choose two fragments with the largest overlap. Merge chosen fragments. Repeat step 2 and 3 until only one fragment is left.

Example Test Case Suppose you are trying to build this superstring:
GACCCGCTAGGGCCTTCCGCAGGTCAGTTCA From the substrings: CTAGGGCCT, TCCGCAGGT, GCCTTCC, GCTAGGG, GACCCGC, CCGCTAG, AGGGCC, CGCAG, CAGGT, TCAGT, TTCA, AGTT, CCTT, CCCG Usually the superstring is not known in advance, but for the sake of the exercise, we need a goal for the end comparison. (Note: the substrings must also be analyzed backwards in practical applications)

C T A G

Adjusting the Algorithm for Parallel Use
By starting with the longest substrings that are independent (no overlap) or nearly independent, the algorithm can be distributed across multiple threads, so long as the minimum superstring from each thread is compared across all threads, because it needs to remain OR We can use one primary superstring and use the distributed processing power to find the next substring with the most overlap. (This avoids the issue of communicating threads not having the most up to date string when making the comparison, also called De Novo Sequencing)

Error Checking and the Bowtie Algorithm

Error Checking Contaminated samples, misreads, and other issues can arise during the medical aspect of sequencing, leading to imperfect data being fed into the sequence assembler. This means our substring comparison algorithm must account for the possibility of inaccurate base pairs.

Bowtie Algorithm Pseudocode
while{Number of Strings > 1} { For each substring { Find the longest common overlap with superstring that overlaps in one place Error handling for incorrect pairs Choose substring with longest overlap } Concatenation substring with superstring }

Successful Implementations

ABySS Assembly By Short Sequences is a denovo short read assembly algorithm that is open source and commonly used, it was created in C++ and uses MPI for communication between nodes It’s adaptations from TIGR include: ignoring base pair sequences that are less than 100 characters due to the high probability that they will not contribute to the superstring, and take up computational time, as well as distributing the subsequences across multiple storage units by assigning the base pairs a numeric value {0, 1, 2, 3} and generating hash to uniquely index the subsequences

Implementation Notes/Benchmarks
ABySS Benchmarks (vs other assemblers):

Future Research

Other De Novo Short Read Sequencers
Cloudburst, SPAdes, Velvet, and SOAP de novo (and many more) are useful sequencers that have individual pros and cons, unique approaches to problem solving, implementation, error handling, ect. Each of which could be looked into to make this project more robust

Basic de novo assembler implementation
Basic, but not very sophisticated, algorithms exist that are relatively easy to implement for benchmark testing on MST’s cluster system. To deeply understand the inner workings of the algorithms, this approach may be taken. To more deeply understand the state of the medical field, research into relevant implementations, and their associated time and cost analysis can be conducted.

References

References Gallant, John, David Maier, and James Astorer. "On Finding Minimal Length Superstrings." Journal of Computer and System Sciences 20.1 (1980): Web. Schatz, M. C. "CloudBurst: Highly Sensitive Read Mapping with MapReduce." Bioinformatics (2009): Web. Simpson, J. T., K. Wong, S. D. Jackman, J. E. Schein, S. J.m. Jones, and I. Birol. "ABySS: A Parallel Assembler for Short Read Sequence Data." Genome Research 19.6 (2009): Web. Sutton, Granger G., Owen White, Mark D. Adams, and Anthony R. Kerlavage. "TIGR Assembler: A New Tool for Assembling Large Shotgun Sequencing Projects." Genome Science and Technology 1.1 (1995): Web.

References con’t Schatz, Michael C. "High Performance Computing for DNA Sequence Alignment and Assembly." Cold Spring Harbor Laboratory. Stone Ridge Technology, 18 May Web. 29 Sept

DNA Sequencing By Dan Massa.

Similar presentations

Presentation on theme: "DNA Sequencing By Dan Massa."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

DNA Sequencing By Dan Massa.

Similar presentations

Presentation on theme: "DNA Sequencing By Dan Massa."— Presentation transcript:

Similar presentations

About project

Feedback