Download presentation
Presentation is loading. Please wait.
1
How to Build a Horse: Final Report
Megan Smedinghoff
2
Project Goals Goal 1 Goal 2 Goal 3
Produce an assembly of the horse genome using the Celera Assembler Goal 2 Reconcile my assembly with the Arachne assembly done by Broad Institute in February 2007 Goal 3 Produce a parallelized version of the UMD Overlapper to help facilitate easier assembly of large organisms in the future
3
DNA Cells in all organisms have a DNA molecule in their nucleus
DNA is an abbreviation for DeoxyriboNucleic Acid DNA is translated into proteins, which determine the structure, function, and behavior of the cell
4
What does DNA look like? Linear, double-stranded, helical molecule.
Consists of four types of nucleotides (bases): adenine (A) thymine (T) guanine (G) cytosine (C)
5
What is a genome? A genome is the linear sequence of bases of the DNA molecule: …..GATGACATGTAT….. Bacteria: ~2-5 million bases Insects (fly): ~200 million bases Mammals: ~3 billion bases
6
The Horse Genome In February 2007, Broad Institute released a draft assembly of the horse genome The sequencing cost $15 million The assembly contains approximately 2.7 billion bases and was done using 6.8-fold coverage The horse that was sequenced
7
Why Sequence the Horse? Allows scientists to study diseases that primarily affect horses such as Glanders Over 80 known genetic conditions exist in both horses and humans; comparative genomics can lead to better treatments for both species Work done on the horse will lead to improvements in the assembly pipeline and allow easier assembly of large mammals in the future
8
Review of Genome Sequencing
DNA target sample SHEAR SIZE SELECT e.g., 10Kbp ± 8% std.dev. Primer End Reads (Mates) SEQUENCE 750bp Vector LIGATE & CLONE Slide courtesy of Art Delcher
9
Review of Assembly Process
Calculating Overlaps We compare each read to every other read and record pairs whose ends overlap by a certain amount (usually ~40 nucleotides) AGTGATTAGATGATAGTAGA | | | | | | | | | | | | GATGATAGTAGAGGATAGATTTA We use the University of Maryland overlapper to calculate the overlaps UMD Overlapper is advantageous because it corrects for sequencing errors and does additional trimming to help produce the most accurate list of overlaps possible
10
Review of Assembly Process
Unitigs We use overlap information to combine reads into unitigs Unitig Contigs We use mate pair information to combine unitigs into contigs Contig Scaffolds We further use mate pair information to determine how far apart contigs should be (the result is called a scaffold) Scaffold
11
Goal 1: Creating an Assembly
Running the Assembler It took nine days for the Celera Assembler to produce an assembly Celera generated over a terabyte of data The final assembly used 24.9 million reads (7.9x coverage) Results Assembly Celera Arachne Number of Scaffolds 59,044 9,687 Total Contigs In Scaffolds 126,810 55,316 Total Span of Scaffolds 2.5 billion bases 2.4 billion bases N50 Contig Size 77,479 bases 112,381 bases
12
Goal 2: Produce a Reconciled Assembly
Step 1: Match the Celera assembly to the Arachne assembly Step 2: Compare scaffolds from the two assemblies Step 3: Run reconciliation software Contig A Contig B Contig C Contig D Contig A’ Contig B’ Contig C’ Contig D’ Mapping of Celera Scaffold to Arachne Scaffold Celera Scaffold Arachne
13
Orientation Problems Definition:
An orientation problem occurs when a set of Celera contigs matches a set of Arachne contigs, but one of the contigs is placed in the opposite orientation Contig A Contig B Contig C Contig D Contig A’ Contig B’ Contig C’ Contig D’ Illustration of an orientation problem between the two assemblies Celera Scaffold Arachne
14
Ordering Problems Definition:
An ordering problem occurs when a set of Celera contigs map to a set of Arachne contigs, but the matches occur in a different order in each scaffold Contig A Contig B Contig C Contig D Contig A’ Contig C’ Contig B’ Contig D’ Illustration of an ordering problem between the two assemblies Celera Scaffold Arachne
15
Gap Size Problems Definition:
A gap problem occurs when the size of the Arachne gap is more than four standard deviations away from the estimated Celera gap size Contig A Contig B Contig C Contig D Contig A’ Contig B’ Contig C’ Contig D’ Illustration of an gap size problem between the two assemblies Celera Scaffold Arachne Gap
16
Results of the Comparison
Total Celera Scaffolds 59,044 Celera Scaffolds compared to Arachne 2,252 (96.6% of bases) Celera Scaffolds with Orientation Problems 42 Celera Scaffolds with Ordering Problems 109 Celera Scaffolds with Gap Size Problems 162 Celera Scaffolds with at least one problem 174 (93.1% of bases) Note: In most Celera scaffolds, there was at most one contig that matched a contig in the Arachne assembly. In these scaffolds, there was no way to look for discrepancies between the two assemblies.
17
Reconciliation Gaps Compressions Assembly A Assembly B Assembly A
Matching sequences Assembly A Assembly B Gap in Assembly B Compressions Assembly A Assembly B Matching sequences Compression error in Assembly A or expansion error in Assembly B
18
Reconciliation Results
Assembly Celera Arachne Reconciled Number of Compressions N/A 687 136 Assembly Size 2.512 billion bases 2.428 billion bases 2.433 billion bases N50 Contig Size 77,479 bases 112,729 bases 130,123 bases
19
Goal 3: Parallelizing the Overlapper
R O M M O R Get associated hex value for each 20-mer in read Record “minimizer” (20-mer with minimum hex value) for sliding 20-base window Create Read, Offset, Minimizer file Filter by certain minimizers and then sort to get candidate pairs for overlap Overlaps Additional Post-Processing Run “checker” on candidate pairs and produce edit-transcripts Output Overlap List
20
Parallelization of Overlapper
The parallelization occurs during the “checker” process when the reads with common minimizers are compared Since the reads are already grouped by minimizers, there is no overhead for doing the comparisons in parallel I used the Sun Grid Engine routines installed on the genome cluster to submit the jobs to available processors
21
Results of Parallelization
Bacteria Current Overlapper 30 Minutes Parallelized Overlapper 14 Minutes Fly Current Overlapper 21.5 Hours Parallelized Overlapper 6.25 Hours Horse Current Overlapper 9 Days Parallelized Overlapper 3.07 Days
22
Summary Goal 1 I created an assembly of the horse using the Celera assembler. The assembly size was 2.5 billion bases. Goal 2 I created a reconciled assembly that fixed 551 out of 687 compressions in the Arachne assembly and increased the assembly size by 4.3 million bases Goal 3 I created a parallelized version of the overlapper that significantly reduces the running time on larger genomes
23
Acknowledgements Thanks to Aleksey Zimin, Guillaume Marçais, Mike Roberts, James Yorke, and Poorani Subramanian for instruction and advice on this project Thanks also to Matthew Pragel for moral support
24
More Technical Overlapper Slide
R O M M O R Get associated hex value for each 20-mer in read Record “minimizer” (20-mer with minimum hex value) for sliding 20-base window Create Read, Offset, Minimizer file Filter by certain minimizers and then sort to get Candidate pairs for overlap Overlaps 10-8 Run “checker” on candidate pairs And produce edit-transcripts Output Overlap List Do Multi-compare and split overlaps when they have a certain number of differences Do Poisson test and eliminate any overlaps that have less than a 10-8 probability of occurring
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.