Mo17 shotgun project Goal: sequence Mo17 gene space with inexpensive new technologies Datasets in progress: Four-phases of 454-FLX sequencing to max of ~12X Include ~3kb paired-end sequencing (for short-range structural variation) Ultra-short-read Solexa or ABI-SOLID (for polishing) Preparation of methyl-spanning linkers to augment IBM map integration, detect rearrangements (Sanger end-sequence) (Ideally would add Mo17 BAC-ends from DuPont, if available)
Shotgun Independent of tiling path -Can detect non-repetitive gene space even within otherwise complex regions that may not be in tiling path Disadvantages of short-reads -Cant expect to recover repetitive sequences
Four Phases of Sequencing Complete in 2007 Sequencing contract established with 454/Roche. Four Phases, including collaborative runs at no cost in P2-4. Phase I underway (30 FLX runs.) Library QC and initial assessment of data quality (30 FLX runs). 10 FLX runs totaling 1 Gb (~0.4X) 20 FLX pair runs spanning 12 Gb (~5X span in 3kb inserts) Assess quality, coverage, contamination, chimerism, accuracy Phase II. (80 runs plus 30 runs from Roche, total 110 runs). Rough draft stage. 40 FLX-pair runs spanning 36 Gb (total 48 Gb~10X span) 70 FLX runs for 7 Gb (total 8Gb ~3.5X sequence) Assess rough draft assembly (3 methods), compare B73, sorghum
Phases III and IV Phase III (50 runs + 20 contributed) –20 FLX-pair runs (total spanning cover ~20X) –50 FLX runs (total 13 Gb sequence ~5.5X) –Draft assembly. Rough annnotation. Assessment of structural variation based on 20X clone cover. Assessment complete by end of Phase IV (60 runs + 30 contributed) –90 FLX runs (to reach total 22 Gb ~10X) –Data collection complete by end of –Early 08. Final assembly. Integration with MSSL ends and IBM map. Proceed to annotation and full analysis. Note: Later phases may use next FLX release with longer read lengths. To be conservative, sequence from FLX-pair reads not included in sequence coverage estimates. Total sequencing cost for Phase I-IV: $1.6M
454-FLX reads are typically either mostly masked, or mostly clean Percent masked by over-repd 16mers ~29% of reads have < quarter of positions masked ~58% of reads have > 2/3 of positions masked
Mo unique full length alignments vs. B73 MAGIs show high quality of unique alignments Residual repeats in MAGIs with multiple hits in 454 data Unique full alignments
SNPs and indels of 454 reads relative to MAGIs consistent with few % variation of Mo17/B73 (combines variation with sequencing errors) SNPs or indels per base Frequency of reads
Multiple assembly alternate plans Divide and conquer –Reduce ~100 million reads to ~50K unique gene spaces of ~thousands of reads each (~10kb) by clustering based on various comparisons Plan A: De novo clustering of masked reads Plan B: map to B73, assemble (de novo for remainder) Plan C: sorghum-assisted –Use various assemblers to lay-out and produce consensus for each cluster (454 assembly team engaged) –Polish sequence with Solexa or SOLID for accuracy –Link with MSSL pairs, integrate with map
Backup analyses vs. B73 reference SNP/variation detection by alignment to B73 sequence -454/Solexa/Solid (various successful models in other species at JGI, elsewhere) Structural variation detection via paired-end placements -Needs to be tolerant of chimerism rate -Model of successful human structural analysis done with 454 (unpublished)
Timeline Phase I in progress, complete by end of month. Analysis to OK phase II ~10 days. Phase II: October Phase III: November Phase IV: December 454 sequencing complete by end of year
~58% of each BAC is masked by over-represented 16-mers
Outreach Dick McCombie
Types of Outreach Public presentations Collaborations CSHL DNA Learning Center
Public Presentations
Collaborations –The Maize Genetics and Genomics Database. --Letter for Carolyn Lawrence-MaizeGDB –MaizeGDB-web site text, links to data –Gramene –EBI Ensembl –Affymetrix Maize Pilot Expression Array Project –Optical map –TWINSCAN –Vmatch –Full-Length cDNA Project
CSHL DNA Learning Center