www.cottonmarker.org Cotton Marker Database (CMD) for Genetic And Genome Research www.cottonmarker.org Anna Blenda, Pengfei Xuan, David Camak, Feng Luo, Don Jones (presenting author) ICGI-2010, Canberra, Australia
www.cottonmarker.org CMD Objectives Collect and integrate all public cotton molecular markers (SSRs and SNPs) as a cotton community resource. Accelerate utilization of molecular markers in cotton breeding. Provide data retrieval and search tools. Provide stand-alone data mining tools. Facilitate collaboration domestically and internationally The CMD database resourse is available for deposit and data mining of the cotton genomes sequencing data www.cottonmarker.org
CMD New Features 1.Primer Redundancy 2. Traits www.cottonmarker.org CMD New Features 1.Primer Redundancy 2. Traits 3. Updated Cmap Viewer with QTLs 4. New System Platform - powerful computing 5. Future/Work in Progress www.cottonmarker.org
View section: 1. Primer Redundancy SSR Projects SNP Projects www.cottonmarker.org 1. Primer Redundancy View section: SSR Projects SNP Projects SSRs Markers Homology SSR Primers Primer Redundancy Panel Publications Maps www.cottonmarker.org
Importance of the Primer Redundancy Check: www.cottonmarker.org Importance of the Primer Redundancy Check: Initial step in the analysis of the CMD cotton SSRs collection redundancy. Avoidance of generating marker redundancy. Financial component is critical (spending money for non-redundant SSR markers only). Direct effect on the efficiency of the molecular breeding research.
Primer Redundancy Summary Page: www.cottonmarker.org Primer Redundancy Summary Page: - 18,002 primer sequences analyzed; 2,570 (14.2%) redundant primer sequences; Types of primer sequence match: forward-forward; reverse-reverse; forward-reverse; reverse-forward. www.cottonmarker.org
Threshold value for primer sequence match: 81% www.cottonmarker.org Threshold value for primer sequence match: 81% The threshold value (81% or higher for sequence match): chosen based on the threshold value analyses (from 70% to 100% match); - below 81% match primer redundancy increases dramatically www.cottonmarker.org
List of Redundant Sequences www.cottonmarker.org List of Redundant Sequences
Primer Redundancy Individual Pages www.cottonmarker.org Primer Redundancy Individual Pages
Redundant primer Info from View/Search SSR pages www.cottonmarker.org Redundant primer Info from View/Search SSR pages
www.cottonmarker.org Downloads Page CMD SSR Primer Redundancy results available from the Downloads page: -excel format www.cottonmarker.org
Search by Primer Redundancy www.cottonmarker.org Search by Primer Redundancy www.cottonmarker.org
Example of published traits and QTLs associated with traits 2. Traits in Cotton Linked to the Genetically Mapped SSR David Camak, undergraduate student (Erskine College) Example of published traits and QTLs associated with traits
QTL Start & Stop Positions Publication Reference Spreadsheet with Annotated Trait Data Trait Symbol Marker Interval for QTL Trait Name QTL/gene Name R2 Value Trait-linked SSR QTL/gene? Cross QTL Start & Stop Positions SSR Genetic Position Publication Reference Linkage Group QTL Span Marker Type This is what David Camak did – he annotated into the excel spreadsheet the information from current pubs regarding ag.important traits in cotton and mapped cotton SSRs linked with those traits. Trait Description
Results Twenty-nine agriculturally important traits were analyzed overall Total number of SSR markers associated with those traits was 142 The total number of crosses/genetic maps analyzed was 15 Initial results of David Camak’s undergraduate research project. The annotation of traits is being continued.
Agriculturally Important Traits Annotated Boll Size Boll Weight Bolls per Plant Color Components Yellowness Fiber Span Length (2.5%, 50%) Fiber Elongation Fiber Fineness Fiber Maturity Fiber Perimeter Fiber Strength Fiber Micronaire Lint Index Lint Percentage Lint Yield Number of Seed/Boll Reflectance Seed Cotton Yield Seed Index Seed Weight Spiny Bollworm Resistance Short Fiber Index Wall Thickness Weight Fitness Uniformity Index Genic Male Sterility
www.cottonmarker.org Number of Cotton SSRs Associated/Linked with the Analyzed Agriculturally Important Cotton Traits
Agriculturally Important Traits 1 Agriculturally Important Traits Number of SSRs Associated with Each Trait Fiber Elongation 15 Fiber Length Yellowness 14 Fiber Strength 13 Fiber Strength (kNm/kg) 12 Fiber Reflectance Micronaire Boll Size (g) 11 Lint Percentage 10 Short Fiber Index Lint Cotton Yield (kg/ha) 9 Maturity 2.5% Fiber Span Length (mm) 8 Fiber Maturity Seed Cotton Yield (kg/ha) Seed Index (g) Fiber Elongation Percentage 7
Agriculturally Important Traits 2 Agriculturally Important Traits Number of Unique SSRs Associated with Each Trait Wall Thickness 7 50% Fiber Span Length (mm) 6 Boll Weight Genic Male Sterility Micronaire Reading 5 Weight Fitness Bolls per Plant 4 Fiber Perimeter Fiber Strength (cN/tex) 2 Spiny Bollworm Resistance Fiber Length (mm) 1 Fiber Length Uniformity *These numbers are continually updated as molecular research and breeding uncover more trait-linked SSRs
View Traits Go to CMD Homepage @ www.cottonmarker.org Click on Traits
Search SSRs Listed by Traits Traits by Published Symbol Results Search SSRs Listed by Traits Traits by Published Symbol Click on any Trait
SSR Linked with Selected Trait Choose Trait Based on SSR List of SSRs Choose Trait Based on SSR Click
Trait Data From Spreadsheet Trait Information 1 QTL and Marker Information Positions for Genetic Mapping All Relevant Data from Spreadsheet Click on 1 or 2 2
1 Specific Molecular Marker Source Page Forward and Reverse Primer Sequences Molecular Marker (SSR) Other Useful Information Related to Specific SSR
Search feature available on any page, including the homepage 2 Trait Search Page Simple search for agriculturally important traits in cotton Search feature available on any page, including the homepage
3. Updated CMap Viewer with QTLs 26 cotton genetic maps are available to view and compare in CMap Viewer; QTL information was added 5) Updates in CMAp (26 maps, QTL info) Consensus map:BC1-RIL: ("Guazuncho2" (G. hirsutum) x "VH8-4602" (G. barbadense)) 2009 Reference map: Comprehensive Reference Map (CRM) CottonDB 2010 BC1: (G. hirsutum "Emian22" x G. barbadense "Pima3-79") x "Emian22" 2008 BC1: Hai-7124 x Junmian-1 2007 BC1: [(G. hirsutum "TM-1" x G. barbadense "Hai7124") x "TM-1"] 2007 BC1: [(G. hirsutum "TM-1" x G. barbadense "Hai7124") x "TM-1"] 2006 BC1: (("Guazuncho2" (G. hirsutum) x "VH8-4602" (G. barbadense)) x "Guazuncho2") 2004 BC2: (("Guazuncho2" (G. hirsutum) x "VH8-4602" (G. barbadense)) x "Guazuncho2") 2005 DH: Vgs x (TM-1 x Hai-7124) 2005 F2: Deltapine x Giza-83 2008 F2: Deltapine-61 x Texas-701 2007 F2: G. arboreum ("Jianglingzhongmian" x "Zhejiangxiaoshanlushu") 2008 F2: Xinluzao-1 x Hai-7124 2008 F2: G. hirsutum "CRI36" x G. barbadense "Hai7124" 2007 F2: G. hirsutum "Handan208" x G. barbadense "Pima90" 2007 F2: G. hirsutum "Handan208" x G. barbadense "Pima90" 2005 F2: Hai-7124 x Junmian-1 2007 F2: G. hirsutum race "Palmeri" x G. barbadense Acc. "K101" 2007 F2: G. hirsutum race "Palmeri" x G. barbadense Acc. "K101" 2004 F2: TM-1 x WT-936 2005 F2: Yumian-1 x T586 2005 F2: Acala-44 x Pima S-7 2004 RIL: 7235 x TM-1 2007 RIL: Zhongmiansuo-12 x 8891 2007 RIL: "TM-1" (G. hirsutum (AD1)) x "3-79" (G. barbadense (AD2)) 2006 RIL: "TM-1" (G. hirsutum (AD1)) x "3-79" (G. barbadense (AD2)) 2005 4WC: (Simian-3 x Sumian-12) x (Zhong-4133 x 8891) 2008
4. New Virtualization /HPC System Platform Palmetto HPC - CMD was moved to virtual machines for high-performance computing (HPC); jobs submitted by users transfer to Clemson Palmetto HPC; - very powerful computing resource ( more than 5000 computing notes); daily remote backup. ssh Event Channel Virtual MMU Virtual CPU Control IF Hardware (SMP, MMU, physical memory, Ethernet, SCSI/IDE) GuestOS (CentOS 5.4) Device Manager & Control s/w VM0 CMD web page Cmap Cgi-bin CMBioTools VM1(cmdweb) Front-End Device Drivers (CentOS GuestOS 5.4) MySQL PostgreSQL VM2(databases) (CentOS 5.4)) Unmodified User Software VM3(gmod) Safe HW IF Xen Virtual Machine Monitor Back-End SMP Warriors HPC Cluster - Live and development website host one virtual machine based on virtualization technology. (ready to cloud computing) Computing job submit by user will transfer to Clemson Palmetto HPC. Very powerful computing resource ( more than 5000 computing notes) Daily remote snapshot backup. Remote Daily Backup snapshot
www.cottonmarker.org Future - 100 SSRs from Siva Kumpatla (Dow Agro): a collaborative project with Texas A&M , SSRs mapped on TM-1 x 3-79 map; - 200 SSRs from Ramesh Kantety; Updating of the mapped SSR data is in progress More SNP data is coming; Annotation of traits/genes that are mapped and linked to SSRs/SNPs is in progress www.cottonmarker.org
3 pipelines were designed (Pengfei Xuan): www.cottonmarker.org Future (cont.) 3 pipelines were designed (Pengfei Xuan): 1. Eukaryotic Automated Structural Annotation Pipeline 2. Transposable elements denovo 3. Transposable elements annotation
1). Eukaryotic Automated Structural Annotation Pipeline www.cottonmarker.org (work in progress) EST based refinement (PASA) Finalize best annotation Phase 3 Genome Sequence Gene Finders EST Database (PASA) Database Comparisons Consensus prediction Manually build gene models (200 genes) Gene Finder Use gene models as Training set Repeat Masker Preliminary gene finding Phase 1 Phase 2 Manual check - Aimed to identify a vast majority of genes; raw sequences are run through a series of programs and scripts (“pipeline”) in an automated way; generates a basic working gene set as a starting point for further work. The pipeline was designed by Pengfei for CUGI initially, but we are planning to implement it into CMD. It will be very handy when genome sequences are available.
2). Transposable Elements Computational Identification (work in progress) This pipeline is searching the genome sequences for TEs and creates a library file of TEs for a genome of interest this pipeline starts by comparing the genome with itself using BLASTER. Then it cluster matches with GROUPER, RECON and PILER, clustering programs specific for interspersed repeats. For each cluster, it builds a multiple alignment from which a consensus sequence is derived. Finally these consensus are classified according to TE features and redundancy is removed. At the end we obtain a library of classified, non-redundant consensus sequences.
3). Transposable Elements Annotation (work in progress) This pipeline mines a genome using a library of TEs from TEdenovo pipeline. Identified TEs are filtered and annotated. TEannot: this pipeline mines a genome with a library of TE sequences, for instance the one produced by the TEdenovo pipeline, using BLASTER, RepeatMasker and CENSOR. An empirical statistical filter is applied to discard false-positive matches. Short simple repeats (SSRs) are annotated along the way with TRF, RepeatMasker and MREPS. Then the pipeline chains, with MATCHER via dynamic programming, TE fragments belonging to the same, disrupted copy. A "long join" procedure is subsequently applied to connect distant fragments. Finally annotations are exported into GFF3 and gameXML files.
CMD TEAM Anna Blenda, PI Feng Luo,collaborator Pengfei Xuan www.cottonmarker.org CMD TEAM Anna Blenda, PI Research Assistant Professor, Genetics and Biochemistry Clemson University Feng Luo,collaborator Assistant Professor, School of Computing Clemson University Pengfei Xuan M.S. student Computer Science Clemson University David Camak former member, currently M.S. student Biology SELU
Acknowedgements Cotton Incorporated www.cottonmarker.org Acknowedgements Cotton Incorporated www.cottonmarker.org
www.cottonmarker.org Thank you!