Download presentation
Presentation is loading. Please wait.
Published byJulianna Hodge Modified over 9 years ago
1
Genome-Wide SNP Discovery from de novo Assemblies of Pepper (Capsicum annuum ) Transcriptomes Hamid Ashrafi 1, Jiqiang Yao 2, Kevin Stoffel 1, Sebastian R. Chin-Wo 3, Theresa Hill 1, Alexander Kozik 3 and Allen Van Deynze 1 1 Department of Plant Sciences, Seed Biotechnology Center, University of California, Davis, CA 95616 2 Interdisciplinary Center for Biotechnology Research (ICBR), University of Florida, Gainesville, FL 32610 3 Genome Center, University of California, Davis, CA 95616 Background and Significance To obtain as many transcribed genes as possible, peppers were sampled from different, cultivars, tissues at multiple stages of growth and development. To discover putative SNPs among three sampled pepper cultivars by sequencing transcriptomes using Illumina Genome Analyzer. To annotate the transcriptome sequence in order to have an insight into pepper biological processes. To use annotated genes for QTL analysis and candidate gene discovery. Objectives Materials and Methods Results Conclusions References Acknowledgments Plant Materials and cDNA Library Preparation The seed of three pepper (C. annuum) lines ‘CM334,’ ‘Maor’ and ‘Early Jalapeño’ were planted. Three cDNA libraries (one from each pepper variety) were prepared using pooled RNA that was extracted from 4 tissues: root, young leaf, flower and fruit using Qiagen RNeasy Mini Kit (Qiagen Valencia CA, USA). Fruit tissues were collected in different developmental stages; 5, 10, and 20 days post pollination developing fruit, breaker and ripe fruit. The libraries were constructed by shearing cDNAs and 300 ‐ 350 bp fragments were selected on gels. The libraries were normalized using a double-stranded nuclease protocol. The cDNA libraries were sequenced using Illumina Genome Analyzer IIx (GAIIx) (Illumina Inc., San Diego, CA) for 80-120 cycles at UC Davis Genome Center core facility. De Novo Assembly of NGS Sequences The NGS data (GAIIx) went through our standard preprocessing pipeline, developed at UC Davis (Kozik, A, 2010). Velvet (Zerbino and Birney, 2008) and CLC (CLCBIO, 2010) software packages were used to assemble the sequences. CAP3 was used to make the final assembly of three assemblies. Trimmed reads Min 40nt – Max 85nt Trimmed reads Min 25nt – Max 60nt One iteration of CLC assembly with all reads One iteration of CLC assembly with all reads One iteration of CLC assembly with all reads Velvet Assembler Early Jalapeño 31 35 41 31 35 41 Trimmed reads Min 40nt – Max 85nt Trimmed reads Min 25nt – Max 60nt All K-mer assemblies, assembled with CAP3 Maor 31 35 41 Trimmed reads Min 40nt – Max 85nt Trimmed reads Min 25nt – Max 60nt 31 35 41 All K-mer assemblies, assembled with CAP3 CM334 31 35 41 31 35 41 Trimmed reads Min 40nt – Max 85nt Trimmed reads Min 25nt – Max 60nt All K-mer assemblies, assembled with CAP3 Velvet K-mers CLC Assembler Trimmed reads Min 40nt – Max 85nt Trimmed reads Min 25nt – Max 60nt Trimmed reads Min 40nt – Max 85nt Trimmed reads Min 25nt – Max 60nt Velvet AssemblerCLC AssemblerVelvet AssemblerCLC Assembler CM334 assembly made with CAP3Early Jalapeño assembly made with CAP3 Maor assembly made with CAP3 + + + Pepper final assembly made with CAP3 (Reference Sequence) Assembly Statistics No. of Contigs Total nt N50 CM334 83,113 84,792,1801,488 Early Jalapeño 82,614 84,973,8651,488 Maor 76,375 79,383,6731,526 Pepper assembly 123,261 135,019,7871,647 (CM334,EJ and Maor) Annotation A total of 63,202 contigs (51.3%) had at least one hit in the non-redundant database of GenBank with an average length of 1,495 nucleotides. Contigs with a hit, covered 94.5 M bases (70%) of the total assembly. A total of 60,055 (48.7%) contigs that did not have any hit in the GenBank were on average 674 nucleotide long and covering 40.5 M bases (30%) of the total assembly. Based on all results of BLASTX, Vitis vinifera, Arabidopsis thaliana and Oryza sativa were the top three species in the blast hits (Fig 3). Mapping step of Blast2GO resulted in identification of 37,918 (30.7%) contigs with Gene Ontology (GO) terms. Biological Processes (BP) at different GO levels were generated. Fig 4 shows the BP at level 2. For each BP number of annotated sequences are shown in Fig 5. Kegg maps for 150 biological pathways were generated and contigs within each pathways were determined. For instance, Fig 6 depicts Kegg map of Pyrimidine Metabolism pathway. SNP discovery A total of 22,863 putative SNPs within 11,869 contigs were identified by our SNP discovery pipeline. The contigs with identified putative SNPs comprised 23,794 kb (17.6%) of pepper transcriptomes assembly. On average 1 SNP per 1040 bp of exonic regions of pepper genome was identified. Assembly of transcriptomes of three pepper cultivars, increased the total assembled bases by 50%. The present pepper transcriptome assembly represents ~4% of pepper genome (3500 Mb). We demonstrated that for the plants for which the genome sequences are not available yet, the transcriptome assembly is an alternate approach SNP calling. Annotation of 51% of contigs or 70% of total assembled bases indicates that ~49% of contigs are small contigs that are covering the remaining 30% of unannotated sequences. Conesa, A., S. Götz, et al. (2005). "Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research." Bioinformatics 21(18): 3674-3676. Kozik, A (2010).Tool to process and manipulate Illumina sequences). http://code.google.com/p/atgc-illumina/downloads/list”). http://code.google.com/p/atgc-illumina/downloads/list Li, H. and R. Durbin (2009). "Fast and accurate short read alignment with Burrows–Wheeler transform." Bioinformatics 25(14): 1754-1760. Li, H., B. Handsaker, et al. (2009). "The Sequence Alignment/Map format and SAMtools." Bioinformatics 25(16): 2078-2079. Zerbino, D. and E. Birney (2008). "Velvet: algorithms for de novo short read assembly using de Bruijn graphs." Genome Res 18: 821 - 829. Molecular breeding of pepper (Capsicum spp.) has been hampered by the paucity of molecular markers. This is primarily due to lack of availability of the pepper genome sequence and limited available sequence resources. In recent years with the more cost effective sequencing technologies such as Illumina, sequencing of expressed genes (transcriptomes), gene discovery and allele mining is no longer insurmountable. In order to exploit the speed and scale of data from new sequencing technologies and in an effort to enrich the sequence resources of pepper, we sequenced transcriptome sequences (RNA-seq) of three pepper lines: Maor, Early Jalapeño (EJ) and Criollo de Morelos 334 (CM334). We selected a wide range of tissues to represent as many expressed genes as possible. The reference sequence was constructed from >200 million Illumina reads (80-120 nt) using a combination of Velvet, CLC and CAP3 software packages. BWA (Li and Durbin, 2009), SAMtools (Li et al, 2009b) and in-house Perl scripts were used to identify SNPs among three pepper lines. The SNPs were filtered to be 100 bp apart from any putative intron-exon junctions as well as adjacent SNPs. After filtering >22,000 high quality putative SNPs were identified and bioinformatically mapped to pepper genetic maps. The reference sequence was annotated by Blast2Go software (Conesa et al, 2005). The authors would like to thank Enza Zaden, Nunhems, Rijk Zwaan, Syngenta, Vilmorin and UC Discovery program for the financial support. We also would like to thank sequencing facility of UC Davis Genome Center and Bioinformatics core facility to provide us the servers and computational power. The annotation would not be possible without collaboration with Dr. R Michelmore’s laboratory. SNP Discovery Pipeline BWA was used to map all the reads of three genotypes individually to the Pepper final transcriptome assembly. SAMtools was use to make the pileups of each cultivar and discover the difference within each cultivar with reference sequence. Indels were screened out of pileup files. Intron-exon junction positions were inferred in the reference sequence based on Arabidopsis gene models using intron finder of Solanaceae Genome Network website (SGN). In-house Perl scripts were used to create allele call table of all three genotypes, the SNPs were filtered against adjacent SNPs and identified Intronic regions. Sequences surrounding the SNPs (100 base on each side) were extracted from the reference sequence to design assays. Annotation of Reference Sequence Blast2Go program was used to annotate the reference sequence, obtain the statistics and generate Kegg maps(http://www.genome.jp/kegg/pathway.html). Fig. 3 Fig. 4 Fig. 5 Fig. 2 Distribution of contig length in pepper transcriptome assembly N50=1647 Mean=1095 Max=19,089 Min=265 Fig. 6 Kegg map of Pyrimidine Metabolism Fig. 1 De Novo assembly of pepper transcriptomes
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.