Comparative analyses of the potato and tomato transcriptomes

Slides:



Advertisements
Similar presentations
Targeted Data Introduction  Many mapping, alignment and variant calling algorithms  Most of these have been developed for whole genome sequencing and.
Advertisements

Outline to SNP bioinformatics lecture
living organisms According to Presence of cell The non- cellular organism The cellular organisms According to Type the Eukaryotes the prokaryotes human.
Structural and Functional Genomics of Tomato Barone et al Tomato (Solanum Lycopersicon) – economically important crop worldwide, – intensively investigated.
SolCAP Solanaceae Coordinated Agricultural Project SNP Development for Elite Potato Germplasm David Douches Walter De Jong Robin Buell David Francis John.
Biology and Bioinformatics Gabor T. Marth Department of Biology, Boston College BI820 – Seminar in Quantitative and Computational Problems.
SolCAP Solanaceae Coordinated Agricultural Project Dedicated to the Improvement of Potato and Tomato Executive Commitee : David Douches Walter De Jong.
Elephant Seg Dup Analysis 1.Genome 2.Parameters for Pipeline 3.Analysis.
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
SolCAP Solanaceae Coordinated Agricultural Project What is SolCAP? The SolCAP project links together people from public institutions, private institutions.
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.
Jonathan B. Puritz, Christopher M. Hollenbeck, and John R. Gold Fishing for selection, but only catching bias: library effects in double-digest RAD data.
Evaluation of PacBio sequencing to improve the sunflower genome assembly Stéphane Muños & Jérôme Gouzy Presented by Nicolas Langlade Sunflower Genome Consortium.
Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Informatics challenges and computer tools for sequencing 1000s of human genomes Gabor T. Marth Boston College Biology Department Cold Spring Harbor Laboratory.
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
High Throughput Sequencing
Plants.ensembl.org / The transPLANT project is funded by the European Commission within its 7 th Framework Programme under the thematic.
GeVab: Genome Variation Analysis Browsing Server Korean BioInformation Center, KRIBB InCoB2009 KRIBB
SOL Genomics Network Formed in 2003 to answer two questions: – How can a common set of genes give rise to such a wide range of morphologically and ecologically.
ARC Biotechnology Platform: Sequencing for Game Genomics Dr Jasper Rees
What is SGN? S GN is a rapidly evolving comparative resource for the plants of the Solanaceae family, which includes important crop and model plants such.
Solanum lycopersicum Chromosome 4 Sequencing Update SOL Germany– October 2008 Wellcome Trust Medical Photographic Library.
The New Zealand Institute for Plant & Food Research Limited Potato Genome Sequencing Consortium, notes from the edge Dr Susan Thomson, Dr Mark Fiers, Dr.
Variables: – T(p) - set of candidate transcripts on which pe read p can be mapped within 1 std. dev. – y(t) -1 if a candidate transcript t is selected,
Bioinformatics and Sequencing Relevant to SolCAP
“Recent next generation sequencing results” MACHADO LAB.
MapNext: a software tool for spliced and unspliced alignments and SNP detection of short sequence reads Hua Bao Sun Yat-sen University, Guangzhou,
APPLICATION OF MOLECULAR MARKERS FOR CHARACTERIZATION OF LATVIAN CROP PLANTS Nils Rostoks University of Latvia Vienošanās Nr. 2009/0218/1DP/ /09/APIA/VIAA/099.
Solanum lycopersicum Chromosome 4 Sequencing Update UK-SOL– Dec 2008 Wellcome Trust Medical Photographic Library.
Advancing Science with DNA Sequence Metagenome definitions: a refresher course Natalia Ivanova MGM Workshop September 12, 2012.
The use of complex populations in breeding with markers SBC “Breeding with molecular markers” David Francis Contact:
Genomics and Arabidopsis. What is ‘genomics’? Study of an organism’s entire genome –All the DNA encoded in the organism –Nucleus, mitochondria, chloroplasts.
Gerton Lunter Wellcome Trust Centre for Human Genetics From calling bases to calling variants: Experiences with Illumina data.
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
Plants.ensembl.org / The transPLANT project is funded by the European Commission within its 7 th Framework Programme under the thematic.
© 2010 by The Samuel Roberts Noble Foundation, Inc. 1 The Samuel Roberts Noble Foundation, 2510 Sam Noble Parkway, Ardmore, OK, 73401, USA 2 National Center.
Cancer Genome Assemblies and Variations between Normal and Tumour Human Cells Zemin Ning The Wellcome Trust Sanger Institute.
Serghei Mangul Department of Computer Science Georgia State University Joint work with Irina Astrovskaya, Marius Nicolae, Bassam Tork, Ion Mandoiu and.
Linkage and Mapping. Figure 4-8 For linked genes, recombinant frequencies are less than 50 percent.
UK NGS Sequencing Update July 2009 Dr Gerard Bishop - Division of Biology Dr Sarah Butcher – Centre for Bioinformatics.
Marker Assisted Selection in Tomato Pathway approach for candidate gene identification and introduction to metabolic pathway databases. Identification.
The Genome Assemblies of Tasmanian Devil Zemin Ning The Wellcome Trust Sanger Institute.
Software and Databases for managing and selecting molecular markers General introduction Pathway approach for candidate gene identification and introduction.
CASE7——RAD-seq for Grape genetic map construction
Genomics Chapter 18.
The Wellcome Trust Sanger Institute
Chapter 12 Assessment How could manipulating DNA be beneficial?
BLAST Sequences queried against the nr or grass databases. GO ANALYSIS Contigs classified based on homology to known plant or fungal genes Next.
13 th January 2008 Plant & Animal Genome Conference Progress with Sequencing Tomato Chromosome 4 Clare Riddle Tomato Project Group Wellcome Trust Sanger.
Accessing and visualizing genomics data
SNP Discovery in Whole-Genome Light-Shotgun 454 Pyrosequences Aaron Quinlan 1, Andrew Clark 2, Elaine Mardis 3, Gabor Marth 1 (1) Department of Biology,
1 Comparative analyses of the potato and tomato transcriptomes David Francis, AllenVan Deynze, John Hamilton, Walter De Jong, David Douches, Sanwen Huang,
Notes: Human Genome (Right side page)
RNA Sequencing and transcriptome reconstruction Manfred G. Grabherr.
A high-resolution map of human evolutionary constraints using 29 mammals Kerstin Lindblad-Toh et al Presentation by Robert Lewis and Kaylee Wells.
A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.
When the next-generation sequencing becomes the now- generation Lisa Zhang November 6th, 2012.
1 Finding disease genes: A challenge for Medicine, Mathematics and Computer Science Andrew Collins, Professor of Genetic Epidemiology and Bioinformatics.
De Novo Assembly of Mitochondrial Genomes from Low Coverage Whole-Genome Sequencing Reads Fahad Alqahtani and Ion Mandoiu University of Connecticut Computer.
Sequencing and Assembly of the WheatD Genome using BAC Pools A Preliminary Study Daniela Puiu Sept 23rd 2013.
071126_EAS56_0057_FC – lanes 1-8 read 2 b a _EAS56_0057_FC – lanes 1-8 read 1 Table S1. Summary tables for a read 1 and b read 2 of a.
Risheng Chen et al BMC Genomics
Short Read Sequencing Analysis Workshop
Gapless genome assembly of Colletotrichum higginsianum reveals chromosome structure and association of transposable elements with secondary metabolite.
Denovo genome assembly of Moniliophthora roreri
Professors: Dr. Gribskov and Dr. Weil
Ssaha_pileup - a SNP/indel detection pipeline from new sequencing data
Relationship between Genotype and Phenotype
Evolution of Biodiversity
Presentation transcript:

Comparative analyses of the potato and tomato transcriptomes David Francis, AllenVan Deynze, John Hamilton, Walter De Jong, David Douches, Sanwen Huang, and C. Robin Buell Supported by the AFRI Plant Breeding, Genetics, and Genomics Program of USDA’s National Institute of Food and Agriculture

Questions International Sol Project: How can a common set of genes/proteins give rise to such a wide range of morphologically and ecologically distinct organisms? SolCAP: How can variation be harnessed to improve varieties that benefit the consumer, processors, and the environment? Sequence data available to address these questions: Draft genome for doubled monoploid DM1-3 516R44 (S. tuberosum L. Phureja group); S. tuberosum, S. lycopersicum, S. pimpinellifolium GAII transcriptomes Technology Next Generation Sequencing SNP genotyping

What comparisons do we want to make? How well do S. tuberosum expressed sequences align to DM1-3 516R44 genomic sequences? How well do S. lycopersicum expressed sequences align to DM1-3 516R44 genomic sequences? How is variation distributed within a Species? within a market class? within a variety? within a gene? Which sequence variation is important to phenotypic variation?

Library creation/QC GAII sequencing (single and paired end) 400 300 Data Collection Assembly Analysis: transcriptome complexity SNP calling/validation identification of genes under selection

Illumina GA II Output for Potato Sample Total Clusters Total PE Reads PF Passed Clusters % PF Passed Clusters Total PE PF Reads Actual PE Reads Atlantic 1 7,601,277 15,202,554 6,382,748 83.97 12,765,496 Atlantic 2 10,544,542 21,089,084 9,252,168 87.74 18,504,336 30,185,186 Premier 1 7,812,394 15,624,788 6,652,121 85.15 13,304,242 Premier 2 11,678,379 23,356,758 9,999,926 85.63 19,999,852 31,949,096 Snowden 1 7,996,418 15,992,836 6,837,553 85.51 13,675,106 Snowden 2 11,781,671 23,563,342 10,393,322 88.22 20,786,644 33,288,120

Velvet Assemblies of Potato Illumina Sequences With a minimum kmer of 31 and a minimum contig length of 150bp: Variety Total Gb Transcriptome Size (Mb) No. Contigs N50 (bp) Maximum Contig (Kb) Atlantic 1.8 38.4 45215 666 11.2 Premier 1.9  38.2 54917 408 6.6 Snowden 2.0 58754 358 6.9

Velvet Assemblies of Potato Illumina Sequences Alignment of S. tuberosum GAII-transcriptome contigs to the PGSC draft genome sequence from DM1-3 516R44: Atlantic: 45214 contigs 32520 align with GMAP(95%id, 50%cov) 27106 align with GMAP(95%id, 90%cov) Premier: 54917 contigs 41497 align with GMAP (95%id, 50%cov) 37297 align with GMAP (95%id, 90%cov) Snowden: 58754 contigs 44479 align with GMAP (95%id, 50%cov) 40708 align with GMAP (95%id, 90%cov)

Tomato Illumina GA II Output Variety Insert Size Read Length Total Reads PF Reads %PF Passed Total PF FL7600 300 61/47 22,491,304 20,685,342 92.0 60 16,025,976 14,382,577 89.8 15,645,164 13,985,875 89.4 49,053,794 NC84173 350 61/61 27,079,946 22,687,626 83.8 11,058,431 10,366,811 93.8 14,401,240 12,687,134 88.1 52,539,617 OH9242 26,960,898 24,874,218 92.3 10,316,775 9,671,753 14,676,814 12,879,812 87.8 51,954,487 T5 26,799,944 24,677,302 92.1 16,822,639 14,738,351 87.6 15,726,257 13,744,511 87.4 59,348,840 PI114490 17,721,226 16,422,842 92.7 17,115,349 14,902,672 87.1 17,890,649 15,248,587 85.2 52,727,224 PI212816 17,631,906 16,450,422 93.3 18,238,179 15,354,882 84.2 84 21,829,622 18,500,235 84.8 57,699,707

Velvet Assemblies of Tomato Illumina Sequences With a k-mer length of 31 and a minimum contig length of 150bp: Variety Total Gb Transcriptome Size (Mb) No. Contigs N50 (bp) Maximum Contig (Kb) FL7600 2.82 39.8 59,581 424 12.1 NC84173 2.77 39.2 60,534 496 13.3 OH9242 2.70 39.1 59,051 476 11.6 T5 3.04 40.6 60,031 632 14 PI114490 41 61,310 690 11.7 PI212816 3.00 41.1 66,118 471

Sequence quality: Viewing an Atlantic potato contig from the Velvet assembly

Alignment of contigs relative to DM1-3 516R44 FL7600 (93.7 % id; 94.4 % coverage) Snowden (97.9; 94.7)

Identify intra-varietal SNPs Query SNPs Filtered SNPs Atlantic Asm 224748 150669 Premier Asm 265673 181800 Snowden Asm 258872 166253 A/C SNP

Filtered SNP counts Filtering on SNP quality and 1 SNP/ 150bp window Ref Query d 10 d 20 d 30 d 40 d 50 d 60 d 100 atlantic 21336 17509 14493 12150 10277 8673 4435 premier 21789 18050 15084 12477 10584 8919 4620 snowden 19997 16518 13694 11378 9689 8048 4173 21117 17096 14106 11785 9790 8222 4228 22951 18431 15016 12377 10300 8703 4371 20972 16846 13709 11357 9479 7873 4113 20777 16998 13984 11619 9647 8131 4186 22101 17888 14701 12068 10124 8650 4223 21083 16963 13792 11218 9359 7735 3896

Filtered SNP counts No. SNPs Validation rate depth of coverage Filtering on SNP quality and 1 SNP/ 150bp window No. SNPs Validation rate depth of coverage

Genotyping platforms…. Comments on quality control… Data…. direct comparison of sequence analysis of SNPs across populations

COS R-gene Comparison of two genes on tomato chromosome 9 BAC

COSII Fresh Market vs Fresh Market         Identities = 573/573 (100%), Gaps = 0/573 (0%) Fresh Market vs Processing         Identities = 569/569 (100%), Gaps = 0/569 (0%) S. lycopersicum vs S. pimpinellifolium         Identities = 339/341 (99%), Gaps = 0/341 (0%) Potato vs Potato         Identities = 606/612 (99%), Gaps = 0/612 (0%) Tomato vs Potato          Identities = 914/948 (96%), Gaps = 6/948 (0%)

DIVERGED SEQUENCE Fresh Market vs Fresh Market         Identities = 959/959 (100%), Gaps = 0/959 (0%) Fresh Market vs Processing         Identities=1560/1560(100%), Gaps=0/1560 (0%) S. lycopersicum vs S. pimpinellifolium         Identities = 612/613 (99%), Gaps = 0/613 (0%) Tomato vs Potato         Identities = 223/280 (79%), Gaps = 11/280 (3%) Potato vs Potato   Identities = 246/278 (88%), Gaps = 7/278 (2%)

What patterns do we expect to see for genes “under selection”? Low Variation (fixed) High Ka/Ks (mutations affect protein, possible diversifying selection) Mutations (loss of function) FST (genes that distinguish populations)

Population structure: coding vs. non-coding Processing Fresh-market Vintage Landrace All 173 markers (K=6) CA & OH OH CN 89 Coding markers (K=5) 84 Non-coding markers (K=6) CA OH OH CN 500K burnin/750K MCMC reps, 20 runs for each K from 3 to 8 21

Distribution of FST for genes ovate: 0 fw2.2: 0 sp6: 0.14 ovate: 0.26 fw2.2: 0 sp6: 0.73 ovate: 0 fw2.2: 0.5 sp6: 1 ovate: 0 fw2.2: 0.42 sp6: 0.74 ovate: 0.14 fw2.2: 0.46 sp6: 0.05 ovate: 0.31 fw2.2: 0 sp6: 0.47 22

Examples of highly polymorphic genes within S. lycopersicum Note: I am working on a replacement that compares Ka/Ks for selected tomato and potato genes

Examples of highly polymorphic genes within S. lycopersicum Note: I am working on a replacement that compares Ka/Ks for selected tomato and potato genes

Distribution of PM genes across populations is not random Processing Fresh Market Vintage Wild 25

Visit us at http://solcap.msu.edu/ Tools, Downloads

Conclusions ~5.7 Gb PF potato transcriptome sequence (3 varieties) ~14.3 Gb PF tomato transcriptome sequence (6 varieties) DM1-3 516R44 draft genome is an excellent scaffold for potato and tomato GAII transcriptome alignments. SNPs are not evenly distributed in genes/genomes Genes with signatures of selection (Ka/Ks; high FST) tend to be genes associated with response to abiotic and biotic stress. Co-adapted complexes result from selection during plant breeding. Lessons Learned: Control GAII Sequence of DM1-3 516R44 would permit bioinformatic optimization or pipelines rather than relying on empirical validation.

Collaborators, Cornell Acknowledgments Collaborators, CAU Wencai Yang Collaborators, CAAS Sanwen Huang Collaborators, OSU Matt Robbins Sung-Chur Sim Troy Aldrich Collaborators, Cornell Walter de Jong Lucas Mueller Joyce van Eck Collaborators, UCD Allen Van Deynze Kevin Stoffel Alex Kozic Collaborators, MSU David Douches C Robin Buell John Hamilton Kelly Zarka Funding USDA/AFRI This project is supported by the Agriculture and Food Research Initiative of USDA’s National Institute of Food and Agriculture.