GBS Bioinformatics Pipeline(s) Overview

Slides:



Advertisements
Similar presentations
Single Nucleotide Polymorphism And Association Studies Stat 115 Dec 12, 2006.
Advertisements

Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
GBS & GWAS using the iPlant Discovery Environment
Processing of miRNA samples and primary data analysis
Genetic Basis of Agronomic Traits Connecting Phenotype to Genotype Yu and Buckler (2006); Zhu et al. (2008) Traditional F2 QTL MappingAssociation Mapping.
SOLiD Sequencing & Data
Association Modeling With iPlant
Bioinformatics caacaagccaaaactcgtacaaCgagatatctcttggaaaaactgctcacaatattgacgtacaaggttgttcatgaaactttcggtaAcaatcgttgacattgcgacctaatacagcccagcaagcagaat Managing.
General methods of SNP discovery: PolyBayes Gabor T. Marth Department of Biology Boston College Chestnut Hill, MA
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.
High Throughput Sequencing
Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers
NGS Analysis Using Galaxy
Whole Exome Sequencing for Variant Discovery and Prioritisation
G ENOTYPING - BY -S EQUENCING WHAT IS IT AND WHAT IS IT GOOD FOR ? K EITH R. M ERRILL NCSU – C ROP S CIENCE.
Visualising NGS data in GBrowse 2 August 2009 GMOD Meeting 6-7 August 2009 Dave Clements GMOD Help Desk National Evolutionary Synthesis Center (NESCent)
NGS data analysis CCM Seminar series Michael Liang:
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Targeted next generation sequencing for population genomics and phylogenomics in Ambystomatid salamanders Eric M. O’Neill David W. Weisrock Photograph.
MapNext: a software tool for spliced and unspliced alignments and SNP detection of short sequence reads Hua Bao Sun Yat-sen University, Guangzhou,
Biology 101 DNA: elegant simplicity A molecule consisting of two strands that wrap around each other to form a “twisted ladder” shape, with the.
Introduction to the Gramene Genetic Diversity module 5/2010 Build #31.
ParSNP Hash Pipeline to parse SNP data and output summary statistics across sliding windows.
Alexis DereeperCIBA courses – Brasil 2011 Detection and analysis of SNP polymorphisms.
1 Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data Yi Wang, Gagan Agrawal, Gulcin Ozer and Kun Huang The Ohio State University.
Current Challenges in Metagenomics: an Overview Chandan Pal 17 th December, GoBiG Meeting.
BRUDNO LAB: A WHIRLWIND TOUR Marc Fiume Department of Computer Science University of Toronto.
Overview of developments. Nested Association Mapping (NAM) Jianming Yu, James B. Holland, Michael D. McMullen and Edward S. Buckler, Genetics, Vol. 178,
Using a Single Nucleotide Polymorphism to Predict Bitter Tasting Ability Lab Overview.
Data Mining in Ensembl with BioMart Giulietta Spudich.
GVS: Genome Variation Server Materials prepared by: Warren C. Lathe, PhD Updated: Q Version 2.
Tutorial 6 High Throughput Sequencing. HTS tools and analysis Review of resequencing pipeline Visualization - IGV Analysis platform – Galaxy Tuning up.
Genotyping and Genetic Maps Bas Heijmans Leiden University Medical Centre The Netherlands.
P.M. VanRaden and D.M. Bickhart Animal Genomics and Improvement Laboratory, Agricultural Research Service, USDA, Beltsville, MD, USA
Copyright OpenHelix. No use or reproduction without express written consent1.
No reference available
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Current Data And Future Analysis Thomas Wieland, Thomas Schwarzmayr and Tim M Strom Helmholtz Zentrum München Institute of Human Genetics Geneva, 16/04/12.
Introduction of the ChIP-seq pipeline Shigeki Nakagome November 16 th, 2015 Di Rienzo lab meeting.
GSVCaller – R-based computational framework for detection and annotation of short sequence variations in the human genome Vasily V. Grinev Associate Professor.
Gene Technologies and Human ApplicationsSection 3 Section 3: Gene Technologies in Detail Preview Bellringer Key Ideas Basic Tools for Genetic Manipulation.
Short Read Workshop Day 5: Mapping and Visualization
Canadian Bioinformatics Workshops
A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
1 Bioinformatics Tools for Genotyping Frances Tong Dr. Garry Larson, Ph.D City of Hope Department of Molecular Medicine Southern California Bioinformatics.
From Reads to Results Exome-seq analysis at CCBR
Canadian Bioinformatics Workshops
071126_EAS56_0057_FC – lanes 1-8 read 2 b a _EAS56_0057_FC – lanes 1-8 read 1 Table S1. Summary tables for a read 1 and b read 2 of a.
Using command line tools to process sequencing data
Lesson: Sequence processing
Genetic Engineering in Medicine, Professor Bob Goldberg
Next Generation Sequencing Analysis
DNA Sequencing -sayed Mohammad Amin Nourion -A’Kia Buford
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
Introduction to RAD Acropora millepora.
The FASTQ format and quality control
Example of a common SNP in dogs
Discovery tools for human genetic variations
Lecture 9 Genome Mapping By Ms. Shumaila Azam
Using TASSEL Getting TASSEL
Genomics for Regional Development
Maximize read usage through mapping strategies
Canadian Bioinformatics Workshops
The Variant Call Format
Presentation transcript:

GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding: Ed Buckler Jeff Glaubitz James Harriman Presentation: Terry Casstevens With supporting information from the coders.

Three Pipelines Discovery Pipeline Production Pipeline UNEAK Pipeline Requires a reference genome Multiple steps to get to genotypes Hands on tutorial is based on this pipeline Production Pipeline Uses information from Discovery Pipeline One step from sequence to genotypes UNEAK Pipeline For species without a reference genome Fei Lu will present this tomorrow at 9:30

Vocabulary Sequence File Taxa GBS Bar Code Key File GBS Tag Plugin Text file containing DNA sequence reads and supplemental information from the Illumina Platform. Taxa An individual sample GBS Bar Code A short known sequence of DNA used to assign a GBS Tag to its original Taxa Key File Text file used to assign a GBS Bar Code to a Taxa GBS Tag DNA sequence consisting of a cut site remnant and additional sequence. Plugin Tassel pipeline module that performs specific task

GBS Discovery Pipeline Tag Counts SNP Caller Genotypes Tags by Taxa Sequence TOPM

GBS Discovery Pipeline Tag Counts SNP Caller Genotypes Tags by Taxa Sequence TOPM

Raw Sequence (Qseq) HWI-ST397 0 3 68 15896 200039 0 1 GTCGATTCTGCTGACTTCATGGCTTCTGTTGACGACGATGTGGAACGAGCTGTTGTTGAAACTGATGAGGTTGCTGAGATCGGAAGAGCGGTTCAGCAGG HWI-ST397 0 3 68 15960 200043 0 1 GAGAATCAGCTTTTCCAACACCTTGAGTTTGAGTATGCGATGACAGTTACTCTTACTGTCCATTGTCAGCATTGCCAGAGCTTGACCAGCTGAGATCGGA HWI-ST397 0 3 68 15831 200053 0 1 ATGTACTGCACCGTTGCAAGCGAGCACCACCAAGCGGCGGTATGCACTTTGCAATATGTAGCTAGAATAGGATTTTCAGGTGATTAGGAGCGTAAAAAAG HWI-ST397 0 3 68 15867 200049 0 1 CCAGCTCAGCCTGCATTCTTTCAAAAACTTCCAATGCCTCTCTTGGCCTAGCATTTTGGGCATACCCTGTGACCATTGCTGTCCATGCCACCATATCCTT HWI-ST397 0 3 68 15943 200048 0 1 GATTTTACTGCACATCGGTCTTGTCACACCAGCTATACCTGTAGAGTTGCCTTCCACAGTTGTAGAGATCGGAAGAGCGGTTCAGCGGGACTGCCGAGAA HWI-ST397 0 3 68 15812 200062 0 1 TCACCCAGCATCACGCCCCTTCACATCCAGTAAAACCCCTGAATGATGTGCTGTCACTGTTTGATATACAGTTGTTAACGTGAGGACGGGCTTTGAAGGA HWI-ST397 0 3 68 15888 200067 0 1 CTTGACTGCCACCATGAATATGTGTTCCAAGTGCCACAAGGACTTGGCCCTGAAGCAAGAACAAGCCAAACTTGCAGAGATCGGAAGAGCGGTTCAGCAG HWI-ST397 0 3 68 15969 200067 0 1 CCACAACTGCTCCATCTTTTCCATGAGACATTGCTCCCGCCATTGCACCCTTGGCATCAGCAGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG HWI-ST397 0 3 68 15786 200078 0 1 GTATTCTGCACACGAATCAGCTGAGACACCAATTGGGCATGAATCAAATGGCGCCATTGCCGGGGATCGAACCCCGAATCAAATGGTGCCATTGCCACTG HWI-ST397 0 3 68 15830 200072 0 1 AATATGCCAGCAGTTAAGAGAGTTCAAGATCCAGGGCTCATATTCAGTCACCTATATCAATTTCGAAATGGATTTCCAGGGTTTTAAGAGCCTAACAAAG HWI-ST397 0 3 68 15863 200073 0 1 CTCCCTGCGGGTGCGCGCGACCCATCTTCAGTTGGAGCGTCTATCGGCGTTGCTGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTA HWI-ST397 0 3 68 15762 200088 0 1 TGGTACGTCTGCGGAATGGCGTTTTTTATGCCTTAGTGGTTCGCAGAGCATTTGGCAGCTGAGATGGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGAT HWI-ST397 0 3 68 15903 200085 0 1 GGACCTACTGCCCAAGAACGGCTCACCCATCATCCGCTTTCTTCACCTTCCGTCTTCTTTGGCTGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGAC HWI-ST397 0 3 68 15921 200082 0 1 GAGAATCAGCGTGTACGGGGCACGGGGTGACTGCTGTTGCGTGCGAGGGCTGAGATCGGAAGAGCGGTTCAGCAGGAGTGCCGAGACCGATCTCGTATGC HWI-ST397 0 3 68 15984 200085 0 1 TTCTCCAGCCGCATGGGCCGGAGACCAGAGAGGCCTCCCCAGGATTTGCACGATAGACCACGACTTATGGACGATTGGGAAGCCCTTGTTGGAAGGAAAT HWI-ST397 0 3 68 15788 200096 0 1 GCGTCAGCAAATGCCCCAACAGCCAAGTCAGCAATTGCCTCAGCAACTTGGGCCACAAACACCACAGCTGAGATCGGAAGAGCGGTTCAGCAGGAATGCC HWI-ST397 0 3 68 15842 200099 0 1 TAGGCCATCAGCTGACTTCCCGGGTGTGGAGAAAAGAGGGCCCCTCACTTCTCTCAAGTGCTGAGATCGGAAGAGCGGTTCAGCAGGAATGCGGAGACCG HWI-ST397 0 3 68 15876 200105 0 1 GGACCTACTGCCGGCGGGACGAAAGCGGTTGTTGAATGATGGGGGTCACTAGGCCTTCCAGGGCCTTTAAGCGCGCGCTGAGATCGGAAGAGGGGTTCAG HWI-ST397 0 3 68 15937 200097 0 1 CTCCCTGTTGAAGCATGTGCAAAAGAGCTTGTTCTCGGCCTTCTTCAAGCCATTCTCTTGGCAGACGGCTTTGCCTAGAAGTTTCGCCCCATCACCCTTG HWI-ST397 0 3 68 15958 200102 0 1 CGCCTTATCTGCCCTCGCCGGTCATGGGGAGTGGTGCCCCTACCTCGGACAAGACAGATGCAGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG

Raw Sequence (Qseq) HWI-ST397 0 3 68 15896 200039 0 1 GTCGATTCTGCTGACTTCATGGCTTCTGTTGACGACGATGTGGAACGAGCTGTTGTTGAAACTGATGAGGTTGCTGAGATCGGAAGAGCGGTTCAGCAGG HWI-ST397 0 3 68 15960 200043 0 1 GAGAATCAGCTTTTCCAACACCTTGAGTTTGAGTATGCGATGACAGTTACTCTTACTGTCCATTGTCAGCATTGCCAGAGCTTGACCAGCTGAGATCGGA HWI-ST397 0 3 68 15831 200053 0 1 ATGTACTGCACCGTTGCAAGCGAGCACCACCAAGCGGCGGTATGCACTTTGCAATATGTAGCTAGAATAGGATTTTCAGGTGATTAGGAGCGTAAAAAAG HWI-ST397 0 3 68 15867 200049 0 1 CCAGCTCAGCCTGCATTCTTTCAAAAACTTCCAATGCCTCTCTTGGCCTAGCATTTTGGGCATACCCTGTGACCATTGCTGTCCATGCCACCATATCCTT HWI-ST397 0 3 68 15943 200048 0 1 GATTTTACTGCACATCGGTCTTGTCACACCAGCTATACCTGTAGAGTTGCCTTCCACAGTTGTAGAGATCGGAAGAGCGGTTCAGCGGGACTGCCGAGAA HWI-ST397 0 3 68 15812 200062 0 1 TCACCCAGCATCACGCCCCTTCACATCCAGTAAAACCCCTGAATGATGTGCTGTCACTGTTTGATATACAGTTGTTAACGTGAGGACGGGCTTTGAAGGA HWI-ST397 0 3 68 15888 200067 0 1 CTTGACTGCCACCATGAATATGTGTTCCAAGTGCCACAAGGACTTGGCCCTGAAGCAAGAACAAGCCAAACTTGCAGAGATCGGAAGAGCGGTTCAGCAG HWI-ST397 0 3 68 15969 200067 0 1 CCACAACTGCTCCATCTTTTCCATGAGACATTGCTCCCGCCATTGCACCCTTGGCATCAGCAGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG HWI-ST397 0 3 68 15786 200078 0 1 GTATTCTGCACACGAATCAGCTGAGACACCAATTGGGCATGAATCAAATGGCGCCATTGCCGGGGATCGAACCCCGAATCAAATGGTGCCATTGCCACTG HWI-ST397 0 3 68 15830 200072 0 1 AATATGCCAGCAGTTAAGAGAGTTCAAGATCCAGGGCTCATATTCAGTCACCTATATCAATTTCGAAATGGATTTCCAGGGTTTTAAGAGCCTAACAAAG HWI-ST397 0 3 68 15863 200073 0 1 CTCCCTGCGGGTGCGCGCGACCCATCTTCAGTTGGAGCGTCTATCGGCGTTGCTGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTA HWI-ST397 0 3 68 15762 200088 0 1 TGGTACGTCTGCGGAATGGCGTTTTTTATGCCTTAGTGGTTCGCAGAGCATTTGGCAGCTGAGATGGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGAT HWI-ST397 0 3 68 15903 200085 0 1 GGACCTACTGCCCAAGAACGGCTCACCCATCATCCGCTTTCTTCACCTTCCGTCTTCTTTGGCTGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGAC HWI-ST397 0 3 68 15921 200082 0 1 GAGAATCAGCGTGTACGGGGCACGGGGTGACTGCTGTTGCGTGCGAGGGCTGAGATCGGAAGAGCGGTTCAGCAGGAGTGCCGAGACCGATCTCGTATGC HWI-ST397 0 3 68 15984 200085 0 1 TTCTCCAGCCGCATGGGCCGGAGACCAGAGAGGCCTCCCCAGGATTTGCACGATAGACCACGACTTATGGACGATTGGGAAGCCCTTGTTGGAAGGAAAT HWI-ST397 0 3 68 15788 200096 0 1 GCGTCAGCAAATGCCCCAACAGCCAAGTCAGCAATTGCCTCAGCAACTTGGGCCACAAACACCACAGCTGAGATCGGAAGAGCGGTTCAGCAGGAATGCC HWI-ST397 0 3 68 15842 200099 0 1 TAGGCCATCAGCTGACTTCCCGGGTGTGGAGAAAAGAGGGCCCCTCACTTCTCTCAAGTGCTGAGATCGGAAGAGCGGTTCAGCAGGAATGCGGAGACCG HWI-ST397 0 3 68 15876 200105 0 1 GGACCTACTGCCGGCGGGACGAAAGCGGTTGTTGAATGATGGGGGTCACTAGGCCTTCCAGGGCCTTTAAGCGCGCGCTGAGATCGGAAGAGGGGTTCAG HWI-ST397 0 3 68 15937 200097 0 1 CTCCCTGTTGAAGCATGTGCAAAAGAGCTTGTTCTCGGCCTTCTTCAAGCCATTCTCTTGGCAGACGGCTTTGCCTAGAAGTTTCGCCCCATCACCCTTG HWI-ST397 0 3 68 15958 200102 0 1 CGCCTTATCTGCCCTCGCCGGTCATGGGGAGTGGTGCCCCTACCTCGGACAAGACAGATGCAGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG

Key File

GBS Tags Fragment from GBS library: Insert Barcode adapter Cut site Common adapter ‘Good’ reads: (only the first 64 bases after the barcode are kept) typical read: Insert (first 64 bases) Barcode Cut site short fragment: Insert (<64bp) Barcode Cut site Common adapter chimera or partial digestion: Insert (<64bp) Cut site 2nd Insert Barcode

GBS Tags Fragment from GBS library: Insert Barcode adapter Cut site Common adapter ‘Good’ reads: (only the first 64 bases after the barcode are kept) typical read: Insert (first 64 bases) Barcode Cut site short fragment: Barcode Cut site Insert (<64bp) Cut site chimera or partial digestion: Barcode Cut site Insert (<64bp) Cut site

GBS Tags Fragment from GBS library: Insert Barcode adapter Cut site Common adapter ‘Good’ reads: (only the first 64 bases after the barcode are kept) typical read: Insert (first 64 bases) Barcode Cut site short fragment: Barcode Cut site Insert (<64bp) Cut site chimera or partial digestion: Barcode Cut site Insert (<64bp) Cut site Rejected reads: Barcode Cut site Common adapter adapter dimer Not matching barcode and cut site remnant Contains N in first 64 bases after the barcode

GBS Discovery Pipeline Tag Counts SNP Caller Genotypes Tags by Taxa Sequence TOPM

Tag Counts With information from the key file, each sequence file is processed, tags are identified and counted. If a tag is shorter than 64 bases it is padded. The tags and counts are put into a tag count file for each sequence file. QseqToTagCountsPlugin / FastqToTagCountsPlugin

Master Tag Counts The individual tag count files are merged into a master tag count file. A minimum count is specified at the merge stage to exclude tags with low counts (likely sequencing errors). MergeMultipleTagCountsPlugin

Conversion of Tags to Fastq Sequence aligners do not work with the tag count file format. In preparation for the alignment step, the Master Tag Count file is converted to fastq format. TagCountsToFastqPlugin

GBS Discovery Pipeline Tag Counts SNP Caller Genotypes Tags by Taxa Sequence TOPM

Tag Alignment / TOPM The GBS pipeline uses an external aligner to do the initial alignment. The current version uses bowtie2 which produces the alignment in the SAM format. bowtie2 We convert the SAM file into our tags on physical map format (TOPM) SAMConverterPlugin

TOPM

So Far We Have Identified and counted GBS tags. Converted tag counts file to fastq. Aligned the tags to a reference. Converted the alignment to TOPM.

GBS Discovery Pipeline Tag Counts SNP Caller Genotypes Tags by Taxa Sequence TOPM

Tags by Taxa In this step we identify which tags are present in which taxa. Original Sequence Files Key File Master Tag Count File Recently migrated to HDF5 file format. Efficient storage Large data sets SeqToTBTHDF5Plugin

Tags By Taxa Additional Operations If many TBTs have been created they are merged into 1 TBT. Taxa that were sequenced multiple times are merged. The TBT table is pivoted in preparation for SNP calling. ModifyTBTHDF5Plugin

GBS Discovery Pipeline Tag Counts SNP Caller Genotypes Tags by Taxa Sequence TOPM

SNP Calling Files used in SNP Calling Some Key Settings TOPM TBT Pedigree File (optional) Some Key Settings mnF MinimumF (inbreeding coefficient) mnMAF Minimum Minor Allele Frequency mnMAC Minimum Minor Allele Count mnLCov Minimum Locus Coverage TagsToSNPByAlignmentPlugin

HapMap rs# alleles chrom pos strand SgSBRIL067:633Y5AAXX:2:C9 SgSBRIL019:633Y5AAXX:2:C3 S1_2100 A/G 1 2100 + N N N N N N N R N A N S1_2163 T/C 1 2163 + N N N N N N T C T T N S1_13837 T/G 1 13837 + N N N N N N N G N N T S1_14606 C/T 1 14606 + N N C N N N T T T T C S1_2061 T/A 1 20601 + T N N N N N N A N N N S1_68332 C/T 1 68332 + N N N N N N N N N N N S1_68596 A/T 1 68596 + A N N N N N N N N A N S1_69309 G/A 1 69309 + N G N N N N N A N N N S1_79955 T/G 1 79955 + N T G T T N T T N N N S1_79961 T/G 1 79961 + N T T T T N T T N N N S1_80584 G 1 80584 + N N N N N N N N N N G S1_80647 C/T 1 80647 + N N N N N N N C N N C S1_81274 T/G 1 81274 + N N N N N N T G N N N S1_108834 G/A 1 108834 + N N N N N N N N N N N S1_112345 T/G 1 112345 + N N N N N N K T N N N S1_115359 C/T 1 115359 + N N N N N N T C N T S1_115362 T/C 1 115362 + N N N N N N N C N N N S1_115405 G/A 1 115405 + G G A N N G G G G N S1_115516 T/G 1 115516 + N N T N N N T T N N T S1_116694 A/G 1 116694 + N A G N N N G A N N N S1_119016 C/T 1 119016 + N N N N C N N C N N N S1_155366 T/C 1 155366 + N T N N N N

GBS Discovery pipeline Fastq Tags by Taxa Tag Counts TOPM SNP Caller Genotypes

GBS Discovery pipeline Fastq Tags by Taxa Tag Counts TOPM SNP Caller Genotypes Filtered Genotypes

Production Pipeline

Why another pipeline? The last maize build (30000 taxa) with the discovery pipeline took weeks. Most common alleles have been identified after the first few discovery builds. Use the information from the discovery pipeline to call SNPs in new runs quickly. Improve efficiency and automate.

GBS Bioinformatics Pipelines Discovery Production Fastq Fastq Tags by Taxa Tag Counts TOPM SNP Caller Genotypes

TagsOnPhysicalMap (TOPM) Discovery Production Fastq Fastq Tags by Taxa Tag Counts TOPM SNP Caller Genotypes

GBS Bioinformatics Pipelines Discovery Production Fastq Fastq Tags by Taxa Tag Counts TOPM SNP Caller Genotypes Filtered Genotypes

GBS Bioinformatics Pipelines Discovery Production Fastq Fastq Tags by Taxa Tag Counts TOPM TOPM SNP Caller Genotypes Filtered Genotypes

GBS Bioinformatics Pipelines Discovery Production Fastq Fastq Tags by Taxa Tag Counts TOPM TOPM SNP Caller Genotypes Filtered Genotypes

GBS Bioinformatics Pipelines Discovery Production Fastq Fastq Tags by Taxa Tag Counts TOPM TOPM SNP Caller Genotypes Filtered Genotypes Genotypes

Running the Production Pipeline Required Files: Sequence file (fastq or qseq) Key file Production TOPM TASSEL 3 Standalone & RawReadsToHapMapPlugin Running the Pipeline: One lane processed at a time HapMap files by chromosome ~40 minutes

Testing Production Pipeline Compared HapMap files produced by Discovery Pipeline and Production Pipeline Site Comparison: Discovery 48,139 Production 47,676 Difference due to maximum 8 alleles 99.98% correlation of genetic distance matrices

Next Steps In Pipeline Development Hierarchical Data Format – supports very large data sets and complex data structures. Working to fuse TOPM, TBT, Keyfile, and Pedigree File into one HDF5 repository. Continued improvements to SNP caller. Ability to use tags not present in the reference.