GBS Bioinformatics Pipeline(s) Overview

Name: GBS Bioinformatics Pipeline(s) Overview
Uploaded: 2017-10-10T00:11:44+00:00
Duration: PTM28S45
Channel: Sylvia Dennis
Description: GBS Bioinformatics Pipeline(s) Overview

GBS Bioinformatics Pipeline(s) Overview
Getting from sequence files to genotypes. Pipeline Coding: Ed Buckler Jeff Glaubitz James Harriman Presentation: Terry Casstevens With supporting information from the coders.

Three Pipelines Discovery Pipeline Production Pipeline UNEAK Pipeline
Requires a reference genome Multiple steps to get to genotypes Hands on tutorial is based on this pipeline Production Pipeline Uses information from Discovery Pipeline One step from sequence to genotypes UNEAK Pipeline For species without a reference genome Fei Lu will present this tomorrow at 9:30

Vocabulary Sequence File Taxa GBS Bar Code Key File GBS Tag Plugin
Text file containing DNA sequence reads and supplemental information from the Illumina Platform. Taxa An individual sample GBS Bar Code A short known sequence of DNA used to assign a GBS Tag to its original Taxa Key File Text file used to assign a GBS Bar Code to a Taxa GBS Tag DNA sequence consisting of a cut site remnant and additional sequence. Plugin Tassel pipeline module that performs specific task

GBS Discovery Pipeline
Tag Counts SNP Caller Genotypes Tags by Taxa Sequence TOPM

Raw Sequence (Qseq) HWI-ST GTCGATTCTGCTGACTTCATGGCTTCTGTTGACGACGATGTGGAACGAGCTGTTGTTGAAACTGATGAGGTTGCTGAGATCGGAAGAGCGGTTCAGCAGG HWI-ST GAGAATCAGCTTTTCCAACACCTTGAGTTTGAGTATGCGATGACAGTTACTCTTACTGTCCATTGTCAGCATTGCCAGAGCTTGACCAGCTGAGATCGGA HWI-ST ATGTACTGCACCGTTGCAAGCGAGCACCACCAAGCGGCGGTATGCACTTTGCAATATGTAGCTAGAATAGGATTTTCAGGTGATTAGGAGCGTAAAAAAG HWI-ST CCAGCTCAGCCTGCATTCTTTCAAAAACTTCCAATGCCTCTCTTGGCCTAGCATTTTGGGCATACCCTGTGACCATTGCTGTCCATGCCACCATATCCTT HWI-ST GATTTTACTGCACATCGGTCTTGTCACACCAGCTATACCTGTAGAGTTGCCTTCCACAGTTGTAGAGATCGGAAGAGCGGTTCAGCGGGACTGCCGAGAA HWI-ST TCACCCAGCATCACGCCCCTTCACATCCAGTAAAACCCCTGAATGATGTGCTGTCACTGTTTGATATACAGTTGTTAACGTGAGGACGGGCTTTGAAGGA HWI-ST CTTGACTGCCACCATGAATATGTGTTCCAAGTGCCACAAGGACTTGGCCCTGAAGCAAGAACAAGCCAAACTTGCAGAGATCGGAAGAGCGGTTCAGCAG HWI-ST CCACAACTGCTCCATCTTTTCCATGAGACATTGCTCCCGCCATTGCACCCTTGGCATCAGCAGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG HWI-ST GTATTCTGCACACGAATCAGCTGAGACACCAATTGGGCATGAATCAAATGGCGCCATTGCCGGGGATCGAACCCCGAATCAAATGGTGCCATTGCCACTG HWI-ST AATATGCCAGCAGTTAAGAGAGTTCAAGATCCAGGGCTCATATTCAGTCACCTATATCAATTTCGAAATGGATTTCCAGGGTTTTAAGAGCCTAACAAAG HWI-ST CTCCCTGCGGGTGCGCGCGACCCATCTTCAGTTGGAGCGTCTATCGGCGTTGCTGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTA HWI-ST TGGTACGTCTGCGGAATGGCGTTTTTTATGCCTTAGTGGTTCGCAGAGCATTTGGCAGCTGAGATGGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGAT HWI-ST GGACCTACTGCCCAAGAACGGCTCACCCATCATCCGCTTTCTTCACCTTCCGTCTTCTTTGGCTGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGAC HWI-ST GAGAATCAGCGTGTACGGGGCACGGGGTGACTGCTGTTGCGTGCGAGGGCTGAGATCGGAAGAGCGGTTCAGCAGGAGTGCCGAGACCGATCTCGTATGC HWI-ST TTCTCCAGCCGCATGGGCCGGAGACCAGAGAGGCCTCCCCAGGATTTGCACGATAGACCACGACTTATGGACGATTGGGAAGCCCTTGTTGGAAGGAAAT HWI-ST GCGTCAGCAAATGCCCCAACAGCCAAGTCAGCAATTGCCTCAGCAACTTGGGCCACAAACACCACAGCTGAGATCGGAAGAGCGGTTCAGCAGGAATGCC HWI-ST TAGGCCATCAGCTGACTTCCCGGGTGTGGAGAAAAGAGGGCCCCTCACTTCTCTCAAGTGCTGAGATCGGAAGAGCGGTTCAGCAGGAATGCGGAGACCG HWI-ST GGACCTACTGCCGGCGGGACGAAAGCGGTTGTTGAATGATGGGGGTCACTAGGCCTTCCAGGGCCTTTAAGCGCGCGCTGAGATCGGAAGAGGGGTTCAG HWI-ST CTCCCTGTTGAAGCATGTGCAAAAGAGCTTGTTCTCGGCCTTCTTCAAGCCATTCTCTTGGCAGACGGCTTTGCCTAGAAGTTTCGCCCCATCACCCTTG HWI-ST CGCCTTATCTGCCCTCGCCGGTCATGGGGAGTGGTGCCCCTACCTCGGACAAGACAGATGCAGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG

Key File

GBS Tags Fragment from GBS library:
Insert Barcode adapter Cut site Common adapter ‘Good’ reads: (only the first 64 bases after the barcode are kept) typical read: Insert (first 64 bases) Barcode Cut site short fragment: Insert (<64bp) Barcode Cut site Common adapter chimera or partial digestion: Insert (<64bp) Cut site 2nd Insert Barcode

Insert Barcode adapter Cut site Common adapter ‘Good’ reads: (only the first 64 bases after the barcode are kept) typical read: Insert (first 64 bases) Barcode Cut site short fragment: Barcode Cut site Insert (<64bp) Cut site chimera or partial digestion: Barcode Cut site Insert (<64bp) Cut site

Insert Barcode adapter Cut site Common adapter ‘Good’ reads: (only the first 64 bases after the barcode are kept) typical read: Insert (first 64 bases) Barcode Cut site short fragment: Barcode Cut site Insert (<64bp) Cut site chimera or partial digestion: Barcode Cut site Insert (<64bp) Cut site Rejected reads: Barcode Cut site Common adapter adapter dimer Not matching barcode and cut site remnant Contains N in first 64 bases after the barcode

Tag Counts With information from the key file, each sequence file is processed, tags are identified and counted. If a tag is shorter than 64 bases it is padded. The tags and counts are put into a tag count file for each sequence file. QseqToTagCountsPlugin / FastqToTagCountsPlugin

Master Tag Counts The individual tag count files are merged into a master tag count file. A minimum count is specified at the merge stage to exclude tags with low counts (likely sequencing errors). MergeMultipleTagCountsPlugin

Conversion of Tags to Fastq
Sequence aligners do not work with the tag count file format. In preparation for the alignment step, the Master Tag Count file is converted to fastq format. TagCountsToFastqPlugin

Tag Alignment / TOPM The GBS pipeline uses an external aligner to do the initial alignment. The current version uses bowtie2 which produces the alignment in the SAM format. bowtie2 We convert the SAM file into our tags on physical map format (TOPM) SAMConverterPlugin

So Far We Have Identified and counted GBS tags.
Converted tag counts file to fastq. Aligned the tags to a reference. Converted the alignment to TOPM.

Tags by Taxa In this step we identify which tags are present in which taxa. Original Sequence Files Key File Master Tag Count File Recently migrated to HDF5 file format. Efficient storage Large data sets SeqToTBTHDF5Plugin

Tags By Taxa Additional Operations
If many TBTs have been created they are merged into 1 TBT. Taxa that were sequenced multiple times are merged. The TBT table is pivoted in preparation for SNP calling. ModifyTBTHDF5Plugin

SNP Calling Files used in SNP Calling Some Key Settings TOPM TBT
Pedigree File (optional) Some Key Settings mnF MinimumF (inbreeding coefficient) mnMAF Minimum Minor Allele Frequency mnMAC Minimum Minor Allele Count mnLCov Minimum Locus Coverage TagsToSNPByAlignmentPlugin

HapMap rs# alleles chrom pos strand SgSBRIL067:633Y5AAXX:2:C9 SgSBRIL019:633Y5AAXX:2:C3 S1_2100 A/G N N N N N N N R N A N S1_2163 T/C N N N N N N T C T T N S1_13837 T/G N N N N N N N G N N T S1_14606 C/T N N C N N N T T T T C S1_2061 T/A T N N N N N N A N N N S1_68332 C/T N N N N N N N N N N N S1_68596 A/T A N N N N N N N N A N S1_69309 G/A N G N N N N N A N N N S1_79955 T/G N T G T T N T T N N N S1_79961 T/G N T T T T N T T N N N S1_80584 G N N N N N N N N N N G S1_80647 C/T N N N N N N N C N N C S1_81274 T/G N N N N N N T G N N N S1_ G/A N N N N N N N N N N N S1_ T/G N N N N N N K T N N N S1_ C/T N N N N N N T C N T S1_ T/C N N N N N N N C N N N S1_ G/A G G A N N G G G G N S1_ T/G N N T N N N T T N N T S1_ A/G N A G N N N G A N N N S1_ C/T N N N N C N N C N N N S1_ T/C N T N N N N

GBS Discovery pipeline
Fastq Tags by Taxa Tag Counts TOPM SNP Caller Genotypes

GBS Discovery pipeline
Fastq Tags by Taxa Tag Counts TOPM SNP Caller Genotypes Filtered Genotypes

Production Pipeline

Why another pipeline? The last maize build (30000 taxa) with the discovery pipeline took weeks. Most common alleles have been identified after the first few discovery builds. Use the information from the discovery pipeline to call SNPs in new runs quickly. Improve efficiency and automate.

GBS Bioinformatics Pipelines
Discovery Production Fastq Fastq Tags by Taxa Tag Counts TOPM SNP Caller Genotypes

TagsOnPhysicalMap (TOPM)
Discovery Production Fastq Fastq Tags by Taxa Tag Counts TOPM SNP Caller Genotypes

Discovery Production Fastq Fastq Tags by Taxa Tag Counts TOPM SNP Caller Genotypes Filtered Genotypes

Discovery Production Fastq Fastq Tags by Taxa Tag Counts TOPM TOPM SNP Caller Genotypes Filtered Genotypes

Discovery Production Fastq Fastq Tags by Taxa Tag Counts TOPM TOPM SNP Caller Genotypes Filtered Genotypes Genotypes

Running the Production Pipeline
Required Files: Sequence file (fastq or qseq) Key file Production TOPM TASSEL 3 Standalone & RawReadsToHapMapPlugin Running the Pipeline: One lane processed at a time HapMap files by chromosome ~40 minutes

Testing Production Pipeline
Compared HapMap files produced by Discovery Pipeline and Production Pipeline Site Comparison: Discovery 48,139 Production 47,676 Difference due to maximum 8 alleles 99.98% correlation of genetic distance matrices

Next Steps In Pipeline Development
Hierarchical Data Format – supports very large data sets and complex data structures. Working to fuse TOPM, TBT, Keyfile, and Pedigree File into one HDF5 repository. Continued improvements to SNP caller. Ability to use tags not present in the reference.

GBS Bioinformatics Pipeline(s) Overview

Similar presentations

Presentation on theme: "GBS Bioinformatics Pipeline(s) Overview"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

GBS Bioinformatics Pipeline(s) Overview

Similar presentations

Presentation on theme: "GBS Bioinformatics Pipeline(s) Overview"— Presentation transcript:

Similar presentations

About project

Feedback