A Minimum Information Standard for Reporting NGS Immunogenomic Genotyping Data Steven J. Mack, PhD Children’s Hospital Oakland Research Institute Immunogenomic.

Slides:



Advertisements
Similar presentations
The MGED Ontology: Providing Descriptors for Microarray Data Trish Whetzel Department of Genetics Center for Bioinformatics University of Pennsylvania.
Advertisements

Considerations for Analyzing Targeted NGS Data HLA
Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis Yan Guo.
IHWG Workshop Data Tools for HLA Sequence.
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
Basic Genomic Characteristic  AIM: to collect as much general information as possible about your gene: Nucleotide sequence Databases ○ NCBI GenBank ○
HML as an implementation of the “standard” Bob Milius, PhD Bioinformatics Research NMDP.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Genome Browsers Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
NHGRI/NCBI Short-Read Archive: Data Retrieval Gabor T. Marth Boston College Biology Department NCBI/NHGRI Short-Read.
How to use the web for bioinformatics Molecular Technologies February 11, 2005 Ethan Strauss X 1373
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
Using ArrayExpress. ArrayExpress is an international public repository for well-annotated microarray data, including gene expression, comparative genomic.
Data Formats & QC Analysis for NGS Rosana O. Babu 8/19/20151.
NGS Analysis Using Galaxy
Dan Masiga Molecular Biology and Biotechnology Department International Centre of Insect Physiology and Ecology, Nairobi, Kenya BARCODE Data Standard The.
Whole Exome Sequencing for Variant Discovery and Prioritisation
Gene expression services: ArrayExpress and the Gene Expression Atlas Contact: Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL
Controlled Vocabularies (Term Lists). Controlled Vocabs Literally - A list of terms to choose from Aim is to promote the use of common vocabularies so.
Todd J. Treangen, Steven L. Salzberg
HLA Genotyping Data Generated by 454 Sequencing Cherie Holcomb, Ph.D. Roche Molecular Systems picture placeholder NGS Data Consortium October 8, 2012.
Amandine Bemmo 1,2, David Benovoy 2, Jacek Majewski 2 1 Universite de Montreal, 2 McGill university and Genome Quebec innovation centre Analyses of Affymetrix.
File formats Wrapping your data in the right package Deanna M. Church
How I learned to quit worrying Deanna M. Church Staff Scientist, Short Course in Medical Genetics 2013 And love multiple coordinate.
RNAseq analyses -- methods
1 MIAME The MIAME website: © 2002 Norman Morrison for Manchester Bioinformatics.
UCSC Genome Browser 1. The Progress 2 Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools.
NCBI’s Genome Annotation: Overview Incremental processing Re-annotation ( batch ) Post-annotation review Case studies NOTE: limiting discussion to annotation.
HLA Analysis and Next Generation Sequencing Henry Erlich, Ph.D. Cherie Holcomb, Ph.D. Roche Molecular Systems picture placeholder NGS and EFI, May 14,
The Functional Genomics Experiment Object Model (FuGE) Andrew Jones, School of Computer Science, University of Manchester MGED Society.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
Overview of Bioinformatics 1 Module Denis Manley..
PROGNOCHIP-BASE, FORTH-ICS 1 PrognoChip-BASE: An Information System for the Management of Spotted DNA MicroArray Experiments Extension of BASE v
Nature Reviews/2012. Next-Generation Sequencing (NGS): Data Generation NGS will generate more broadly applicable data for various novel functional assays.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Epidemiology 217 Molecular and Genetic Epidemiology Bioinformatics & Proteomics John Witte.
Ontologies Working Group Agenda MGED3 1.Goals for working group. 2.Primer on ontologies 3.Working group progress 4.Example sample descriptions from different.
Copyright OpenHelix. No use or reproduction without express written consent1.
Introduction to RNAseq
Bioinformatics and Computational Biology
California Pacific Medical Center
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
ANALYSIS OF GENE EXPRESSION DATA. Gene expression data is a high-throughput data type (like DNA and protein sequences) that requires bioinformatic pattern.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Introduction and Applications of Microarray Databases Chen-hsiung Chan Department of Computer Science and Information Engineering National Taiwan University.
Biases in RNA-Seq data. Transcript length bias Two transcripts of length 50 and 100 have the same abundance in a control sample. The expression of both.
No reference available
PLANT BIOTECHNOLOGY & GENETIC ENGINEERING (3 CREDIT HOURS) LECTURE 13 ANALYSIS OF THE TRANSCRIPTOME.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
GSVCaller – R-based computational framework for detection and annotation of short sequence variations in the human genome Vasily V. Grinev Associate Professor.
Welcome to the combined BLAST and Genome Browser Tutorial.
Welcome to the Protein Database Tutorial. This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.
Validation of RNA-Seq data An introduction to qPCR Sarah Diermeier, Ph.D. Cold Spring Harbor Laboratory
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
Practice:submit the ChIP_Streamline.pbs 1.Replace with your 2.Make sure the.fastq files are in your GMS6014 directory.
Canadian Bioinformatics Workshops
From Reads to Results Exome-seq analysis at CCBR
Validation of HLA Typing by NGS
Introduction to GENDX HLA typing products
exRNA Metadata Standards
Using ArrayExpress.
SBT Unique Selling Points
Royal Liverpool University Hospital & University of Liverpool
Genomic Formats and the HLA Data Standard
HLA-Class I: Typing Theory
Features & Benefits GENDX SBT Products
Figure Genetic characterization of the novel GYG1 gene mutation (A) GYG1_cDNA sequence and position of primers used. Genetic characterization of the novel.
Presentation transcript:

A Minimum Information Standard for Reporting NGS Immunogenomic Genotyping Data Steven J. Mack, PhD Children’s Hospital Oakland Research Institute Immunogenomic NGS Data Standards Consortium Meeting ASHI 39 th Annual Meeting, Chicago November 17, 2013

Minimum Information (MI) standards and reporting guidelines identify data and their associated meta data minimally necessary to allow an experiment to be reproduced and interpreted. MIBBI: Minimum Information for Biological and Biomedical Investigations Taylor et al. (2008) Nature Biotechnology. 26(8) MI standards for reporting biological and biomedical research Developed by researchers, key vendors and software developers. Minimum Information Standards

Minimum Information About a Microarray Experiment MIAME describes the minimum information required to ensure that microarray data can be easily interpreted and that results derived from its analysis can be independently verified. Goals: recording and reporting microarray-based gene expression data define content and structure of the necessary information Brazma et al. (2001) Nature Genetics 29, 365 – 371. doi: /ng Classic MIBBI Example -- MIAME

Six elements are required to support microarray-based publications. 1. The raw data for each hybridization 2. The final processed data for the hybridizations 3. Essential sample annotation 4. The experiment design including sample data relationships 5. Sufficient annotation of the array design 6. Essential experimental and data processing protocols MIAME has been extended to pertain to specific microarray research fields MIAME/Env: Environmental Transcriptomics MIAME/Nutr: Nutrigenomics MIAME/Plant: Plant Transcriptomics MIAME/Tox: Toxicogenomics MIAME Elements

Facilitate downstream analysis for all current use cases E.g., generating NMDP allele codes, choosing a donor, data-analysis Foster reanalysis of genotyping results in different nomenclature epochs Re-evaluate a result under current, past and future nomenclatures Backward compatibility with older molecular genotyping methods Share SBT, SSO & SSP results using some elements of this standard Permit the evaluation of performance across NGS platforms and software Compare GS Junior vs. IonTorrent for same samples/primary data Without attempting to describe all of the technical aspects of an NGS genotyping experiment What should a MIBBI for Immunogenomic Data Do?

Define an NGS Genotyping Result with 10 Categories of Information 1Sample Annotation 2Reference Context 1Full Genotype 1Consensus Sequence 1Unreferenced Sequences 1Novel Polymorphisms 2Sequence Regions Targeted 3Read Metadata 1Primary Data 1Platform Documentation Dynamic Static MIRING Message Accessory Data NGS/HTS Specific Method Independent Minimum Information for Reporting Immunogenomic NGS Genotyping

MIRING Element 1: Sample Annotation Sample identifiers should be included in the genotyping report, and used consistently across all applicable categories of information. Includes: project-specific sample identifiers barcode sequences that identify the sample in the primary data Multiple identifiers can be included, but a single primary identifier should be applied across the message. Q: Should Protected Health Information (PHI) be included in the message?

MIRING Element 2: Reference Context Any reference sequences applied in the genotyping must be explicitly defined in each genotyping report. Different types of reference sequence can be applied for different aspects of genotyping. The reference genome assembly (or a specific alternate assembly) used for any alignment of reads must be identified with a specific Genome Reference Consortium (GRC) release version. GRCh37.p13 (GRC human genome build 37 patch 13) The reference allele sequence database used for read filtering or genotype calling from the consensus sequence must be identified with a particular IMGT/HLA, IPD-MHC or IPD-KIR Database release version. IMGT/HLA Database release version IPD-KIR Database release version IPD-MHC NHP Database release version 1.9.0

MIRING Element 3: Full Genotype The genotype is the collection of all ambiguous alleles that are derived from the consensus sequence. All ambiguous alleles and ambiguous genotypes should be explicitly defined in the genotype. This is not a “best guess” for a two-allele genotype call. Use Genotype List (GL) String format (or comparable format), and provide a Uniform Resource Identifier (URI) when available. KIR3DL2*008/KIR3DL2*038+KIR3DL2*00701|KIR3DL2*027+KIR3DL2*016 & Undetermined: how to denote novel alleles in the genotype string?

MIRING Element 4: Consensus Sequence The consensus sequence is generated from the primary read data by the analysis software, and serves as the basis for the genotype. Consensus sequences should be formatted to identify any phase and/or ploidy information that has been generated by the NGS platform. Use FASTA format blocks to report consensus sequences. >sample12345|allele_1|HLA-A|5’UTR|IMGT/HLA3.13.1|haploid| CAGGAGCAGAGGGGTCAGGGCGAAGTCCCAGGGCCCCAGGCGTGGCTCTCAGGGTCTCAGGCCCCG AAGG CGGTGTATGGATTGGGGAGTCCCAGCCTTGGGGATTCCCCAACTCCGCAGTTTCTTTTCTCCCTCTCCC A ACCTACGTAGGGTCCTTCATCCTGGATACTCACGACGCGGACCCAGTTCTCACTCCCATTGGGTGTCGG G TTTCCAGAGAAGCCAATCAGTGTCGTCGCGGTCGCTGTTCTAAAGTCCGCACGCACCCACCGGGACTC AG ATTCTCCCCAGACGCCGAGG Multiple FASTA blocks can be used to identify phase & ploidy. A format is needed for meta-data in the FASTA descriptor line.

MIRING Element 5: Unreferenced Sequences Regions of the consensus sequence for which no reference allele sequence is available for any of the possible alleles in a given genotype must be explicitly noted in the genotyping report. A genotyping result is based on a consensus sequence for exons 2-5, but for one of the alleles in the genotype, no reference sequence is available for exon 5. These notations can take the form of a direct reference to entire sequences or ranges of positions in sequences in FASTA file of consensus sequences. This category could be merged with category 4, by identifying sections of the consensus sequence that are present in the reference allele sequence database (e.g., in FASTA descriptor line).

MIRING Element 6: Novel Polymorphisms Novel polymorphisms in consensus sequences (nucleotide polymorphisms not included in the reference allele sequence database) must be explicitly noted. Identify potential peptide changes, potential null alleles, or likely changes in protein expression, suggested by novel sequences. Use Variant Call Format (VCF) to identify novel polymorphisms relative to the IMGT/HLA or IPD-KIR reference sequence for a given locus.

MIRING Element 7: Sequence Regions Targeted The specific regions targeted in order to generate the genotyping result must be identified in the genotyping report. In some cases, these sequence regions may correspond to specific amplified features such as exons, introns, or UTRs, but in other cases, larger regions of genomic sequence may have been applied. Use the Browser Extensible Data (BED) format to identify sequence regions relative to a specific Genome Reference Consortium release version. track name=”HLA-DRB1" description=”assessed DRB1 features” Chr exon ,0,255 Chr exon ,0,255 Chr exon ,0,255 Chr exon ,0,255 Chr exon ,0,255 Chr exon ,0,255 IMGT/HLA & IPD-KIR Databases do not reference GRC assembly. Some genes/sequence regions may not be in a GRC assembly.

MIRING Element 8: Read Metadata Primary data must include details of the cutoff values and reference sequences (e.g., IMGT/HLA Database version ) used to filter the data for read quality and/or mapping quality, along with the final read depth obtained and a confidence score of the zygosity for the SNPs used to infer the final genotype. Is a specific format for these metadata needed?

MIRING Element 9: Primary Data Unmapped reads with quality scores must be made available as the primary NGS data. Primary data must be limited to full-length reads that include syntactically valid adapter and indexing/barcoding sequences. However, adapter sequences need not be included in the primary data. Primary data should be made available as accessory data (e.g., deposited in the NCBI’s Sequence Read Archive) as opposed to being part of the message; references to this read data should be included in the message. Use the Sanger FASTQ format to report NGS primary data.

MIRING Element 10: Platform Documentation The specific details of the methodology and pertinent versions of the platform and analysis software applied to obtain the genotyping result must be documented in a public fashion [e.g., in NCBI’s Genetic Testing Registry (GTR)]. References to this documentation should be included in the message. Relevant platform information to be deposited should include: Instrument version, Instrument software version identifier(s), Analysis software version identifier(s), Analysis software parameters used, Reagent versions and lot number, Sequence read lengths, Expected amplicon/insert length, Reference sequences applied, and Primer target locations Requires a parallel MIBBI identifying the Minimum Information for Documenting Immunogenomic NGS Genotyping. Should reference sequences unavailable in public databases (IMGT/HLA, IPD-KIR, genbank, EMBL) be excluded?

MIRING Development CHORI Jill A. Hollenbach Steven J. Mack Life Technologies Scott Conradson Benjamin Gifford Stanford University Marcelo Fernandez-Viña Paul J. Norman Anthony Nolan Research Institute James Robinson NMDP Martin Maiers Robert P. Milius Michael L. Heuer David Roe