GIAB: Genome reference material development resources for clinical sequencing Chunlin Xiao 1, Justin Zook 2, Shane Trask 1, Melissa Landrum 1, Marc Salit 2, Stephen Sherry 1, and the Genome-in-a-Bottle Consortium 1 NIH/NLM/NCBI, 45 Center Drive, Bethesda, MD NIST, 100 Bureau Dr, Gaithersburg, MD Data Visualization Current consensus SNP callset for NA12878 generated by NIST can be visualized through NCBI Get-RM browser ( Other variant call sets for the same individual generated by clinical laboratories with various technologies can be uploaded as different tracks for side-by-side comparison. The browser also allows you to upload your own data for display in the Sequence Viewer alongside NCBI-provided tracks. References Justin Zook et al. (2014) Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nature Biotechnology 32, 246–251 The 1000 Genomes Project Consortium (2012) An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 Durbin,R.M. et al. (2010) A map of human genome variation from population-scale sequencing. Nature, 467, 1061– Abstract Reference materials play important roles in validating performance of sequencing platforms and enabling regulations of clinical applications. Genome-in-a-Bottle (GIAB) project is a collaboration between NIST, FDA, NCBI, academic sequencing groups, sequencing technology developers, and clinical laboratories to develop analytical- grade reference genome materials and accompanying performance metrics for the development of regulations and professional standards for clinical sequencing. NCBI is serving as the Data Coordination Center (DCC) and repository for the raw sequencing reads, mapped alignments, genotypes, and other details for each sample on a dedicated FTP site (ftp://ftp-trace.ncbi.nih.gov/giab/ftp). Here we describe the processes of data generations and data submissions, and how the community can access the data. We are also developing a genome browser for data visualization. GIAB consortium plans to release data to the public on a regular basis.ftp://ftp-trace.ncbi.nih.gov/giab/ftp Data Submission and Accessioning NCBI serves as Data Coordination Center (DCC) and repository for the raw sequence, genotypes and other details for each sample from Genome-in-a-Bottle project. Currently the project is focusing on one sample, which is NA12878, daughter of NA12891 and NA We have created a drop-box for each of the submitters, including NIST, COMPLETE, GARVAN, ILLUMINA, INOVA, RTG, NCI, NOVARTIS, so that they can upload their data to NCBI. Collaborator can submit raw sequence reads in fastq format, read alignments in bam format, genotype data in VCF format, or analysis tools to NCBI. Subsequently the submitted data will be accessioned and archived at NCBI. Data Distribution All the submitted GIAB data are made available by DCC to the research community on a dedicated ftp site and aspera server. User can download data, including fastq, bams, vcfs files, via our ftp site (ftp://ftp-trace.ncbi.nih.gov/giab/ftp). The structure of GIAB ftp site is very similar to 1000genomes ftp site. The primary sequence data are organized by sample name (under “/data” directory), while the official genotype data are released under “/release” directory. Intermediate data or method development data are organized under “/technical” directory. For each of the release, we create a sequence.index file to track all the fastq sequences along with the meta information. An alignment.index file is created to include all the alignment bam files that are used for generating variant calls.ftp://ftp-trace.ncbi.nih.gov/giab/ftp To facilitate cloud-based data analysis, the whole GIAB data set has been mirrored to Amazon Cloud. User with AWS cloud accessibility can access the GIAB data through Amazon Simple Storage Service (S3) and the bucket name is s3://giab/. (a) Layout of GIAB data at NCBI ftp site (b) Layout of GAIB data at Amazon S3