-- Don Preuss NCBI/NLM/NIH

Slides:



Advertisements
Similar presentations
Creating HIPAA-Compliant Medical Data Applications with Amazon Web Services Presented by, Tulika Srivastava Purdue University.
Advertisements

Distributed Data Processing
Wrapping up our last topic: You and your (DNA) parasites Events like these, happening over and over again, have led to… Edward Marcotte/Univ. of Texas/BCH391L/Spring.
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information Database Resources.
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWS Ravi K Madduri University of Chicago and ANL.
XSEDE 13 July 24, Galaxy Team: PSC Team:
Pharmacy Information Resources TTUHSC Preston Smith Library presents Rev. 08/2014.
NATIONAL LIBRARY OF MEDICINE The PubMed ID and Entrez, PubMed and PubMed Central Edwin Sequeira National Center for Biotechnology Information June 21,
Bioinformatics and the Engineering Library ASEE 2008 Amy Stout.
George A. Komatsoulis, Ph.D. National Center for Biotechnology Information National Library of Medicine National Institutes of Health U.S. Department of.
World Wide Web Basics Informatics Training for CDC Public Health Advisors.
CLOUD COMPUTING.  It is a collection of integrated and networked hardware, software and Internet infrastructure (called a platform).  One can use.
Vivien Bonazzi Ph.D. Program Director: Computational Biology (NHGRI) Co Chair Software Methods & Systems (BD2K) Biomedical Big Data Initiative (BD2K)
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
Building Data-intensive Pipelines Ravi K Madduri Argonne National Lab University of Chicago.
Introductory Overview
Moving forward our shared data agenda: a view from the publishing industry ICSTI, March 2012.
IPlant Collaborative Powering a New Plant Biology iPlant Collaborative Powering a New Plant Biology.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
Bioinformatics.
Cloud Computing. What is Cloud Computing? Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable.
Genomics Virtual Lab: analyze your data with a mouse click Igor Makunin School of Agriculture and Food Sciences, UQ, April 8, 2015.
Computer Lab (I) Introduction of galaxy and UCSC genome browser.
Searching PubMed® NCBI, NLM Resources, Micromedex -GSBS TTUHSC Preston Smith Library presents Rev. 08/17/14.
Board on Research Data and Information, National Research Council “Changing Roles of Libraries in Support of Scientific Data Activities” June 3, 2010 More.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Objectives.
IPlant cyberifrastructure to support ecological modeling Presented at the Species Distribution Modeling Group at the American Museum of Natural History.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
The Future of the iPlant Cyberinfrastructure: Coming Attractions.
8 October 2009Microbial Research Commons1 Toward a biomedical research commons: A view from NLM-NIH Jerry Sheehan Assistant Director for Policy Development.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
Bioinformatics Core Facility Guglielmo Roma January 2011.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
The iPlant Collaborative Using iPlant for sharing, managing, and analyzing ecological data Ramona Walls Presented at ESA 2014 – Ignite session August 12,
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
IPlant Collaborative Hands-on Cyberinfrastructure Workshop - Part 1 R. Walls University of Arizona Biodiversity Information Standards (TDWG) Sep. 28, 2015,
The Internet The History and Future of the Internet.
IPlant Genomics in Education Workshop Genome Exploration in Your Classroom.
GCRC Meeting 2004 BIRN Coordinating Center Software Development Vicky Rowley.
Variation data in VectorBase NIH/NIAID VectorBase site visit March 2015.
The User Perspective Michelle Osmond. The Research Challenge Molecular biology, biochemistry, plant biology, genetics, toxicology, chemistry, and more.
Paperless Timesheet Management Project Anant Pednekar.
Cloud computing Hugh Shanahan, Department of Computer Science, Royal Holloway, University of London CCC 2011, Huazhong Agricultural University, Wuhan 20.
Development of e-Science Application Portal on GAP WeiLong Ueng Academia Sinica Grid Computing
EMBL-EBI Data Archives – An Overview. The EMBL-EBI mission Provide freely available data and bioinformatics services to all facets of the scientific community.
Globus.org/genomics Globus Galaxies Science Gateways as a Service Ravi K Madduri, University of Chicago and Argonne National Laboratory
An Introduction to NCBI & BLAST National Center for Biotechnology Information Richard Johnston Pasadena City College.
CyVerse-enabled NCBI Sequence Read Archive (SRA) Submission Pipeline
2nd Texas A&M Big Data Workshop Development of “Big Data” Scientific Workflow Management Tools for the Materials Genome Initiative: “Materials Galaxy”
Electronic Commerce Semester 1 Term 1 Lecture 7. Introduction to the Web The Internet supports a variety of important tools, such as file transfer, electronic.
Who invented Bioinformatics? And how? F.P. Appio DESTEC March 27, 2015.
NCBI: something old, something new. What is NCBI? Create automated systems for knowledge about molecular biology, biochemistry, and genetics. Perform.
Canadian Bioinformatics Workshops
CyVerse Workshop Discovery Environment Overview. Welcome to the Discovery Environment A Simple Interface to Hundreds of Bioinformatics Apps, Powerful.
Enhancements to Galaxy for delivering on NIH Commons
To develop the scientific evidence base that will lessen the burden of cancer in the United States and around the world. NCI Mission Key message:
Introduction to Bioinformatics and Functional Genomics
CyVerse Tools and Services
Tools and Services Workshop
University of Chicago and ANL
Joslynn Lee – Data Science Educator
Platform as a Service.
Tools and Services Workshop
Data uploading and sharing with CyVerse
SRA Submission Pipeline
Lesson 3 Bioinformatics Laboratory
Storing and Accessing G-OnRamp’s Assembly Hubs outside of Galaxy
TOPMed Analysis Workshop Genetic Analysis Center Biostatistics Department University of Washington TOPMed Data Coordinating Center August 7-9, 2017 Introduction.
Amazon Web Services.
Presentation transcript:

-- Don Preuss NCBI/NLM/NIH Trends in Genomics Big Data, NCBI perspective, and 1,000 Genomes in the Cloud -- Don Preuss NCBI/NLM/NIH Every decade a new, lower priced computer class forms with new programming platform, network, and interface resulting in new usage and industry. - Bell’s Law of computer classes

Outline Emerging trends on "Big Data“ and large scale networking and "the cloud" in the genomics community. Trends in data transfer and data compression Cloud initiatives – 1,000 Genomes in the cloud

National Center for Biotechnology Information Created by Public Law 100-607 in 1988 as part of National Library of Medicine at NIH to: Create automated systems for knowledge about molecular biology, biochemistry, and genetics Perform research into advanced methods of analyzing and interpreting molecular biology data. Enable biotechnology researchers and medical care personnel to use the systems and methods developed. The NCBI advances science and health by providing access to biomedical and genomic information. Builders and providers of GenBank, Entrez, BLAST, PubMed, dbGaP, SRA, dbSNP, Pubchem and much, much, more…. Center for basic research and training in computational biology.

NCBI Daily Users Web page views: 28 million per day Web users: 3.1 million per day Data downloaded: 26.6 TB per day Peak web hits: 7,000 per second

Sequencers

DNA Sequencing Caught in Deluge of Data BGI, based in China, is the world’s largest genomics research institute, with 167 DNA sequencers producing the equivalent of 2,000 human genomes a day. BGI churns out so much data that it often cannot transmit its results to clients or collaborators over the Internet or other communications lines because that would take weeks. Instead, it sends computer disks containing the data, via FedEx

Big Data in Scientific Discovery Physics: Large Hadron Collider Biology: 1000 Genomes Project Trunnell 2012

NLM I2 Traffic Stats 2009-2012

Getting exponential growth under control

What is the Big Data Problem in Biology What is the Big Data Problem in Biology? Example: Reducing the 1000 Genome Dataset Submitted BAM Read IDs as strings Original quality & recalibrated quality scores Additional analysis tags 250TB Size (Terabytes) cSRA (lossless) Read IDs as integers 40-level read qualities using recalibrated quality scores cSRA (lossy) 8 level qualities for all sites Uniform binning of recalibrated quality scores 85TB Variant Call Format (VCF) Genotype likelihoods for all variants 30TB 0.1TB Total Project Size Lossless cSRA Lossy cSRA Analysis Genotypes

Flicek

Problem: Enable Access to Data 1,000 genome data set is very large Many sites do not have capacity for 50-200TB downloads Request – Can the 1,000 genomes project store the data in the cloud? Reduce cost for extramural investigators and increase accessibility to data In addition, it supports Federal Open Data A primary goal of Data.gov is to improve access to Federal data and expand creative use of those data beyond the walls of government… Latest release announced at #ICGH2011, more releases coming. Part of the National Big Data Initiative Announcement

Why is NCBI interested in cloud computing? Quantity of Data NCBI has petabytes of sequence data that is made available to researchers around the world. Bandwidth NIH has a good bit of network capacity, and Network capacity is available for many sites to download data sets, especially those on Internet II. However, for many more, it is not available, reducing their practical access to research data Analysis Tools and Platforms Some need simple tools – Extract a portion of data (chromosome, area of interest) Others use more complex tools – Genome browsers, analysis tools for epigenomics using Elastic MapReduce If we can bring compute to the data we can improve access to the data References in this talk to any specific commercial products, process, service, manufacturer, company, or trademark does not constitute its endorsement or recommendation by the U.S. Government, HHS, or NIH. As an agency of the U.S. Government, NIH cannot endorse or appear to endorse any specific commercial products or services.

1,000 Genomes in the Clouds The 1,000 Genome Project files are loaded in Amazon S3 Millions of files have been uploaded (200TB) AMIs have been developed to analyze and review the data Cloudbiolinux, Galaxy This is a public data set with storage provided by AWS NIH is funding several efforts to port genome pipelines to cloud computing environments Research labs, such as those at Emory and UCSC have placed versions of their software in AWS to make 1,000 genome data readily accessible through browser interfaces in the cloud

What is Galaxy Galaxy is a framework for integrating computational tools. It allows nearly any tool that can be run from the command line to be wrapped in a structured well defined interface. On top of these tools, Galaxy provides an accessible environment for interactive analysis that transparently tracks the details of analyses, a workflow system for convenient reuse, data management, sharing, publishing, and more. Even more – Galaxy has made it easy for a researcher to extend their compute power into cloud compute systems Tools like Galaxy make it possible for a researcher to take advantage of much greater compute power without having to worry about the infrastructure details. http://usegalaxy.org From ASMB tutorial

Summary/Questions Compression will help slow this big data problem Other big data problems remain New file formats will compress data close to sequencers Last mile networking is a big issue, prevents access for researchers Cloud will enable access for many more researchers internationally and at underserved institutions Email: donp@nih.gov