Bio-IT World Conference and Expo ‘12, April 25, 2012 A Nation-Wide Area Networked File System for Very Large Scientific Data William K. Barnett, Ph.D.

Slides:



Advertisements
Similar presentations
Summary of Cloud Computing (CC) from the paper Abovce the Clouds: A Berkeley View of Cloud Computing (Feb. 2009)
Advertisements

Statewide IT Conference30-September-2011 HPC Cloud Penguin on David Hancock –
Experiences from the SCinet Research Sandbox How to Tune Your Wide Area File System for a 100 Gbps Network Scott Michael LUG2012 April 24,2012.
Dawei Lin, Ph.D. Director, Bioinformatics Core UC Davis Genome Center July 20, 2008, SLIMS (Solexa sequencing.
Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.
Is 'Designing' Cyberinfrastructure - or, Even, Defining It - Possible? Peter A. Freeman National Science Foundation January 29, 2007 The views expressed.
Aleksi Kallio CSC – IT Center for Science Chipster and collaboration with other bioinformatics platforms.
WORKFLOWS IN CLOUD COMPUTING. CLOUD COMPUTING  Delivering applications or services in on-demand environment  Hundreds of thousands of users / applications.
An Introduction to the Open Science Data Cloud Heidi Alvarez Florida International University Robert L. Grossman University of Chicago Open Cloud Consortium.
Scientific Data Infrastructure in CAS Dr. Jianhui Scientific Data Center Computer Network Information Center Chinese Academy of Sciences.
IPlant Collaborative Powering a New Plant Biology iPlant Collaborative Powering a New Plant Biology.
Bioinformatics Core Facility Ernesto Lowy February 2012.
Statewide IT Conference, Bloomington IN (October 7 th, 2014) The National Center for Genome Analysis Support, IU and You! Carrie Ganote (Bioinformatics.
Next Generation Cyberinfrastructures for Next Generation Sequencing and Genome Science AAMC 2013 Information Technology in Academic Medicine Conference.
Empowering Bioinformatics Workflows Using the Lustre Wide Area File System across a 100 Gigabit Network Stephen Simms Manager, High Performance File Systems.
1 Developing a Data Management Plan C&IT Resources for Data Storage and Data Security Patrick Gossman Deputy CIO for Research January 16, 2014.
Genomics Virtual Lab: analyze your data with a mouse click Igor Makunin School of Agriculture and Food Sciences, UQ, April 8, 2015.
Genomics, Transcriptomics, and Proteomics: Engaging Biologists Richard LeDuc Manager, NCGAS eScience, Chicago 10/8/2012.
Bio-IT World Asia, June 7, 2012 High Performance Data Management and Computational Architectures for Genomics Research at National and International Scales.
The National Center for Genome Analysis Support as a Model Virtual Resource for Biologists Internet2 Network Infrastructure for the Life Sciences Focused.
Using Biological Cyberinfrastructure Scaling Science and People: Applications in Data Storage, HPC, Cloud Analysis, and Bioinformatics Training Scaling.
Cloud Computing & Amazon Web Services – EC2 Arpita Patel Software Engineer.
RNA-Seq 2013, Boston MA, 6/20/2013 Optimizing the National Cyberinfrastructure for Lower Bioinformatic Costs: Making the Most of Resources for Publicly.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
CSIU Submission of BLAST jobs via the Galaxy Interface Rob Quick Open Science Grid – Operations Area Coordinator Indiana University.
The Future of the iPlant Cyberinfrastructure: Coming Attractions.
Bioinformatics Core Facility Guglielmo Roma January 2011.
BRUDNO LAB: A WHIRLWIND TOUR Marc Fiume Department of Computer Science University of Toronto.
Using Biological Cyberinfrastructure Scaling Science and People: Applications in Data Storage, HPC, Cloud Analysis, and Bioinformatics Training Scaling.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
AUTOMATING DAAS DESKTOPS WITH CITRIX CORTEX Tony Sanchez WW Alliances Solutions Architecture Citrix Systems Inc SESSION CODE: CLI415 (c) 2011 Microsoft.
The National Center for Genomic Analysis Support: creating a national cyberinfrastructure environment for genomics researchers. William Barnett, Thomas.
Pti.iu.edu/sc14 The National Center for Genome Analysis Support Supercomputing 2014 November 17-21, 2014.
Providing National Cyberinfrastructure to Biologists, esp. Genomicists. William K. Barnett, Ph.D. (Director) Thomas G. Doak (Manager & Domain Biologist)
Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications Thilina Gunarathne, Tak-Lon Wu Judy Qiu, Geoffrey Fox School of Informatics,
Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.
Power and Cooling at Texas Advanced Computing Center Tommy Minyard, Ph.D. Director of Advanced Computing Systems 42 nd HPC User Forum September 8, 2011.
Galaxy Community Conference July 27, 2012 The National Center for Genome Analysis Support and Galaxy William K. Barnett, Ph.D. (Director) Richard LeDuc,
1 The Cloud and Desktop as a Service as a teaching tool for different research communities David Wallom Oxford e-Research Centre.
Globus.org/genomics Globus Galaxies Science Gateways as a Service Ravi K Madduri, University of Chicago and Argonne National Laboratory
Tackling I/O Issues 1 David Race 16 March 2010.
A Research Collaboratory for Open Source Software Research Yongqin Gao, Matt van Antwerp, Scott Christley, Greg Madey Computer Science & Engineering University.
Architecture of a platform for innovation and research Erik Deumens – University of Florida SC15 – Austin – Nov 17, 2015.
Canadian Bioinformatics Workshops
Bioinformatics Computation in the Cloud A Joint Collaboration Between Microsoft’s External Research and eXtreme Computing Groups
EGI-InSPIRE RI EGI Compute and Data Services for Open Access in H2020 Tiziana Ferrari Technical Director, EGI.eu
Galaxy based BLAST submission to distributed high throughput computing resources Rob Quick and Soichi Hayashi Open Science Grid Operations Indiana University.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
February 3, 2009 Bridging Academic and Medical Cultures Academic Research Systems and HIPAA William K. Barnett Anurag Shankar.
Unit 3 Virtualization.
Accessing the VI-SEEM infrastructure
What is HPC? High Performance Computing (HPC)
CyVerse Tools and Services
Tools and Services Workshop
University of Chicago and ANL
Joslynn Lee – Data Science Educator
Cloud computing-The Future Technologies
Bioinformatics Community of CNGrid A New Approach to Utilizing Grids
National Center for Genome Analysis Support
Recap: introduction to e-science
AWS COURSE DEMO BY PROFESSIONAL-GURU. Amazon History Ladder & Offering.
Future Data Architectures Big Data Workshop – April 2018
Richard LeDuc, Ph.D. (Manager)
Cloud computing mechanisms
IBM Power Systems.
Storing and Accessing G-OnRamp’s Assembly Hubs outside of Galaxy
Trip report: Visit to UPPNEX
Amazon Web Services.
Microsoft Virtual Academy
Presentation transcript:

Bio-IT World Conference and Expo ‘12, April 25, 2012 A Nation-Wide Area Networked File System for Very Large Scientific Data William K. Barnett, Ph.D. Richard LeDuc, Ph.D. National Center for Genome Analysis Support

Bio-IT World. April 25, 2012National Center for Genome Analysis Support: Summary Changing genomics analytical needs NCGAS and its mission NCGAS cyberinfrastructure The 100 Gigabit demonstration Scaling genomics analysis NCGAS workflow and logical models Outcomes for life sciences research

Bio-IT World. April 25, 2012National Center for Genome Analysis Support: Changing genomics analytical needs Next Gen sequencers are generating more data and getting cheaper Sequencing is:  Becoming commoditized at large centers and  Multiplying at individual labs Analytical capacity has not kept up  Bioinformatics support  Computational support (thousand points solution)  Storage support

Bio-IT World. April 25, 2012National Center for Genome Analysis Support: NCGAS widening the analytical bottleneck Funded by National Science Foundation Large memory clusters for assembly Bioinformatics consulting for biologists Optimized software for better efficiency Open for business at:

Bio-IT World. April 25, 2012National Center for Genome Analysis Support: Making it easier for Biologists Galaxy interface provides a “user friendly” window to NCGAS resources Supports many bioinformatics tools Available for both research and instruction. Common Rare Computational Skills LOW HIG H

Bio-IT World. April 25, 2012National Center for Genome Analysis Support: NCGAS Cyberinfrastructure at IU Mason large memory cluster (512 GB/node) Quarry cluster (16 GB/node) Data Capacitor (~1 PB at 20 Gbps throughput) Research File System (RFS) for data storage Research Database Cluster for structured data High speed internal network (40 Gbps) Bioinformaticians and software engineers

Bio-IT World. April 25, 2012National Center for Genome Analysis Support: GALAXY.IU.EDU Model Virtual box hosting Galaxy.IU.edu The host for each tool is configured to meet IU needs Quarry Mason Data Capacitor RFS UITS/NCGAS establishes tools, hardens them, and moves them into production. A custom Galaxy tool can be made to import data from the RFS to the DC. Individual labs can get duplicate boxes – provided they support it themselves. Policies on the DC guarantee that untouched data is removed with time.

Bio-IT World. April 25, 2012National Center for Genome Analysis Support: NCGAS Workflow Demo at SC 11 STEP 1: data pre- processing, to evaluate and improve the quality of the input sequence STEP 2: sequence alignment to a known reference genome STEP 3: SNP detection to scan the alignment result for new polymorphisms Bloomington, INSeattle, WA

10 Gbps 100 Gbps NCGAS Mason (Free for NSF users) IU POD (12 cents per core hour) Amazon EC2 (20 cents per core hour) Data Capacitor NO data storage Charges Amazon Cloud Storage $80 – 120 per TB per month Lustre WAN File System Your Friendly Regional Sequencing Lab Your Friendly Neighborhood Sequencer Your Friendly National Sequencing Center NCGAS Logical Model

January 26, 2016Customize footer: View menu/Header and Footer Commodity Internet (1Gbps but highly variable) Internet2 (100Gbps) Gbps NLR to Sequencing Centers (10Gbps/link) IU Data Capacitor (20 Gbps throughput) Ultra SCSI 160 Disk (1.2 Gbps, 160 MBps) DDR3 SDRAM (51.2 Gbps, 6.4GBps, ) This Architecture Scales!

Bio-IT World. April 25, 2012National Center for Genome Analysis Support: How would this work at scale? 1.Biologists use Galaxy to execute workflows 2.Sequence data mounted via Lustre WAN or automatically transferred using Internet2/NLR 3.Data Capacitor flows data into Mason or other clusters nationally or internationally 4.Data Capacitor mounts or mirrors reference data from NCBI or other sources 5.Results delivered through web interfaces and to visualization or other science tools

Bio-IT World. April 25, 2012National Center for Genome Analysis Support: Outcomes for Life Sciences Research… Sequencing is creating analytical problems that cannot be solved at sequencing centers or labs. Bioinformatics support, storage, data transfer, and computation all needed to do the science. NCGAS provides a model of a scaled infrastructure for biologists who cannot become bioinformaticians to accomplish their research.

Bio-IT World. April 25, 2012 National Center for Genome Analysis Support: Thank You Questions? Bill Barnett Rich LeDuc