Empowering Bioinformatics Workflows Using the Lustre Wide Area File System across a 100 Gigabit Network Stephen Simms Manager, High Performance File Systems.

Slides:



Advertisements
Similar presentations
Current Testbed : 100 GE 2 sites (NERSC, ANL) with 3 nodes each. Each node with 4 x 10 GE NICs Measure various overheads from protocols and file sizes.
Advertisements

STANFORD UNIVERSITY INFORMATION TECHNOLOGY SERVICES IT Services Storage And Backup Low Cost Central Storage (LCCS) January 9,
Statewide IT Conference30-September-2011 HPC Cloud Penguin on David Hancock –
1 Cyberinfrastructure Framework for 21st Century Science & Engineering (CIF21) NSF-wide Cyberinfrastructure Vision People, Sustainability, Innovation,
Experiences from the SCinet Research Sandbox How to Tune Your Wide Area File System for a 100 Gbps Network Scott Michael LUG2012 April 24,2012.
1 Cyberinfrastructure Framework for 21st Century Science & Engineering (CF21) IRNC Kick-Off Workshop July 13,
IDC HPC User Forum Conference Appro Product Update Anthony Kenisky, VP of Sales.
1 stdchk : A Checkpoint Storage System for Desktop Grid Computing Matei Ripeanu – UBC Sudharshan S. Vazhkudai – ORNL Abdullah Gharaibeh – UBC The University.
What is it? Hierarchical storage software developed in collaboration with five US department of Energy Labs since 1992 Allows storage management of 100s.
1 Supplemental line if need be (example: Supported by the National Science Foundation) Delete if not needed. Supporting Polar Research with National Cyberinfrastructure.
© 2013 Mellanox Technologies 1 NoSQL DB Benchmarking with high performance Networking solutions WBDB, Xian, July 2013.
For more notes and topics visit:
IPlant Collaborative Powering a New Plant Biology iPlant Collaborative Powering a New Plant Biology.
The Creation of a Big Data Analysis Environment for Undergraduates in SUNY Presented by Jim Greenberg SUNY Oneonta on behalf of the SUNY wide team.
1 A Basic R&D for an Analysis Framework Distributed on Wide Area Network Hiroshi Sakamoto International Center for Elementary Particle Physics (ICEPP),
Challenges of Storage in an Elastic Infrastructure. May 9, 2014 Farid Yavari, Storage Solutions Architect and Technologist.
Presented by CCS Network Roadmap Steven M. Carter Network Task Lead Center for Computational Sciences.
Bioinformatics Core Facility Ernesto Lowy February 2012.
Statewide IT Conference, Bloomington IN (October 7 th, 2014) The National Center for Genome Analysis Support, IU and You! Carrie Ganote (Bioinformatics.
OOI CI R2 Life Cycle Objectives Review Aug 30 - Sep Ocean Observatories Initiative OOI CI Release 2 Life Cycle Objectives Review CyberPoPs & Network.
Next Generation Cyberinfrastructures for Next Generation Sequencing and Genome Science AAMC 2013 Information Technology in Academic Medicine Conference.
Research & Academic Computing Bradley C. Wheeler Associate Vice President & Dean.
Open Science Grid For CI-Days Internet2: Fall Member Meeting, 2007 John McGee – OSG Engagement Manager Renaissance Computing Institute.
1 Developing a Data Management Plan C&IT Resources for Data Storage and Data Security Patrick Gossman Deputy CIO for Research January 16, 2014.
Ocean Observatories Initiative Common Execution Infrastructure (CEI) Overview Michael Meisinger September 29, 2009.
Big Red II & Supporting Infrastructure Craig A. Stewart, Matthew R. Link, David Y Hancock Presented at IUPUI Faculty Council Information Technology Subcommittee.
Genomics, Transcriptomics, and Proteomics: Engaging Biologists Richard LeDuc Manager, NCGAS eScience, Chicago 10/8/2012.
UTA Site Report Jae Yu UTA Site Report 4 th DOSAR Workshop Iowa State University Apr. 5 – 6, 2007 Jae Yu Univ. of Texas, Arlington.
Bio-IT World Asia, June 7, 2012 High Performance Data Management and Computational Architectures for Genomics Research at National and International Scales.
The National Center for Genome Analysis Support as a Model Virtual Resource for Biologists Internet2 Network Infrastructure for the Life Sciences Focused.
Data Warehousing at Acxiom Paul Montrose Data Warehousing at Acxiom Paul Montrose.
HEAnet Centralised NAS Storage Justin Hourigan, Senior Network Engineer, HEAnet Limited.
DDN & iRODS at ICBR By Alex Oumantsev History of ICBR  Campus wide Interdisciplinary Center for Biotechnology Research  Core Facility  Funded by the.
RNA-Seq 2013, Boston MA, 6/20/2013 Optimizing the National Cyberinfrastructure for Lower Bioinformatic Costs: Making the Most of Resources for Publicly.
DATABASE MANAGEMENT SYSTEMS IN DATA INTENSIVE ENVIRONMENNTS Leon Guzenda Chief Technology Officer.
CSIU Submission of BLAST jobs via the Galaxy Interface Rob Quick Open Science Grid – Operations Area Coordinator Indiana University.
CSG - Research Computing Redux John Holt, Alan Wolf University of Wisconsin - Madison.
Presented by Leadership Computing Facility (LCF) Roadmap Buddy Bland Center for Computational Sciences Leadership Computing Facility Project.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop iPlant Data Store.
9 Systems Analysis and Design in a Changing World, Fourth Edition.
Lustre-WAN: Enabling Distributed Workflows in Astrophysics Scott Michael April 2011.
INDIANAUNIVERSITYINDIANAUNIVERSITY The Data Capacitor Digital Library Brown Bag February 22, 2006 Stephen Simms Data Capacitor Project Manager Research.
Cyberinfrastructure What is it? Russ Hobby Internet2 Joint Techs, 18 July 2007.
CEOS Working Group on Information Systems and Services - 1 Data Services Task Team Discussions on GRID and GRIDftp Stuart Doescher, USGS WGISS-15 May 2003.
Large Area Surveys - I Large area surveys can answer fundamental questions about the distribution of gas in galaxy clusters, how gas cycles in and out.
Active Storage Processing in Parallel File Systems Jarek Nieplocha Evan Felix Juan Piernas-Canovas SDM CENTER.
The National Center for Genomic Analysis Support: creating a national cyberinfrastructure environment for genomics researchers. William Barnett, Thomas.
Pti.iu.edu/sc14 The National Center for Genome Analysis Support Supercomputing 2014 November 17-21, 2014.
Providing National Cyberinfrastructure to Biologists, esp. Genomicists. William K. Barnett, Ph.D. (Director) Thomas G. Doak (Manager & Domain Biologist)
Cyberinfrastructure: Many Things to Many People Russ Hobby Program Manager Internet2.
Bio-IT World Conference and Expo ‘12, April 25, 2012 A Nation-Wide Area Networked File System for Very Large Scientific Data William K. Barnett, Ph.D.
SALSASALSASALSASALSA Digital Science Center February 12, 2010, Bloomington Geoffrey Fox Judy Qiu
Galaxy Community Conference July 27, 2012 The National Center for Genome Analysis Support and Galaxy William K. Barnett, Ph.D. (Director) Richard LeDuc,
A Scalable Distributed Datastore for BioImaging R. Cai, J. Curnutt, E. Gomez, G. Kaymaz, T. Kleffel, K. Schubert, J. Tafas {jcurnutt, egomez, keith,
Tackling I/O Issues 1 David Race 16 March 2010.
Southern California Infrastructure Philip Papadopoulos Greg Hidley.
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
Data Infrastructure in the TeraGrid Chris Jordan Campus Champions Presentation May 6, 2009.
High Performance Cyberinfrastructure Discovery Tools for Data Intensive Research Larry Smarr Prof. Computer Science and Engineering Director, Calit2 (UC.
ORNL Site Report ESCC Feb 25, 2014 Susan Hicks. 2 Optical Upgrades.
Scientific Computing at Fermilab Lothar Bauerdick, Deputy Head Scientific Computing Division 1 of 7 10k slot tape robots.
ORNL is managed by UT-Battelle for the US Department of Energy OLCF HPSS Performance Then and Now Jason Hill HPC Operations Storage Team Lead
CyVerse Data Store Managing Your ‘Big’ Data. Welcome to the Data Store Manage and share your data across all CyVerse platforms.
Clouds , Grids and Clusters
Tools and Services Workshop
The demonstration of Lustre in EAST data system
Joslynn Lee – Data Science Educator
National Center for Genome Analysis Support
Richard LeDuc, Ph.D. (Manager)
Introduce yourself Presented by
Presentation transcript:

Empowering Bioinformatics Workflows Using the Lustre Wide Area File System across a 100 Gigabit Network Stephen Simms Manager, High Performance File Systems Indiana University

Today’s talk brought to you by NCGAS Funded by National Science Foundation Large memory clusters for assembly Bioinformatics consulting for biologists Optimized software for better efficiency Open for business at:

Data In the 21 st Century everything is data –Patient data –Nutritional data –Musical data Raw material for –Scientific advancement –Technological development

Better Technology = More Data

ODI – One Degree Imager –WIYN (Wisconsin, Indiana, Yale, NOAO)telescope in Arizona –ODI will provide 1 billion pixels/image Pan-STARRS –Providing 1.4 billion pixels/image –Currently has over 1 Petabyte of images stored Better Telescopes One Degree Imager 32k x 32k CCD

Better Televisions Ultra High Definition Television (UHDTV) –16 times more pixels than HDTV –Last month LG began sales of 84” UHDTV –Tested at the 2012 Summer Olympics –Storage media lags behind

Genomics Next Gen sequencers are generating more data and getting cheaper Sequencing is:  Becoming commoditized at large centers and  Multiplying at individual labs Analytical capacity has not kept up  Storage support  Computational support (thousand points solution)  Bioinformatics support

NSF Funded in Terabytes Lustre storage (currently 1.1 PB) 24 Servers with 10Gb NICs Short to mid-term storage Data Capacitor

The Lustre Filesystem Open Source Supports many thousands of client systems Supports petabytes of storage Over 240 GB/s measured throughput at ORNL Scalable –aggregates separate servers for performance –user specified “stripes” Standard POSIX interface

Lustre Scalable Object Storage Client OSS MDS metadata server object storage server

Computation

Workflow - The Data Lifecycle

Data Lifecycle – Centralized Storage

NCGAS Cyberinfrastructure at IU Mason large memory cluster (512 GB/node) Quarry cluster (16 GB/node) Data Capacitor (1.1 PB) Research File System (RFS) Research Database Cluster for structured data Bioinformaticians and software engineers

Galaxy: Make it easier for Biologists Galaxy interface provides a “user friendly” window to NCGAS resources Supports many bioinformatics tools Available for both research and instruction. Common Rare Computational Skills LOW HIG H

GALAXY.IU.EDU Model Virtual box hosting Galaxy.IU.edu The host for each tool is configured to meet IU needs Quarry Mason Data Capacitor RFS UITS/NCGAS establishes tools, hardens them, and moves them into production. A custom Galaxy tool can be made to import data from the RFS to the DC. Individual labs can get duplicate boxes – provided they support it themselves. Policies on the DC guarantee that untouched data is removed with time.

Increasing DC’s Utility If we’re getting high speed performance across campuses –What could we do across longer distances? Empower geographically distributed workflows Facilitate data sharing among colleagues Provide data everywhere all the time

Gb Lustre WAN 977 MB/s between ORNL and IU Using a single Dell 2950 client Across 10Gb TeraGrid connection

2007 Bandwidth Challenge Win: Five Applications Simultaneously Acquisition and Visualization –Live Instrument Data Chemistry –Rare Archival Material Humanities Acquisition, Analysis, and Visualization –Trace Data Computer Science –Simulation Data Life Science High Energy Physics

Beyond a Demo To make Lustre across the Wide Area Network useful and more than a demo we needed to be able to span heterogeneous name spaces –In Unix each user has a UID –It could differ from system to system –To preserve ownership across systems we created a method for doing so

IU’s Data Capacitor WAN Filesystem Funded by Indiana University in 2008 Put into production in April of TB of storage available as production service Centralized short-term storage for resources nationwide: –Simplifies use of distributed resources –Projects space exists for mid-term storage

Gas Giant Planet Research

2010: Lustre WAN at 100Gb

100 Gbit Testbed – Full Duplex Results 16*8 Gbit/s 16*20 Gbit/s DDR IB 5*40 Gbit/s QDR IB 16*8 Gbit/s 100GbE Writing to Freiberg 10.8 GB/s Writing to Dresden 11.1 GB/s

100 Gbit Testbed – Uni-Directional Efficiency Unidirectional Lustre: GByte/s (94.4%) TCP/IP: 98.5 Gbit/s (98.5%) Link: 100 Gbit/s (100.0%)

2011: SCinet Research Sandbox Supercomputing 2011, Seattle –Joint effort of SCinet and Technical Program Software Defined Networking and 100 Gbps –From Seattle to Indianapolis (2,300 miles) Demonstrations using Lustre WAN –network –benchmark –applications

Network, Hardware and Software Internet2 and ESnet, 50.5 ms RTT

Network, Hardware and Software

Application Results Applications –Peak: 6.2 GB/s –Sustained: 5.6 GB/s

NCGAS Workflow Demo at SC 11 STEP 1: data pre- processing, to evaluate and improve the quality of the input sequence STEP 2: sequence alignment to a known reference genome STEP 3: SNP detection to scan the alignment result for new polymorphisms Bloomington, INSeattle, WA

Monon 100 Provides 100Gb connectivity between IU and Chicago Internet2 deploying 100Gb networks nationally New opportunities for sharing Big Data New opportunities for moving Big Data

Commodity Internet (1Gbps but highly variable) Internet2 (100Gbps) Gbps NLR to Sequencing Centers (10Gbps/link) IU Data Capacitor WAN (20 Gbps throughput) Ultra SCSI 160 Disk (1.2 Gbps, 160 MBps) DDR3 SDRAM (51.2 Gbps, 6.4GBps, )

10 Gbps 100 Gbps NCGAS Mason (Free for NSF users) IU POD (12 cents per core hour) Amazon EC2 (20 cents per core hour) Data Capacitor NO data storage Charges Amazon Cloud Storage $80 – 120 per TB per month Lustre WAN File System Your Friendly Regional Sequencing Lab Your Friendly Neighborhood Sequencer Your Friendly National Sequencing Center NCGAS Logical Model

National Center for Genome Analysis Support (NCGAS) Using high speed networks like I2 and the Monon 100, the DC-WAN facility will be ingesting data from Laboratories with next generation sequencers and serving reference data sets from sources like NCBI. Data will be processed using IU’s Cyberinfrastructure

Special Thanks To NCGAS – Bill Barnett and Rich LeDuc IU’s High Performance Systems Group Application owners and IU’s HPA Team IU’s Data Capacitor Team Matt Davy, Tom Johnson, Ed Balas, Jeff Ambern, Martin Swany Andrew Lee, Chris Robb, Matthew Zekauskas and Internet2 Evangelos Chaniotakis, Patrick Dorn and ESnet Brocade – 10Gb Cards, 100Gb Cards, and optics Ciena – 100 Gb optics DDN– 2 SFA 10K IBM – iDataPlex nodes Internet2, ESnet– Network link and equipement Whamcloud – Lustre support

Thank you! Stephen Simms High Performance File Systems