Download presentation
Presentation is loading. Please wait.
Published byVictoria Harris Modified over 9 years ago
1
Empowering Bioinformatics Workflows Using the Lustre Wide Area File System across a 100 Gigabit Network Stephen Simms Manager, High Performance File Systems Indiana University ssimms@indiana.edu
2
Today’s talk brought to you by NCGAS Funded by National Science Foundation Large memory clusters for assembly Bioinformatics consulting for biologists Optimized software for better efficiency Open for business at: http://ncgas.orghttp://ncgas.org
3
Data In the 21 st Century everything is data –Patient data –Nutritional data –Musical data Raw material for –Scientific advancement –Technological development
4
Better Technology = More Data
5
ODI – One Degree Imager –WIYN (Wisconsin, Indiana, Yale, NOAO)telescope in Arizona –ODI will provide 1 billion pixels/image Pan-STARRS –Providing 1.4 billion pixels/image –Currently has over 1 Petabyte of images stored Better Telescopes One Degree Imager 32k x 32k CCD
6
Better Televisions Ultra High Definition Television (UHDTV) –16 times more pixels than HDTV –Last month LG began sales of 84” UHDTV –Tested at the 2012 Summer Olympics –Storage media lags behind
7
Genomics Next Gen sequencers are generating more data and getting cheaper Sequencing is: Becoming commoditized at large centers and Multiplying at individual labs Analytical capacity has not kept up Storage support Computational support (thousand points solution) Bioinformatics support
8
NSF Funded in 2005 535 Terabytes Lustre storage (currently 1.1 PB) 24 Servers with 10Gb NICs Short to mid-term storage http://www.flickr.com/photos/shadowstorm/404158384/ http://www.flickr.com/photos/dvd5/163647219/ http://www.flickr.com/photos/vidiot/431357888/ Data Capacitor
9
The Lustre Filesystem Open Source Supports many thousands of client systems Supports petabytes of storage Over 240 GB/s measured throughput at ORNL Scalable –aggregates separate servers for performance –user specified “stripes” Standard POSIX interface
10
Lustre Scalable Object Storage Client OSS MDS metadata server object storage server
11
Computation
12
Workflow - The Data Lifecycle
13
http://www.flickr.com/photos/davesag/4307240/in/set-799526/
14
Data Lifecycle – Centralized Storage
15
NCGAS Cyberinfrastructure at IU Mason large memory cluster (512 GB/node) Quarry cluster (16 GB/node) Data Capacitor (1.1 PB) Research File System (RFS) Research Database Cluster for structured data Bioinformaticians and software engineers
16
Galaxy: Make it easier for Biologists Galaxy interface provides a “user friendly” window to NCGAS resources Supports many bioinformatics tools Available for both research and instruction. Common Rare Computational Skills LOW HIG H
17
GALAXY.IU.EDU Model Virtual box hosting Galaxy.IU.edu The host for each tool is configured to meet IU needs Quarry Mason Data Capacitor RFS UITS/NCGAS establishes tools, hardens them, and moves them into production. A custom Galaxy tool can be made to import data from the RFS to the DC. Individual labs can get duplicate boxes – provided they support it themselves. Policies on the DC guarantee that untouched data is removed with time.
18
Increasing DC’s Utility If we’re getting high speed performance across campuses –What could we do across longer distances? Empower geographically distributed workflows Facilitate data sharing among colleagues Provide data everywhere all the time
19
2006 - 10 Gb Lustre WAN 977 MB/s between ORNL and IU Using a single Dell 2950 client Across 10Gb TeraGrid connection
20
2007 Bandwidth Challenge Win: Five Applications Simultaneously Acquisition and Visualization –Live Instrument Data Chemistry –Rare Archival Material Humanities Acquisition, Analysis, and Visualization –Trace Data Computer Science –Simulation Data Life Science High Energy Physics
21
Beyond a Demo To make Lustre across the Wide Area Network useful and more than a demo we needed to be able to span heterogeneous name spaces –In Unix each user has a UID –It could differ from system to system –To preserve ownership across systems we created a method for doing so
22
IU’s Data Capacitor WAN Filesystem Funded by Indiana University in 2008 Put into production in April of 2008 360TB of storage available as production service Centralized short-term storage for resources nationwide: –Simplifies use of distributed resources –Projects space exists for mid-term storage
23
Gas Giant Planet Research
24
2010: Lustre WAN at 100Gb
25
100 Gbit Testbed – Full Duplex Results 16*8 Gbit/s 16*20 Gbit/s DDR IB 5*40 Gbit/s QDR IB 16*8 Gbit/s 100GbE Writing to Freiberg 10.8 GB/s Writing to Dresden 11.1 GB/s
26
100 Gbit Testbed – Uni-Directional Efficiency Unidirectional Lustre: 11.79 GByte/s (94.4%) TCP/IP: 98.5 Gbit/s (98.5%) Link: 100 Gbit/s (100.0%)
27
2011: SCinet Research Sandbox Supercomputing 2011, Seattle –Joint effort of SCinet and Technical Program Software Defined Networking and 100 Gbps –From Seattle to Indianapolis (2,300 miles) Demonstrations using Lustre WAN –network –benchmark –applications
28
Network, Hardware and Software Internet2 and ESnet, 50.5 ms RTT
29
Network, Hardware and Software
30
Application Results Applications –Peak: 6.2 GB/s –Sustained: 5.6 GB/s
31
NCGAS Workflow Demo at SC 11 STEP 1: data pre- processing, to evaluate and improve the quality of the input sequence STEP 2: sequence alignment to a known reference genome STEP 3: SNP detection to scan the alignment result for new polymorphisms Bloomington, INSeattle, WA
32
Monon 100 Provides 100Gb connectivity between IU and Chicago Internet2 deploying 100Gb networks nationally New opportunities for sharing Big Data New opportunities for moving Big Data
33
Commodity Internet (1Gbps but highly variable) Internet2 (100Gbps) 0 100 Gbps NLR to Sequencing Centers (10Gbps/link) IU Data Capacitor WAN (20 Gbps throughput) Ultra SCSI 160 Disk (1.2 Gbps, 160 MBps) DDR3 SDRAM (51.2 Gbps, 6.4GBps, )
34
10 Gbps 100 Gbps NCGAS Mason (Free for NSF users) IU POD (12 cents per core hour) Amazon EC2 (20 cents per core hour) Data Capacitor NO data storage Charges Amazon Cloud Storage $80 – 120 per TB per month Lustre WAN File System Your Friendly Regional Sequencing Lab Your Friendly Neighborhood Sequencer Your Friendly National Sequencing Center NCGAS Logical Model
35
National Center for Genome Analysis Support (NCGAS) Using high speed networks like I2 and the Monon 100, the DC-WAN facility will be ingesting data from Laboratories with next generation sequencers and serving reference data sets from sources like NCBI. Data will be processed using IU’s Cyberinfrastructure
36
Special Thanks To NCGAS – Bill Barnett and Rich LeDuc IU’s High Performance Systems Group Application owners and IU’s HPA Team IU’s Data Capacitor Team Matt Davy, Tom Johnson, Ed Balas, Jeff Ambern, Martin Swany Andrew Lee, Chris Robb, Matthew Zekauskas and Internet2 Evangelos Chaniotakis, Patrick Dorn and ESnet Brocade – 10Gb Cards, 100Gb Cards, and optics Ciena – 100 Gb optics DDN– 2 SFA 10K IBM – iDataPlex nodes Internet2, ESnet– Network link and equipement Whamcloud – Lustre support
37
Thank you! Stephen Simms ssimms@iu.edu High Performance File Systems hpfs-admin@iu.edu
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.