The National Center for Genomic Analysis Support: creating a national cyberinfrastructure environment for genomics researchers. William Barnett, Thomas.

Slides:



Advertisements
Similar presentations
Distributed Data Processing
Advertisements

September 4, 2014 Using National Cyberinfrastructure Tom Doak Carrie Ganote National Center for Genome Analysis Support.
XSEDE 13 July 24, Galaxy Team: PSC Team:
Summary Role of Software (1 slide) ARCS Software Architecture (4 slides) SNS -- Caltech Interactions (3 slides)
Linux Platform  Download the source tar ball from the BLAST source code link  ncbi-blast src.tar.gz  Compilation  cd /BLASTdirectory/c++ ./configure.
Design of Web-based Systems IS Development: lecture 10.
Web Servers How do our requests for resources on the Internet get handled? Can they be located anywhere? Global?
14 July 2000TWIST George Brett NLANR Distributed Applications Support Team (NCSA/UIUC)
1 Supplemental line if need be (example: Supported by the National Science Foundation) Delete if not needed. Supporting Polar Research with National Cyberinfrastructure.
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,
High Performance Computing (HPC) at Center for Information Communication and Technology in UTM.
UNIVERSITY of MARYLAND GLOBAL LAND COVER FACILITY High Performance Computing in Support of Geospatial Information Discovery and Mining Joseph JaJa Institute.
Scientific Data Infrastructure in CAS Dr. Jianhui Scientific Data Center Computer Network Information Center Chinese Academy of Sciences.
E-BIOGENOUEST: A REGIONAL LIFE SCIENCES INITIATIVE FOR DATA INTEGRATION Datacite Annual Conference Nancy Olivier Collin – IRISA/INRIA
The BioBox Initiative: Bio-ClusterGrid Gilbert Thomas Associate Engineer Sun APSTC – Asia Pacific Science & Technology Center.
National Center for Genome Analysis Support: Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment.
Statewide IT Conference, Bloomington IN (October 7 th, 2014) The National Center for Genome Analysis Support, IU and You! Carrie Ganote (Bioinformatics.
©Ian Sommerville 2006Software Engineering, 8th edition. Chapter 12 Slide 1 Distributed Systems Architectures.
Next Generation Cyberinfrastructures for Next Generation Sequencing and Genome Science AAMC 2013 Information Technology in Academic Medicine Conference.
WEB TERMINOLOGIES. Page or web page: a file that can be read over the world wide web Pages or web pages: the global collection of documents associated.
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
DCS Overview MCS/DCS Technical Interchange Meeting August, 2000.
ITIS 1210 Introduction to Web-Based Information Systems Chapter 4. Understanding the Internet’s Software Structure.
DISTRIBUTED COMPUTING
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES High Performance Computing applications in GEON: From Design to Production Dogan Seber.
Genomics, Transcriptomics, and Proteomics: Engaging Biologists Richard LeDuc Manager, NCGAS eScience, Chicago 10/8/2012.
Bio-IT World Asia, June 7, 2012 High Performance Data Management and Computational Architectures for Genomics Research at National and International Scales.
The National Center for Genome Analysis Support as a Model Virtual Resource for Biologists Internet2 Network Infrastructure for the Life Sciences Focused.
Business Intelligence Appliance Powerful pay as you grow BI solutions with Engineered Systems.
RNA-Seq 2013, Boston MA, 6/20/2013 Optimizing the National Cyberinfrastructure for Lower Bioinformatic Costs: Making the Most of Resources for Publicly.
Installation and Development Tools National Center for Supercomputing Applications University of Illinois at Urbana-Champaign The SEASR project and its.
07:44:46Service Oriented Cyberinfrastructure Lab, Introduction to BOINC By: Andrew J Younge
CSIU Submission of BLAST jobs via the Galaxy Interface Rob Quick Open Science Grid – Operations Area Coordinator Indiana University.
INVITATION TO COMPUTER SCIENCE, JAVA VERSION, THIRD EDITION Chapter 6: An Introduction to System Software and Virtual Machines.
October 21, 2015 XSEDE Technology Insertion Service Identifying and Evaluating the Next Generation of Cyberinfrastructure Software for Science Tim Cockerill.
14 Aug 08DOE Review John Huth ATLAS Computing at Harvard John Huth.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
Algorithms for Biological Sequence Analysis ─ Class Presentation Human-Mouse Alignments with BLASTZ Galaxy: A Platform for Interactive Large-scale Genome.
CLUSTER COMPUTING TECHNOLOGY BY-1.SACHIN YADAV 2.MADHAV SHINDE SECTION-3.
9 Systems Analysis and Design in a Changing World, Fourth Edition.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
Valentina Di Francesco Senior Program Officer for Bioinformatics, Structural Genomics and Systems Biology Microbial Genomics.
Nature Reviews/2012. Next-Generation Sequencing (NGS): Data Generation NGS will generate more broadly applicable data for various novel functional assays.
CSCI 1101 INTRODUCTION TO COMPUTERS 5. Basic Computer Architecture.
CPSC 171 Introduction to Computer Science System Software and Virtual Machines.
Pti.iu.edu/sc14 The National Center for Genome Analysis Support Supercomputing 2014 November 17-21, 2014.
Providing National Cyberinfrastructure to Biologists, esp. Genomicists. William K. Barnett, Ph.D. (Director) Thomas G. Doak (Manager & Domain Biologist)
State of LSC Data Analysis and Software LSC Meeting LIGO Hanford Observatory November 11 th, 2003 Kent Blackburn, Stuart Anderson, Albert Lazzarini LIGO.
Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.
Bio-IT World Conference and Expo ‘12, April 25, 2012 A Nation-Wide Area Networked File System for Very Large Scientific Data William K. Barnett, Ph.D.
Doug Benjamin Duke University. 2 ESD/AOD, D 1 PD, D 2 PD - POOL based D 3 PD - flat ntuple Contents defined by physics group(s) - made in official production.
Galaxy Community Conference July 27, 2012 The National Center for Genome Analysis Support and Galaxy William K. Barnett, Ph.D. (Director) Richard LeDuc,
Chapter 1 Basic Concepts of Operating Systems Introduction Software A program is a sequence of instructions that enables the computer to carry.
Lecture III: Challenges for software engineering with the cloud CS 4593 Cloud-Oriented Big Data and Software Engineering.
NCGAS provides A specific goal is to provide dedicated access to memory rich supercomputers customized for genomics studies, including Mason and other.
PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.
Computational Sciences at Indiana University an Overview Rob Quick IU Research Technologies HTC Manager.
WP5 – Infrastructure Operations Test and Production Infrastructures StratusLab kick-off meeting June 2010, Orsay, France GRNET.
CyVerse Workshop Discovery Environment Overview. Welcome to the Discovery Environment A Simple Interface to Hundreds of Bioinformatics Apps, Powerful.
A Web Based Job Submission System for a Physics Computing Cluster David Jones IOP Particle Physics 2004 Birmingham 1.
Galaxy based BLAST submission to distributed high throughput computing resources Rob Quick and Soichi Hayashi Open Science Grid Operations Indiana University.
Scientific Data Processing Portal and Heterogeneous Computing Resources at NRC “Kurchatov Institute” V. Aulov, D. Drizhuk, A. Klimentov, R. Mashinistov,
Joslynn Lee – Data Science Educator
CyVerse Discovery Environment
Grid Computing.
National Center for Genome Analysis Support
XSEDE’s Campus Bridging Project
Richard LeDuc, Ph.D. (Manager)
Storing and Accessing G-OnRamp’s Assembly Hubs outside of Galaxy
Trip report: Visit to UPPNEX
Presentation transcript:

The National Center for Genomic Analysis Support: creating a national cyberinfrastructure environment for genomics researchers. William Barnett, Thomas G. Doak, Le-Shin Wu, Carrie L. GanoteIndiana University Services Offered Cyberinfrastructure and Architecture User Feedback BLAST on the OSG Original Concept Grid + Cluster Concept Software supported on the Mason cluster as of October, 2014 [5] abyssfastqcoases allpathslggalaxypicard amosgatkraxml arachnegenomemapperrsem bedtoolsgmapsam2counts bio3dhmmersamtools bioconductorkhmerscythe blatmacsshore bowtiemakersmrt bwametamossoapdenovo cd-hitmlrhosra-toolkit celeramothurstacks cufflinksmummertophat cutadaptninjatransabyss cytoscapenamdtrinityrnaseq edenanovoalignvelvet Bioinformatics consulting – including advice on library preparation, choice of assembly software, and recommended parameters. Cyberinfrastructure – as seen to the right, the hardware and system support to manage and analyze genomics data at scale. Archival data storage – Archival tape storage for long-term safe deposit of final results and raw data. In addition, the IUScholarWorks repository can be linked to the archived data providing a convenient link for access to raw or supplementary data. Curated software support – popular software tools are installed, optimized, and maintained on IU machines; e.g.: Genome Browser deployment – pictured above, a genome browser loaded with your data and hosted by NCGAS. Graphical interfaces for bioinformatics tools – Galaxy and GenePattern - two web-based portals for bioinformatics analysis deployed by NCGAS. The Trinity assembler has a web portal in its own version of Galaxy. Bioinformatics consulting – the largest perceived need among the participants of this study was grant-funded bioinformatics consulting support. Data storage and movement – after consulting support, the handling of data was the next obstacle where participants indicated help would be important. This includes the long- and short-term storage of data, as well as the movement of large data sets. Cyberinfrastructure – High performance computing with sufficient processing power and memory is another area researchers would find helpful. Curated software support – participants chose curated, installed and maintained, published software applications among needed services. Mason large memory cluster – with 512 GB of memory and 32 cores in each of its 18 nodes, the Mason cluster is a real workhorse of bioinformatics analysis. Open Science Grid – with a highly distributed grid architecture, the OSG provides opportunistic cycles, allowing a user to potentially run thousands of tasks at once. XSEDE – the Extreme Science and Engineering Discovery Environment awards allocations on some of the country’s fastest and largest supercomputers. Data Storage and movement – an optional 50TB allocation is available to NCGAS users on the 15/7PB Scholarly Data Archive for archiving. The Data Capacitor 2 is IU’s 5PB high performance file system, tuned for fast reading and writing of large files. These systems are tied into the 100GigE Internet 2 backbone. BLAST [1] is an essential bioinformatic tool, heavily used in genomics to infer homology between sequences. BLAST treats each input as a separate entity and can be run in a highly parallel fashion—this makes it an ideal target for running on a grid. A typical Galaxy setup showing the different connections machines use to communicate, handle data, and send jobs. The Mason Cluster in IU’s Data Center JBrowse [6] on the IQ Wall, Cyberinfrastructure Building, Bloomington IN The Blast on OSG tool viewed through the Galaxy [2-4] interface These diagrams show two communications setups between the Galaxy server and the job, running through HTCondor on the Open Science Grid. NCGAS conducted two recent surveys to assess the needs of genomics researchers. The first survey was addressed to NCGAS users, and the second went to NSF-funded biologists. Results from the second survey include: On a 1-5 Likert scale, where 1 is “very dissatisfied” and 5 is “very satisfied,” the average overall score for NCGAS services was 4.4 ± 1.4 (95% confidence interval). 63% of respondents indicated, “I could not have done my research without NCGAS,” while another 30% indicated NCGAS was helpful to completing their research. Common comments and requests from participants included requests for better data handling, documentation and training, and more personnel for NCGAS. Results from NCGAS User Survey Results from NSF Survey References BLAST [1] Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) "Basic local alignment search tool." J. Mol. Biol. 215: Galaxy [2] Goecks, J, Nekrutenko, A, Taylor, J and The Galaxy Team. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol Aug 25;11(8):R86. Galaxy [3] Blankenberg D, Von Kuster G, Coraor N, Ananda G, Lazarus R, Mangan M, Nekrutenko A, Taylor J. "Galaxy: a web-based genome analysis tool for experimentalists". Current Protocols in Molecular Biology Jan; Chapter 19:Unit Galaxy [4] Giardine B, Riemer C, Hardison RC, Burhans R, Elnitski L, Shah P, Zhang Y, Blankenberg D, Albert I, Taylor J, Miller W, Kent WJ, Nekrutenko A. "Galaxy: a platform for interactive large-scale genome analysis." Genome Research Oct; 15(10): PY3 [5] Barnett, William K.; Stewart, Craig A. (2014). National Center for Genome Analysis Program Year 3 Report – September 15, 2013 – September 14, Jbrowse [6] Skinner, M. E., Uzilov, A. V., Stein, L. D., Mungall, C. J., and Holmes, I. H. (2009). JBrowse: a next-generation genome browser. Genome research, 19(9):