HII Technical Infrastructure

Slides:



Advertisements
Similar presentations
ASCR Data Science Centers Infrastructure Demonstration S. Canon, N. Desai, M. Ernst, K. Kleese-Van Dam, G. Shipman, B. Tierney.
Advertisements

SALSA HPC Group School of Informatics and Computing Indiana University.
NIST Big Data Public Working Group Security and Privacy Subgroup Presentation September 30, 2013 Arnab Roy, Fujitsu Akhil Manchanda, GE Nancy Landreville,
PRAKTICKÝ ÚVOD DO SUPERPOČÍTAČE ANSELM Infrastruktura, přístup a podpora uživatelů David Hrbáč
XSEDE 13 July 24, Galaxy Team: PSC Team:
UNCLASSIFIED: LA-UR Data Infrastructure for Massive Scientific Visualization and Analysis James Ahrens & Christopher Mitchell Los Alamos National.
DANSE Central Services Michael Aivazis Caltech NSF Review May 23, 2008.
Mike Smorul Saurabh Channan Digital Preservation and Archiving at the Institute for Advanced Computer Studies University of Maryland, College Park.
BIOCMS: Resource Integration and Web Application Framework for Bioinformatics DHUNDY R BASTOLA †, *, ANIL KHADKA †, MOHAMMAD SHAFIULLAH † AND HESHAM ALI.
The University of Texas Research Data Repository : “Corral” A Geographically Replicated Repository for Research Data Chris Jordan.
Genome database & information system for Daphnia Don Gilbert, October 2002 Talk doc at
Development Principles PHIN advances the use of standard vocabularies by working with Standards Development Organizations to ensure that public health.
Lustre at Dell Overview Jeffrey B. Layton, Ph.D. Dell HPC Solutions |
E-BIOGENOUEST: A REGIONAL LIFE SCIENCES INITIATIVE FOR DATA INTEGRATION Datacite Annual Conference Nancy Olivier Collin – IRISA/INRIA
HPC at IISER Pune Neet Deo System Administrator
 Cloud computing  Workflow  Workflow lifecycle  Workflow design  Workflow tools : xcp, eucalyptus, open nebula.
Bioinformatics Core Facility Ernesto Lowy February 2012.
K E Y : SW Service Use Big Data Information Flow SW Tools and Algorithms Transfer Application Provider Visualization Access Analytics Curation Collection.
U.S. Dept. Agriculture Agricultural Research Service Rob Griesbach Technology Transfer Coordinator
Department of Biomedical Informatics Service Oriented Bioscience Cluster at OSC Umit V. Catalyurek Associate Professor Dept. of Biomedical Informatics.
DISTRIBUTED COMPUTING
Big Red II & Supporting Infrastructure Craig A. Stewart, Matthew R. Link, David Y Hancock Presented at IUPUI Faculty Council Information Technology Subcommittee.
Efficient Approaches to High-Scale Apache Hadoop Processing Cloud Computing West - November /9/2012 Joey Jablonski Practice Director, Analytic Services.
Corral: A Texas-scale repository for digital research data Chris Jordan Data Management and Collections Group Texas Advanced Computing Center.
DDN & iRODS at ICBR By Alex Oumantsev History of ICBR  Campus wide Interdisciplinary Center for Biotechnology Research  Core Facility  Funded by the.
Business Intelligence Appliance Powerful pay as you grow BI solutions with Engineered Systems.
DANSE Central Services Michael Aivazis Caltech NSF Review May 31, 2007.
Introduction to U.S. ATLAS Facilities Rich Baker Brookhaven National Lab.
The PROGRESS Grid Service Provider Maciej Bogdański Portals & Portlets 2003 Edinburgh, July 14th-17th.
NML Bioinformatics Service— Licensed Bioinformatics Tools High-throughput Data Analysis Literature Study Data Mining Functional Genomics Analysis Vector.
PROGRESS: ICCS'2003 GRID SERVICE PROVIDER: How to improve flexibility of grid user interfaces? Michał Kosiedowski.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
K E Y : SW Service Use Big Data Information Flow SW Tools and Algorithms Transfer Transformation Provider Visualization Access Analytics Curation Collection.
GCRC Meeting 2004 BIRN Coordinating Center Software Development Vicky Rowley.
PROGRESS: GEW'2003 Using Resources of Multiple Grids with the Grid Service Provider Michał Kosiedowski.
K E Y : DATA SW Service Use Big Data Information Flow SW Tools and Algorithms Transfer Hardware (Storage, Networking, etc.) Big Data Framework Scalable.
RDA Data Support Section. Topics 1.What is it? 2.Who cares? 3.Why does the RDA need CISL? 4.What is on the horizon?
Role Activity Sub-role Functional Components Control Data Software.
High throughput biology data management and data intensive computing drivers George Michaels.
The Evolution of the Italian HPC Infrastructure Carlo Cavazzoni CINECA – Supercomputing Application & Innovation 31 Marzo 2015.
CSE 5810 Biomedical Informatics and Cloud Computing Zhitong Fei Computer Science & Engineering Department The University of Connecticut CSE5810: Introduction.
WP5 – Infrastructure Operations Test and Production Infrastructures StratusLab kick-off meeting June 2010, Orsay, France GRNET.
NA-MIC National Alliance for Medical Image Computing Core 1b – Engineering Data Management Daniel Marcus Washington University.
Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb
Automated File Server Disk Quota Management May 13 th, 2008 Bill Claycomb Computer Systems Analyst Infrastructure Computing Systems Department Sandia is.
ORNL is managed by UT-Battelle for the US Department of Energy OLCF HPSS Performance Then and Now Jason Hill HPC Operations Storage Team Lead
ETRIKS Platform for bioinformatics ISGC 17/03/15 Pengfei Liu, CC-IN2P3/CNRS.
Rafael Jimenez ELIXIR CTO BioMedBridges Life science requirements from e-infrastructure: initial results from a joint BioMedBridges workshop Stephanie.
Million Veteran Program: Industry Day Genomic Data Processing and Storage Saiju Pyarajan, PhD and Philip Tsao, PhD Million Veteran Program: Industry Day.
Enhancements to Galaxy for delivering on NIH Commons
Compute and Storage For the Farm at Jlab
Self-Contained Systems
To develop the scientific evidence base that will lessen the burden of cancer in the United States and around the world. NCI Mission Key message:
Accessing the VI-SEEM infrastructure
Tools and Services Workshop
Joslynn Lee – Data Science Educator
LinkSCEEM-2: A computational resource for the development of Computational Sciences in the Eastern Mediterranean Mostafa Zoubi SESAME Outreach SESAME,
StratusLab Final Periodic Review
StratusLab Final Periodic Review
DI4R, 30th September 2016, Krakow
Daniel Murphy-Olson Ryan Aydelott1
UK GridPP Tier-1/A Centre at CLRC
NGS Oracle Service.
Online Shopping APP.
Research Data Archive - technology
USF Health Informatics Institute (HII)
Overview of HPC systems and software available within
Saranya Sriram Developer Evangelist | Microsoft
MMG: from proof-of-concept to production services at scale
Presentation transcript:

HII Technical Infrastructure

Data Management TEDDY ‘omics data Quantity of data (~550 TB) Diversity of data sources (9 labs, 28 analytes) Number of analytical partners (9 EAP groups, 47 HPC users, 76 data sharing platform users) Number of data releases (>65 releases) 2

Data Management Total raw data storage as of APR 2018: 2.0 PB Total expected Case-Control data: ~550 TB Dietary Biomarkers  - 4.1 MB Exome – 100 GB Gene Expression - 12 to 14 TB Metabolomics - 16 to 24 TB Microbiome & Metagenomics –  86 TB Proteomics – 2 to 3 TB SNPs – 60 GB RNA Sequencing – 150 TB Whole genome sequencing – 250 TB 3

Technical Infrastructure Objective: Comprehensively store, manage, and share HII Big Data assets in support of ‘omics analysis Allow analytical partners to bring their analyses to the data Components: Data Infrastructure Clinical Data Warehouse Big Data Repository Controlled Vocabulary Repository Laboratory Transfer CLI Data Exchange API Analytical Infrastructure High Performance Computing (HPC) Cluster Analysis Software Library hdExchange health informatics institute hiiData Product Data Exchange Platform 4

Technical Infrastructure 5

Analytical Infrastructure: HPC Cluster Hardware The HPC platform consists of two clusters: HII (hii): 90+ nodes with ~ 1600 Cores / 8 TB Memory RC (circe): 400+ nodes with ~ 5000 Cores / 12 TB Memory Nodes in the clusters are upgraded and expanded on a continual basis. Specs of latest 40 nodes provisioned: Processor: 20-core E5-2650 v3 @ 2.30GHz (Haswell Microarchitecture) Memory: 128 GB @ 2133 Mhz MPI/Storage Interconnect: QDR Infiniband @ 32 Gb/s All nodes have access to the following storage arrays: 1.7 PB DDN GPFS (Home Directories, Group Shares, and Scratch) 300 TB Oracle ZFS (Genetic Sequence Files and Curated Results) https://usf-hii.github.io/pages/hii-hpc.html 6

Data Infrastructure: hdExchange hdExchange API Primary mechanism for programmatically accessing TEDDY clinical and ‘omics data from HPC environment Hides the complexities of backend data management providing single point of contact for straightforward access to data assets https://exchange.hiidata.org/documentation.htm 7

Data Infrastructure: hdExchange Specifications RestFUL Web API W3C Standards for Rest Architecture Token-based API Authentication Synchronous Delivery of Tabular Data Clinical Metadata & Data Dictionary Asynchronous Processing of Data File Requests Background Process and Message Queue for Scalability and Big Data 8

Data Infrastructure: hdExplore Data Sharing Platform Web application consisting of an interactive user interface for accessing TEDDY clinical metadata and associated documentation along with a suite of data manipulation and visualization tools The platform is accessible only internally to authorized investigators. 9

Data Infrastructure: hdExplore 10

11