Presentation is loading. Please wait.

Presentation is loading. Please wait.

Infrastructure for Sharing Very Large Data Sets Antonio M. Ferreira, PhD Executive Director, Center for Simulation and Modeling.

Similar presentations


Presentation on theme: "Infrastructure for Sharing Very Large Data Sets Antonio M. Ferreira, PhD Executive Director, Center for Simulation and Modeling."— Presentation transcript:

1 Infrastructure for Sharing Very Large Data Sets http://www.sam.pitt.edu Antonio M. Ferreira, PhD Executive Director, Center for Simulation and Modeling Research Associate Professor Departments of Chemistry and Computational & Systems Biology University of Pittsburgh

2 P ARTS OF THE I NFRASTRUCTURE P UZZLE Hardware Networking Storage Compute Software Beyond scp/rsync Globus gtdownload Policies Not all data is “free” Access controls

3 P ARTS OF THE I NFRASTRUCTURE P UZZLE Hardware Networking Storage Compute Software Beyond scp/rsync Globus, gtdownload, bbcp, etc. Policies Not all data is “free” Access controls

4 T HE “O LD ” M ODEL Disk Main Memory L3 Cache L2 Cache L1 Cache L1i Cache CPU Core Bus

5 N ETWORK IS THE N EW B US Main Memory L3 Cache L2 Cache L1 Cache L1i Cache CPU Core Bus Disk Network

6 D ATA S OURCES AT P ITT TCGA Currently 1.1 PB growing by ~50 TB/mo. Pitt is largest single contributor UPMC Hospital System 27 individual hospitals generating clinical and genomic data ~30,000 patients in BRCA alone LHC Generates more than 10 PB/year Pitt is a Tier 3 site

7 TCGA D ATA B REAKDOWN CancerPitt ContributionAll Univ's ContributionPitt's Percentage Mesothelioma (MESO)93724.32 Prostate adenocarcinoma (PRAD)9542722.25 Kidney renal clear cell carcinoma (KIRC)10753619.96 Head and Neck squamous cell carcinoma (HNSC)7451714.31 Breast Invasive Carcinoma (BRCA)149106114.04 Ovarian serous cystadenocarcinoma (OV)6359710.55 Uterine Carcinosarcoma (UCS)65710.53 Thyroid carcinoma (THCA)495009.80 Skin Cutaneous Melanoma (SKCM)414319.51 Bladder Urothelial Carcinoma (BLCA)232688.58 Uterine Corpus Endometrial Carcinoma (UCEC)445567.91 Lung adenocarcinoma (LUAD)315006.20 Pancreatic adenocarcinoma (PAAD)71136.19 Colon adenocarcinoma (COAD)214494.68 Lung squamous cell carcinoma (LUSC)214934.26 Stomach adenocarcinoma (STAD)153734.02 Kindey renal papillary cell carcinoma (KIRP)92273.96 Rectum adenocarcinoma (READ)61693.55 Sarcoma (SARC)71993.52 Pheochromocytoma and Paraganglioma (PCPG)41792.23 Liver hepatocellular carcinoma (LIHC)32401.25 Cervical Squamous cell carcinoma and endocervical adenocarcinoma (CESC) 32421.24 Esophageal carcinoma (ESCA)21651.21 Adrenocortical Carcinoma (ACC)0920.00 Lymphoid Neoplasm Diffuse Large B-cell Lymphoma (DLBC)0380.00 Gliobastoma mutliforme (GBM)06610.00 Kidney chromophobe (KICH)01130.00 Acute Myeloid Leukemia (LAML)02000.00 Brain Lower Glade Glioma (LGG)05160.00

8

9 H OW Q UICKLY D O Y OU N EED Y OUR D ATA ? http://fasterdata.es.net/home/requirements-and-expectations

10 H OW D O W E L EVERAGE T HIS ON C AMPUS ? http://noc.net.internet2.edu/i2network/maps-documentation/maps.html

11 S CIENCE DMZ http://fasterdata.es.net/science-dmz/science-dmz-architecture/

12 A FTER THE DMZ Now that you have a DMZ, what’s next? It’s the last mile Relatively easy to bring 100 Gbps to the data center It’s another thing entirely to deliver such speeds to clients (disk, compute, etc.) How do we address the challenge? DCE and IB are converging Right now, high bandwidth network to storage is probably the best we can do Select users and laboratories get 10 GE to their systems

13 C AMPUS 100GE N ETWORKING

14 P ITT /UPMC N ETWORKING

15 B EYOND THE C AMPUS : XSEDE The most advanced, powerful, and robust collection of integrated digital resources and services in the world. 11 supercomputers, 3 dedicated visualization servers. Over 2 PFLOPs peak computational power. Single virtual system that scientists can use to interactively share computing resources, data, and expertise … Online training for XSEDE and general HPC topics XSEDE Annual XSEDE conference Learn more at http://www.xsede.orghttp://www.xsede.org

16 PSC/P ITT S TORAGE http://www.psc.edu/index.php/research-programs/advanced-systems/data-exacell

17 SLASH2 A RCHITECTURE

18 A FTER THE DMZ ( CONT.) Need the right file systems to backend a DMZ Lustre/GPFS How do you pull data from the high-speed network? Where will it land? DMZ explicitly avoids certain security restrictions Access Controls Genomics/Bioinformatics is growing enormously DMZ is likely not HIPPA-compliant Is it EPHI? Can we let it live with non-EPHI data?

19 C URRENT F ILE S YSTEMS /home directories are traditional NFS SLASH2 filesystem for long-term storage 1 PB of total storage Accessible from both PSC and Pitt compute hardware Lustre for “active” data 5 GB/s total throughput 800 MB/s single-stream performance InfiniBand connectivity Important for both compute and I/O

20 Computing on Distributed Genomes How do we make this work once we get the data? Need the APIs Genomic data from UPMC UPMC has data collection UPMC lacks HPC systems for analysis

21 I NSTITUE FOR P ERSONALIZED M EDICINE Pitt/UPMC joint venture Drug Discovery Institute Pitt Cancer Institute UPMC Cancer Institute UPMC Enterprise Analytics Improve patient care Discover novel uses for existing therapeutics Develop novel therapeutics Enable genomics-based research and treatement

22 W HAT IS PGRR? What PGRR IS…What PGRR is not…. 1.A common information technology framework for accessing deidentified national big data datasets that are important for Personalized Medicine 2.A portal that allows you to use this data easily with tools and resources provided by the Simulation and Modeling Center (SaM), Pittsburgh Supercomputing Center (PSC), and UPMC Enterprise Analytics (EA) 3.A managed environment to help you meet the information security and regulatory requirements for using this data 4.A process for helping you stay current about updates and modifications made to these datasets 1.A place to store your individual research results 2.A system to access UPMC clinical data 3.A service for analyzing data on your behalf

23 Data Exacell Storage (SLASH2) PGR R PSC IPM Portal Pipeline Codes Frank Pitt Pitt (IPM, UPCI) M Source (e.g. NCI, CGHub) TCGA GO Blackligh t Sherloc k BAM Panasa s 40 TB Brashear 290 TB Virtuoso 10 Gbit (throttled to 2 Gbit) Network Replication Metadata supercell 100 TB Databas e nodes BAM Non- BAM MDS ~8 TB* ~100 TB* Xyrate x 240 TB Bl1 local 75 TB Bl2 local 100 TB *Growing to ~1 PB of BAM data and 33 TB of non- BAM data Pittsburgh Genome Resource Repository n1n1 n2n2 n3n3 n0n0 InfiniBand 1Gbit (assumed)

24 How Do We Protect Data? Genomic Data (~424 TB) Deidentified genomic data Patient genomic data from UPMC system DUAs (Data Use Agreements) Umbrella document signed by all Pitt/UPMC researchers Required training for all users Access restricted to DUA users only dBGap (not HIPAA) We host, but user (via DUA) is ultimately responsible for data protection

25 TCGA A CCESS R ULES

26 C ONTROLING A CCESS

27 PGRR D ATA N OTIFICATIONS

28 A CKNOWLEDGEMENTS Albert DeFusco (Pitt/SaM) Brian Stengel (Pitt/CSSD) Rebecca Jacobson (Pitt/DBMI) Adrian Lee (Pitt/Cancer Institute) J. Ray Scott (PSC) Jared Yanovich (PSC) Phil Blood (PSC)

29 C ENTER FOR S IMULATION AND M ODELING Center for Simulation and Modeling (SaM) 326 Eberly (412) 648-3094 http://www.sam.pitt.edu http://www.sam.pitt.edu Co-directors: Ken Jordan & Karl Johnson Associate Director: Michael Barmada Executive Director: Antonio Ferreira Administrative Coordinator: Wendy Janocha Consultants: Albert DeFusco, Esteban Meneses, Patrick Pisciuneri, Kim Wong Network Operations Center (NOC) RIDC Park Lou Passarello Jeff Raymond, Jeff White Swanson School of Engineering (SSoE) Jeremy Dennis


Download ppt "Infrastructure for Sharing Very Large Data Sets Antonio M. Ferreira, PhD Executive Director, Center for Simulation and Modeling."

Similar presentations


Ads by Google