GCC Genomics Core Computing. Current situation GCC 1.0 Roche 454 Current cluster UZ network 8C 16Gb 2TB UZ NAS Storage 8C 16Gb Per run: ~ 1 Mio reads.

1 GCC Genomics Core Computing

2 Current situation GCC 1.0 Roche 454 Current cluster UZ network 8C 16Gb 2TB UZ NAS Storage 8C 16Gb Per run: ~ 1 Mio reads ~ 2Gb raw data

3 New sequencer: 1000x increase 1.1TB / run (200Gbp) ~1000 Mio reads 8 days run! Basic analysis of 1 full run < 1 week on 3 nodes with 48Gb RAM and 8 CPU cores each (and needs 7TB space) Full capacity sequencing = full capacity 24 cpu cores

4 Meta-analyses & post-analyses Several fold higher needs than basic run analyses Integrate multiple runs (e.g,. patient versus controls, families, etc) Integrate with previous data Integrate with publicly available data –RNA-Seq + gene expression data from GEO Integrate with other data sources –DNA-Seq + RNA-Seq + Methyl-Seq Integrate with genome browsers –Galaxy, UCSC, Ensembl Make analysis pipelines available to users as a service Custom analyses as a service or in collaboration

5 Ideal computing setup High Performance Computing (HPC) 500MB/s

6 UZ-GBIOMED-VSC 8C 16Gb 2TB 8C 16Gb UZ NAS Storage -Additional RAM (32Gb or 48 Gb per node) - Additional storage? DAS or NAS? Dell, NetApp? DAS or NAS? Dell, NetApp? Open-MPI SGE Distributed computing Torque/PBS Distributed computing Flexible computing ~ 100 cpu 6Gb RAM/core NetApp +DDN Storage - Servers - Storage - Switches Software: - Academic tools - CLCBio? Software: - CASAVA (parall. by user) - Academic: bowtie, bwa, … - CLCBio? UZ-Patient data Software: - CASAVA - CLCBio - Roche - Computing (0,5 EUR / cpu-hour) - Storage (750-1500 EUR / TB) VSC gbiomedUZ

7 To be discussed How can HiSEQ2000 choose between UZ and KULeuven network to send run data to storage? –1Gb –350 Gb / run compressed Where to store data after secondary analysis? –Cheap storage –External HDD –tape Who does what? –Jeroen / Jan for UZ? –Stein / Gert / Raf for Biomed? Can we already buy additional RAM for UZ cluster? Can we connect gbiomed servers directly to UZ storage? –What are the requirements? Estimate load over 3 levels –# users –# run –Difficult to estimate now – evaluate after 1yr

8 What’s next Decide on gbiomed hardware List of things needed at UZ Start testing CASAVA on UZ system and on VSC Test CLCBio on UZ system for Illumina data Test with 1000 genomes data



11 Storage How much do we need? –1.1 TB per run –7 TB space during analysis BUT: keep only runs that are being analyzed –~ 3 at a time? – 10-15 TB After analysis: –Data delivered to client –Data compressed and moved to offline storage Cheap HDD array? Tape? External HDDs?

12 Proposal for GCC2.0 (ideas under construction) UZ Computing nodes (existing) 8C 16Gb 2TB UZ NetApp Storage 8C 16Gb Patient-related data Non-patient-related data (e.g., model organisms, cell lines, …) 32C 256Gb 8C 48Gb gbiomed computing nodes Fast interconnect; high I/O bandwidth Illumina HiSEQ2000 Roche 454 ICTS/VSC NetApp +DDN Storage VSC (existing), pay per cpu- hour ! Non-patient- related data ! ! = to create, to test, or to open 10Gb link

13 GCC2.0 features Divide and conquer: solution at 3 levels –UZ: for UZ-patient-related data (protected) –Gbiomed: ad hoc, flexible computing for research (non-UZ-patient related data) –VSC: high-performance computing (non UZ-patient related data) Storage (too expensive to duplicate) –VSC storage with Gbiomed access (create 10Gb fast interconnect from ICTS to gbiomed) –UZ storage with Gbiomed access (create ‘open-access’ policy for non-patient related data) –Gbiomed ad hoc storage (HDDs in the local servers) Computing –VSC for HPC –Servers in UZ (patient-related data) –Servers in gbiomed (for research-related ad hoc analyses, web services, development, software testing, …) Requires fast (10Gb ethernet) access to ICTS storage and fast (and open) access to UZ-open storage

14 GCC2.0 Cost, Timing & Effort estimates Budget from Stichting tegen Kanker –200-250 K left for computing Solution for the first 3 years should be possible (excluding bioinformatics manpower) Budget spread between VSC-gbiomed-UZ: to be decided internally in genomics core VSC x% –Storage (86.400 EUR for 32 TB; ~80 TB is needed for 25 runs per year) –Computing time (29.594 EUR for 55.000 cpu-hours) Gbiomed local servers and local storage y% UZ additional storage z% Software licenses (CLCBio) (price quote requested) –More investments needed over time (e.g., new hardware is only for 3 years) Timing: 31 August 2010? Estimated effort (to be discussed) –VSC: Create 10Gb ethernet link to gbiomed (cost?) … mandays for startup and testing (network connections, storage, software) Maintenance included in price –Genomics Core bioinformaticians (VRC, CME) … mandays for startup and testing –Gbiomed IT: … mandays for setting-up local servers & integration with ICTS storage … FTE for maintenance of local servers –UZ: … mandays for additional storage and setup NetApp share

15 Hurdles to overcome 1) 10Gb ethernet link between VSC and gbiomed –For non-UZ-patient related data –To transfer Illumina data to VSC –To run ad hoc analyses on local gbiomed servers, connected to the VSC storage, without the need to duplicate the storage solution and the data (too costly) –An absolute requirement –Currently not available –A necessary investment for future VSC-BMW interactions 2) UZ-Patient-related data cannot be transferred to VSC storage, nor computed at VSC –Can VSC provide a secure transfer, storage and computing environment for UZ-data? If not, data analysis and storage for UZ-data remains in UZ. 3) Link between UZ storage and gbiomed for non-patient related data –Gbiomed-UZ –10Gb link is possible in principle. Perhaps during transition period (while waiting for 10Gb link VSC-gbiomed)?

16 Alternatives All-in-one solution PSSCLabs Public tender

17 Bioinformatics analyses Estimated effort from Genomics Core bioinformatician for basic analysis of 1 run: ~2-3 mandays –Included in service fee? –This analysis will not be satisfactory for most projects Fee-based bioinformatics and data analysis service for more advanced analyses? Many users have a bioinformatician in the group or already collaborate with bioinformaticians Contribution in the service fee for GCC hardware & maintenance cost, and software licenses Estimated effort: –Either only basic analysis services are offered: ½ FTE bioinformatics postdoc –Or basic plus advanced bioinformatics services will be offered: 1 FTE bioinformatics postdoc.

