Serono Science Scientific computing and high performance applications
Research Computing in Serono Hardware environment High performance computing applications
Drug development pipeline Target discovery Target validation Screening + H2L PCD Phase I/II Phase III/IV Marketing Proteomics Genomics Chemistry Human Genetics Biostatistics Transcriptomics Mouse genetics Cell biology Pharmacology Sciences WGS uArrays Protein arrays siRNA caliper combichem MS Taqman imaging cellomics x-ray RMN Y2H HTS Technologies genomes Transcript map SNPs Protein structure Patient data protein map interactions screening data phenotype images haplotype Data types pathways Structure/activity
Research Computing Vision & Missions Use in-silico technologies to help identify and progress therapeutic proteins and small molecules that will successfully feed the development pipeline By: Research Knowledge Management Delivering across Research an integrated information environment that puts scientific data, information and knowledge at the scientist’s fingertips Computational Life Science Developing cutting-edge scientific applications enabling in-silico drug discovery and driving Serono’s competitive advantage Advanced Data Analysis Providing advanced and pervasive data analysis competencies to make sense of high-throughput and complex data. Research IT environment Providing the computing and communication infrastructure to deliver the vision
Research computing activities 1. Data processing Technology driven 2. Predictions and simulation 3. Data analysis Interpretation 1s -1s m 4. Data management
Advanced Data Analysis - Issues Data complexity Amount of data Analysis cannot be performed in silos – we need information systems able to correlate data available from all sorts of experimental information (Genome scans, DGE, RNAi, Cell assays, proteomics, interactions, phenotypes) 2000 2002 2004 2001-3 High content cell assays, genomic sequence, QSAR High density microarrays as a discovery platform Genome scan data – 100’000 SNP’s, hundreds of patients, several diseases Biomarkers identification through proteomics and trasncriptomics Compendium, Virtual Combinatorial Library Multidimensional decision making Microarrays: complex data (time series, complex tissues deconvolution, disease models, full transcriptome) 2004-7
Grid for the life sciences – differences Physics Biology Theory « complete » Inexistent or imprecise Level of abstraction (model) Single Multiple Volume Very high Low-medium Data complexity Low
User-friendly interfaces (Web based) Generic End-user Access in silico generation tools e.g. Text mining, Data Analysis Corporate Database Core Integrated Oracle-Based Systems Drill-down E-notebook Publish LIMS QC LIMS QC LIMS QC Specialized, complex power-user interfaces
HPC Hardware environment SGI Origin 3900 64 proc (cc-numa), IRIX, 128 GB SGI Altix 3700 BX2 16 proc Timelogic Decypher FPGA bioinformatics accelerator x 4 SGI Origin 3900 32 proc Linux Xeon cluster, 50 proc 10 TB CXFS SAN (Geneva only) Computational chemistry (docking, combichem, compendium, pharmacophore, structure resolution) Bioinformatics (public domain tools, sequence databases, peptide identification, in-silico modeling) Blast, SW, profiles Same as above, Boston, Paris Distributed data storage
High performance computing applications in Serono today Large scale sequence to sequence comparisons Genome wide analysis (microRNA, focused gene prediction, gw profiles, etc.) Sequence data base monitoring Gene index and data mapping Large scale proteomics (peptide identification) Virtual screening In-silico biology
Smart is better than More In combinatorial chemistry design, one scaffold and 4 groups of 800 reagents each generate a library of 320 billions virtual compounds Virtual Combinatorial Database Enumerated substracture search would take years of CPU time and 1 petabyte of storage A proprietary non-enumerated search retrieves hits in just a few seconds Fast pre-filtering of compounds reduces amount of compounds for time-consuming docking studies Useful for new compound acquisition, known protein target structure, not for primary screen (replating) Usual size of virtual screens in Serono: ~1000 compounds Virtual Screening
Future grid applications Large scale in-silico modeling Protein-protein interaction QM-based, dynamic virtual screening Data grids Imaging
Past grid evaluations (corporate PC idle cycles) High deployment costs – IT resources Concern about availability of PC resources – habits and procedures Foreseen replacement of desktop by even less available laptops Modification of software to run effectively on the grid Previous studies show that a large corporate grid of 1000 desktops is not more efficient than a 64 proc dedicated cluster (Novartis) The in-house idle-cycle grid model is not efficient
Issues in the pharma industry IP considerations Competitive intelligence Security policies Obsession with proprietary data and know-how Is the current model of « all in-house » sustainable? Distributed (grid-enabled) public domain bioinformatics services will anyway become pervasive and will superceed capabilities available in-house