From Bio-Informatics towards e-BioScience L.O. (Bob) Hertzberger Computer Architecture and Parallel Systems Group Department of Computer Science Universiteit van Amsterdam
Background information experimental sciences There is a tendency to look ever deeper in: Matter e.g. Physics Universe e.g. Astronomy Life e.g. Life sciences Instrumental consequences are increase in detector: Resolution & sensitivity Automation & robotization Therefore experiments change in nature & become increasingly more complex
Impact in the life sciences Impact of high throughput methods e.g. Omics experimentation genome ===> genomics
New technologies in Life Sciences research University of Amsterdam cell GenomicsTranscriptomicsProteomicsMetabolomics RNA protein metabolites DNA Methodology/ Technology
Omics impact
Impact in the life sciences Impact of high throughput methods e.g. Omics experimentation genome ===> genomics Instrumentation being used in omics experimentation: Transcriptomics via among others; micro-arrays Proteomics via among others; Mass Spectroscopy (MS) Metabolomics via among others; MS & Nuclear Magnetic Resonance (NMR)
Results in Paradigm shift in Life sciences Past experiments where hypothesis driven Evaluate hypothesis Complement existing knowledge Present experiments are data driven Discover knowledge from large amounts of data
Life sciences research: from gene to function Gene DNA NH 2 COOH Protein Genome-wide micro-array analysis “High-throughput” protein-analysis mRNA AAAAAAAAA function-2 function-1 function-n Whole-genome sequence projects Protein function: -prediction by bioinformatics -proof by laboratory research cell nucleus Gene expression by RNA synthesis mRNA translation by protein synthesis
Developments towards Bio- informatics & e-Science Experiments become increasingly more complex Driven by increase of detector developments Results in an increase in amount and complexity of data Something has to be done to harness this development Bio-informatics to translate data into useful biological, medical, pharmaceutical & agricultural knowledge
The what of Bioinformatics Bioinformatics is redefining rules and scientific approaches, resulting in the ‘new biology’. Within this new paradigm the traditional scientific boundaries are blurred, leaving no clear line between ‘dry or computational’ and ‘wet-based’ approaches
Role of bioinformatics cell Data generation/validation Data integration/fusion Data usage/user interfacing GenomicsTranscriptomicsProteomicsMetabolomics Integrative/System Biology RNA protein metabolites DNA methodology Bioinformatics
Two sides of Bioinformatics The scientific responsibility to develop the underlying computational concepts and models to convert complex biological data into useful biological and chemical knowledge Technological responsibility to manage and integrate huge amounts of heterogeneous data sources from high throughput experimentation Need for e-Science support
Developments towards Bio- informatics & e-Science Experiments become increasingly more complex Driven by increase of detector developments Results in an increase in amount and complexity of data Something has to be done to harness this development Bio-informatics to translate data into useful biological, medical, pharmaceutical & agricultural knowledge Virtualization of experimental resources enabling sharing & leading to e-BioScience
Life science/genomics research consortia and industry Grid infrastructure Bioinformatics e-Science & research infrastructure e-Bioscience and life science innovation domain e-Bioscience & research infrastructure Life science application areas Generic e-Science ICT development and support Network infrastructure and computing capacity
Why e-BioScience There is an increasing necessity to use results from other scientist e.g. share data & information :
Re-use and sharing of biological data (2) Information content of omics data extremely high, however, Data subject to noise, biological and technical variation How to induce biological principles from these genome-wide data sets? Approach: develop methodology for “reverse engineering” of biological mechanisms. Biggest challenge in bioinformatics today. Need for external data sources for in-silico experimentation Two practices for re-use and sharing of data Collectively compile huge amounts of relevant data and make these available to the community. Examples: Bio-banking, compendia (e.g. NIH’s Affymetrix SNP repository). Re-use information from different and diverse experiments to discover phenomena
Re-use and sharing of biological data (2) Compendium example: re-use and sharing of Huntington data Datasets: 404 Affymetrix Gene chips of measurements on extremely rare human brain samples (Hodges et al. Hum. Mol. Genetics, 2006) Available from NCBI GEO database (MIAME) Goal: find genes involved in Huntington’s Disease Approach: Reanalyze gene expression data Combine genotype data and clinical data (e.g. using SigWin) Extend experiments with own ChIP on chip data
Resource Identification software Repository of relevant meta-information from: Data warehouses e.g. GEO, ArrayExpress, Protein Interaction database Literature (Mining of PubMed using Collexis) Information resources specialized on diseases, genes, proteins, e.g. OMIM, GenBank, Ensembl
Why e-BioScience There is an increasing necessity to use results from other scientist e.g. share data & information: Data repositories Cohort studies in Bio-banking Biodiversity Expensive and complex equipment Mass Spectroscopy MRI Other
Problems for the realization of e-BioScience Life Science field is still in an early stage of development and: First principles are not understood at all As a consequence experimental methods are not well established and will not for a time to come Because of the new forms of omics instrumentation there is a need for design for experimentation methods Lack correct logging of conditions under which experiments are done is production of large amounts of data that request among others statistical techniques for interpretation As a consequence results are multi interpretable
Problems for the realization of e-BioScience Problems for bioinformatics & e-Bioscience: Rationalisation at this early stage is almost impossible Pre- standardization & standardization almost non existent Where there are standards they are inadequate because multi interpretable (like MIAME for micro-array’s) In addition there are commercial end-user products that are difficult to integrate Users lack the training necessary to handle these complex experimental situation Only possible solution is to create a flexible experimentation environment for the end-users
Role of ICT in e-BioScience e-Science is a new form of science methodology complementing theoretical and experimental sciences. It is using generic methods and an ICT infrastructure to support this methodology. Web services as a paradigm/way of using/accessing information Grid is as a method of accessing & sharing computing resources by virtualization What is missing in e-BioScience: Connection between biological problem & e-Bioscience User oriented tools that can be re-used and extended General model of ICT based integration Semantic support ontology’s and semantic support for workflows to make user knowledge explicit
Consequences for bio- informatics & e-BioScience Considerable amounts of experimentation is necessary before a well established methodology will emerge The VL-e approach might be a good model & produces an environment in which the necessary experimentation can be realized
Enhancing the scientific process: e-BioLab Problem domain experts can focus on the biology because they are shielded from technical details by e-scientists. Viewpoints on the research question and the data semi-instantaneously can be expressed and visualized. Ideas and analyses can be retained and documented. Facilities for remote collaboration are present*. * Rauwerda et al., 2 nd IEEE International Conference on e-Science and Grid Computing (submitted) Readily accessible data + models data mining Small integration experiments + integration methods Easy visua- lization Vague results Basic model of problem area e-BioOperator Biologists e-BioScientist Motivation: Interacting with the problem domain requires an environment in which the domain can be opened up and ideas, hunches and notions on the data and crude models of the biology can be visualized A tangible space in which biologists, aided by e-scientists, will have the full potential of VL-e at their disposal. An actual laboratory in which: Problem domain experts (biologists, medical doctors) and scientists from enabling disciplines jointly and in a creative manner work on the analyses and design of –omics experiments. Basic concept of e-BioLab:
Enhancing the scientific process: e-BioLab (2) Realization: Large high resolution display (26.2 Mpixel) with high bandwidth (10 Gbit/s) connection to render cluster Full access to computational facilities and GRID middleware of VL-e e-whiteboards and tablet PCs to share and store ideas High definition video cameras for remote collaboration Highly adaptable lab configuration. Research into: Problem Solving Environments for biology under study formulation of scientific workflows that allow for sufficient interactivity and guarantee reproducibility Maintaining an electronic lab journal for e-science experimentation Methods for: Information Management of omics data Biological Domain Interaction / Resource Identification Modeling of Biological Information and Knowledge Remote scientific co-operation Man-machine interaction
High resolution displays in e-bioscience Clustering Video remote collaboration Gene lists Remote whiteboard SOM Interesting PathwaysGO catagories Literature Mining GSEA Example: concurrently display in a discussion with a remote partner Clustering results of microarray experiments Interesting pathways that are predominant in certain clusters Gene Ontology categories Results from literature mining Gene Set Enrichment of categories identified in literature mining Notions depicted on the e-whiteboards
Virtual Lab for e-Science research Philosophy Multidisciplinary research and development of related ICT infrastructure Generic application support Application cases are drivers for computer & computational science and engineering research Problem solving partly generic and partly specific Re-use of components via generic solutions whenever possible
Generic e-Science services Generic e-Science services Grid Services Harness multi-domain distributed resources Technology push Domain Specific tools Application pull Domain generic e-BioScience services Microarray pipeline Mass spectroscopy pipeline Pathway visualization Protein annotation Generic e-Science services
Generic e-Science services Generic e-Science services Grid Services Harness multi-domain distributed resources Technology push Domain generic e-Science services Domain generic e-Science services Generic e-Science services Domain Specific tools Micro-array Transcriptomics pipeline Mass spectroscopy Proteomics pipeline Domain Generic services Application pull
Bioinformatics methods in VL-e (1) Example 1 – An application specific method modified by e-science into a generic one: SigWin* Starting point: Application specific method for detecting windows of increased gene expression on chromosomes** (implemented in C and perl for SAGE technology) Motivation: Broad interest from molecular biology in positional behaviour of any measurement data that can be mapped onto DNA sequences SigWin e-Science version: GRID-based modular workflow for detecting windows of significance in any sequence of values Widely applicable from gene expression to meteorology data Modules reusable for alternative workflows, e.g. protein modification Scalable to very large datasets * Inda et al., 2 nd IEEE International Conference on e-Science and Grid Computing (submitted) ** Versteeg et al, Genome Research, 2003
Bioinformatics methods: SigWin Significant window detector Generalisation of RIDGE method Human gene expression Temperature in Amsterdam DNA curvature of the Escherichia coli chromosome
Bioinformatics methods in VL-e (2) Example 2 – An application specific method composed of generic and specific modules in a workflow: OligoRAP* Purpose: a re-annotation workflow for oligo libraries Motivation: rapidly evolving knowledge in genome analysis requires frequent re-assessment of the molecules which are used to measure gene-expression. OligoRAP Uses set of application generic (BIOMOBY) BLAT and BLAST sequence alignment (web)services. Uses application specific (BIOMOBY) annotation analysis service BIOMOBY: de-facto standard for bio-informatics webservices. Joint work of sequence analysis lab and micro-array lab Workflow: Adjustable filtering criteria make quality level of oligos explicit Workflow provenance makes re-annotation reproducible. * P. Neerincx, H. Rauwerda, F. Verster, A. Kommadath, T.M. Breit, J.A.M. Leunissen, Poster ISMB 2006
Virtual Lab for e-Science research Philosophy Multidisciplinary research and development of related ICT infrastructure Generic application support Application cases are drivers for computer & computational science and engineering research Problem solving partly generic and partly specific Re-use of components via generic solutions whenever possible Rationalization of experimental process Reproducible & comparable Two research experimentation environments Proof of concept for application experimentation Rapid prototyping for computer & computational science experimentation
Medical Diagnosis and Imaging Problem Solving Environment Partners: Universiteit van Amsterdam (UvA) Academisch Medisch Centrum (AMC) Vrije Universiteit Medisch Centrum (VUMC) Philips Research Philips Medical Systems TU Delft IBM Applications: 1.Eddy current reduction 2.Matched Masked Bone Elimination 3.Functional brain imaging, DWI and fiber tracking 4.MR virtual colonoscopy 5.Parallel MEG data analyses 6.Grid-based data storage, retrieval and sharing 7.Interactive 3D medical visualization Objective: To study the design and implementation of a PSE for medical diagnosis and imaging to support and enhance the clinical diagnostic and therapeutic decision process
Brain Imaging and Fiber Tractography Diffusion Weighted Imaging (DWI) Restricted Brownian motion results in anisotropy that can be measured >= 6 measurements, reduced to tensor per voxel Largest eigenvectors give diffusion vector Whole volume fiber tracking can take many hours Depends on size of volume and number of measurements per voxel Suitable for parallelization Visualization techniques
Medical Diagnosis and Imaging Problem Solving Environment VL-e generic services: Provides: Scientific visualization techniques Image processing algorithms Uses: Experiment editor Parallel processing techniques Application specific services: Access to PACS, DICOM Interfaces to medical scanners (MRI) In-house developed algorithms: Eddy Current Reduction Matched Masked Bone Elimination Patient privacy Grid Middleware Surfnet Virtual Laboratory VL-e Environment … Medical Applications … Grid services: Storage facilities (SRB) High Performance Computing platforms High Performance Visualization platforms
Eddy current reduction Shear, magnification and translation as a result of residual currents in DWI 2D matching to correct Computationally expensive Parallelization through domain decomposition Computing cycles via Grid Integrated PACS solution Effects of residual eddy currents on Philips 3T Intera with DWI. Figure by Erik-Jan Vlieger, AMC.
Medical Diagnosis and Imaging Problem Solving Environment 2D/3D visualization VL experiment topology Image processing, Data storage Filtering, analyses, simulation Data retrieval, acquisition
The situation in the Netherlands Netherlands Bio-Informatics Center (NBIC) was set up as part of the Dutch Genomics Initiative Netherlands Genomics Initiative (NGI) Its aim was to organize bio-informatics in the Netherlands and to generate sufficient critical mass also to support as a technology center the other genomics initiatives Organizational structure: Board of directors Dr van Kampen scientific director Drs R. Kok executive director Prof. Dr. Hertzberger adjunct scientific director Board of overseeing International Advisory board Scientific Committee Program Steering Group
Current NBIC activities Currently NBIC runs three programs and took the initiative and participates in another three joint activities besides collaboration such as with SURF (networking) and VL-e (e-Science): NBIC programs: BioRange: a bio-informatics research program of 25 M$ & 25 M$ matching BioAssist: a 10 M$ support program BioWise: a 3 M$ education program Participation in : Computation life sciences: a 5 M$ program with among others physics, chemistry and computational science Pilot grid roll out: a 3M$ Grid rollout & support with Dutch Foundation for computing (NCF) and others BIG GRID: a 35M$ GRID and e-Science program in the Netherlands together with NCF, physics, VL-e and others
Program activities Bio Range has four program lines: Micro array related bio-informatics Proteomics related bio-informatics Integrated bio-informatics Informatics research for Bio-informatics All program lines comprise a number of collaborative projects with participation of groups all over the Netherlands Bio Assist runs two program lines Establishment of e-bioscience support environment Establishment of generic e-science infrastructure In future also addition towards biomedical as was illustrated
The VL-e infrastructure Grid Middleware Surfnet Application specific service Application Potential Generic service & Virtual Lab. services Grid & Network Services Virtual Laboratory VL-e Proof of Concept Environment Telescience Medical Application Bio Informatics Applications VL-e Experimental Environment Virtual Lab. rapid prototyping (interactive simulation) Additional Grid Services (OGSA services) Network Service (lambda networking) VL-e Certification Environment Test & Cert. Compatibility Test & Cert. Grid Middleware Test & Cert. VL-software
Grid Middleware Surfnet Network Service (lambda networking) Virtual Laboratory VL-E Experimental Environment VL-E Proof of concept Environment Telescience Medical Application Bio Applicatio ns Rapid prototyping (interactive simulation) Additional Grid Services (OGSA services) e-Science Roll out Application feedback Stable Application & VL-e component Unstable Application & VL-e component Grid Middleware Surfnet Virtual Laboratory Big Grid xxxx BioAssist Total 25M$ support + 25M$ matching Total 35 M$ support
Conclusions Omics experiments change the face of life sciences Bioinformatics can be considered to be an essential enabler and is a form of e-Science Will help to realize necessary paradigm shift in Life Science experimentation Better support of experimentation & optimal use of ICT infrastructure requires rationalization experimentation process Information management essential technology Bioinformatics can not be decoupled from e-Bio- science applications e-Bioscience also has to comprise biomedical applications