Providing an environment where every data-driven researcher will thrive Professor Carole Goble University of Manchester,

Slides:



Advertisements
Similar presentations
DRIVER Building a worldwide scientific data repository infrastructure in support of scholarly communication 1 JISC/CNI Conference, Belfast, July.
Advertisements

Environmental Information Data Centre: enabling the discovery of CEH-held data John Watkins Deputy Director EIDC.
SysMo-DB: Supporting Data Access and Integration Carole Goble, University of Manchester UK Jacky Snoep, Uni of Manchester / Stellenbosch, S Africa Isabel.
RightField The Semantic Annotation of Experimental Data using Spreadsheets, The Semantic Annotation of Experimental Data using Spreadsheets, Katy Wolstencroft,
SysMO-DB: A pragmatic approach to sharing information amongst Systems Biology projects in Europe Carole Goble, University of Manchester,
Provenance in a Collaborative Bio-database RAASWiki Donald Dunbar & Jon Manning Queen’s Medical Research Institute University of Edinburgh Use Cases for.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
SysMO-DB: Towards “just enough” data exchange for the SysMO Consortium Katy Wolstencroft, University of Manchester, UK.
SysMO-DB: Towards “just enough” data exchange for the SysMO Consortium Stuart Owen, University of Manchester.
Designing, Executing and Reusing Scientific Workflows Katy Wolstencroft, Paul Fisher, myGrid.
Accelerating Time to Experiment – The myExperiment Approach to Open Science David De Roure Carole Goble Jiten Bhagat.
IBM Watson Research © 2004 IBM Corporation BioHaystack: Gateway to the Biological Semantic Web Dennis Quan
The Imperial College Tissue Bank A searchable catalogue for tissues, research projects and data outcomes Prof Gerry Thomas - Dept. Surgery & Cancer The.
University of Southampton, U.K.
Jiten Bhagat University of myExperiment A Social VRE for Research Objects JISC Roadshow | February.
August 29, 2002InforMax Confidential1 Vector PathBlazer Product Overview.
Data, data standards and sharing Dr Daniel Swan Bioinformatics Support Unit
1 Digital Libraries and Evidence in the Developing World Context Dr. Jon Ferguson Senior Health Database Scientist IMMPACT Project University of Aberdeen.
RightField Rich Annotation of Experimental Biology through Stealth Using Spreadsheets Katy Wolstencroft, Stuart Owen, Matthew Horridge, Olga Krebs, Wolfgang.
1 FACS Data Management Workshop The Immunology Database and Analysis Portal (ImmPort) Perspective Bioinformatics Integration Support Contract (BISC) N01AI40076.
Science as an Open Enterprise: Open Data for Open Science Professor Brian Collins CB, FREng UCL, June 2012 Emerging conclusions from a Royal Society Policy.
Moving forward our shared data agenda: a view from the publishing industry ICSTI, March 2012.
Good practice in Research Data Management Module 6: Tools, training and support.
SysMO-DB: Towards “just enough” data exchange for the SysMO Consortium Carole Goble, Uni of Manchester, UK Jacky Snoep, Uni of Manchester, UK / Stellenbosch,
Data Curation and Management activities within the UCT Computational Biology Group Dr Nicky Mulder.
Support for MAGE-TAB in caArray 2.0 Overview and feedback MAGE-TAB Workshop January 24, 2008.
Highlights from Day 3* in the Big Data House * ±1.
Bioinformatics and medicine: Are we meeting the challenge?
Designing, Executing, Reusing and Sharing Workflows: Taverna and myExperiment Supporting the in silico Experiment Life Cycle Katy Wolstencroft Paul Fisher.
A centre of expertise in digital information management UKOLN is supported by: Monica Duke Project.
SysMO-DB: Just Enough Exchange for Systems Biology Data and Models Carole Goble, Katy Wolstencroft, Stuart Owen, Sergejs Aleksejevs - University of Manchester.
RightField: Semantic Enrichment of Systems Biology Data using Spreadsheets Katy Wolstencroft myGrid, SysMO-DB University of Manchester.
SysMo-DB: Towards “just enough” data exchange for the SysMO Consortium Carole Goble, Uni of Manchester, UK Jacky Snoep, Uni of Manchester, UK / Stellenbosch,
Copyright OpenHelix. No use or reproduction without express written consent1.
Data-driven research with e-Laboratories Stuart Owen University of Manchester
Taverna and my Grid Open Workflow for Life Sciences Tom Oinn
Helping scientists collaborate BioCAD. ©2003 All Rights Reserved.
Taverna Workflow. A suite of tools for bioinformatics Fully featured, extensible and scalable scientific workflow management system – Workbench, server,
Building and Running caGrid Workflows in Taverna 1 Computation Institute, University of Chicago and Argonne National Laboratory, Chicago, IL, USA 2 Mathematics.
Joint agINFRA & SCI-BUS workshop, 30/05/2013, Budapest, Hungary FP 7-INFRASTRUCTURES programme agINFRA Joint agINFRA & SCI-BUS workshop agINFRA.
CaBIG Workflow University of Chicago, USA University of Manchester, UK.
Introduction to caArray caBIG ® Molecular Analysis Tools Knowledge Center April 3, 2011.
SysMO-DB: Sharing and Exchanging Data and Models in Systems Biology Katy Wolstencroft University of Manchester.
Taverna Workflows for Systems Biology Katy Wolstencroft School of Computer Science University of Manchester.
Because good research needs good data Funded by: Digital Curation for Researchers, 28th February 2013 The Shifting Research Data Management Policy Landscape.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
Professor Carole Goble
The Functional Genomics Experiment Object Model (FuGE) Andrew Jones, School of Computer Science, University of Manchester MGED Society.
Recent Developments in CLARIN-NL Jan Odijk P11 LREC, Istanbul, May 23,
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
Nature Reviews/2012. Next-Generation Sequencing (NGS): Data Generation NGS will generate more broadly applicable data for various novel functional assays.
Cooperative experiments in VL-e: from scientific workflows to knowledge sharing Z.Zhao (1) V. Guevara( 1) A. Wibisono(1) A. Belloum(1) M. Bubak(1,2) B.
SysMO-DB and ISA Katy Wolstencroft, University of Manchester, UK.
Linking Models & Data within the ISA structure Stuart Owen (based upon notes by Olga Krebs).
PLANETS, OPF & SCAPE A summary of the tools from these preservation projects, and where their development is heading.
Workshop: Linking Models and Data in SysMO Katy Wolstencroft, SysMO-DB University of Manchester, UK.
Open Access and Institutional Repositories. Accra, June 2007 Institutional repositories in SA research institutions: the DISA experience Dr D Peters.
High Risk 1. Ensure productive use of GRID computing through participation of biologists to shape the development of the GRID. 2. Develop user-friendly.
High throughput biology data management and data intensive computing drivers George Michaels.
1 LS DAM Overview August 7, 2012 Current Core Team: Ian Fore, D.Phil., NCI CBIIT, Robert Freimuth, Ph.D., Mayo Clinic, Mervi Heiskanen, NCI-CBIIT, Joyce.
Describing and Annotating Experimental Data: Hands On.
Fedora Commons Overview and Background Sandy Payette, Executive Director UK Fedora Training London January 22-23, 2009.
Developing our Metadata: Technical Considerations & Approach Ray Plante NIST 4/14/16 NMI Registry Workshop BIPM, Paris 1 …don’t worry ;-) or How we concentrate.
Towards a unified MOD resource: An Overview
Why Create a PGDB? Perform pathway analyses as part of a genome project Analyze omics data Create a central public information resource for the organism,
Professor Carole Goble University of Manchester, UK
GO-FAANG Workshop 7-8 October 2015
Data challenges in the pharmaceutical industry
Brian Matthews STFC EOSCpilot Brian Matthews STFC
Presentation transcript:

Providing an environment where every data-driven researcher will thrive Professor Carole Goble University of Manchester, UK

Pipelines –Scientific workflows over (web) services –Data pipelines, model population and validation, simulation sweeps –Distributed, federated datasets and analyses combined with local datasets and analysis –Opening up resources. e-Laboratories –Crowd-sourcing, group curating and sharing/reusing scientific assets. –Web 2.0 and Semantic Web. –Social networking, community content, collaborative filtering –Sharing and exchanging “Research Objects” –Opening up capabilities and capacity.

Pan European collaboration. Systems Biology of Microorganisms 13 projects, 91 institutes –Different research outcomes –A cross-section of microorganisms, incl. bacteria, archaea and yeast. Record and describe the dynamic molecular processes occurring in microorganisms by computerized mathematical models. –Modellers meet experimentalists Pool research capacities, data, models and know-how. Retrospectively. BaCell-SysMO COSMIC SUMO KOSMOBAC SysMO-LAB PSYSMO Valla MOSES TRANSLUCENT STREAM SulfoSYS + two more

Data-driven Multiple ‘omics –genomics, transcriptomics –proteomics, metabolomics Images, Reaction Kinetics Models Data sets + experiments + models –SBML, Agent-based, Mechanics based Analysis of data

Systems biology workflows in MCISB

High throughput experimental methods Public data sets (e.g. EBI) Web Services ~ 1400 NAR January Issue Little databases Lab books Spreadsheets Private and Shared. Proliferation Derived data Long tail. Little Data

My Datasets My Analytics Big Data Group Science Data services “Little” Data “Local” Science Publish Access

Massive decentralisation – wikis, sticks, spreadsheets Massive centralisation – commons, clouds, curated core facilities Tremendous fragility Digital Dust in Data Tombs

Picking Pain Points. Keeping it Real. Project Directors –Data remains with us under our control. –We control who sees what. –Just enough exchange. SysMO PALs –Spreadsheets. –Yellow Pages. –Standard Operating Procedures.

An education Modellers vs Experimentalists Computational thinking Systems thinking

Gray‘s Laws (modified) Working Now, Working to working –Gateways and ramps –Jam today, jam tomorrow –Just enough, just in time –Work with what you got already 20 questions –Is there any group generating kinetic data? –Is this data available? –Who is working with which organism? –What methods are been used to determine enzyme activity? –Under which experimental conditions are my partners working on for the measurement of glucose concentration? ? ? ? ?

Help people search for and find stuff Data Services Processes Models Software Experts

SysMO SEEK Assets Catalogue. Archive. Social Network. Sharing Space. Gateway. Yellow Pages –People. Expertise. Projects. Institutions. Facilities. Studies. Data –Experimental data sets and analysed results. –Gateway to data stores – SABIO-RK, ‘omics Models –Store. Stimulate. Publish. Curate. –Gateway to COPASI, JWS Online, BioModels. Processes –Laboratory protocols – Standard Operating Procedures –Bioinformatics analyses – computational workflows - Taverna –Model population and validation – workflows – Taverna –Gateway to myExperiment, MolMeth, OpenWetWare…. Interlinking ASSETS CATALOGUE

Linking data to process Standard Operating Procedures Models Software Provenance The Lab Book Retrospective method reconstruction The myth of reproducible science

Scientists willing to share methods and protocols. SOPs an early win. Defined standard metadata model based on Nature Protocols. Seeded.

Linking data with stuff Research Objects for packaging and exchanging Assets –Workflows linked to models linked to data linked to SOPs –Encapsulate community standards –Mixed resources: External and central. –Trust –“Preservation Packet” –Bechhofer et al 2010 forthcoming in The Future of The Web for Collaborative Science SBRML –Systems Biology Results Markup Language –To tie to the SBML

At the coal-face The Spreadsheet. The Content Management Systems. Legacy assets are assets. Metadata ramps.

The Content Management System Lightweight and flexible. Low take-on, hidden operations costs. Knowledgeable Civilians. Looks nice. Anarchy amenable.

Spreadsheets Template distribution Template mapping SysMOLab

Everyone wants metadata. No one wants to collect it. Standards mayhem Metadata millstones Most data is thrown away. Metadata for my sake Metadata compliance by stealth Preparation for publishing

CIMRCIMR Core Information for Metabolomics Reporting MIABEMIABE Minimal Information About a Bioactive Entity MIACAMIACA Minimal Information About a Cellular Assay MIAMEMIAME Minimum Information About a Microarray Experiment MIAME/EnvMIAME/Env MIAME / Environmental transcriptomic experiment MIAME/NutrMIAME/Nutr MIAME / Nutrigenomics MIAME/PlantMIAME/Plant MIAME / Plant transcriptomics MIAME/ToxMIAME/Tox MIAME / Toxicogenomics MIAPAMIAPA Minimum Information About a Phylogenetic Analysis MIAPARMIAPAR Minimum Information About a Protein Affinity Reagent MIAPEMIAPE Minimum Information About a Proteomics Experiment MIAREMIARE Minimum Information About a RNAi Experiment MIASEMIASE Minimum Information About a Simulation Experiment MIENSMIENS Minimum Information about an ENvironmental Sequence MIFlowCytMIFlowCyt Minimum Information for a Flow Cytometry Experiment MIGenMIGen Minimum Information about a Genotyping Experiment MIGSMIGS Minimum Information about a Genome Sequence MIMIxMIMIx Minimum Information about a Molecular Interaction Experiment MIMPPMIMPP Minimal Information for Mouse Phenotyping Procedures MINIMINI Minimum Information about a Neuroscience Investigation MINIMESSMINIMESS Minimal Metagenome Sequence Analysis Standard MINSEQEMINSEQE Minimum Information about a high-throughput SeQuencing Experiment MIPFEMIPFE Minimal Information for Protein Functional Evaluation MIQASMIQAS Minimal Information for QTLs and Association Studies MIqPCRMIqPCR Minimum Information about a quantitative Polymerase Chain Reaction experiment MIRIAMMIRIAM Minimal Information Required In the Annotation of biochemical Models MISFISHIEMISFISHIE Minimum Information Specification For In Situ Hybridization and Immunohistochemistry Experiments STRENDASTRENDA Standards for Reporting Enzymology Data TBCTBC Tox Biology Checklist BioPAX : Biological Pathways Exchange FuGE Functional Genomics Experiment MGED: Microarray Experimental Conditions MIBBI: Minimum Information for Biological and Biomedical Investigations Minimum Information Models 63% 47%

Just Enough Results Model Harvest standards e.g. MIAME (MIBBI.org) Analyse consortium schemas and spreadsheets JERMs for each data type – microarray, metabolomics, proteomics.... Map project data sources to JERMs. Distribute JERM spreadsheet templates “I only want to collect and share just enough results”

JERM Spreadsheets Templates Controlled vocabulary plug in RDF for ripping, mashing and comparing spreadsheets. A little semantics goes a long way

Reward curation Local curation at the point of capture – ISA-TAB for ‘omics. Centralised curation – SBML, CellML, SBO Automated curation. Which data is worth curating?

Blue-Collar Science. Curator Credit Curator Career Funding. Personal and institutional visibility Scholarly citation metrics Federate workloads Unpopular with the big data providers.

Commons-based Quality Control.

Progressive Curation: “lazy evaluation” metadata Just enough, Just in time Jam today and Jam tomorrow Gain Pain Very BAD Good, but Unlikely Just right

Sensitive sharing. Collaborate to compete Good reasons not to. Just enough just in time sharing. Data kept at host. Registered centrally through harvesting. Pre-Publication sharing vs Publication

Competitive advantage. Academic vanity. Adoption. Reputation. Scrutiny. Being scooped. Misinterpretation. Reputation. Legal issues. Rewards Risks Nature 461, 145 (10 September 2009) | doi: /461145a

Access Permissions Just Enough Sharing Reusing myExperiment

Reward sharing and reusing not reinventing. Technically. Culturally. Institutionally. Credit and Risk Mitigation.

Attribution. Trust. Credit Reward and Provenance Reusing myExperiment

Some pretty key things Data citation Stable and shared ids and names –A nightmare. –Sharednames.org –Biosharing.org Versioning and Provenance –Models, software, data sets –Ensembl web service doesn’t report version number.

Data commons, Data havens For data after the project has ended. For the common good or me. Tidy and untidy data. Beth’s Provenance Objects Bio2RDF

Access and availability of data and data analysis resources Web services underpin the ESFRI ELIXIR programme. Interfaces that are understandable and stable. Designed for people too. No access, no tools, no point (Keith Haines) Deposition to community databanks that minimise pain.

What is it? Is it working?

Data analysis, model population and data pipelining ramps. Crossing the adoption chasm There is a world of complexity for data preparation, processing and analysis Science Informatics Sweatshops. E-Laboratories. Workflows. Portals. Pre-cooked processes and process templates. Pre-cooked interfaces. Training.

MicroArray from tumor tissue Microarray preProcessing Lymphoma prediction Lymphoma Prediction Workflow Wei Tan Univ. Chicago Ack. Juli Klemm, Xiaopeng Bian, Rashmi Srinivasa (NCI) Jared Nedzel (MIT) caArray GenePattern Use gene- expression patterns associated with two lymphoma types to predict the type of an unknown sample.

myExperiment Communities Supermarket shoppers Tool builders Trainers and Trainees

Drop and Compute Local folder synchronised and shared via cloud Condor job submitted by drag and drop Results appear in Dropbox Ian Cottam

Bashing against local IT NO – you can’t access that datastore / run your analysis. Joined up thinking.

Data + Publications Data trapped in documents Supplemental information Text mining Text mining workflows Text mining to find method and controls

Reflect. Elsevier Challenge Winner 2009

Manual and Auto-mark up [Oscar-3]

Do not underestimate the power of Interactive Visualisation and Browsing Pre-cooked complex queries. Navigation. With my data. At the click of a button.

Distributed Annotation Service Upload and overlay my data

SysMO summary Providing an environment where every data-driven researcher will thrive Reality is messy. –Extreme Technology Determinism vs Voluntarist Sociocultural shaping Extreme and continuous partnership with users. –Act Local Think Global Agile development environment facilitated stream of features to tackle pain points. –Leverage other e-Laboratories, Maintaining scientists’ buy-in. Socio-Political Axis dominates the Technical Axis. –Collaboration evolutions, Confidence in exchange.

Coordination Sustainability Interoperability Adoption Capacity Data Six Action Plan Areas

Capacity building of our skills base Influence training and capacity building programmes. Promote training for young and mid-career researchers and research technologists. Enable mixed skilled research teams to include research and information technologists. Value and reward highly skilled research and information technologists within HE institutions with a career structure.

Data Silo culture Funding silos Discipline silos

Academic Credit and Risk Mitigation for sharing, curating, and reusing not reinventing

Data and Software is free like puppies are free

University of Stellenbosch, South Africa University of Manchester, UK Jacky Snoep EML Research gGmbH, Germany Isabel Rojas University of Manchester, UK Olga Krebs Wolfgang Müller Sergejs Aleksejevs Carole Goble Stuart Owen Katy Wolstencroft Finn Bacal

Links myGrid Project – SysMO-DB – myExperiment – Taverna – JWS Online – SABIO-RK –