FuGE: A framework for developing standards for functional genomics Angel Pizarro Univesrity of Pennsylvania Andrew Jones University of Manchester
Overview Challenge of building data standards Introduction to FuGE Current status Formats developed using FuGE
Data Standards for HT Genomics Major challenge developing standards: Technology still evolving Heterogeneous data formats (and data types) from software and instruments “Important” info about starting sample is almost unlimited Large quantities of metadata to validate results BUT: Most of these problems are shared by microarrays, proteomics, metabolomics etc.
Experiment Workflow Material Treatment Material Treatment Material Treatment Material Data Acquisition Data Data Transformation Data = Inputs and outputs of Protocols = Instance of some Protocol Data
Functional Genomics Experiment (FuGE) Object Model Merges of MAGE and PEDRo models where attempted –Results where and even more complex model that still left other FG technologies untouched –Main motivation was reuse MAGE sample prep and ontology components FuGE project was created as independent project from MGED and PSI Model of common components across FG to enable synergy between standards –Sample description, protocols, investigation structure
Architecture Details FuGE mainly represented as UML model –UML 1.4 using Magic Draw 9.5 Uses AndroMDA to produces platform specific models –XML Schema –Language Bindings and API’s Java, Perl, C, etc. –Database schema
FuGE Common Bio Description Audit Ontology Protocol Reference Investigation Data Material Conceptual Molecule Common: General data format management Auditing Referencing external resources Protocols Bio: Investigation structure Data Materials (organisms, solutions, compounds) Theoretical molecules e.g. sequences FuGE Structure
FuGE Workflow
FuGE is an Enabler Serve as a basis for developing new formats –PSI-GPS and MGED are using FuGE for developing their new data formats Existing formats can be tied together using FuGE –mzData does not describe biosource separation procedure (gels, LC, etc.) –CPAS from FHCRC does this
Use 1: Extending FuGE
Protocol definition says “See ExternalData file for parameters” (rather than storing params in Protocol) Use 2: Tie Together External Formats Protocol ProtocolApplication MaterialExternalData mzData file File format definition Parser will exist to extract data / parameters from mzData file Material can be used to describe the sample. This connects the MS data with a separation workflow inputMaterialoutputData
Status of FuGE Milestone 1 release - Sep 2005 Milestone 2 release - Dec 2005 –Acceptance by PSI and MGED at this time Milestone 3 – Spring 2006 –Milestone 2 of GelML and spML Version 1.0 – Fall 2006
FuGE Extensions MAGE V2 –Format for microarray data and annotations GelML –Format for methods + results of 2D gels –Milestone 1 Dec 2005 –Release scheduled for Spring/Summer 2006 spML –Sample processing: liquid chromatography, capillary electrophoresis, centrifugation –Milestone 1 Dec 2005 CPAS uses a FuGE-inspired manifest for experiments Metabolomics community considering PRIDE contemplating FuGE for data format Flow Cytometry community interested MIACA?
Summary FuGE should help convergence of omics data formats: –Single description of the sample for all types of experiment –Shared representation of protocols –Investigation and workflow structure for integrating different omics projects –Good starting point, proven development methodology
Acknowledgements Other FuGE developers –Andrew Jones (Manchester) –Michael Miller (Rosetta), Paul Spellman (Lawrence Berkley) –MGED, PSI, Fred Hutch CRC, Genologics, and various Contact:
While I have your attention… Space cost –Ultra expensive ~$19/GB ($380 for 20GB) –Cheap (TerraStation NAS) ~$0.80/GB ($16) –Ultra Cheap ($500 PC) ~ $0.50 ($10) MIAPE confounding factors –Will never have a complete list –We are implicitly telling investigators that they don’t know how to do good science (a Bad Thing) –Instead require quality assessment statistics on the data (variance, reproducibility, etc.)