Download presentation
Presentation is loading. Please wait.
1
The Representation of Scientific Data Frank.Gibson@ncl.ac.uk
2
Overview Recording archiving and sharing the process and the results of experimental data is a challenge What to store? How to store it? Why?
3
Science is complicated
4
Technology Complex experimental workflow Advances in instrumentation High-through methods
6
Analysis is complicated 12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt 12541 ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg 12601 tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga 12661 tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc 12721 atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa 12781 taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa
7
Analysis New algorithms and software Data integration From multiple sources Genomics Proteomics Metabolomics Neuroscience Systems biology
8
2D Image analysis
9
Problems “In the standard model, one collects data, publishes a paper or papers and then gradually loses the original dataset.” THE NEW KNOWLEDGE ECONOMY AND SCIENCE AND TECHNOLOGY POLICY Geoffrey Bowker, University of California, San Diego Geoffrey Bowker, University of California, San Diego
10
Problems Large, complex datasets are commonplace, Heterogeneous data formats –Vendor specific, Lab specific Multitude of analysis methods –Proprietary, open source
11
Benefits Knowledge discovery – results Sharing of best practice Evaluation of results Sharing of data Re-use
12
Re-use of neuroscience datasets Data that is shared and can be interpreted can often be used to address multiple questions. Data that have been collected with one question in mind often turn out to be highly valuable to address other questions (1) Hippocampus recordings for mapping place fields were the basis for high-profile papers addressing questions concerning temporal organization of neural codes (PMID: 12891358 ).12891358 (2) Paired recordings using extracellular and intracellular electrodes originally collected for detecting dendritically generated action potentials provide ground truth for testing and comparing spike- sorting techniques (PMID: 10899214 ).10899214
13
CARMEN Code, Analysis, Repository and Modelling for e-Neuroscience www.carmen.org.uk www.carmen.org.uk Engineering and Physical Sciences Research Council
14
Virtual Laboratory for Neurophysiology Enabling sharing and collaborative exploitation of data, analysis code and expertise that are not physically collocated
15
Cost Infrastructure Acquisition – data and metadata Developing a common representation Potential benefits are not always experienced by data producers Lab experimenter vs bioinformatician
16
Data pyramid Raw data Derived data Results Processing
17
Mass Spectrometry Data pyramid Raw data Derived data Results Processing
18
How do we store the data? Dictated by form of access Raw data, typically vendor specific formats for vendor specific software analysis Derived data – unlimited formats – higher level of access required to determine results Results – often queries over derived data Problematic if derived data are represented in inconsistent structures – consistent representation is valuable
19
Metadata Description of results Sample How it was generated Equipment Processing steps Expensive to capture Important to validate result Lab-book
20
Standards Science is a challenge Scientific data is complex Different data representations add further complexity to complex science We need a common representation of data to get back to just complex science Lots of individuals have created formats in isolation – only works for their data in their lab
21
What is a standard? “established by consensus and approved by a recognized body, that provides, for common and repeated use, rules, guidelines or characteristics for activities or their results, aimed at the achievement of the optimum degree of order in a given context“ BSI - http://www.bsi-global.com/en/Standards-and-Publications/About-standards/Glossary/
22
Community standards development
23
Knowledge Standards: allow working together for knowledge discovery
24
Standards bodies W3C -World wide web consortium (W3C) IEEE - Institute of Electrical and Electronics Engineers OMG – Object management group
25
Life science communities SocietyDomainWebsite The Genomics Standards Consortium (GCS) Genomicshttp://darwin.nox.ac.uk/gsc/ Microarray and Gene Expression Data Society (MGED) Genomicswww.mged.org Proteomics Standards Initiative (PSI) Proteomicshttp://psidev.info Metabolomics Standards Initiative (MSI) Metabolomicswww.metabolomicssociety.org Flow Cytometry experiment Community Flow Cytometry www.flowcyt.org
27
Technologies for data standards Important to adopt a technology that provides a clear representation of the domain The model and the model documentation capture a shared understanding of the domain Many technologies exist which support modelling Each focuses on a different use such a validation, code generation and data transmission
28
Technologies being used Simple text documents or spreadsheets XML - Extensible Markup Language RDF – Resource Description Framework UML – Unified Modeling Language OWL – Web ontology Language OBO – Open Biomedical Ontology format
29
Simple documents A list of what is required MIxxx Minimum information XXX MIAME Minimum information about a Microarray Experiment MAIPE Minimum information about a Proteomics Experiment
30
MIAPE:GE Identifies the minimum information required to report the use of n- dimensional gel electrophoresis in a proteomics experiment
33
XML Widely used for representing biological information Mark up sections with elements Validates against a schema Bioinformatics students Frank Gibson Representation of scientific data Students all fell asleep
34
UML An implementation independent model Allows multiple technology implementations of the same model Such as XML, JAVA, Relational tables
35
The numbers indicate the multiplicity of the relationship with * meaning “many”. One or more instances of JetEngine can be associated with one or more instances of Aeroplane A filled diamond indicates containment. An Aeroplane can not exist without a JetEngine An arrow shows the direction of the relationship. An open-headed arrow indicates inheritance. A Pilot and a Passenger are both instances of Person, inheriting the attributes “name” and “DOB”. 1..*
36
Functional Genomics Experiment (FuGE) Model of common components in science investigations, such as materials, data, protocols, equipment and software. Provides a framework for capturing complete laboratory workflows, enabling the integration of pre-existing data formats.
38
GelML
39
RDF Overcomes limited expressivity of XML Allows the semantic meaning of statements to be captured
40
Uniprot(beta) in RDF
41
Ontolgies for Life science Emergence has occurred for two reasons Consistent annotation of data To add meaning and understanding that can be interpreted computationaly Bio-ontologies registered on the OBO foundry
43
Bio-ontologies OBO format Flat file format, more suited to controlled vocabularies, made popular by GO OWL W3C recommendation, designed for computers not humans
44
sepCV In OBO
45
OBI An ontology for all investigations in the life sciences Implemented in OWL Large community involvement sepCV to be integrated within OBI
46
Tools Tools are important Biologist don’t want to look at XML Need data entry tools – a website… Direct export of data and metadata from instruments Equipment vendors and manufactures need to be involved in the “community” of standards development Tools lag behind development of the standard
47
Symba - data entry and storage
48
The Representation of Scientific Data The Road Map
49
Patience Standards development is slow it requires A measure of technical and political consensus An organisational framework Individuals who are willing to contribute time and expertise, both domain experts and knowledge engineers (modellers)
50
The Problem Identify the problem Identify the users that need the problem solved Requirements gathering – what do the users need? See if someone else has already done it! If so, use it and go to the pub
51
Implementation Define the problem – MIxxx Model the problem – UML (FuGE) Generate an implementation (XML) Define semantics - Ontologies
52
Testing and Review Stage One: Requirements gathering –Extensive interactions with the community –Consideration of several (informal) use cases –Internal generation of first draft of guidelines Stage Two: Module Testing –Guidelines used to document real experiments –Feedback gathered on coherence and practical usability Stage Three: Committee review –Build an invited panel of leaders in the particular technique –Send draft for ‘review’ by experts on an individual basis –Final round of discussion by panel on email list Stage Four: Controlled release –Make the module publicly available –Recommend to organised groups and proactive individuals –Provide mechanisms to gather feedback –Released alongside practical examples of use cases Stage Five: Enforcement –Offer the module to journals, repositories and funders for review, with a view to their enforcing it (either to get published, or to get money) Cycle
54
Tool support Can occur in parallel but often after release Abstraction away from the model Simple data entry – often website
55
Standards for Gel electrophoresis
56
Pitfalls Re-invention. Don’t re-invent the wheel! If it exists use it Over ambition: pragmatic compromise don’t over complicate it or it will not get used. - keep it simple stupid Under investment – money, time, but most importantly with the people that will use it.
57
What is the point? Facilitate consistent computational analysis Develop one piece of code to do one thing instead of lots of code to do one thing Easier lab management of data Storage and analysis Allow data integration and systems biology Efficient science
58
Take away message Mixx FuGE OBI They have done the hard work Re-use, extend and contribute
59
Questions?
60
Data mining mine Keep out mine Data is mine, mine mine…. Data store
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.