Presentation is loading. Please wait.

Presentation is loading. Please wait.

Standards for Data Sharing Program Oversight Chair: Colin Ingram, Newcastle UK.

Similar presentations


Presentation on theme: "Standards for Data Sharing Program Oversight Chair: Colin Ingram, Newcastle UK."— Presentation transcript:

1 Standards for Data Sharing Program Oversight Chair: Colin Ingram, Newcastle UK

2 www.incf.org/about/programs/datasharing Mission: to develop metadata and data standards for reproducible research; to develop standards for archiving, storing, sharing, and re-using neuroscience data and databases

3 Electrophysiology Task Force – Topics Data formats Metadata What metadata is required? Collection and management of metadata Persistent identifiers for data Evaluate existing systems and develop recommendations Promoting data sharing: Data sharing statements in publications Discussions with publishers on best strategies Interoperability of electrophysiology data sharing sites CARMEN (carmen.org.uk) - CRCNS (crcns.org) - G-Node (g-node.org)

4 Data format Goal: Develop a standard format for representing basic electrophysiology data types in HDF5 along with metadata required to interpret the data

5 Basic data types Basic data types included in Neuroshare: Time Series - continuous recording. Time Series Segment - short sections of recorded data, usually encompassing spike waveforms. Neural Event - times of neural events, e.g. spikes, optionally associated with sorted units (neurons). Experimental Event - list of times and values. Used to describe stimuli or other experimental events. Other data types: Images, image stacks - for imaging data Feature vectors - for spike sorting

6 Why HDF5? General purpose container for storing numerical array data, commonly used in science Hierarchical organization Groups – like directories on file system Datasets – arrays (Both of the above are called ” nodes”). Allows attaching metadata (key-value pairs) to groups and data sets Requires more detailed specification to form a standard.

7 Object model & API vs. File format standard Object model & API standard – uses library to access data. Adds abstraction, separating specification of data from how data is stored in the file. Allows using different storage back ends. File format standard – Specification of file format is the standard. Requires using the HDF5 API to access data.

8 Existing systems that use HDF5 Object model & API standard NEO – electrophysiology data. For Python. NeXus format – for particle physics NetCDF4 – mainly used for geographic data, but also used by RIKEN Brain Science Institute to store neuroscience data. Ovation (http://physionconsulting.com)http://physionconsulting.com File format standard klusta-team Kwik-format. Ken Harris, electrophysiology. brainliner.jp – writes MatLab files to HDF5 NeuroHDF Eglen MEA format (by Stephen Eglen, retina data).

9 Proposed systems epHDF – File format standard for basic electrophysiology data types Pandora – Object model & API, more general by allowing many types of data. Semantic Web – File format standard; uses semantic web technologies for metadata.

10 epHDF Specifies format for storing each of the basic electrophysiology data types JSON used to specify metadata Uses XML namespace mechanism and reserved names to associate data and metadata in file to externally defined schemas

11 epHDF – associating data to schema Root level attribute specifies an abbreviation and location of schema. Example: / - HDF5 root node, has attribute schema = ["ephdf http:// url_of_schema_location ", … ] Nodes containing data either use a standard name, or reference a schema entity. Example: /channel_05/timeSeries – dataset node contains time series values. - timeSeries is standard name. If not using standard name, have attribute: schema = "ephdf:timeSeries" Data in nodes have a standard layout. Example: timeSeries have 1-D (N) or 2-D (MxN) array of numeric values. N is number of samples. If 2-D, M is number of channels.

12 epHDF – metadata { "ephdf:basic":{ "sampling_rate": { "value":, "unit": "Hz" }, "time_divisor": {"value": 1000, "units": "Hz" }, "units" : {"name": "Volts", "value_map": { "minValue":, "scaleFactor": } }, "starting_time" : { "dateTime": "2012-12-24T14:05:23", "decimalSeconds": 0.00034 }, "duration" : { "value":, "divisor": "sampling_rate" } } Metadata stored as string, in attribute of node (group or data set). Enables interpreting the numeric data in each electrophysiology data type. The value_map specification enables numeric values to be stored in whatever is most efficient (e.g. 16 bit integer). Other metadata schemas can be added in parallel to “ephdf:basic” to include more information.

13 Electrophysiology Task Force – File Format Standard Genericaly describes multidimensional data (enough information to make a plot) Elements for annotating the data e.g. define regions of interest, segments and events Elements for the definition of provenance Can be combined with a model for metadata (uses odML) Pandora - Data model for storing neuroscience data and metadata

14 Electrophysiology Task Force – File Format Standard Pandora - Data model for storing neuroscience data and metadata

15 Semantic Web Way to represent semantic information on the web Works using “triples”. Subject, predicate, object. (Also called RDF, for “Resource Description Framework”). Requires unique ID (UID) for each subject, object, predicate. Example: SubjectPredicateObject London isTypeCity Ontologies used to define different classes and predicates; allows representing all kinds of information Way to specify a labeled directed graph Allows searching for data using standard protocol (SPARQL).

16 Semantic Web for electrophysiology data Define ontologies for electrophysiology data types and concepts Have conventions for assigning unique IDs to datasets (arrays) stored in HDF5 Represent all metadata using RDF

17 Example – identifying data type and units HDF5 Dataset channel_1_ts - (1D array of numeric values) RDF specifying metadata channel_1_ts isTypetimeSeries channel_1_tshasUnitVoltage

18 Example - relation HDF5 Datasets channel_1_ts - (time series) unit_3_spikes-(array of spike times) RDF specifying metadata unit_3_spikes createdBysort_45 sort_45isType spikeSortProcess sort_45hasSource channel_1_ts sort_45hasAlgorithm klusta_kwik

19 Semantic Web approach - Potential advantage All metadata is semantically machine processable Allows rich set of concepts to be expressed (Sophia Ananiadou – text mining for semantic analysis). Enables data integration Incorporation of NI-DM provenance model? Allows external annotation of data sets without modification of data set Create new data set via “mashup”. Example: new spike sorting algorithm, have different list of spikes, but reference original time series signal. Supported by W3C community

20 How to evaluate? – use cases Test different approaches using example data sets Blackrock microsystems sample file. Has 144 channels, time series, and time series segments experimental events. CRCNS.org, pvc (primary visual cortex) data sets 1, 2, 3. Have multielectrode time series segments and neural event data types. Each uses a different stimulus. CRCNS.org, hc-2, hc-3 data sets. Has files used by Buzsaki Lab, for hippocampus data. Same as Ken Harris. Eglen MEA recording format Hdf5Manager. System developed University of West Bohemia. Includes extensive metadata. Brainliner.jp


Download ppt "Standards for Data Sharing Program Oversight Chair: Colin Ingram, Newcastle UK."

Similar presentations


Ads by Google