The design and implementation of the Neurophysiology Data translation Format (NDF) Developed by Bojian Liang, Martyn Fletcher, Jim Austin. Advanced Computer.

Slides:



Advertisements
Similar presentations
Agenda Definitions Evolution of Programming Languages and Personal Computers The C Language.
Advertisements

Streaming NetCDF John Caron July What does NetCDF do for you? Data Storage: machine-, OS-, compiler-independent Standard API (Application Programming.
Mi-Joung choi, Hong-Taek Ju, Hyun-Jun Cha, Sook-Hyang Kim and J
DCS Architecture Bob Krzaczek. Key Design Requirement Distilled from the DCS Mission statement and the results of the Conceptual Design Review (June 1999):
Software Requirements
Introduction to Databases Transparencies
Mapping Physical Formats to Logical Models to Extract Data and Metadata Tara Talbott IPAW ‘06.
Introduction to BIM BIM Curriculum 01.
Discussion and conclusion The OGC SOS describes a global standard for storing and recalling sensor data and the associated metadata. The standard covers.
FALL 2005CSI 4118 – UNIVERSITY OF OTTAWA1 Part 4 Web technologies: HTTP, CGI, PHP,Java applets)
A Scalable Application Architecture for composing News Portals on the Internet Serpil TOK, Zeki BAYRAM. Eastern MediterraneanUniversity Famagusta Famagusta.
EUROPEAN UNION Polish Infrastructure for Supporting Computational Science in the European Research Space The Capabilities of the GridSpace2 Experiment.
Topics Introduction Hardware and Software How Computers Store Data
EARTH SCIENCE MARKUP LANGUAGE “Define Once Use Anywhere” INFORMATION TECHNOLOGY AND SYSTEMS CENTER UNIVERSITY OF ALABAMA IN HUNTSVILLE.
Metadata Creation with the Earth System Modeling Framework Ryan O’Kuinghttons – NESII/CIRES/NOAA Kathy Saint – NESII/CSG July 22, 2014.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Indo-US Workshop, June23-25, 2003 Building Digital Libraries for Communities using Kepler Framework M. Zubair Old Dominion University.
SOFTWARE DESIGN (SWD) Instructor: Dr. Hany H. Ammar
1.8History of Java Java –Based on C and C++ –Originally developed in early 1991 for intelligent consumer electronic devices Market did not develop, project.
1 Overview of the Application Hosting Environment Stefan Zasada University College London.
EARTH SCIENCE MARKUP LANGUAGE Why do you need it? How can it help you? INFORMATION TECHNOLOGY AND SYSTEMS CENTER UNIVERSITY OF ALABAMA IN HUNTSVILLE.
Algoval: Evaluation Server Past, Present and Future Simon Lucas Computer Science Dept Essex University 25 January, 2002.
Invitation to Computer Science 5 th Edition Chapter 6 An Introduction to System Software and Virtual Machine s.
Copyright © 2007 Addison-Wesley. All rights reserved.1-1 Reasons for Studying Concepts of Programming Languages Increased ability to express ideas Improved.
Windows 2000 Course Summary Computing Department, Lancaster University, UK.
Software Project Planning Defining the Project Writing the Software Specification Planning the Development Stages Testing the Software.
Bookkeeping Tutorial. Bookkeeping & Monitoring Tutorial2 Bookkeeping content  Contains records of all “jobs” and all “files” that are created by production.
GEM Portal and SERVOGrid for Earthquake Science PTLIU Laboratory for Community Grids Geoffrey Fox, Marlon Pierce Computer Science, Informatics, Physics.
Constructing Data Mining Applications based on Web Services Composition Ali Shaikh Ali and Omer Rana
The european ITM Task Force data structure F. Imbeaux.
Grid Execution Management for Legacy Code Applications Grid Enabling Legacy Code Applications Tamas Kiss Centre for Parallel.
Term 2, 2011 Week 1. CONTENTS Problem-solving methodology Programming and scripting languages – Programming languages Programming languages – Scripting.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Submission and repository management of digital libraries, using WWW Gregory Karvounarakis Sarantos Kapidakis.
Large Scale Nuclear Physics Calculations in a Workflow Environment and Data Provenance Capturing Fang Liu and Masha Sosonkina Scalable Computing Lab, USDOE.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
Frameworks CompSci 230 S Software Construction.
A Practical Approach to Metadata Management Mark Jessop Prof. Jim Austin University of York.
1 Metadata Working G roup Report Members (fixed in mid-January) G.AndronicoINFN,Italy P.CoddingtonAdelaide,Australia R.EdwardsJlab,USA C.MaynardEdinburgh,UK.
David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk.
Mantid Stakeholder Review Nick Draper 01/11/2007.
COS PIPELINE CDR Jim Rose July 23, 2001OPUS Science Data Processing Space Telescope Science Institute 1 of 12 Science Data Processing
August 2003 At A Glance The IRC is a platform independent, extensible, and adaptive framework that provides robust, interactive, and distributed control.
Project Planning Defining the project Software specification Development stages Software testing.
Chapter – 8 Software Tools.
: Information Retrieval อาจารย์ ธีภากรณ์ นฤมาณนลิณี
Lecture #1: Introduction to Algorithms and Problem Solving Dr. Hmood Al-Dossari King Saud University Department of Computer Science 6 February 2012.
Grid Execution Management for Legacy Code Architecture Exposing legacy applications as Grid services: the GEMLCA approach Centre.
ETICS An Environment for Distributed Software Development in Aerospace Applications SpaceTransfer09 Hannover Messe, April 2009.
Microsoft Visual Basic 2015 CHAPTER ONE Introduction to Visual Basic 2015 Programming.
Software tools for digital LLRF system integration at CERN 04/11/2015 LLRF15, Software tools2 Andy Butterworth Tom Levens, Andrey Pashnin, Anthony Rey.
Some of the utilities associated with the development of programs. These program development tools allow users to write and construct programs that the.
Installing Java on a Home machine
Muen Policy & Toolchain
Chapter 7 Text Input/Output Objectives
Topics Introduction Hardware and Software How Computers Store Data
CSCI-235 Micro-Computer Applications
Chapter 7 Text Input/Output Objectives
Installing Java on a Home machine
Computer Science I CSC 135.
Chapter 9: IOS Images and Licensing
CIS16 Application Development – Programming with Visual Basic
Topics Introduction Hardware and Software How Computers Store Data
Chapter 1 Introduction(1.1)
Code Analysis, Repository and Modelling for e-Neuroscience
Notes for speaker included
Introduction of Week 11 Return assignment 9-1 Collect assignment 10-1
Paper by D.L Parnas And D.P.Siewiorek Prepared by Xi Chen May 16,2003
Code Analysis, Repository and Modelling for e-Neuroscience
Presentation transcript:

The design and implementation of the Neurophysiology Data translation Format (NDF) Developed by Bojian Liang, Martyn Fletcher, Jim Austin. Advanced Computer Architectures Group, Dept. of Computer Science, University of York, York, YO10 5DD, UK. {bojian, martyn.fletcher, Presented by Leslie Smith, University of Stirling

Slide 2 Data problems / issues. Our solution: Neurophysiology Data translation Format: NDF. What is NDF and what does it provide? Future work. Overview

Slide 3 The CARMEN (Code Analysis, Repository and Modelling for e-Neuroscience) project provides an environment for sharing neurophysiological experimental data and algorithms using GRID technology. A consortium effort to create a virtual laboratory for neurophysiology, led by 11 UK universities in collaboration with other academic and commercial partners, for the benefit of the neuroscience community. The CARMEN Project

Slide 4 The CARMEN system has to handle a wide range of incoming data types as well as derived data. Often unreadable unless you use vendor specific software or know the encoding format Data may be used by users or services. In a processing chain, the output of a service may be the input of the other services. It is impractical to have services that use arbitrary input and output data formats, particularly for workflows Data interchangability problem There is a need for data translation to allow resources to access a standard data format to facilitates an environment where data can be processed in a consistently interpretable way for both human users and machines.

Slide 5 Remote data: to avoid unnecessary data downloading / moving and processing: a. A user needs to know as much information as possible about the data before the data is downloaded or processed. b. A service needs to verify the data as a valid input type before processing the data. c. A workflow editor needs information to pre-verify the type of the input data set from a remote data depository or output from another service in the construction of a workflow script. Questions: 1. How do we interrogate and understand the remote data without downloading / accessing the whole binary data set? 2. A file extension is not enough to pre-verify workflow input / output file, so where is a workflow editor to get information to perform the verification? Remote Data Issues

Slide 6 Sub-dataset selection and partial data extraction / downloading: a. Neurophysiological experimental data are complex data sets. Most CARMEN services are designed to process only one of the data types within a data set. b. Raw data contains multiple channels from the acquisition equipment but only parts of these data channels may be desired. c. The volume of data in a channel of data may be very large but only some channels and time intervals are of interest. d. Processed data and raw data may be mixed in the same data set. Question: 1. Can we tell a service exactly which data portions we need to process? 2. Can we download (or use) only the channels (or parts of channels) of interest? Partial data access issues

Slide 7 In a research environment new data types / formats are created whenever new scientific instruments or services / algorithms are introduced. It is difficult / impossible to try to specify these precisely in advance. Questions: 1. Can we create services that accept new data types as input? 2. Can we create services that create a new data types as output? 3. Can all this be done in a consistent manner, using the predefined data types? 4. How can a service that uses new data types perform pre- verifying as for the predefined data types? Evolving data type issues

Slide 8 Use of a generic metadata system: most users are specialists and will not appreciate many of the generic metadata specifications. On manual completion of a metadata upload form, a user doesn’t know which fields are required for the data set. Consequently, the metadata uploaded may be incomplete and not usable. On uploading a data set, the metadata may not be directly available for the user – a special tool for a particular data format may be required. It is impractical to upload metadata manually for a huge number of data files. Automatically uploading metadata is equivalent to having a data standard. This implies that the metadata is already included in the data set and a data standard must be used. Metadata for a temporary data sets, such as the output of a service (which may be the input of the other services) are not available from the metadata system. Separating the metadata from a data set affects the data set portability. Our conclusion: The metadata used for the above purpose should be integrated with the data set. Can a well designed metadata system solve the problems?

Basic data types The primary data types are TIMESERIES: continuous time series. NEURALEVENT: events such as spike times EVENT: other event data (e.g. stimuli) SEGMENT: sections of TIMESERIES data GMATRIX: generic matrix data: user-defined IMAGE: image data Since the content is described using XML, additional data types can be added to cope with new developments in electrophysiology. Slide 9

Slide 10 The NDF wraps metadata, binary data together with a configuration file. 1. A separate NDF configuration file, using an XML format, minimizes the work necessary to extract metadata from a data set, obviating the need to look inside the associated binary data file. It is only necessary to download the NDF configuration file and the metadata information can be easily viewed using a web browser. 2. Two semi-defined data types are extendable on an application basis and conventional vendor data files may also be “wrapped” as an NDF data set. A particular ID field allows these application specified data to be identified. 3. NDF supports the most commonly used numerical data types from 8-bit integer to double precision floating point data. This helps to reduce the data size by using the most efficient data types as well as reducing the network traffic load when downloading / uploading NDF data sets. 4. The NDF data format permits the download of data “regions of interest” (partial data access) rather than the whole data set, reducing network traffic. Partially accessing a MAT file zipped stream is supported. The NDF data format (1)

Slide For a data processing chain a history or “foot-print” of each previous process can be included in the output data. This information is useful (may also be required) for later processing or reference. In particular, other researchers can easily repeat the work by reference to the data processing history records. 6. The NDF supports image data or image sequence data. 7. A separate XML file can be used to store the experimental event data, annotation and additional third party data objects. 8. The NDF minimizes the need for re-implementation of research tools currently used by neuro-scientists and researchers. A MAT file is used as the main numerical data file format. This is a publicly descrbed data format 9. Supports multiple data files for one data channel. This allows data size of either a single channel or full data set to exceed 2GB both in 32-bit and 64- bit operating systems. The NDF data format (2)

Slide 12 The CARMEN Portal NDF Data Channel & Time Selector

Slide 13 The NDF API: Is implemented as a C library. Provides a low level I/O interface for accessing the NDF data set including the XML format header file, MAT format host data files and the XML format annotation files. Translates the XML tree/node to C style data structures. Insulates the MAT data format and (and image format data) from the clients. Provides a standard way for data structure memory management. The NDF Data I/O API (1)

Slide 14 The NDF API: Supports multiple-run data writing modes for large data sets with known total data length. Supports multiple-run data writing modes for data stream with unknown total data length. Supports zipped data stream for MAT file. Supports partial data reading on both compressed and uncompressed data in MAT file. Automatically manages the data file splitting for large data set. The NDF Data I/O API (2)

Slide 15 The NDF MatLab Toolbox has been implemented on top of the NDF C library API. It consists of a set of object oriented MatLab classes and functions that provide high level support for NDF data I/O. A “multiple data formats” to NDF converter is embedded to the toolbox as data input module. Full protection and auto-correction for misused data types on parameter structure. It has been used within the CARMEN service code programming. It is also used as a set of convenient tools on a researcher’s desktop for NDF data I/O and data conversion. The NDF MatLab Toolbox

Slide 16 Expand the specification to improve compatibility with data sets from the fields other than neuro-science. Provide services for partial data downloading of remote data sets. Provide services for data preview of remote data sets. Extend the data converter to support data conversion from additional appropriate formats. … and enabling future-proofing! Detailed information is available at the CARMEN portal, Future work