Overview and Motivation of the ICAT Software Suite Kerstin Kleese van Dam
Science and Technology Facilities Council STFC employ more than 2200 staff who are deployed at 7 locations, these are: Swindon where the headquarter is based, the Rutherford Appleton Laboratory, the Daresbury Laboratory, the Chilbolton Observatory, the UK Astronomy Technology Centre in Edinburgh, the Isaac Newton Group of Telescopes on La Palma; and the Joint AstronomySwindonUK Astronomy Technology CentreIsaac Newton Group of TelescopesJoint Astronomy CentreCentre in Hawaii.
Research and Science Support at STFC Deliver world class science Engender world class science Communicate world class science Annually over visiting Scientists from around the world from both Academia and Industry.
Why an Integrated e- Infrastructure is required HPC Analysis Storage Analysis Experiment Computing HPC Scientist
What STFC aim to achieve with their e-Infrastructure Enabling users to get rapid access to their current and past data, related experiments, publications etc., leading to improved analysis through more complete information. Creating a powerful, long lasting scientific knowledge resource.
Integrated e- Infrastructure Proposal Metadata Catalogue Information Experiment Data Acquisition System Secure Storage Data Analysis Publication E-Pubs Proposal System All Data and Metadata Capture is automated.
e-Infrastructure – Access to Multiple Facilities(2) Data Portal SNS - ORNL ISIS – TS1 + 2 DLS CLF CSL - Canada SRS + ERLP
How we achieve the integration HPC Analysis Storage Analysis Experiment Computing HPC Metadata Scientist
ICAT Software Suite
The ICAT software suite centrally catalogues all experiment related information and extracted key results. Where ever possible information is gathered automatically trough integration with existing IT systems such as proposal systems or data acquisition. The catalogue and the data it references are accessible via a well defined API for easy embedding into any applications. Distributed Data Metadata Catalogue Generic Catalogue Access Interface Data Access and Analysis Applications
Underlying Data Infrastructure Online Proposal System User Office System incl.: User Database Scheduling Health and Safety Proposal Management Metadata Catalogue Data Acquisition System Storage Management System DataAccessPortal Single Sign On Account Creation and Management ICAT Software Suite, providing the crucial integration of key functions.
The online proposal system is the entrance point to the Data Management System, and is a rich source of contextual information about the users experiment. ICAT and the STFC Proposal Systems
ICAT and STFC Data Acquisition Plug-ins for the data acquisition system ensure automatic, quality controlled collection of data and metadata. ICAT can be easily linked to any existing system. ISIS : -SECI (C#,.net) with link to LabView and openGenie DLS : -Generic Data Acquisition (Java, on top of EPICS) CLF : -For Laser Diagnostics, (LabView)
ICAT and DLS Storage Management DLS uses the Storage Resource Broker for its Storage Management, this has been integrated with ICAT for data access and delivery. Main advantage : Decoupling physical file location from the logical one. Strict Security Expandable to many storage systems
ICAT and ISIS Storage Management ISIS uses their own in house developed data storage access system called Data.ISIS. Similar to SRB it abstracts from the physical location of the files and delivers the same advantageous in terms of decoupling of logical and physical location of files and security.
ICAT Architecture Online Proposal System User Office System incl.: User Database Scheduling Health and Safety Proposal Management Metadata Catalogue Data Acquisition System Storage Management System DataAccessPortal Single Sign On Account Creation and Management ICAT Software Suite, providing the crucial integration of key functions.
ICAT 3.3 Aims and Objectives ICAT API Version 3.3 aims to be the Grid aware software infrastructure that enables applications to exploit the capabilities of the ICAT catalogue. Data Portal Version 3.3 aims to be the Grid aware software infrastructure that serves the Data Search and Retrieval (DSR) requirements of the STFC. It makes use of the ICAT API 3.3.
Overall Architecture Principles The ICAT software suite has a modular design with clear functional boundaries for each component. Core functionalities have been grouped together, customisable presentation layers are separated from the function layer to achieve easy maintenance, easy customisation, insulation from changes to underlying areas. All interaction with the ICAT catalogue are now through the ICAT API.
Core Scientific Metadata Model (CSMD)
Rich Data at STFC Scientific Data of the highest Quality is produced at STFC Facilities and Departments. The continuity and longevity of STFC has led to a unique wealth of Information. How about a system that would give access to all of it independent of where it was produced?
Model Motivation (1) Most Scientists think in terms of Studies during which they perform a number of investigations e.g. experiments, observations, measurements and simulations. Results from these investigations usually run through different stages: raw data, analysed or derived data and end results. Data should be grouped accordingly. Metadata and Software (e.g. STFC DataPortal) should allow the user to search for interesting data. Not all information captured in specific metadata schemas e.g. CML, would be used to search for this data or distinguish one data set from another, give possibility to select special parameter.
Model Motivation (2) A common general format/standard for Scientific Studies and data holdings metadata did not exist By proposing Model and Implementation: –Form a specification for the types of metadata studies should capture during Scientific Studies –Ease citation, collaboration, exploitation and Integration –Allow easy Integration of distributed heterogeneous metadata systems into a homogeneous (albeit virtual) Platform Therefore – The Common Scientific Metadata Model (CSMD) developed.
General Layout Why – i.e. what was the need What is it – description – support keyword searches and taxonmic approaches – data organisation like a file systems but support linking to a database also Where is it used – project & software What are the users likely to search on What distinguishes one study/investigation/data set from the next
Metadata Model Structure The Common Scientific metadata model (CSMDM) is a study-data set orientated model holding study information about: –Topic Indexing –Provenance –Data Holding –Legal notes Copyright, patents and conditions of use etc relating to the study and the data in the study –Related Material Publications, Community information and related links –(Access Conditions) Metadata Granule Topic Study Access Conditions Related Material Legal Note Data Holding Investigation 1 M 1111 Atomic Data Object Data Collection M M M M 1
Model Breakdown: Provenance The Study contains the following metadata: –The Study Name –The Study Institution –The Investigator –Extended Study Information Abstract Funding Start and End times –Investigations
Investigations A Study can have more than one investigation; possible enumerations are experiment, simulation, measurements etc. – investigations contain: –Name –Investigation Type –Abstract –Resource –Link to DataHolding
Topic (for indexing) Keywords –Discipline (i.e. domain) –Keyword Source (e.g. domain dictionary) –Keyword Subjects –Discipline –Subject Source (e.g. domain taxonomy) –Subject
Access Condition & Related Material Access Conditions –Contains a list of users or groups who are allowed access to the metadata and data, or a pointer to an access control system which contains such data for this study Related Material –One or many links and or textual descriptions of material related to this study e.g. earlier studies or parallel studies
Data Data Description holds a logical description of the Study’s data: –Data Name –Type of Data –Status –Data Topic –Parameters –Related Data Ref –Relation type (e.g. derived) Data Location contains the link between logical name (e.g. URI's) and physical URLs –Data Name –Locator(s) (In the case of Atomic Data Objects these can refer to files as well as named Selects on a database – i.e. virtual data objects)
More on Parameters Parameters contain a lot of information about the atomic data objects (ADO) and collections A collection/ADO can have many parameter entries, each parameter entry contains: Parameter derivation (e.g. measured/fixed) –The value –The units –Range –Error margin Parameter aggregation is also supported
Cardinality Issues The model recommends a certain cardinality of elements Certain metadata components are necessary for one to have an instance of the implemented model – treating everything as optional is not acceptable It is though implementations may modify this more to their needs – model attempts to remain ideal (i.e. most common Cardinality)
Enumeration Issues Enumerations (or controlled vocabularies) e.g. types of investigator, types of institutions; these are distinct from the model e.g. as taxonomies are. However they are necessary for the model to work so implementations e.g. STFC DataPortal implementation of the model propose some enumerations for common things Recognised and relevant controlled vocabularies are hoped to be used by implementations where they are available
Conformance Level For a complete metadata study-dataset record a large amount of metadata has to be stored/processed So it’s useful to have conformance levels Model uses 5 levels Each level specifies more metadata (and Indexing information) should be held
Level 1 Type of Information captured: –Study and Investigation metadata with indexing at the Study level Level 1 metadata is similar to library/publication style metadata (e.g. DublinCore)
Level 2 Type of Information captured: –Level 1 + DataHolding metadata (i.e. DataSets and DataObjects)
Level 3 Type of Information captured: –Level 2 + related material, Access condition, indexing to data collection levels
Level 4 Type of Information captured: –Level 3 + indexing to data object level and data object parameter information
Level 5 Type of Information captured: –All metadata components are filled as L4 + funding, resources used, facilities used etc
Conformance Levels L1 is similar to library/publication style metadata (e.g. DublinCore) The current DataPortal uses somewhere between L4 and L5 –the new systems designed with CSMD conforms to L4+ Benefit of conformance levels; the higher the level of conformance to the CSMD the richer the clients that operate on the data can be –e.g. identifying datasets and atomic data objects which link directly to keywords/taxonomies and not just studies
CSMD Used on DataPortal Implementation used as Data Interface for DataPortal Single view of heterogeneous systems/schemas Acts as a stress test of the model –Limitations feed into Model Requirements –New requirements feed back into implementation
ICAT Schema 3.3
Specifics of the ICAT 3.3 Schema
ICAT 3.3 Schema - Facility
ICAT 3.3 Schema - Study
ICAT 3.3 Schema – Study (2)
ICAT 3.3 Schema – Study (3) Study Investigation Study Status
ICAT 3.3 Schema - Investigation
ICAT 3.3 Schema - Instrument
ICAT 3.3 Schema - Shift
ICAT 3.3 Schema – Shift (2)
ICAT 3.3 Schema - Keywords
ICAT 3.3 Schema - Topic
ICAT 3.3 Schema – Topic (2) Topic Topic List
What is an Ontology? Ontologies are used to capture knowledge about a domain of interest. An ontology describes the concepts in the domain and the relationships that hold between those concepts.
Advantages of Ontologies Provide increased flexibility when representing frequently changing viewpoints of information. Alterations can be simply followed up in the model without having to alter the applications on which they are based. Allows a unified view of heterogeneous data sources. Remove conflicts and terminological uncertainties. Facilitate Moderated searches, optimisation of the search results.
Why Ontologies are a useful Solution? At present over 1,700,000 keywords describing experiments are housed in ISIS ICAT many of which are synonyms. These keywords are used to index experimental studies, however this is seen as a limited method as these free text keywords have no context, and are hard to map by non-experts to terms used by facilities in the same domain and harder still to those outside. The creation of ontologies at ISIS will aid in the mapping of concrete manifestations of familiar terms in one domain as well as related concepts in different domains. This will facilitate searching of data by category and grouping of data into keywords across studies. This could aid in the cross facility searching of related scientific data from the various scientific facilities housed at STFC e.g. CLF and DLS.
A Protégé-OWL Ontology Classes Individuals Properties A class is a concept in the domain - a class of People - a class of Pets - a class of Countries A class is a collection of elements with similar properties. Instances of classes - America can be an instance of the class Country. Gemma Mathew Fluffy Italy America England Fido Class Person Class Pet Class Country livesIn hasSibling hasPet
ISIS Facilities Ontology Hierarchy
Class ISISExperiment Class DataFile Class Year wasConductedIn hasInvestigator Class Instrument Class Investigator HRP00145.RAW 1986 Pete Jones HRPD Class CrystallographyGroupExperiment hasUsedInstrument Hydrazinium Class InvestigationTitle hasTitle hasDataFileName Protein Crystallography GroupExperiment ISIS Facilities Ontology
Sample, Investigator and Experiment Ontologies Sample Investigator Experiment
Ontology Maintainer A web application for graphically displaying current versions of an ontology Currently ontologies are built within Protégé, an editing environment Difficulty in showing constructed ontologies to other domain experts The OntoMaintainer allows users to visualize ontology and enter feedback on the classification and structure of the hierarchy Encourages collaboration between domain experts (scientists) and ontology builders by allowing members of the community to be involved in the development and maintenance of ontologies
Topic Mapping Tool Mapping Tool provides a way of linking proposal system data to the structure of the ontology. Data is mapped to the ontology structure according to a set of defined rules. Proposal System Database Ontology Mapping Rules
Mapping Tool
Object Sample Detail Chemical FormulaName SampleType Liquid poly{1,4- phenylene-[9,9- bis(4-phenoxy butylsulfonate)] fluorene-2,7-diyl} ; C12E5; D2O poly{9,9-bis[6- (N,N-trimethyl- ammonium)hexyl] fluorene-co-1,4- phenylene}; C12E5;D2= C37H52N2I2: C22H46=6;D2= C37H30S2O8; C22H46O6;D2O
Ontologies would help maximise the value of data collected at ISIS and other STFC facilities by improving the access, navigation and reuse of data. Ontologies would facilitate the mapping of terms across STFC facilities which will allow cross-facility searching e.g. external users will be able to search for all experiments carried out across STFC using a powder diffractometer (instrument) even if they do not know the local names of the specific instruments. The OntoMaintainer will facilitate the process of creating and maintaining ontologies by providing a means of getting feedback directly from domain experts
ICAT 3.3 Schema - Investigator
ICAT 3.3 Schema – Investigator (2) Investigator Facility User
ICAT 3.3 Schema - Sample
ICAT 3.3 Schema – Sample Parameter
ICAT 3.3 Schema – Dataset
ICAT 3.3 Schema – Dataset (2)
ICAT 3.3 Schema – Dataset (3)
ICAT 3.3 Schema – Dataset Status
ICAT 3.3 Schema – Dataset Type
ICAT 3.3 Schema – Dataset Parameter
ICAT 3.3 Schema – Data File
ICAT 3.3 Schema – Data File (1)
ICAT 3.3 Schema – Data File (2)
ICAT 3.3 Schema – Related Data Files
ICAT 3.3 Schema – Data File Parameter
ICAT 3.3 Schema – Authorisation
Other ICAT Related Schema
ICAT API Session Schema There are 3 tables to the schema, the user, user_session and myproxy_servers: USER – All users who have logged in USER_SESSION – All user’s sessions on Icat MYPROXY_SERVERS -- configuration information about which server to logon to
ICAT Core Database Schema
ICAT Core Session
ICAT Core Event
ICAT Core User
ICAT Core DataBase
ICAT Core Database The core ICAT catalogue is at STFC run on an Oracle 10G RAC clustered database server. The system has been customised to make efficient use of the offered features of Oracle. If required these could however be removed in the future.
ICAT API
ICAT Architecture Online Proposal System User Office System incl.: User Database Scheduling Health and Safety Proposal Management Metadata Catalogue Data Acquisition System Storage Management System DataAccessPortal Single Sign On Account Creation and Management ICAT Software Suite, providing the crucial integration of key functions.
ICAT API Version 3.3 (1) The ICAT API version 3.3 is the interface that any application should use to interact with the core ICAT system catalogue. At present it is used by applications such as the ISIS XML ingest, the DLS Generic Data Acquisition System, DLS DDH and the DataPortal. The API offers a wide range of web services for the easy interaction with the ICAT core catalogue.
ICAT API Version 3.3 (2) The ICAT API version 3.3 consists of three main components: Web Services offered to other applications ICAT Catalogue Interactions ICAT Catalogue Session Management
ICAT API Version 3.3 (3) The ICAT API version 3.3 uses JPL and SQL to directly interact with the underlying oracle databases The ICAT API version 3.3 has been written in Java using EJB3, JPA and JAX-WS
ICAT API Version 3.3 (4) Web Services offered to other applications for the Search, List, Ingest, Delete, Modification of: Authentication Investigation, Datafile and Dataset Information Investigator Keywords Publication Sample Download
DataPortal
ICAT Architecture Online Proposal System User Office System incl.: User Database Scheduling Health and Safety Proposal Management Metadata Catalogue Data Acquisition System Storage Management System DataAccessPortal Single Sign On Account Creation and Management ICAT Software Suite, providing the crucial integration of key functions.
DataPortal for ICAT Version 3.3 The DataPortal is a highly customisable web interface to interact with the ICAT version 3.3. There are at present two distinctive versions one for ISIS and one for DLS. Whereas the underlying functionality is the same the graphical representation and choice of used services varies. The DataPortal offers a number of search interfaces, the ability to explore investigations and download associated data.
Top Left Hand Menu
Bottom Left Hand Menu
Top Right Hand Menu
Session Expire
DLS DataPortal
Questions?