caGrid Executive Introduction caGrid 1.3 Justin Permar caGrid Knowledge Center kc.nci.nih.gov/CaGrid/KC
2 Agenda Vision and Use Cases caGrid Introduction Building and Using caBIG Applications Component / Service Survey Grid Interactions Grid Service Deployment
3 Vision “Imagine, if you will, a resource that would give individual scientists the capacity to easily view aggregate information on thousands of patients; a system that would also allow both patients and physicians to have complete medical records - including the patient's personal genome, tests performed over time, and medications taken - available at the click of a mouse. Rather than recruiting patients into clinical trials by who walks into the clinic or by individual referral, clinician-scientists could scan a database for patients precisely matched to their study, even if the study is looking for patients with specific genomic alterations, mutations, or translocations.” “In efforts to increase both the efficacy and efficiency of cancer care, managers of healthcare systems would have patient outcome data from hospitals across the country to utilize in benchmarking their own outcomes in key areas and managing cost. These brief examples are just a glimpse of the power that could come from such an interconnected national biomedical resource.” Source: John Niederhuber, Director, NCI
4 About caBIG® caBIG ® stands for the cancer Biomedical Informatics Grid ®. caBIG ® is an information network enabling all constituencies in the cancer community – researchers, physicians, and patients – to share data and knowledge. The components of caBIG ® are widely applicable beyond cancer as well. The mission of caBIG ® is to develop a truly collaborative information network that accelerates the discovery of new approaches for the detection, diagnosis, treatment, and prevention of cancer, ultimately improving patient outcomes. The goals of caBIG ® are to: Connect scientists and practitioners through a shareable and interoperable infrastructure Develop standard rules and a common language to more easily share information Build or adapt tools for collecting, analyzing, integrating, and disseminating information associated with cancer research and care. Source:
5 Driving needs: cancer Biomedical Informatics Grid A multitude of “legacy” information systems, most of which cannot be readily shared between institutions An absence of tools to connect different databases An absence of common data formats A huge and growing volume of data must be collected, analyzed, and made accessible Few common vocabularies, making it difficult, if not impossible, to interlink diverse research and clinical results Difficulty in identifying and accessing available resources An absence of information infrastructure to share data within an institution, or among multiple institutions Avoid redundancy by re-building applications at multiple institutions
6 What is the Grid? “Controlled and coordinated resource sharing and problem solving in dynamic, scalable virtual organizations.” 1 Securely sharing (with policies!): Computers Software Data Other Resources 1 The Anatomy of the Grid: Enabling Scalable Virtual Organizations. I. Foster, C. Kesselman, S. Tuecke. International J. Supercomputer Applications, 15(3), 2001.
7 What is caBIG? Common, widely distributed infrastructure that addresses common caBIG needs and permits the cancer research community to focus on innovation Shared, harmonized set of terminology, data elements, and data models that facilitate information exchange Collection of interoperable applications developed to common standards Cancer research data available for mining and integration
8 Why Grid for caBIG? Informatics RequirementsAdvantages of Grid ControlledSecure, role-based, locally-controlled access ComprehensiveData from multiple types of sources ConnectedSyntactic and Semantic Interoperability ConvenientSimplified and customizable interfaces CostCost effective – builds on existing technologies CompliantImplements policy & technical standards CredibleBuilt on experience & best practices Adapted from Muzna Mirza, MD, MSHI’s presentation on Global Public Health Grid:Muzna Mirza, MD, MSHI
9 Agenda Vision and Use Cases caGrid Introduction Building and Using caBIG Applications Component / Service Survey Grid Interactions Grid Service Deployment
The “G” in caBIG Ca ncer Biomedical Informatics Grid Provides the software infrastructure that underlies the tools and applications of caBIG Analogous to the “power grid” A multitude of applications with differing requirements can seamlessly be plugged in to a common infrastructure What is caGrid to caBIG?
11 What is caGrid? (2) Biomedical applications that share data all have common needs for syntactic and semantic interoperability caGrid aims to be a platform for interoperability caGrid is a Grid software toolkit aimed at software developers creating Grid applications caGrid provides the GAARDS toolkit, a standard security platform metadata services that add semantic information to all Grid services Introduce, a toolkit to develop Grid services The Grid is a trusted network that supports collaborative biomedical research. “Getting on the Grid” involves joining the trusted network by applying for and utilizing Grid credentials
12 Compatibility and Interoperability caBIG ® provides standards-based compatibility guidelines for creating software systems that are syntactically and semantically interoperable.
The Grid Allows Users to Find and Utilize Data and Analytical resources Grid service information is advertised to a Grid service directory called the Index service. This service is used to locate Grid services relevant to your research objectives. Data or Analytical Resource caBIO Grid Service Grid (Client Apps, Users) Grid Service Directory (Index Service) advertise discover
14 caGrid: High Level View Once a caBIG ® tool is adopted or adapted by members of the research community, the tool is connected to the Grid to securely share data and analysis routines with collaborating researchers.
15 Infrastructure Focus Areas Leveraging Grid technologies and standards as an interoperability platform Metadata Infrastructure Surfacing wealth of existing caBIG data-oriented metadata on the grid Providing new service-oriented metadata Security Integrating existing systems and applications with Grid security Lowering burden of implementation of grid-wide and local policy Tooling for Service Developers Powerful platform for bringing applications and data to the grid Facilitating Grid-wide operations Federated query, workflow execution, resource discovery Making the Grid more accessible Graphical installation and configuration, higher-level object-oriented APIs, web portals, graphical administrative applications Quality Comprehensive testing infrastructure, automated builds and test execution on multiple platforms, dashboard with historical archive
16 More About Security Comprehensive security is critical for collaboration scenarios involving biomedical data sharing. The caGrid security components, collectively known as GAARDS, include the following services: Dorian – Allows users to login to the Grid Authentication Service – Integrates existing institutional login capabilities with the Grid Grid Grouper – Allows institutions to implement group-based security policies Grid Trust Service – Provides capabilities for Grid entities to trust each other Credential Delegation Service – Provides the ability to securely transfer Grid credentials to others Web Single Sign-On – Allows a single login to provide access to multiple web applications that utilize Grid services
17 caGrid Integration with Existing Information Systems caGrid is an informatics platform that integrates and augments existing informatics infrastructure Examples include the following: caGrid integrates existing repositories of semantic information such as ontology servers caGrid integrates with existing institutional login systems (e.g., LDAP) caGrid shares data from existing databases and files In summary, caGrid integrates with existing systems to share and analyze data for multi-institutional clinical and research scenarios
18 Getting Started with caGrid To get started developing Grid applications, first install caGrid Use the caGrid installer to load caGrid onto your development machine Using the installer is the easiest way to install caGrid Features include: Guided, wizard-like interface for easy installation The installer can be used to re-configure existing installations The only requirement to run the installer is the Sun® Java™ 5 Development Kit.
19 Agenda Vision and Use Cases caGrid Introduction Building and Using caBIG Applications Component / Service Survey Grid Interactions Grid Service Deployment
20 caGrid Community Involvement: Building Grid Applications caGrid itself provides no real “data” or “analysis” to caBIG; caGrid enables the community to build services that share and analyze data The real “value” of the grid comes from bringing this information to the “end user” Community members develop end user applications which consume of the resources provided by the grid A Grid data service shares data securely with collaborators A Grid analytical service analyzes data A Grid application utilizes multiple Grid services to aid clinical and research workflows
21 caCORE Development Process caCORE is a robust set of tools and resources to support the development of caBIG ® -compatible systems NCI offers comprehensive training for caCORE tools Create an Information Model using a modeling tool Information Models Perform Semantic Integration using the SIW Vocabularies Generate Code and Interfaces using the caCORE SDK Code Generator APIs Transform the Model into Metadata using the UML Loader CDEs Generate a Grid Service using caGrid Grid Reference: Dr. Robert Freimuth, Vocabulary Knowledge Center Director
22 UML Model Creation Process Enterprise Vocabulary Services (EVS) Stores controlled terminologies used during semantic annotation The SIW pulls concepts from EVS and attaches them to model components cancer Data Standards Repository (caDSR) Common Data Elements (CDEs) UML model elements that are semantically annotated are added to the caDSR as CDEs Create a Logical Model (UML class diagram) using Enterprise Architect Logical Model Create a Data Model (database schema) using Enterprise Architect Data Model Semantically Annotate the UML Model using the SIW Semantics Map the Logical Model to the Data Model using caAdapter Mapping Model is complete and ready for compatibility review and load into caDSR Load Model
23 caBIG ® Compatibility Guidelines Areas of Interoperability Semantic Interoperability (VCDE) Information Models Vocabularies and Ontologies Common Data Elements (CDEs) Syntactic Interoperability (Architecture) Programming and Messaging Interfaces An application must meet the criteria specified in all four areas to be "caBIG ® Compatible" Vocabularies Information Models APIs CDEs Reference: Dr. Robert Freimuth, Vocabulary Knowledge Center Director
24 caBIG ® Compatibility Guidelines Levels of Maturity Legacy: Implies no interoperability with an external system or resource Bronze: Minimum requirements to achieve basic interoperability Silver: Rigorous requirements to significantly reduce the barrier of use for parties not involved with development of that resource Gold: Extensions to silver that add standardization and harmonization practices to enable full syntactic and semantic interoperability Vocabularies Information Models APIs CDEs Source:
Using caBIG Applications
26 Agenda Vision and Use Cases caGrid Introduction Building and Using caBIG Applications Component / Service Survey Grid Interactions Grid Service Deployment
27 caGrid 1.3 Core Services All caGrid Core Services were redeployed on all caBIG® Grids (OSU Training, QA, Stage, and Production) for this release. The (12) caGrid 1.3 Core Services are: * New for 1.3 ** Significantly Rewritten or Enhanced for 1.3 Metadata ServicesSecurity ServicesBusiness Activity Services Global Model Exchange Service** Authentication Service**Federated Query Processor Service** Index Service**Credential Delegation ServiceBPEL Workflow Service Metadata Model Service*Dorian Service**Taverna Workflow Service* Grid Grouper Service Grid Trust Service (Master & Slave)
28 What’s the use of metadata? Service metadata is critical for finding Grid resources relevant to particular research and clinical scenarios Metadata describes the service functionality and meaning of data that are shared by a Grid service Scenario: Scientists and others using the Grid want to find and utilize existing data sources and algorithms relevant to their research scenarios Solution: Grid services register with a Grid service directory Scenario: Users want to view the structure and relationships of data on the Grid Solution: The UML model defines the content of Grid data types and relationships between these types Scenario: Users need to know the format of the data described in a UML model Solution: XML schemas, stored in a Grid repository, define the data format to act as the foundation for syntactic interoperability Scenario: Scientists want to identify the meaning of the data described in a UML model Solution: Grid data is annotated with semantic information, such as use of community-approved vocabulary and concept definitions
29 What caGrid services provide this functionality? Scenario: Scientists and others using the Grid want to find and utilize existing data sources and algorithms relevant to their research scenarios The Index Service included in caGrid is a Grid-wide service directory that serves as the “white” and “yellow” pages of the Grid Scenario: Users want to view the structure and relationships of data on the Grid Every data service provides a data model that represents the information in the UML model Scenario: Users need to know the format of the data described in a UML model The Global Model Exchange (GME) Service is a Grid-wide repository for XML schemas Scenario: Scientists want to identify the meaning of the data described in a UML model The Metadata Model Service (MMS) is used to add semantic information to caGrid services The MMS also is used to generate a Grid representation of the data in your UML model, including semantic information
30 How does caGrid use the caBIG semantic repositories? All caGrid Services are expected to publish a set of standard metadata which draws heavily from the metadata registered in caDSR and EVS Common Metadata describes generic information about service providing Cancer Center, points of contact, etc The Service’s operations are defined and their inputs and outputs described using CDEs in caDSR and vocabulary from EVS Data Services additionally describe the domain Model they are exposing Classes, attributes, and associations from the UML model Semantics of the UML model
31 What security problems exist for multi- institutional data sharing scenarios? Inter-institutional “trust” What institutions participate in the Grid? How can you verify that an identity is issued by an institution (that is claims to be from)? User authentication How does a user prove their identity? How can we check that the identity is legitimate? User authorization How can institutions that share Grid services grant privileges to their collaborators? How can institutions that share data ensure their collaborators can only access data that the institutions intend to share? Data Integrity How can institutions be sure that data they are sharing is transmitted properly? Data Security How can institutions be sure that they share data only with whom they intend to share data? Allowing services to retrieve and analyze data on your behalf
32 What caGrid Services Address these Security Scenarios? Inter-institutional “trust” The Grid Trust Service (GTS) is used to establish a trust fabric, which is a collection of authoritative certificate authorities User authentication Dorian has a CA that is an essential part of the trust fabric Dorian issues both host certificates and user credentials that are trusted by others in the Grid because they have synchronized with the trust fabric The Authentication Service allows institutions to integrate their local user management systems with the Grid User authorization Grid Grouper provides group management, which in turn, allows service developers to add group-based authorization policies The Common Security Module (CSM) can be used to protect individual data elements shared by a Grid data service
33 What caGrid Services Address these Security Scenarios? (2) Data Integrity caGrid supports checksums to ensure that data has not been altered during transmissions Data Security caGrid supports encryption to ensure that data cannot be read by others during transmission Allowing services to work for you The credential delegation service (CDS) allows you to hand your credential to a third party for a specified period of time
34 How do Grid applications use core caGrid services? The user community adds data services and analytical services to the Grid These services share data and analytical resources with others Multi-institutional collaborations will require the use of multiple Grid services caGrid provides “higher-level” services that utilize the aforementioned Grid services The Federated Query Processor (FQP) provides applications with capabilities to aggregate data from multiple (equivalent) data services and to join data from multiple data services The workflow services allow users to specify interactions between services to achieve a desired result For example, retrieve all ECG data for subjects in our clinical trial and calculate the mean QT value, storing the data in a results data service
35 Other caGrid Utilities and APIs CQL and DCQL CQL is the “caGrid Query Language” that is used to retrieve data from caGrid data services DCQL is the distributed query language that is used for federated query processing Web Single Sign On The Web Single Sign On component allows users to sign in once and use multiple secure web applications Introduce Grid application developers use the Introduce toolkit to create data and analytical services The Introduce toolkit can be extended to add project-specific functionality
36 An example Introduce development process (0 lines of developer code!) Create Semantically Harmonized Data Model Grid-ify Generate Data Resource
37 Agenda Vision and Use Cases caGrid Introduction Building and Using caBIG Applications Component / Service Survey Grid Interactions Grid Service Deployment
38 Grid Workflows utilize core Grid Services The Grid services that are included in caGrid provide a core set of features available for Grid usage scenarios Grid workflows are software implementations of real-life clinical and research workflows Figure: Example Data Analysis Workflow
39 Example Image Analysis Scenario Each image processing step is a Grid service Each step in background correction is an operation Source: Joel H. Saltz, Scott Oster, Shannon L. Hastings, Stephen Langella, Renato A. Ferreira, Justin D. Permar, Ashish Sharma, David W. Ervin, Tony C. Pan, Umit V. Catalyurek, Tahsin M. Kurc, "Translational research design templates, Grid computing, and HPC", IEEE International Symposium on Parallel and Distributed Processing., : pp. 1-15, June,
40 Agenda Vision and Use Cases caGrid Introduction Building and Using caBIG Applications Component / Service Survey Grid Interactions Grid Service Deployment
41 Joining the Grid During Grid service creation, the service creator specifies the authentication and authorization requirements for the service For example, a service can require that users must authenticate with the service in order to communicate Specify authorization options (CSM/Grid Grouper) that are needed to support data retrieval and analysis operations that the service offers. A service can require authorization at the service level, operation level, and data level (give the user permission to retrieve only what they are allowed to view) Configure a container to host the service Two types of containers: secure and non-secure A non-secure container can only host non-secure services and does not support authentication or authorization A secure container can host secure and non-secure services and will support authentication and authorization as specified by the service A secure container has its own identity that it uses to communicate with the rest of the Grid Deploy the service to the container and start the container The service advertises itself to the Grid service directory The service directory, in turn, asks your service for information about its operations and data
42 The Role of Grid Policy The virtual organizations that join a Grid collectively establish (and enforce) policies that govern the use of the Grid Security policies How long can a user Grid session last? Data sharing policies Sharing de-identified data? Limited data sets? PHI? Service level agreements What requirements are imposed on service providers? Other domain-specific policies
43 Project Resources and Communication cagrid.org Software Downloads Documentation Tutorials Technical Paper and Presentations FAQs caBIG® caGrid Knowledge Center Knowledge Base Forums Enterprise Support Community engagement caGrid GForge Home (project website) Feature Requests Bug Reports caGrid Portal (web portal)
44 Acknowledgments THANK YOU caGrid Development team caBIG® Documentation and Training team