University of ViennaP. Brezany 1 Knowledge Discovery in Grid Datasets – Goals, Design Concepts and the Architecture Peter Brezany University of Vienna
P. Brezany 2 Collecting Data Data Re- positories Satellites Laboratories (microscopes, MRI/CT scanners,...) Computer simulations Experiments (high energy physics,...) Analysis Business
University of ViennaP. Brezany 3 Motivation Computational Grid – a new-generation infrastructure Challenge: Advanced analysis of data managed by Grid Typical data in modern Grid applications: –files, file collections, relational and XML DBs, virtual data, data objects The data is often is large, geographically distributed and its complexity is increasing; some applications require special security precautions. Our research aims: –Phase 1 : Knowledge discovery Grid system (GridMiner) –Phase 2 : Intelligent Grid system (WisdomGrid)
University of ViennaP. Brezany 4 Outline Motivation Background and Related Work Basic Concepts and GridMiner Architecture Grid Data Integration System Data Mining Layer Implementation Issues and Experiments Future Research Conclusions
University of ViennaP. Brezany 5 Background and Related Work Basic Grid development (Globus 1) – metacomputing Data Grid (Globus 2, DataGrid of CERN, etc.) Semantic Grid (myGrid) Open Grid Service Architecture (Globus 3, OGSA-DAIS) Parallel and Distributed Data Mining and Data Warehousing Knowledge Grid (GridMiner and work of others) Web Intelligence
University of ViennaP. Brezany 6 GridMiner Requirements Open architecture Data distribution, complexity, heterogeneity, and large data size Applying different kinds of analysis strategies Compatibility with existing Grid infrastructure Openness to tools and algorithms Scalability Grid, network, and location transparency Security and data privacy OLAP support
University of ViennaP. Brezany 7 GridMiner (Layered) Abstract Architecture Computational & Data Grid Information Grid Knowledge Grid Data to Knowledge Control User Interface Built on the K.G. Jeffery‘s proposal
University of ViennaP. Brezany 8 GridMiner Conceptual Architecture JobControlJobControl
University of ViennaP. Brezany 9 Service Architecture Based on OGSA-DAIS
University of ViennaP. Brezany 10 Data Distribution Scenarios 1.Single data source 2.Federated data sources with different types of partitioning
University of ViennaP. Brezany 11 Example Vertical and horizontal distribution of the virtual data source
University of ViennaP. Brezany 12 Mapping Schema
University of ViennaP. Brezany 13 Grid Data Mediation Services
University of ViennaP. Brezany 14 Architecture of a Data Mining System
University of ViennaP. Brezany 15 Components of the Data Mining Layer GridMiner Service Factory GridMiner Service Registry GridMiner Data Mining Service GridMiner Preprocessing Service GridMiner Presentation Service GridMiner Orchestration Service
University of ViennaP. Brezany 16 Centralized Data Mining
University of ViennaP. Brezany 17 Parallel and Distributed Data Mining
University of ViennaP. Brezany 18 GridMiner Orchestration Service
University of ViennaP. Brezany 19 GridMiner Job Specification Language
University of ViennaP. Brezany 20 Implementation Prototype Implementation of the Mediation Service for horizontal data partitioning Implementation of Data Mining Services for decision tree construction as OGSA conformous Grid service, based on the Globus Toolkit 3 Release We use –a freely available Java-based data mining system Weka (data preprocessing and data mining tasks) – (main memory oriented) –a home-grown Java implementation of the algorithm SPRINT (disk-oriented)
University of ViennaP. Brezany 21 Experimental Environment Test data suites –synthetical data (generated by an extended version of the IBM Quest Synthetic Data Generation Code) –TBI (Traumatic Brain Injury) databases Grid testbed –Vienna –CERN –Dublin –Zagreb –Cracow Goals in the first phases –Verifying model accuracy –Overhead of the service layers
University of ViennaP. Brezany 22 Extending the Functionality
University of ViennaP. Brezany 23 OLAM
University of ViennaP. Brezany 24 Example: Mining Patterns for Data Classification and Associations use database dat1, dat2 mine classifications analyze patient_outcome using g_parsimony display as tree use database DBs attributes mine associations using method_attributes display as rules
University of ViennaP. Brezany 25 Workflow 1: Interactive Mode
University of ViennaP. Brezany 26 Workflow 2: Batch Mode
University of ViennaP. Brezany 27 Workflow 3: Hybrid Mode
University of ViennaP. Brezany 28 Execution Model Based on Static Workflow
University of ViennaP. Brezany 29 Execution Model Based on Dynamic Workflow
University of ViennaP. Brezany 30 Towards the Wisdom Grid (WG)
University of ViennaP. Brezany 31 WG Architecture Wisdom Grid Agent Grid Service Knowledge Base Service Knowledge Discovery Service Agent Platform External Services External Knowledge Base Domain Knowledge AgentsKnowledge Explorer Agent End User (personal) Agent Grid KB
University of ViennaP. Brezany 32 Work-Flow End User Agent Knowledge AgentKnowledge Explorer Agent Knowledge Base service External Agents Knowledge Base Agent Service Knowledge discovery service Services...
University of ViennaP. Brezany 33 Knowledge Discovery Service Client for other services Knowledge Discovery in Databases GridMiner data mining on-line analytical processing (OLAP) Web Mining semantic web Online libraries Web/Grid Services Knowledge Explorer Agent
University of ViennaP. Brezany 34 Knowledge Base Service / KB KBS - Search, Query, Expand Knowledge Base KB- Database that stores particular data about real objects and relations between these objects and their properties Consists of ontologies and instances Information about resources (location, query lang.) on the Web web/grid services,agents references to the online database Languages XML/RDF/DAML-OIL/DAML-S/OWL
University of ViennaP. Brezany 35 Ontology - example Patient Age Human has is DAML-OIL Language:
University of ViennaP. Brezany 36 Knowledge Base - example Patient Temperature Human has DatabaseTables jdbc://foo/hospitaltable:PATIENTSattribute:PAT_ID is Value Attribute has
University of ViennaP. Brezany 37 Semantic mediator Distributed heterogeneous databases –Different database schemas –Different query languages –Different names of attributes/tables… but the same semantics ! WG enables semantics mediation at a higher level
University of ViennaP. Brezany 38 Semantic mediator (cont.) PATIENTS PAT_IDPAT_AGEPAT_BLOOD_TYPE...…… PAT_TAB IDAGEBT...…… Patient Age Human has is Blood Type has AGEPAT_AGE samePropertyAs BTPAT_BLOOD_TYPE samePropertyAs Database in Hospital X Database in Hospital Z
University of ViennaP. Brezany 39 Distributed Knowledge base is subclass has property Class property uri:fooX#Patient uri:fooY#Human uri:fooZ#Temperature class uri:fooX#Ill_Person Is same class as
University of ViennaP. Brezany 40 Agent Grid Service Supports system with ability to communicate with the outside world in standard languages FIPA Standards ACL – Agent Communication Language KQML- Knowledge Query and Manipulation Language Agent Platform (JADE,FIPA-OS) Agents Domain Knowledge Agent Knowledge Explorer Agent End-user Agent (personal)
University of ViennaP. Brezany 41 Querying End-user agent with own ontology – subset of ontology Merging of ontologies without own ontology Negotiating about domain of interest Queries created from ontology Templates
University of ViennaP. Brezany 42 Answers Mined Knowledge (GridMiner) –Decision trees/ rules »(clinical pathways) –Association rules Instances of domain ontology –Particular data –References –Links to Web sites –Information about another knowledge providers
University of ViennaP. Brezany 43 Case Study - Medical Application End User (personal) Agent Q: Outcome? + data about patient’s condition Knowledge Agent Training set GridMiner Testset Hospital Databases Knowledge Discovery Service Knowledge Base Semantic Web/Grid A: probability of survival + references to the diagnoses Knowledge Explorer Agent resources
University of ViennaP. Brezany 44 Conclusions and Future Work Application and extension of the Grid technology to knowledge discovery – an important, but non- traditional Grid application domain Introduction of a new Grid Data Mediation Service Future work –Performance evaluation on large synthetic data volumes –Coupling of the Data Minining services architecture with the OLAP services architecture –Development of a knowledge discovery oriented Grid Workflow Language and the appropriate Workflow Engine –Application of GridMiner to a real medical application (management of patients with severe traumatic brain injuries) –Development of the Wisdom Grid