Institut für Softwarewissenschaft - Universität WienP.Brezany 1 Grids, Grid Technologies and Data Mining Peter Brezany Institut für Softwarewissenschaft.

Slides:



Advertisements
Similar presentations
The Anatomy of the Grid: An Integrated View of Grid Architecture Carl Kesselman USC/Information Sciences Institute Ian Foster, Steve Tuecke Argonne National.
Advertisements

Chapter 13 Review Questions
Database Architectures and the Web
High Performance Computing Course Notes Grid Computing.
The Experience Factory May 2004 Leonardo Vaccaro.
CoreGRID Workpackage 5 Virtual Institute on Grid Information and Monitoring Services Authorizing Grid Resource Access and Consumption Erik Elmroth, Michał.
The Grid Background and Architecture. 1. Keys to success for IT technologies Infrastructure Open Standards.
Introduction and Overview “the grid” – a proposed distributed computing infrastructure for advanced science and engineering. Purpose: grid concept is motivated.
GGF Toronto Spitfire A Relational DB Service for the Grid Peter Z. Kunszt European DataGrid Data Management CERN Database Group.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 1: Introduction to Windows Server 2003.
Milos Kobliha Alejandro Cimadevilla Luis de Alba Parallel Computing Seminar GROUP 12.
Grids and Grid Technologies for Wide-Area Distributed Computing Mark Baker, Rajkumar Buyya and Domenico Laforenza.
Institut für Softwarewissenschaft - Universität WienP.Brezany 1 Toward Knowledge Discovery in Databases Attached to Grids Peter Brezany Institute for Software.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
Chapter 3: Operating-System Structures System Components Operating System Services System Calls System Programs System Structure Virtual Machines System.
Presenter: Dipesh Gautam.  Introduction  Why Data Grid?  High Level View  Design Considerations  Data Grid Services  Topology  Grids and Cloud.
 DATABASE DATABASE  DATABASE ENVIRONMENT DATABASE ENVIRONMENT  WHY STUDY DATABASE WHY STUDY DATABASE  DBMS & ITS FUNCTIONS DBMS & ITS FUNCTIONS 
DISTRIBUTED COMPUTING
Database System Concepts and Architecture
Grid Computing - AAU 14/ Grid Computing Josva Kleist Danish Center for Grid Computing
Data Warehouse Overview September 28, 2012 presented by Terry Bilskie.
GT Components. Globus Toolkit A “toolkit” of services and packages for creating the basic grid computing infrastructure Higher level tools added to this.
SOFTWARE DESIGN AND ARCHITECTURE LECTURE 09. Review Introduction to architectural styles Distributed architectures – Client Server Architecture – Multi-tier.
The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Dataset Caitlin Minteer & Kelly Clynes.
Grid Technologies  Slide text. What is Grid?  The World Wide Web provides seamless access to information that is stored in many millions of different.
Chapter 4 Realtime Widely Distributed Instrumention System.
Session-8 Data Management for Decision Support
The Grid System Design Liu Xiangrui Beijing Institute of Technology.
The Anatomy of the Grid Mahdi Hamzeh Fall 2005 Class Presentation for the Parallel Processing Course. All figures and data are copyrights of their respective.
Virtual Data Grid Architecture Ewa Deelman, Ian Foster, Carl Kesselman, Miron Livny.
1 4/23/2007 Introduction to Grid computing Sunil Avutu Graduate Student Dept.of Computer Science.
4 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved. Computer Software Chapter 4.
Service - Oriented Middleware for Distributed Data Mining on the Grid ,劉妘鑏 Antonio C., Domenico T., and Paolo T. Journal of Parallel and Distributed.
Middleware for Grid Computing and the relationship to Middleware at large ECE 1770 : Middleware Systems By: Sepehr (Sep) Seyedi Date: Thurs. January 23,
Ames Research CenterDivision 1 Information Power Grid (IPG) Overview Anthony Lisotta Computer Sciences Corporation NASA Ames May 2,
Authors: Ronnie Julio Cole David
Database Environment Chapter 2. Data Independence Sometimes the way data are physically organized depends on the requirements of the application. Result:
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
CIS/SUSL1 Fundamentals of DBMS S.V. Priyan Head/Department of Computing & Information Systems.
CEOS Working Group on Information Systems and Services - 1 Data Services Task Team Discussions on GRID and GRIDftp Stuart Doescher, USGS WGISS-15 May 2003.
Introduction to Grids By: Fetahi Z. Wuhib [CSD2004-Team19]
Marv Adams Chief Information Officer November 29, 2001.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
1 Observations on Architecture, Protocols, Services, APIs, SDKs, and the Role of the Grid Forum Ian Foster Carl Kesselman Steven Tuecke.
Foundations of Information Systems in Business. System ® System  A system is an interrelated set of business procedures used within one business unit.
Globus and PlanetLab Resource Management Solutions Compared M. Ripeanu, M. Bowman, J. Chase, I. Foster, M. Milenkovic Presented by Dionysis Logothetis.
Securing the Grid & other Middleware Challenges Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer.
GRID ANATOMY Advanced Computing Concepts – Dr. Emmanuel Pilli.
INTRODUCTION TO GRID & CLOUD COMPUTING U. Jhashuva 1 Asst. Professor Dept. of CSE.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
E-Business Infrastructure PRESENTED BY IKA NOVITA DEWI, MCS.
Chapter 1 Characterization of Distributed Systems
Discovering Computers 2010: Living in a Digital World Chapter 14
Clouds , Grids and Clusters
GGF OGSA-WG, Data Use Cases Peter Kunszt Middleware Activity, Data Management Cluster EGEE is a project funded by the European.
Grid Computing.
THE STEPS TO MANAGE THE GRID
Database Architectures and the Web
University of Technology
GRID COMPUTING PRESENTED BY : Richa Chaudhary.
Grid Computing B.Ramamurthy 9/22/2018 B.Ramamurthy.
The Globus Toolkit™: Information Services
Data Warehouse Overview September 28, 2012 presented by Terry Bilskie
Distributed Systems Bina Ramamurthy 11/30/2018 B.Ramamurthy.
Distributed Systems Bina Ramamurthy 12/2/2018 B.Ramamurthy.
Introduction to Grid Technology
The Anatomy and The Physiology of the Grid
The Anatomy and The Physiology of the Grid
Distributed Systems Bina Ramamurthy 4/22/2019 B.Ramamurthy.
Presentation transcript:

Institut für Softwarewissenschaft - Universität WienP.Brezany 1 Grids, Grid Technologies and Data Mining Peter Brezany Institut für Softwarewissenschaft Universität Wien

Institut für Softwarewissenschaft - Universität WienP.Brezany 2 Grid and Grid Technologies Grid computing has emerged as an important field, distinguished from conventional distributed computing by its focus on large- scale resource sharing, innovative applications, and, in some cases, high-performance orientation. Grid itself is supposed to connect computing resources over the wide area network. Internet computing and Grid technologies promise to change the way we tackle complex problems. Harnesing these new technolo- gies effectively will transform scientific disciplines ranging from high-energy physics to the life sciences. The Grid research field can further be divided into 2 subdomains: - Computational Grid : a natural extension of the former cluster computer - Data Grid : efficient management, placement, and replication of large amounts of data; once data are in place, computational tasks can be run.

Institut für Softwarewissenschaft - Universität WienP.Brezany 3 Data Mining on (Data) Grids Data mining on the Grid (DMG) : finding data patterns in an environment with geographically distributed data and computation – an environment with a special data management, data placement, and data replication. A good DMG algorithm analyzes data in a distributed fashion with modest data communication overhead. A typical DMG algorithm involves local data analysis followed by the generation of a global data model. Huge data volumes are involved – high performance I/O needed.

Institut für Softwarewissenschaft - Universität WienP.Brezany 4 Application Examples Finding out the dependency of the emergence of hepatitis-C on the weather patterns: access to a large hepatitis-C DB at one location and an environmental DB at another location. 2 major financial organizations want to cooperate. They need to share data patterns relevant to the data mining task, they do not want to share the data since it is sensitive - combining the databases may not be feasible. A major multi-national corporation wants to analyze the customer transaction records for quickly developing successful business strategies. It has thousands of establishments through out the world and collecting all the data to a centralized data warehouse, followed by analysis using existing commercial data mining software,takes too long. Telemedical applications – see the next 2 slides.

Institut für Softwarewissenschaft - Universität WienP.Brezany 5 Components of Telemedical Applications Web Raw Medical Data Reconstructed Medical Data Derived Medical DataDatabase

Institut für Softwarewissenschaft - Universität WienP.Brezany 6 Telemedical Collaboration - Example A patient living in a remote village has a heart problem. An EEG is taken by the local doctor and all the patient’s details are stored in the doctor’s PC based telemedical system. MRI and CT scans are taken within different departments of a general hospital and stored in the telemedical DB. A consultant compiles a report and saves it in the DB. If necessary, in a specialized clinic a 3D ultrasound scan is taken and further report compiled. Requiring complicated surgery, an external specialist using Virtual Reality techniques defines how the surgery should be planned. The resulting operation is placed on video for, e.g., education.  Data mining support/assistance is needed.

Institut für Softwarewissenschaft - Universität WienP.Brezany 7 Motivations and History

Institut für Softwarewissenschaft - Universität WienP.Brezany 8 Grid Computing Concept Enable communities (“virtual organizations”) to share geographically distributed resources as they pursue common goals—in the absence of central control, omniscience, trust relationships

Institut für Softwarewissenschaft - Universität WienP.Brezany 9 Grid Computing Concept (2) The term ``the Grid´´ was coined in the mid 1990s to denote a proposed distributed computing infrastructure for science and engineering. The aim is coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organizations. Resources: computers, files, data to computers, sensors, networks, laboratory equipments, etc. Sharing is highly controlled, with resource providers and consumers defining clearly and carefully just what is shared, who is allowed to share, and conditions under which sharing occurs. A set of individuals and/or institutions defined by such sharing form a virtual organization (VO).

Institut für Softwarewissenschaft - Universität WienP.Brezany 10 Grid Computing Concept (3) Grid technologies complement rather than compete with existing distributed computing technologies. For example, CORBA focus on enabling resource sharing within a single organization. GRID technologies focus on dynamic, cross-organizational sharing.

Institut für Softwarewissenschaft - Universität WienP.Brezany 11 Community = –1000s of home computer users –Philanthropic computing vendor (Entropia) –Research group (Scripps) Common goal= advance AIDS research Grid Communities and Applications: Home Computers Evaluate AIDS Drugs

Institut für Softwarewissenschaft - Universität WienP.Brezany 12 The Nature of Grid Architecture A Grid architecture identifies fundamental system components, specifies the purpose and function of these components, and indicates how these components interact with one another. Interoperability is the central issue to be addressed. In a network environment, interoperability means common protocols. The GRID architecture is first and foremost a protocol architecture, with protocols defining the basic mechanisms by which VO users and resources negotiate, establish, manage, and exploit sharing relationships. Standard protocols make it easy to define standard services that provide enhanced capablities and construct Application Programming Interfaces and Software Development Kits.

Institut für Softwarewissenschaft - Universität WienP.Brezany 13 The Nature of Grid Architecture (2) Just as the Web revolutionized information sharing by providing a universal protocol and syntax (HTTP and HTML) for information exchange, so we require standard protocols and syntaxes for general resource sharing. A Grid protocol definition specifies - how distributed system elements interact with one another in order to achieve a specified behavior, and - the structure of the information exchanged during this interaction

Institut für Softwarewissenschaft - Universität WienP.Brezany 14 The Nature of Grid Architecture (3) A Grid service is defined solely by the protocol that it speaks and the behaviors that it implements. There are standard Grid services for: - access to computation - access to data - resource discovery - coscheduling (mechanisms for coordinating operations across multiple resources) - data replications, etc. The definition of the above services allows as to enhance services offered to VO participants and also to abstract away resource specific details.

Institut für Softwarewissenschaft - Universität WienP.Brezany 15 The Nature of Grid Architecture (4) Why do we also consider Application Programming Interfaces (APIs) and Software Development Kits (SDKs)? There is more to VOs than interoperability, protocols and services. Developers must be able to develop sophisticated applications in complex and dynamic execution environments. Users must be able to operate these applications. Standard abstractions, APIs, and SDKs can accelerate code development, enable code sharing, and enhance application portability. Summary: identification and definition of 1. protocols  2. services  3. APIs and SDKs.

Institut für Softwarewissenschaft - Universität WienP.Brezany 16 Grid Architecture The architecture is organized into layers – see the next slide Components within each layer share common characteristics but can build on capabilities and behaviors provided by any lower layer. Resource and Connectivity protocols facilitate the sharing of individual resources. They are designed so that they can be imlemented n top of a diverse range of resource types, defined at the Fabric layer, and can in turn be used to construct a wide range of global services and application-specific behaviors at the Collective layer.

Institut für Softwarewissenschaft - Universität WienP.Brezany 17 Layered Grid Architecture (By Analogy to Internet Architecture) Application Fabric “Controlling things locally”: Access to, & control of, resources Connectivity “Talking to things”: communication (Internet protocols) & security Resource “Sharing single resources”: negotiating access, controlling use Collective “Coordinating multiple resources”: ubiquitous infrastructure services, app- specific distributed services Internet Transport Application Link Internet Protocol Architecture

Institut für Softwarewissenschaft - Universität WienP.Brezany 18 Fabric: Interface to Local Control The Grid Fabric layer provides the resources to which shared access is mediated. Fabric components implement the local resource-specific operations that occur as a result of sharing operations at higher levels. At a minimum, recources should implement enquiry mechanisms that permit discovery of their structure and state, and resource management mechanisms that provide some control of delivered quality of service.

Institut für Softwarewissenschaft - Universität WienP.Brezany 19 Fabric: Interface to Local Control (2) A resource-specific characterization of capabilities: Computational resources: Mechanisms for starting programs and for montoring and controlling the execution of resulting processes. Storage resources: Mechanisms for putting and geting files. Enquiry functions for determining hardware and software cha- racteristics and information about available space utilization. Network resources: Mechanisms that provide control over the resources allocated to network transfers. Enquiry functions to determine network characteristics and load. Code repositories: Managing versioned source and object code. Catalogs: Catalog query and update operations.

Institut für Softwarewissenschaft - Universität WienP.Brezany 20 Connectivity: Communicating Easily and Securely The Connectivity layer defines core communication and authentication protocols required for Grid-specific network transactions. Communication protocols enable the exchange of data between Fabric layered resources. Authentication protocols build on communication services to provide cryptographically secure mechanisms for verifying the identity of users and resources.

Institut für Softwarewissenschaft - Universität WienP.Brezany 21 Connectivity (2) Authentications solutions for VO environments should have the following characteristics: Single sign on: Users must be able to ``log on´´ (authenticate) just once and then have access to multiple Grid resources defined by the Fabric layer, without further user intervention. Delegation: A user must be able to endow a program with the ability to run on that user´s behalf, so that the program is able to access the resources on which the user is authorized. Integration with various local security solutions: Grid security solutions must be able to interoperate with various local security solutions. User-based trust relationships: If a user hs the right to use sites A and B, the user should be able to use sites A and B together without requiring that A´s and B´s security adminstrators interact.

Institut für Softwarewissenschaft - Universität WienP.Brezany 22 Resource: Sharing Single Resources The Resource layer defines protocols (and APIs and SDK´s) for secure initiation, monitoring, and control of sharing operations on individual resources. The primary classes of Resource layer protocols: Information protocols are used to obtain information about the structure and state of a resource, e.g., its configuration, current load, and usage policy. Management protocols are used to negotiate access to a shared resource, specifying, for example, resource requirements and the operations to be performed, such as process creation, or data access. A protocol may support monitoring the status of an operation and controlling (e.g., terminating) the operation.

Institut für Softwarewissenschaft - Universität WienP.Brezany 23 Collective: Coordinating Multiple Resources Collective layer contains protocol and services (and APIs and SDKs) that are not associated with any one specific resource but rather are global in nature and capture interactions across collections of resources. This layer can, e.g., implement: Directory services allow VO participants to discover the existence and/or properties of VO resources. Co-allcation, scheduling, and brokering services allow VO participants to request the allocatin of one or more resources for a specific purpose and the schedulng of tasks on the appropriate resources. Monitoring and diagnosics services support the monitoring of VO resources for failure, adversarial attack (``intrusion detection´´), overload, and so forth.

Institut für Softwarewissenschaft - Universität WienP.Brezany 24 Collective (2) Data replication services suport the management of VO storage (and perhaps also network and computing) resources to maximize data access peformance with respect to metrics such as response time, reliability, and cost. Grid-enabled programming systems enable familiar programming models to be used in Grid environments. E.g., a Grid-enabled implementations of the Message Passing Interface (MPI). Software discovery services discover and select the best software imlementation and execution platform based on the parameters of the problem being solved. Community authorization servers enforce community policies governing resource access. Collaboratory services support the coordinated exchange of information within potentially large user communties.

Institut für Softwarewissenschaft - Universität WienP.Brezany 25 Applications Applications are constructed in terms of, and by calling upon, services defined at any layer. Effective application development can often benefit from the use of higher-level languages and frameworks (e.g., the Common Component Architecture, CORBA, etc.). These higher-level systems can build on protocols, services, and APIs provided within the Grid architecture.

Institut für Softwarewissenschaft - Universität WienP.Brezany 26 Protocols, Services, and Interfaces Occur at Each Level Languages/Frameworks Fabric Layer Applications Local Access APIs and Protocols Collective Service APIs and SDKs Collective Services Collective Service Protocols Resource APIs and SDKs Resource Services Resource Service Protocols Connectivity APIs Connectivity Protocols

Institut für Softwarewissenschaft - Universität WienP.Brezany 27 Data Grid The need for Data Grids stems from the fact that scientific applications like data analysis in High Energy Physics, climate modeling or earth observation are very data intensive and a large community of researchers all around the globe wants to have fast access to the data. Future Data Grid applications: Medical Grids and E-Business Grids. Grid Data Warehousing and Grid Data Mining – a new challenging field.

Institut für Softwarewissenschaft - Universität WienP.Brezany 28 Storage Model 2 different kinds of files: Master files (owned by their creators) Replica files. There may be many replicas of a master file. Replicas are owned by, managed by, and may be deleted by, the Grid. The notion of replicas is new, and critical in a Grid environment. Example: Before a DataGrid job can run at site A, data at site B may need to be copied to site A. This data may then be used by subsequent jobs at site A, or may be needed by jobs at site C, which has a better network connection to site A than site B. For this reason, the data should be kept at site A as long as possible. The ReplicaManager keeps track of all replica data so that the replica selection service can select the optimal replica to use for a given job, or to request the creation of a new replica.

Institut für Softwarewissenschaft - Universität WienP.Brezany 29 SQLDatabaseService This servis allows to efficiently store, retrieve and query very large amounts of meta data held in any type of local or remote RDBMS. The database can be used for the implementation of catalogs.

Institut für Softwarewissenschaft - Universität WienP.Brezany GridMiner A Framework for Data Mining on Grids A new research field

Institut für Softwarewissenschaft - Universität WienP.Brezany 31 Architecture of a Data Mining System Graphical user interface Pattern evaluation Data mining engine Database or data warehouse server Knowledge base Database Data warehouse FilteringData cleaning, data integration

Institut für Softwarewissenschaft - Universität WienP.Brezany 32 Decomposition of a Knowledge Discovery Process Preprocessing - data cleaning - data transformation - data reduction Data mining (e.g., association rules) - find frequent itemsets - generate association rules Evaluation of discovered patterns Graphical User Interface

Institut für Softwarewissenschaft - Universität WienP.Brezany 33 Our Philosophy Data mining systems can be decomposed into a set of communicating components  distributed component architecture Placement of data-processing functionalities is critical. Grid data mining research tightly coupled to the ongoing work on parallel I/O for Grids ( e.g., Armada project at the Dartmouth College, USA)

Institut für Softwarewissenschaft - Universität WienP.Brezany 34 Basic Grid Data Mining Models 1.Local data analysis followed by the generation of a global data model – adapting distributed data mining techniques. No data replication. 2. Data mining system components are optimally located on the grid. No dynamic data replication. 3. Data mining system components are optimally located on the Grid. Dynamic data replication is considered.

Institut für Softwarewissenschaft - Universität WienP.Brezany 35 Data Storage and the Components Site ASite B Site C Site D Preprocesing Preprocessing Local DM Construction of the Global Model GUISite E