1 Towards an Open Service Framework for Cloud-based Knowledge Discovery Domenico Talia ICAR-CNR & UNIVERSITY OF CALABRIA, Italy Cloud.

Slides:



Advertisements
Similar presentations
Bringing Grid & Web Services Together
Advertisements

Towards Data Mining Without Information on Knowledge Structure
A Lightweight Platform for Integration of Mobile Devices into Pervasive Grids Stavros Isaiadis, Vladimir Getov University of Westminster, London {s.isaiadis,
1 Senn, Information Technology, 3 rd Edition © 2004 Pearson Prentice Hall James A. Senns Information Technology, 3 rd Edition Chapter 7 Enterprise Databases.
1. 2 Configuring the Cloud Inside and out Paul Anderson publications/mysore-2010-talk.pdf School of.
1 Copyright © 2010, Elsevier Inc. All rights Reserved Fig 2.1 Chapter 2.
11 Application of CSF4 in Avian Flu Grid: Meta-scheduler CSF4. Lab of Grid Computing and Network Security Jilin University, Changchun, China Hongliang.
1 Mixing Public and private clouds a Practical Perspective Maarten Koopmans Nordunet Conference 2009 Maarten Koopmans Nordunet Conference 2009.
© 2010 Pearson Addison-Wesley. All rights reserved. Addison Wesley is an imprint of Chapter 11: Structure and Union Types Problem Solving & Program Design.
Towards a GRID Operating System: from GLinux to a Pervasive GVM Domenico TALIA DEIS University of Calabria ITALY CoreGRID Workshop.
1 From Grids to Service-Oriented Knowledge Utilities research challenges Thierry Priol.
C. Mastroianni, D. Talia, O. Verta - A Super-Peer Model for Resource Discovery Services in Grids A Super-Peer Model for Building Resource Discovery Services.
Designing Services for Grid-based Knowledge Discovery A. Congiusta, A. Pugliese, Domenico Talia, P. Trunfio DEIS University of Calabria ITALY
How Distributed Data Mining Tasks can Thrive as Services on Grids Domenico Talia and Paolo Trunfio Università della Calabria, Italy
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Title Subtitle.
Addition Facts
Year 6 mental test 5 second questions
GridPP July 2003Stefan StonjekSlide 1 SAM middleware components Stefan Stonjek University of Oxford 7 th GridPP Meeting 02 nd July 2003 Oxford.
Chapter 1 Introduction Copyright © Operating Systems, by Dhananjay Dhamdhere Copyright © Introduction Abstract Views of an Operating System.
Secure Virtual Machine Execution Under an Untrusted Management OS Chunxiao Li Anand Raghunathan Niraj K. Jha.
Distributed and Parallel Processing Technology Chapter2. MapReduce
Shredder GPU-Accelerated Incremental Storage and Computation
The Platform as a Service Model for Networking Eric Keller, Jennifer Rexford Princeton University INM/WREN 2010.
©Ian Sommerville 2006Software Engineering, 8th edition. Chapter 31 Slide 1 Service-centric Software Engineering.
ABC Technology Project
Describing Complex Products as Configurations using APL Arrays.
Kensington Oracle Edition: Open Discovery Workflow Meets Oracle 10g Professor Yike Guo.
25 July, 2014 Hailiang Mei, TU/e Computer Science, System Architecture and Networking 1 Hailiang Mei Remote Terminal Management.
Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.
1 Evaluations in information retrieval. 2 Evaluations in information retrieval: summary The following gives an overview of approaches that are applied.
IONA Technologies Position Paper Constraints and Capabilities for Web Services
Database System Concepts and Architecture
31242/32549 Advanced Internet Programming Advanced Java Programming
Executional Architecture
Global Analysis and Distributed Systems Software Architecture Lecture # 5-6.
Chapter 5 Test Review Sections 5-1 through 5-4.
HJ-Hadoop An Optimized MapReduce Runtime for Multi-core Systems Yunming Zhang Advised by: Prof. Alan Cox and Vivek Sarkar Rice University 1.
KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.
Addition 1’s to 20.
25 seconds left…...
Week 1.
We will resume in: 25 Minutes.
Copyright © 2012 Pearson Education, Inc. Chapter 12: Theory of Computation Computer Science: An Overview Eleventh Edition by J. Glenn Brookshear.
A SMALL TRUTH TO MAKE LIFE 100%
Choosing an Order for Joins
©Ian Sommerville 2006Software Engineering, 8th edition. Chapter 31 Slide 1 Service-centric Software Engineering 1.
Chapter 13 The Data Warehouse
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 13 Slide 1 Application architectures.
How Cells Obtain Energy from Food
1 Distributed Agents for User-Friendly Access of Digital Libraries DAFFODIL Effective Support for Using Digital Libraries Norbert Fuhr University of Duisburg-Essen,
11/19/2002Yun (Helen) He, SC20021 MPI and OpenMP Paradigms on Cluster of SMP Architectures: the Vacancy Tracking Algorithm for Multi- Dimensional Array.
Large Scale Computing Systems
Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
WORKFLOWS IN CLOUD COMPUTING. CLOUD COMPUTING  Delivering applications or services in on-demand environment  Hundreds of thousands of users / applications.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Ch 4. The Evolution of Analytic Scalability
Energy Issues in Data Analytics Domenico Talia Carmela Comito Università della Calabria & CNR-ICAR Italy
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
DOE BER Climate Modeling PI Meeting, Potomac, Maryland, May 12-14, 2014 Funding for this study was provided by the US Department of Energy, BER Program.
Master Thesis Defense Jan Fiedler 04/17/98
DOMENICO TALIA (joint work with M. Cannataro, A. Congiusta, P. Trunfio) DEIS University of Calabria ITALY Grid-Based Data Mining and.
Service - Oriented Middleware for Distributed Data Mining on the Grid ,劉妘鑏 Antonio C., Domenico T., and Paolo T. Journal of Parallel and Distributed.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
University of Technology
Data Warehousing and Data Mining
Ch 4. The Evolution of Analytic Scalability
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

1 Towards an Open Service Framework for Cloud-based Knowledge Discovery Domenico Talia ICAR-CNR & UNIVERSITY OF CALABRIA, Italy Cloud Futures 2010 – April 8-9, Redmond, WA

2 Goal Discuss a strategy based on the use of services for the design of distributed knowledge discovery tasks and applications on Cloud, Grids and large distributed systems. Outline how service-oriented knowledge discovery tasks can be developed as a collection of Grid/Web/Cloud services. Investigate how they can be used to develop distributed data analysis applications exploiting the SOA model in a distributed computing scenario.

3 Complex Big Problems Bigger and more complex problems must be solved by large scale distributed computing. DATA SOURCES are larger and larger and distributed. The main problem is not storing DATA, it is analyse, mine, and process DATA for understanding it.

4 Today the information stored in digital data archives is enormous and its size is still growing very rapidly. Data Availability or Data Deluge? The world has created or 750 exabytes (750 billion gigabytes) of digital information in In 2010, it will create more than 1 zettabyte. ( source: IDC)

5 Whereas until some decades ago the main problem was the shortage of information, the challenge now seems to be the very large volume of information to deal with and the associated complexity to process it and to extract significant and useful parts or summaries. Data Availability or Data Deluge?

6 Distributed Data Intensive Apps The use of computers changed our way to make discoveries and is improving both speed and quality of the scientific discovery processes. In this scenario HPC, Cloud and Grid systems provide an effective computational support for distributed data intensive application and for knowledge discovery from large and distributed data sets. Grid systems, parallel computers, and cloud computing systems are key technologies for e-Science. They can be used in integrated frameworks through service interfaces.

7 Distributed Data Mining on Clouds Knowledge discovery (KDD) and data mining (DM) are: Compute- and data-intensive processes/tasks Often based on distribution of data, algorithms, and users. Large scale systems like Clouds and Grids integrate both distributed computing and parallel computing, thus they are key infrastructures for high-performance distributed knowledge discovery. (Data Analytics Clouds) They also offer security, resource information, data access and management, communication, scheduling, fault detection, …

8 Distributed Data Analysis Patterns Data parallelism? Task parallelism? Managing data dependencies Data management: input, intermediate, output Dynamic task graphs/workflows (data dependencies) Dynamic data access involving large amounts of data Parallel data mining and/or Distributed data mining Programming distributed mining operations/taks/patterns 8

9 Programming Levels Grain size MPI, OpenMP, threads, MapReduce, RMI, HPF,… Components, Patterns, Distributed Objects, … Web Services, Grid Services, Workflows, Mushup, … Process #

10 Services for distributed data mining Exploiting the SOA model it is possible to define basic services for supporting distributed data mining tasks/applications in large scale distributed systems for science and industry (from a private Cloud to Interclouds). Those services can address all the aspects that must be considered in data mining and in knowledge discovery processes data selection and transport services, data analysis services, knowledge models representation services, and knowledge visualization services.

11 Collection of Services for Distributed Data Mining It is possible to define services corresponding to Single KDD Steps All steps that compose a KDD process such as preprocessing, filtering, and visualization are expressed as services. Single Data Mining Tasks Here are included tasks such as classification, clustering, and association rules discovery. Distributed Data Mining Patterns This level implements, as services, patterns such as collective learning, parallel classification and meta-learning models. Data Mining Applications or KDD processes This level includes the previous tasks and patterns composed in a multi-step workflow.

12 This collection of data mining services can constitute an Open Service Framework for Grid-based Data Mining Allowing developers to program distributed KDD processes as a composition of single and/or aggregated services available over a Cloud. Those services should exploit other basic Cloud services for data transfer, replica management, data integration and querying. Data mining services Open Service Framework for Cloud-based Data Mining

13 Data mining Cloud services By exploiting the Cloud services features it is possible to develop data mining services accessible every time and everywhere (remotely and from small devices). This approach may result in Service-based distributed data mining applications Data mining services for communities/virtual organizations. Distributed data analysis services on demand. A sort of knowledge discovery eco-system formed of a large numbers of decentralized data analysis services.

14 Data Mining Services: Are they programming abstractions? Apparently not, in a traditional approach. Yes, if we consider the user and application requirements in handling data and in understanding what is useful in it. Basic services as simple operations; Service programming languages for composing them; Complex services and their complex composition; Towards distributed programming patterns for services.

15 Services for Distributed Data Mining Service-based systems we developed Weka4WS Knowledge Grid Mobile Data Mining Services

16 Weka4WS KnowledgeFlow Programming a data mining workflow and run them in parallel.

17 Service-oriented Knowledge Grid

18 Knowledge Grid: Application Design start end data mining file transfer voting schema: /../DMService.wsdl operation: clusterize argument: EM argument: -I 100 -N -1 -S data mining file transfer schema: /../DMService.wsdl operation: classify argument: J48 argument: -C M 2...

19 An example Example of distributed classification The dataset is split into smaller chunks Each chunk is processed in parallel The best model of each processing is chosen

20 Service Oriented Mobile Data Mining The main reaserch goal here is to support a user to access data mining services on mobile devices. The system includes three components: Data providers Mining servers Mobile clients Data provider Mining server Data store Mining server Data store Mobile client Data provider Mobile client Data provider Mobile client

21 Service Oriented Mobile Data Mining A user can choose the mining algorithm and select which part of a result (data mining model) he wants to visualize.

22 The Public Resource Computing paradigm (PRC) is currently used to execute large scientific applications with the help of private computers PRC model can be exploited to program to P2P data mining tasks involving hundreds or thousands of nodes. Highly decentralized data analysis tasks can be programmed as large collections of tasks or services.

23 Impact of Service Overhead Execution times dataset download data mining In a Grid scenario the data mining step represents from 85% to 88% of the total execution time, the dataset download takes about 11%, while the other steps range from 0.5% to 4%.

24 Weka4WS: Application Speedup

25 Summary New HPC infrastructures allow us to attack new problems, BUT require to solve more challenging problems. New programming models and environments are required Data is becoming a BIG player, programming data analysis applications and services is a must. New ways to efficiently compose different models and paradigms are needed. Relationships between different programming levels (from libraries to services) must be addressed. In a long-term vision, pervasive collections of data analysis services and applications must be accessed and used as public utilities. We must be ready for managing with this scenario.

26 Thanks