Liang Chen Advisor: Gagan Agrawal Computer Science & Engineering A Grid-Based Middleware’s Support for Processing Distributed Data Streams Liang Chen Advisor: Gagan Agrawal Computer Science & Engineering
Introduction-Motivation Data stream processing and analysis Data stream: data arrive continuously and need to be processed in real-time Data Stream Applications: Online network Intrusion Detection Sensor networks Network Fault Management System for Telecommunication Network Elements Computer Vision Based Surveillance Common features of data streams Continuous arrival Enormous volume Real-time constraints Data sources could be distributed
Introduction-Motivation Network Fault Management System analyzing alarm message streams Switch Network X Network Fault Management System
Introduction-Motivation Computer Vision Based Surveillance
Introduction-Motivation Challenges & possible Solutions Challenge1: Data and/or Computation intensive Switch Network X
Introduction-Motivation Challenges & possible Solutions Challenge1: Data and/or Computation intensive Solution: Grid computing technologies Switch Network
Introduction-Motivation Challenges & possible Solutions Challenge1: Data and/or Computation intensive Solution: Grid computing technologies Challenge 2: real-time analysis is required Solution: Self-Adaptation functionality is desired
Introduction-Motivation From point of view of the developers who are interested in applications of data streams Would like to concentrate on applications themselves Would not like to focus efforts on Grid computing Adaptation function
Introduction-Our Approach A Middle-ware that is based on Grid standards and tools and provides self-adaptation functionality The middleware is referred to as GATES (Grid-based AdapTive Execution on Stream) Automatically distributed to proper computing nodes Automatically self-adaptive to varying environment without implementing certain algorithms
System Architecture and Design (From Application Perspective) Breaking down a task into several sub-tasks so that the sub-tasks can consist of a pipeline Implementing each sub-task in Java Writing an XML configuration file for the sub-tasks to be automatically deployed. I.E. specify how many stages (sub-tasks) the pipeline has specify where the codes that are implementing the sub-tasks reside Launch the application by running a java program (StreamClient.class) provided by the GATES
System Architecture and Design (Architecture)
System Architecture and Design (Architecture) Stage A Stage B Stage C A B C :Grid services of the GATES :Stages of an application :Queues between Grid services :Buffers for applications
System Architecture and Design (Example) Public class Sampling-Stage implements StreamProcessing{ … void init(){…} void work(buffer in, buffer out){ while(true) { Image img = get-from-buffer-in-GATES(in); Image img-sample = Sampling(img, sampling-ratio); put-to-buffer-in-GATES(img-sample, out); } GATES.Information-About-Adjustment-Parameter(min, max, 1) sampling-ratio = GATES.getSuggestedParameter();
Self-adaptation Algorithm Given a queue’s long-term factor at each stage, we want to improve the method of adjusting values of an adaptation parameter Should the adaptation parameter be modified, and if so, in which direction? How to find a new value (update the value) of the adaptation parameter
Enhanced Self-adaptation Algorithm Should the adaptation parameter be modified, and if so, in which direction? The answer is related to load status of queues at two consecutive stages
Enhanced Self-adaptation Algorithm Performance Parameter A B C A B C A B C A B C A B C A B C A B C A B C Convergent States A B C Non-Convergent States
Enhanced Self-adaptation Algorithm Summary of Load States
Enhanced Self-adaptation Algorithm How to determine the new value for the adaptation parameter Linear update: increase or decrease by a fixed value Hard to find a proper fixed value Previous method Binary tree search
Enhanced Self-adaptation Algorithm Left Border Current Value New Value Right Border Left Border Current Value Right Border
Data Mining Applications & System Evaluation Two Data mining applications Clustream: Clustering data arriving in data streams
Data Mining Applications & System Evaluation Dist-Freq-Counting: finding frequent itemsets from distributed streams
Data Mining Applications & System Evaluation
Data Mining Applications & System Evaluation
Data Mining Applications & System Evaluation
Data Mining Applications & System Evaluation
Data Mining Applications & System Evaluation
Data Mining Applications & System Evaluation
Data Mining Applications & System Evaluation
Data Mining Applications & System Evaluation
Data Mining Applications & System Evaluation
Resource Allocation Schemes Problem Definition Grid resource scheduling for Pipelined processing and real-time distributed streaming applications Mapping workflows onto Grid is a NP-complete problem Static Part: the resource allocation problem for GATES is to determine a deployment configuration Dynamic Part
Static Allocation Scheme Static allocation problem: determining a deployment configuration Objective: Automatically generate a deployment configuration according to the information of available resources The number of data sources and their location The destination The number of stages consisting of a pipeline The number of instances of each stage How the instances connect to each other The node where each instance is placed
Static Allocation Scheme Examples of deployment configurations
Related work Grid Resource Allocation Condor Realtor ACDS etc. Main Differences: our work focuses on Grid resource allocation for workflow applications Adaptation Through a Middleware Cheng et al.’s adaptation framework SWiFT Conductor DART ROAM Main Differences: our work focuses on general supports for adaptation in run-time
Summary Grid computing could be an effective solution for distributed data stream processing GATES Distributed processing Exploit grid web services Self-adaptation to meet the real-time constraints Grid resource allocation schemes