1 A Grid-Based Middleware’s Support for Processing Distributed Data Streams Liang Chen Advisor: Gagan Agrawal Computer Science & Engineering.

1 A Grid-Based Middleware’s Support for Processing Distributed Data Streams Liang Chen Advisor: Gagan Agrawal Computer Science & Engineering

2 Roadmap Introduction –Motivation –Overview of our research System Overview and Initial Evolution –Introduce system architecture and design –Discuss a self-adaptation algorithm –Evaluation Supporting Self-Adaptation in Stream Data Mining –Improve the self-adaptation algorithm –Evaluate the system by using two data mining applications Resource Allocation Schemes –Motivation –Static Resource Allocation Scheme –Dynamic Resource Allocation Scheme Future work –How to improve Dynamic Resource Allocation Scheme –Time-Varying Visualization Related work Summary

3 Introduction- Motivation Data stream processing and analysis –Data stream: data arrives continuously and need to be processed in real-time –Data Streams Applications: Network Fault Management System for Telecommunication Network Elements Computer Vision Based Surveillance Online network Intrusion Detection Traffic Sensors

4 Introduction- Motivation Common features of data streams –Continuous arrival –Enormous volume –Real-time constraints –Data sources could be distributed

5 Introduction- Motivation Network Fault Management System analyzing alarm message streams Switch Network X Network Fault Management System

6 Introduction- Motivation Computer Vision Based Surveillance

7 Introduction- Motivation Switch Network X Challenges & possible Solutions –Challenge1: Data and/or Computation intensive

8 Introduction- Motivation Challenges & possible Solutions –Challenge1: Data and/or Computation intensive –Solution: Grid computing technologies Switch Network

9 Challenges & possible Solutions –Challenge1: Data and/or Computation intensive –Solution: Grid computing technologies –Challenge 2: real-time analysis is required –Solution: Self-Adaptation functionality is desired Introduction- Motivation

10 Introduction- Motivation From point of view of the developers who are interested in applications of data streams –Would like to concentrate on applications themselves –Would not like to focus efforts on Grid computing Adaptation function

11 Introduction- Our Solution GATES (Grid-based AdapTive Execution on Stream) is a middleware which can support distributed data stream processing Internet Globus-OGSA GATES Applications Web service

12 Introduction- Our Solution Middleware: –Built on OGSA –Self-adaptive –Grid resource scheduling schemes Applications built on the GATES: –Automatically distributed to proper computing nodes –Automatically self-adaptive to varying environment without implementing certain algorithms or multiple versions –Automatically system performance optimized

14 System Architecture and Design (From Application Perspective) Breaking down the task into several sub-tasks so that the sub-tasks can consist of a pipeline Implementing each sub-task in Java Writing an XML configuration file for the sub-tasks to be automatically deployed. I.E. –specify how many stages the pipeline has –specify where the codes that are processing the sub-tasks reside Launch the application by running a java program (StreamClient.class) provided by the GATES

15 System Architecture and Design (Architecture)

16 ABC Stage A Stage BStage C :Grid services of the GATES :Stages of an application :Queues between Grid services :Buffers for applications System Architecture and Design (Architecture)

17 Public class Sampling-Stage implements StreamProcessing{ … void init(){ … } … void work(buffer in, buffer out){ … while(true) { Image img = get-from-buffer-in-GATES(in); Image img-sample = Sampling(img, sampling-ratio); put-to-buffer-in-GATES(img-sample, out); } … } System Architecture and Design (Example) sampling-ratio = GATES.getSuggestedParameter(); GATES.Information-About-Adjustment-Parameter(min, max, 1)

18 Goal: balanced pipeline Observation –Adaptation parameters exist Performance Parameters Accuracy Parameters Issues & Solutions –No specific information about applications Expose adaptation parameter(s) to GATES Query Theory (Draw an animation) to present how to tune adaptation parameters –Filtering out short-term bursts and sensitive to long- term behaviors Load-term load factor –Quickly find converged values of tunable (adaptation) parameters Adaptation algorithm ABC

19 Equations Adaptation algorithm

20 Initial Evaluation Two applications –A counting sample application –A computational steering application Three experiments were conducted –The First one was running counting sample applications on the GATES –the other two were running computational steering applications

21 Network Bandwidth (Kilo-Byte sec.) 40 (sec.) 80 (sec.) 120 (sec.) 160 (sec.) Adaptive Version (Kilo-Byte/Sec.) 1462.3612.9459.9671463.5 10187.7193.3509.1302.1234.9 100246.4466.7296.2371.6387.1 1000240.4298.8307.7478399.9 Network Bandwidth (Kilo-Byte/Sec.) 40 (sec.)80 (sec.) 120 (sec.) 160 (sec.) Adaptive Version (Kilo-Byte/Sec.) 10.8910.9620.9810.9870.986 100.8960.9630.9830.9920.986 1000.8870.9570.9790.9880.974 10000.8790.9630.9830.9890.988 Performance comparison Accuracy comparison Initial Evaluation The Experiment One: Non-adaptive Vs. Adaptive version

22 Initial Evaluation The Experiment Two: Self-Adaptation with Different Processing Requirements

23 Initial Evaluation The Experiment Three:Self-Adaptation with Different Data Generation Rate

24 Roadmap Introduction –Motivation –Overview of our research System Overview and Initial Evolution –Introduce system architecture and design –Discuss a self-adaptation algorithm –Evaluation Supporting Self-Adaptation in Stream Data Mining –Improves the self-adaptation algorithm –Evaluate the system by using two data mining applications Resource Allocation Schemes –Motivation –Static Resource Allocation Scheme –Dynamic Resource Allocation Scheme Future work –How to improve Dynamic Resource Allocation Scheme –Time-Varying Visualization Related work Summary

25 Enhanced Self-adaptation Algorithm Given a queue’s long-term factor at each stage, we want improve the method of adjusting values of an adaptation parameter 1.Should the adaptation parameter be modified, and if so, in which direction? 2.How to find a new value (update the value) of the adaptation parameter

26 Enhanced Self-adaptation Algorithm Should the adaptation parameter be modified, and if so, in which direction? –The answer is related to load status of queues at two consecutive stages

27 Enhanced Self-adaptation Algorithm Performance Parameter A B C A B C A B C A B C A B C A B C A B C A B C A B C Convergent States Non-Convergent States

28 Enhanced Self-adaptation Algorithm Summary of Load States

29 Enhanced Self-adaptation Algorithm How to determine the new value for the adaptation parameter –Linear update: increase or decrease by a fixed value Hard to find a proper fixed value –Previous method –Binary tree search

30 Enhanced Self-adaptation Algorithm How to determine the new value for the adaptation parameter --Binary tree search Public class Sampling-Stage implements StreamProcessing{ … void init(){ … } … void work(buffer in, buffer out){ while(true) { Image img = get-from-buffer-in-GATES(in); Image img-sample = Sampling(img, sampling-ratio); put-to-buffer-in-GATES(img-sample, out); } … } sampling-ratio = GATES.getSuggestedParameter(); GATES.Information-About-Adaptation-Parameter(min, max, init_value)

31 Enhanced Self-adaptation Algorithm Left Border Current Value Right Border New Value

33 Data Mining Applications & System Evaluation Two Data mining applications –Clustream: Clustering data arriving in data streams

34 Data Mining Applications & System Evaluation Dist-Freq-Counting: finding frequent itemsets from distributed streams

35 Data Mining Applications & System Evaluation

44 Roadmap Introduction –Motivation –Overview of our research System Overview and Initial Evolution –Introduce system architecture and design –Discuss a self-adaptation algorithm –Evaluation Supporting Self-Adaptation in Stream Data Mining –Improve the self-adaptation algorithm –Evaluate the system by using two data mining applications Resource Allocation Schemes –Problem Definition –Static Resource Allocation Scheme –Dynamic Resource Allocation Scheme Future work –How to improve Dynamic Resource Allocation Scheme –Time-Varying Visualization Related work Summary

45 Resource Allocation Schemes Problem Definition –Grid resource scheduling for Pipelined processing and real-time distributed streaming applications –Mapping workflows onto Grid is a NP-complete problem –Static Part: the resource allocation problem for GATES is to determine a deployment configuration –Dynamic Part

46 Static Allocation Scheme The number of data sources and their location The destination The number of stages consisting of a pipeline ? The number of instances of each stage ? How the instances connect to each other ? The node where each instance is placed  Static allocation problem: determining a deployment configuration  Objective: Automatically generate a deployment configuration according to the information of available resources

47 Examples of deployment configurations Static Allocation Scheme

48 Static Allocation Scheme Challenge –Given an application having m stages n data sources k available computing nodes for placement of stages’ instances –The number of possible configurations is: F(2, n, k) = 1 F(m, n, k) =  (s n (i) *F(m-1, i, k-i)*P k i ) F(3, n, k) >= n n

49 Static Allocation Scheme Assumptions: –Network bandwidths are the critical resources rather than CPU capabilities –Bandwidths of networks inside a cluster are larger than bandwidths of network connecting clusters –We know the topology of networks, the list of available clusters, and resource information of the clusters –We know where data sources and destination are

50 Static Allocation Scheme Observation –The data arrival rates are very high at the first one or two stages –The arrival rates decrease significantly at the following stages –Prim’s algorithm to construct a Minimum Spanning Tree Three steps: –Create a key path corresponding to each data source (Prim Algorithm, MST) –Merge the key paths to create a layout tree –Map each node in a layout tree to a computing node

51 Static Allocation Scheme –Network topology

52 Make each data source a starting node to apply the Minimum spinning tree algorithm(Prim) to the graph Static Allocation Scheme

53 Static Allocation Scheme

54 Static Allocation Scheme A key path

56 Static Allocation Scheme Another key path

57 Static Allocation Scheme Merge two key paths to get a layout tree

59 issues –When the number of tree nodes along a key path is larger than the number of stages transporters are automatically added –When the number of tree nodes along a path is larger than the number of stages Deploy additional stages at the parent node node of the data source Optimization Static Allocation Scheme

60 Evaluations of Resource Allocation Schemes The Deployment configuration created and optimized by our algorithm V.S. the best one manually chosen V.S. all possible configurations

61 Evaluations of Resource Allocation Schemes Environment –Network topologies were randomly generated –The distributed counting sampling application –manual-config: a configuration determined manually –auto-config:a configuration generated by the algorithm –opt-config:a configuration optimized by removing unnecessary transporters

62 Evaluations of Resource Allocation Schemes Comparing auto-config, manual-config, and opt-config

63 Evaluations of Resource Allocation Schemes Comparing auto-config with 120 Other Configurations

64 Evaluations of Resource Allocation Schemes Comparing opt-config with All Possible Configurations

65 Evaluations of Resource Allocation Schemes Overhead of Dynamic Scheme

66 Evaluations of Resource Allocation Schemes Comparing Dynamic and Static Scheme

68 Future Work Improve Dynamic Resource Allocation Scheme –CPU cycles and Network bandwidths Motivation –Currently, only network bandwidth is considered as a constraint when scheduling Grid resources –Few related work propose a metric to integrate both

69 Future Work A B C Improve Dynamic Resource Allocation Scheme –CPU cycles and Network bandwidths Input buffers Output buffers

70 Future Work Improve Dynamic Resource Allocation Scheme –CPU cycles and Network bandwidths A stage’s Desirability for CPU cycles and network bandwidths Edge weigh

71 Future Work Improve Dynamic Resource Allocation Scheme –Support efficient dynamic migration Motivation –To ensure success migration, large volume checkpoints are required –Re-establishing original execution environment is time consuming –Our current migration strategy is “applications are required to be stateless” –Instead of checkpoints, light-weight structure are desirable

72 Future Work Improve Dynamic Resource Allocation Scheme –Support efficient dynamic migration Migration Summary Structure (MSS) Data stream applications store accumulated summary information in the memory GATES provides API function to pre-allocation a block of memory to store MSS Eliminate the constraint of the statelessness

73 Future Work Time-Varying Visualization –Motivation The application involves large volumes dataset The places where data are generated are distributed Typical client-server architecture is not scalable and the server is the bottleneck Adaptation in interactive visualization is desirable however manually adjusting parameters in the grid environment is hard

74 Future Work Time-Varying Visualization –Solutions GATES could fulfill the requirements –Grid-based –Provide automatic adaptation –Automatically allocate Grid resources Issues –Number stages –Appropriate adaptation parameters –Careful evaluation

75 Related work dQUOB (dynamic QUery Objects) Data Cutter Q-fabric (Christian Poellabauer et al) The partitionable services frame for adapting distributed applications (Anca- Andreea Ivan et al) Other research from the database community

76 Summary Grid computing could be an effective solution for distributed data stream processing GATES –Distributed processing –Exploit grid web services –Self-adaptation to meet the real-time constraints

1 A Grid-Based Middleware’s Support for Processing Distributed Data Streams Liang Chen Advisor: Gagan Agrawal Computer Science & Engineering.

Similar presentations

Presentation on theme: "1 A Grid-Based Middleware’s Support for Processing Distributed Data Streams Liang Chen Advisor: Gagan Agrawal Computer Science & Engineering."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 A Grid-Based Middleware’s Support for Processing Distributed Data Streams Liang Chen Advisor: Gagan Agrawal Computer Science & Engineering.

Similar presentations

Presentation on theme: "1 A Grid-Based Middleware’s Support for Processing Distributed Data Streams Liang Chen Advisor: Gagan Agrawal Computer Science & Engineering."— Presentation transcript:

Similar presentations

About project

Feedback