Download presentation
Presentation is loading. Please wait.
Published byAllan Malone Modified over 9 years ago
1
1 A Grid-Based Middleware’s Support for Processing Distributed Data Streams Liang Chen Advisor: Gagan Agrawal Computer Science & Engineering
2
2 Roadmap Introduction –Motivation –Overview of our research System Overview and Initial Evolution –Introduce system architecture and design –Discuss a self-adaptation algorithm –Evaluation Supporting Self-Adaptation in Stream Data Mining –Improve the self-adaptation algorithm –Evaluate the system by using two data mining applications Resource Allocation Schemes –Motivation –Static Resource Allocation Scheme –Dynamic Resource Allocation Scheme Future work –How to improve Dynamic Resource Allocation Scheme –Time-Varying Visualization Related work Summary
3
3 Introduction- Motivation Data stream processing and analysis –Data stream: data arrives continuously and need to be processed in real-time –Data Streams Applications: Network Fault Management System for Telecommunication Network Elements Computer Vision Based Surveillance Online network Intrusion Detection Traffic Sensors
4
4 Introduction- Motivation Common features of data streams –Continuous arrival –Enormous volume –Real-time constraints –Data sources could be distributed
5
5 Introduction- Motivation Network Fault Management System analyzing alarm message streams Switch Network X Network Fault Management System
6
6 Introduction- Motivation Computer Vision Based Surveillance
7
7 Introduction- Motivation Switch Network X Challenges & possible Solutions –Challenge1: Data and/or Computation intensive
8
8 Introduction- Motivation Challenges & possible Solutions –Challenge1: Data and/or Computation intensive –Solution: Grid computing technologies Switch Network
9
9 Challenges & possible Solutions –Challenge1: Data and/or Computation intensive –Solution: Grid computing technologies –Challenge 2: real-time analysis is required –Solution: Self-Adaptation functionality is desired Introduction- Motivation
10
10 Introduction- Motivation From point of view of the developers who are interested in applications of data streams –Would like to concentrate on applications themselves –Would not like to focus efforts on Grid computing Adaptation function
11
11 Introduction- Our Solution GATES (Grid-based AdapTive Execution on Stream) is a middleware which can support distributed data stream processing Internet Globus-OGSA GATES Applications Web service
12
12 Introduction- Our Solution Middleware: –Built on OGSA –Self-adaptive –Grid resource scheduling schemes Applications built on the GATES: –Automatically distributed to proper computing nodes –Automatically self-adaptive to varying environment without implementing certain algorithms or multiple versions –Automatically system performance optimized
13
13 Roadmap Introduction –Motivation –Overview of our research System Overview and Initial Evolution –Introduce system architecture and design –Discuss a self-adaptation algorithm –Evaluation Supporting Self-Adaptation in Stream Data Mining –Improve the self-adaptation algorithm –Evaluate the system by using two data mining applications Resource Allocation Schemes –Motivation –Static Resource Allocation Scheme –Dynamic Resource Allocation Scheme Future work –How to improve Dynamic Resource Allocation Scheme –Time-Varying Visualization Related work Summary
14
14 System Architecture and Design (From Application Perspective) Breaking down the task into several sub-tasks so that the sub-tasks can consist of a pipeline Implementing each sub-task in Java Writing an XML configuration file for the sub-tasks to be automatically deployed. I.E. –specify how many stages the pipeline has –specify where the codes that are processing the sub-tasks reside Launch the application by running a java program (StreamClient.class) provided by the GATES
15
15 System Architecture and Design (Architecture)
16
16 ABC Stage A Stage BStage C :Grid services of the GATES :Stages of an application :Queues between Grid services :Buffers for applications System Architecture and Design (Architecture)
17
17 Public class Sampling-Stage implements StreamProcessing{ … void init(){ … } … void work(buffer in, buffer out){ … while(true) { Image img = get-from-buffer-in-GATES(in); Image img-sample = Sampling(img, sampling-ratio); put-to-buffer-in-GATES(img-sample, out); } … } System Architecture and Design (Example) sampling-ratio = GATES.getSuggestedParameter(); GATES.Information-About-Adjustment-Parameter(min, max, 1)
18
18 Goal: balanced pipeline Observation –Adaptation parameters exist Performance Parameters Accuracy Parameters Issues & Solutions –No specific information about applications Expose adaptation parameter(s) to GATES Query Theory (Draw an animation) to present how to tune adaptation parameters –Filtering out short-term bursts and sensitive to long- term behaviors Load-term load factor –Quickly find converged values of tunable (adaptation) parameters Adaptation algorithm ABC
19
19 Equations Adaptation algorithm
20
20 Initial Evaluation Two applications –A counting sample application –A computational steering application Three experiments were conducted –The First one was running counting sample applications on the GATES –the other two were running computational steering applications
21
21 Network Bandwidth (Kilo-Byte sec.) 40 (sec.) 80 (sec.) 120 (sec.) 160 (sec.) Adaptive Version (Kilo-Byte/Sec.) 1462.3612.9459.9671463.5 10187.7193.3509.1302.1234.9 100246.4466.7296.2371.6387.1 1000240.4298.8307.7478399.9 Network Bandwidth (Kilo-Byte/Sec.) 40 (sec.)80 (sec.) 120 (sec.) 160 (sec.) Adaptive Version (Kilo-Byte/Sec.) 10.8910.9620.9810.9870.986 100.8960.9630.9830.9920.986 1000.8870.9570.9790.9880.974 10000.8790.9630.9830.9890.988 Performance comparison Accuracy comparison Initial Evaluation The Experiment One: Non-adaptive Vs. Adaptive version
22
22 Initial Evaluation The Experiment Two: Self-Adaptation with Different Processing Requirements
23
23 Initial Evaluation The Experiment Three:Self-Adaptation with Different Data Generation Rate
24
24 Roadmap Introduction –Motivation –Overview of our research System Overview and Initial Evolution –Introduce system architecture and design –Discuss a self-adaptation algorithm –Evaluation Supporting Self-Adaptation in Stream Data Mining –Improves the self-adaptation algorithm –Evaluate the system by using two data mining applications Resource Allocation Schemes –Motivation –Static Resource Allocation Scheme –Dynamic Resource Allocation Scheme Future work –How to improve Dynamic Resource Allocation Scheme –Time-Varying Visualization Related work Summary
25
25 Enhanced Self-adaptation Algorithm Given a queue’s long-term factor at each stage, we want improve the method of adjusting values of an adaptation parameter 1.Should the adaptation parameter be modified, and if so, in which direction? 2.How to find a new value (update the value) of the adaptation parameter
26
26 Enhanced Self-adaptation Algorithm Should the adaptation parameter be modified, and if so, in which direction? –The answer is related to load status of queues at two consecutive stages
27
27 Enhanced Self-adaptation Algorithm Performance Parameter A B C A B C A B C A B C A B C A B C A B C A B C A B C Convergent States Non-Convergent States
28
28 Enhanced Self-adaptation Algorithm Summary of Load States
29
29 Enhanced Self-adaptation Algorithm How to determine the new value for the adaptation parameter –Linear update: increase or decrease by a fixed value Hard to find a proper fixed value –Previous method –Binary tree search
30
30 Enhanced Self-adaptation Algorithm How to determine the new value for the adaptation parameter --Binary tree search Public class Sampling-Stage implements StreamProcessing{ … void init(){ … } … void work(buffer in, buffer out){ while(true) { Image img = get-from-buffer-in-GATES(in); Image img-sample = Sampling(img, sampling-ratio); put-to-buffer-in-GATES(img-sample, out); } … } sampling-ratio = GATES.getSuggestedParameter(); GATES.Information-About-Adaptation-Parameter(min, max, init_value)
31
31 Enhanced Self-adaptation Algorithm Left Border Current Value Right Border New Value
32
32
33
33 Data Mining Applications & System Evaluation Two Data mining applications –Clustream: Clustering data arriving in data streams
34
34 Data Mining Applications & System Evaluation Dist-Freq-Counting: finding frequent itemsets from distributed streams
35
35 Data Mining Applications & System Evaluation
36
36 Data Mining Applications & System Evaluation
37
37 Data Mining Applications & System Evaluation
38
38 Data Mining Applications & System Evaluation
39
39 Data Mining Applications & System Evaluation
40
40 Data Mining Applications & System Evaluation
41
41 Data Mining Applications & System Evaluation
42
42 Data Mining Applications & System Evaluation
43
43 Data Mining Applications & System Evaluation
44
44 Roadmap Introduction –Motivation –Overview of our research System Overview and Initial Evolution –Introduce system architecture and design –Discuss a self-adaptation algorithm –Evaluation Supporting Self-Adaptation in Stream Data Mining –Improve the self-adaptation algorithm –Evaluate the system by using two data mining applications Resource Allocation Schemes –Problem Definition –Static Resource Allocation Scheme –Dynamic Resource Allocation Scheme Future work –How to improve Dynamic Resource Allocation Scheme –Time-Varying Visualization Related work Summary
45
45 Resource Allocation Schemes Problem Definition –Grid resource scheduling for Pipelined processing and real-time distributed streaming applications –Mapping workflows onto Grid is a NP-complete problem –Static Part: the resource allocation problem for GATES is to determine a deployment configuration –Dynamic Part
46
46 Static Allocation Scheme The number of data sources and their location The destination The number of stages consisting of a pipeline ? The number of instances of each stage ? How the instances connect to each other ? The node where each instance is placed Static allocation problem: determining a deployment configuration Objective: Automatically generate a deployment configuration according to the information of available resources
47
47 Examples of deployment configurations Static Allocation Scheme
48
48 Static Allocation Scheme Challenge –Given an application having m stages n data sources k available computing nodes for placement of stages’ instances –The number of possible configurations is: F(2, n, k) = 1 F(m, n, k) = (s n (i) *F(m-1, i, k-i)*P k i ) F(3, n, k) >= n n
49
49 Static Allocation Scheme Assumptions: –Network bandwidths are the critical resources rather than CPU capabilities –Bandwidths of networks inside a cluster are larger than bandwidths of network connecting clusters –We know the topology of networks, the list of available clusters, and resource information of the clusters –We know where data sources and destination are
50
50 Static Allocation Scheme Observation –The data arrival rates are very high at the first one or two stages –The arrival rates decrease significantly at the following stages –Prim’s algorithm to construct a Minimum Spanning Tree Three steps: –Create a key path corresponding to each data source (Prim Algorithm, MST) –Merge the key paths to create a layout tree –Map each node in a layout tree to a computing node
51
51 Static Allocation Scheme –Network topology
52
52 Make each data source a starting node to apply the Minimum spinning tree algorithm(Prim) to the graph Static Allocation Scheme
53
53 Static Allocation Scheme
54
54 Static Allocation Scheme A key path
55
55 Static Allocation Scheme
56
56 Static Allocation Scheme Another key path
57
57 Static Allocation Scheme Merge two key paths to get a layout tree
58
58 Static Allocation Scheme
59
59 issues –When the number of tree nodes along a key path is larger than the number of stages transporters are automatically added –When the number of tree nodes along a path is larger than the number of stages Deploy additional stages at the parent node node of the data source Optimization Static Allocation Scheme
60
60 Evaluations of Resource Allocation Schemes The Deployment configuration created and optimized by our algorithm V.S. the best one manually chosen V.S. all possible configurations
61
61 Evaluations of Resource Allocation Schemes Environment –Network topologies were randomly generated –The distributed counting sampling application –manual-config: a configuration determined manually –auto-config:a configuration generated by the algorithm –opt-config:a configuration optimized by removing unnecessary transporters
62
62 Evaluations of Resource Allocation Schemes Comparing auto-config, manual-config, and opt-config
63
63 Evaluations of Resource Allocation Schemes Comparing auto-config with 120 Other Configurations
64
64 Evaluations of Resource Allocation Schemes Comparing opt-config with All Possible Configurations
65
65 Evaluations of Resource Allocation Schemes Overhead of Dynamic Scheme
66
66 Evaluations of Resource Allocation Schemes Comparing Dynamic and Static Scheme
67
67 Roadmap Introduction –Motivation –Overview of our research System Overview and Initial Evolution –Introduce system architecture and design –Discuss a self-adaptation algorithm –Evaluation Supporting Self-Adaptation in Stream Data Mining –Improve the self-adaptation algorithm –Evaluate the system by using two data mining applications Resource Allocation Schemes –Motivation –Static Resource Allocation Scheme –Dynamic Resource Allocation Scheme Future work –How to improve Dynamic Resource Allocation Scheme –Time-Varying Visualization Related work Summary
68
68 Future Work Improve Dynamic Resource Allocation Scheme –CPU cycles and Network bandwidths Motivation –Currently, only network bandwidth is considered as a constraint when scheduling Grid resources –Few related work propose a metric to integrate both
69
69 Future Work A B C Improve Dynamic Resource Allocation Scheme –CPU cycles and Network bandwidths Input buffers Output buffers
70
70 Future Work Improve Dynamic Resource Allocation Scheme –CPU cycles and Network bandwidths A stage’s Desirability for CPU cycles and network bandwidths Edge weigh
71
71 Future Work Improve Dynamic Resource Allocation Scheme –Support efficient dynamic migration Motivation –To ensure success migration, large volume checkpoints are required –Re-establishing original execution environment is time consuming –Our current migration strategy is “applications are required to be stateless” –Instead of checkpoints, light-weight structure are desirable
72
72 Future Work Improve Dynamic Resource Allocation Scheme –Support efficient dynamic migration Migration Summary Structure (MSS) Data stream applications store accumulated summary information in the memory GATES provides API function to pre-allocation a block of memory to store MSS Eliminate the constraint of the statelessness
73
73 Future Work Time-Varying Visualization –Motivation The application involves large volumes dataset The places where data are generated are distributed Typical client-server architecture is not scalable and the server is the bottleneck Adaptation in interactive visualization is desirable however manually adjusting parameters in the grid environment is hard
74
74 Future Work Time-Varying Visualization –Solutions GATES could fulfill the requirements –Grid-based –Provide automatic adaptation –Automatically allocate Grid resources Issues –Number stages –Appropriate adaptation parameters –Careful evaluation
75
75 Related work dQUOB (dynamic QUery Objects) Data Cutter Q-fabric (Christian Poellabauer et al) The partitionable services frame for adapting distributed applications (Anca- Andreea Ivan et al) Other research from the database community
76
76 Summary Grid computing could be an effective solution for distributed data stream processing GATES –Distributed processing –Exploit grid web services –Self-adaptation to meet the real-time constraints
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.