Download presentation
Presentation is loading. Please wait.
1
1 Scaling Up Classifiers to Cloud Computers Christopher Moretti, Karsten Steinhaeuser, Douglas Thain, Nitesh V. Chawla University of Notre Dame
2
2 Motivation Data mining on sets larger than a single machine’s memory is difficult. Ensemble classification can improve accuracy over a single classifier. Ensemble classification can be distributed because each member of the ensemble is independent. Scalable (in terms of data size and number of resources) distributed ensemble classification architectures tend to be finely tailored to an application and algorithm.
3
3 Distributed Data Mining Data Mining on Clouds Abstraction for Distributed Data Mining Implementing the Abstraction Evaluating the Abstraction Take-aways
4
4 Distributed Data Mining For training D, testing T, and classifier F: Divide D into N partitions with partitioner P Run N copies of F, one on each partition, generating a set of votes on T for each partition Collect votes from all copies of F and combine into a final result R
5
5 Challenges in Distributed DM When dealing with large amounts of data (MB to GB to TB), there are systems problems in addition to data mining problems. Why should data miners have to be distributed systems experts too? Scalable (in terms of data size and number of resources) distributed data mining architectures tend to be finely tailored to an application and algorithm.
6
6 Proposed Solution An abstraction framework for distributed data mining An abstraction allows users to declare a distributed workload based on only what they know (sequential programs, data) Why an abstraction? Abstractions hide many complexities from users Unlike a specially-tailored implementation, a conceptual abstraction provides a general- purpose solution for a problem which may be implemented in any of several ways depending on requirements.
7
7 Clusters versus Cloud Computers Small (4-16) to very large Use shared filesystem, often centralized Assign dedicated resources, often in large blocks Often static and generally homogeneous Managed by batch or grid engine Large (~500 CPUs, ~300 disks @ ND) Use individual disks rather than a central FS Assign resources dynamically, without a guarantee of dedicated access Commodity, Dynamic, and Heterogeneous Managed by batch or grid engine
8
8 Implementing the Abstraction There are several factors to consider: How many nodes to use for computation? How many nodes to use for data. How to connect the data and computation nodes?
9
9 Streaming Each process is connected via a data stream. Data exists only in buffers in memory, and stream writers block until stream readers have consumed the buffer. Requires full-way parallelism to complete. Not robust to failure.
10
10 Pull Partitioning is done ahead of computation and partitions are stored on the source node. Computation jobs pull in the proper partition from the source node. Flexible and robust to failure, but not scalable to a large number of computation nodes.
11
11.data P1P2P3 P4 Condor Matchmaker Pull
12
12 Push Work assignments are done ahead of partitioning and partitioning distributes data to where it will be used. Data are accessed locally where possible, or accessed in-place remotely. This improves scalability to larger numbers of computation nodes, but can decrease flexibility and increase reliance on unreliable nodes.
13
13.data P1 P2 P3 P4 Condor Matchmaker Push
14
14 Hybrid Push to a well-known set of intermediate nodes. Pull from those nodes. This combines the advantages of Pull (flexibility, reliability) and Push (I/O performance)
15
15.dataP1 P2 P3 P4 Condor Matchmaker Hybrid
16
16 Implementing the Abstraction The effectiveness of these possibilities hinges on the flexibility, reliability, and performance of their components. An example of such a component is the partitioning algorithm.
17
17
18
18 Partitioning Algorithms Shuffle: One instance at a time from the training data, copy into a partition. Chop: One partition at a time, copy all its instances from the training data
19
19 B A C D E F G H I J A B C D E F G H I J K L L K Shuffle
20
20 B A C D E F G H I J A D G J B E H K C F K L L I Chop
21
21 5.4G / Locals: using fgets, fprintf. R16s: using fgets, chirp_stream_write, intra-sc0 cluster.
22
22 Locals: using fgets, fprintf. R{4,16}s: using fgets, chirp_stream_write, intra-sc0 cluster.
23
23
24
24 Partitioning Conclusions Remote partitioning is faster, but less reliable, than local partitioning Shuffle is slower locally and to a small number of remote hosts but scales better to a large number of remote hosts Shuffle is less robust than Chop for large data sets
25
25 Evaluating the Architectures Evaluation is based on performance and scalability. Classifier algorithms were decision trees, K-nearest neighbors, and support vector machines.
26
26 Protein Data Set (3.3M instances, 170MB), Using Decision Trees
27
27 KDDCup Data Set (4.9M instances, 700MB), Using Decision Trees
28
28 Alpha Data Set (400K instances, 1.8GB), Using KNN
29
29 System Architectures Push Fastest (remote part., mainly local access, etc.) 1-to-1 matching or heavy preference. Could have pure 1-to-1 matching, but more fragile. Pull Slowest (local part, on-jobstart transfer) Most robust (central data, “any” host can run jobs) Hybrid Combination: Push to subset of nodes, then Pull. Faster than Pull (remote part., multiple servers), More robust than Push (small set of servers)
30
30 Future Work Performance vs. Accuracy for long-tail jobs Is there a viable tradeoff between turnaround time and degrading classification accuracy? Efficient data management on multicores Hierarchical abstraction framework Submit jobs to clouds of subnets of multicores
31
31 Conclusions Hybrid method is amenable to both cluster-like environments and larger, more-diverse clouds, and its use of intermediate data servers mitigates some of shuffle’s problems. A fundamental limit of scalability is the available memory on each workstation. For our largest sets, even 16 nodes were not sufficient to run effectively.
32
32 Questions? Data Analysis and Inference Laboratory Karsten Steinhaeuser (ksteinha@cse.nd.edu) Nitesh V. Chawla (nchawla@cse.nd.edu) Cooperative Computing Laboratory Christopher Moretti (cmoretti@cse.nd.edu) Douglas Thain (dthain@cse.nd.edu) Acknowledgements: NSF CNS-06-43229, CCF-06-21434, CNS-07-20813
33
33.data P1 P2 P3 P4 Condor Matchmaker Push
34
34 Architecture Conclusions Hybrid works very well Fast partition (remote, few nodes) Reliable enough partition (few reliable nodes) Fast enough access (no central bottleneck) Reliable enough access (few reliable nodes) Remaining: Hybrid robustness versus efficiency. As we increase number of nodes for performance, robustness decreases.
35
35 Swift (Clifford, TG07 Tutorial) Accessing messy data Describe logical structure by XML Schema Implementing complex computations with SwiftScript language Expression, discovery, reuse of analyses Hiding complexity of distributed systems Scaling to large data, complex analyses Their grid engine and data engine currently scale to 50-100 workers, but many GBs of data
36
36 Virtual Node(s) SwiftScript Abstract computation Virtual Data Catalog SwiftScript Compiler SpecificationExecution Virtual Node(s) Provenance data Provenance data Provenance collector launcher file1 file2 file3 App F1 App F2 Scheduling Execution Engine (Karajan w/ Swift Runtime) Swift runtime callouts C CCC Status reporting Swift Architecture (Clifford, TG07 Tutorial) Provisioning Falkon Resource Provisioner Amazon EC2
37
37 map nouns verbs map nouns verbs map nouns verbs reduce unique nouns unique verbs doc inputs: (file,word) intermediates (word,count) output: (word,count) Sample Application: Identify all unique nouns and verbs in 1M documents Map-Reduce (Dean and Ghemawat, ‘04)
38
38 Classify versus Map-Reduce Mapper as the task assignment: F on subclassifiers D 1,D 2,…D N. Reducer as the collection: D 1,D 2,…D N Collector Classification But what about testing set? On the FS? Sent with tasks? It’s a variable implementation detail! And how is data managed? It’s a variable implementation detail! Logical Partitioning is on the FS, and not a 1 st class member of the model … but we want to study data placement and access.
39
39
40
40 FREERIDE Cluster abstraction for parallelizing datamining codes FREERIDE-G Additional abstraction for access to remote data repositories (still from clusters) Scales to 16-32 compute nodes 2-8 data servers Datasets up to 2GB Both must use data mining codes with particular structures in order to exploit parallelism.
41
41 Beta Data Set (400K instances, 1.8GB), Using SVMs
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.