Download presentation
Published byVirgil Patrick Modified over 9 years ago
1
MULTI-LAYERED SOFTWARE SYSTEM FRAMEWORK FOR DISTRIBUTED DATA MINING
Masum Serazi, Amal Perera, Qiang Ding, Vasiliy Malakhov, William Perrizo North Dakota State University Computer Science Department
2
Outline Introduction to Distributed Data Mining
Demands Existing Projects Architecture Importance of a Layered Architecture A Prototype System System Architecture (layered) Server Communication Client & GUI System Characteristics Conclusion
3
Demands of distributed data mining
Large dataset size Diversity of data Geographic distribution of users and resources Computationally intensive result generation
4
Large scale distributed data mining project
Kensington project Mining enterprise data distributed across the internet. Papyrus project Based on mobile agents implemented using java. PaDDMAS A component based tool set that integrates pre-developed or custom packages JAM Agent based distributed system that has been developed to mine stored in different sites. BODHI Collective data mining with stress on the learning vertically partitioned data. PaDDMS: Parallel and Distributed Data Mining Application Suite JAM: Java Agents for Meta-learning BODHI: The main limitations of all of those applications: particular dataset particular algorithms
5
Architecture Client-Server Agent based Hybrid Advantage:
Able to use high performance computing on the server side to do the data mining. Agent based Hybrid
6
Importance of a Layered Architecture
Layered framework helps to manage complexity. Provides the flexibility to add/remove/modify layer and components of a layer Allows for a better tracking of progress of large, complex projects. Human input is required to tune the data and the algorithms to suite the need (Mix of greyware versus software can be changed over time). Framework helps to manage complexity. When the software development process is divided in this manner, it allows developers to focus on the appropriate level of scale, which is often difficult to do in large projects. That is, it helps the analyst or designer to recognize when they are dealing with too low level of requirements or design elements. It becomes obvious when the designer starts incorporating design elements two or more levels below the level that is being designed, for example when specifying libraries at the application level. Also, by following the appropriate type of documentation for each level according to the enterprise practice, LSD allows for a better tracking of progress of large, complex projects.
7
System Architecture DataMIMETM developed as proof-of-concept.
Based on patent pending, “P-tree technology” Efficient and scalable system. Flexible plug-ins. Conceptual view of the system Client Side Server Side Integrate data (synchronize to existing) System performance ananlysis Mine on DataMIME™ One of the Slave Servers Master Server Internet Capture dataset to DataMIMETM In this section we explain the proposed layered architecture with the use of an example system DataMIME™ that was developed as proof-of-concept. DataMIME™ is an efficient and scalable data mining system providing the flexibility of plugging in new data mining applications when needed. Clients can interact with the DataMIME™ system to capture their data and convert it into the Ptree format after which they can apply different data mining applications. The actual data converter along with all the data mining applications execute on the server side. Figure 1 depicts a conceptual view of the system.
8
Server Architecture Data capture and integration layer (DCI/DII)
Data mining interface (DMI) Distributed Ptree Management Interface (DPMI) Uniform data structure Data mining algorithms (DMA) Client-server communication Client interface DCI/DII Layer Room for new feeder DMA Layer Plugs for new algorithms DMI Layer DPMI: Distributed Ptree Management Interface Already Plugged Algorithm Distributed Ptree database Multi-threaded concurrent and distributed DataMIME™ server has a layered architecture: DCI/DII, DMI, DMA, DPMI, and Ptree Data. Figure 2 describes the organization of the layers.
9
The Distributed P-tree Database
The DPD collects all data in vertical format (as opposed to the ubiquitous horizontal (record-based) data structure used in DBMSs), as Predicate-trees (P-trees) based on the patent pending P-tree technology). P-trees can be 0-dimensional, 1-dimensional, 2-dimensional, etc. Next slide shows the detailed construction of 1-D P-trees from a generic horizontal table of data. DCI/DII Layer Room for new feeder DMA Layer Plugs for new algorithms DMI Layer DPMI: Distributed Ptree Management Interface Already Plugged Algorithm Distributed P-tree database (DPD) Multi-threaded concurrent and distributed DataMIME™ server has a layered architecture: DCI/DII, DMI, DMA, DPMI, and Ptree Data. Figure 2 describes the organization of the layers.
10
But it is pure (pure0) so this branch ends
Predicate tree technology: vertically project each attribute, Current practice: Structure data into horizontal records. Process vertically (scans) then vertically project each bit position of each attribute, then compress each bit slice into a basic Ptree. e.g., compression of R11 into P11 goes as follows: R(A1 A2 A3 A4) R[A1] R[A2] R[A3] R[A4] = Horizontally structured records Scanned vertically R11 1 R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 pure1? false=0 pure1? true=1 Top-down construction of the 1-dimensional Ptree representation of R11, denoted, P11, is built by recording the truth of the universal predicate “pure 1” in a tree recursively on halves, until purity is achieved. pure1? false=0 pure1? false=0 pure1? false=0 Horizontally AND basic Ptrees 1. Whole is pure1? false 0 0 0 0 1 10 ^ P11 P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 0 0 0 1 10 1 0 01 1 0 0 1 ^ 2. Left half pure1? false 0 3. Right half pure1? false 0 0 0 P11 And it’s pure so branch ends 4. Left half of rt half ? false0 0 0 6. Lf half of lf of rt? true1 0 0 0 1 1 5. Rt half of right half? true1 0 0 0 1 7. Rt half of lf of rt? false0 0 0 0 1 10 To count occurrences of 7,0,1,4 use pure : level P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43 = level =2 level ^ But it is pure (pure0) so this branch ends
11
2-D P-tree Data Structure
m 1 Peano or Z-ordering Pure (Pure-1/Pure-0/Mixed) quadrants Root Count (count of 1s in the tree) Provides an efficient format for ANDing, ORing and Complementing. Lossless, compressed, count-computation_ready representations.
12
DCI/DII (Data Capture and Data Integration Interface) layer
Allows user to capture and to integrate data to the required format (p-tree format). The main component of this layer is the feeder. An individual feeder can process a particular format of incoming data. User can write his own feeder and plug it very easily in this architecture. DCI/DII Layer Room for new feeder Already Plugged feeder This layer allows user to capture and to integrate data to the required format (p-tree format). The main component of this layer is the feeder. An individual feeder can process a particular format of incoming data. If there is no feeder that can process a particular format then user can write his own feeder and plug it very easily in this architecture.
13
DMI (Data Mining Interface) layer
DMI does counting, the most important operation for data mining provided by P-trees, including: basic P-trees value P-trees tuple P-trees Interval P-trees Cube P-trees DMI also provide the P-tree algebra, which has four operations: AND OR NOT (complement) and XOR, to implement the point wise logical operations on P-trees for (Data Mining Algorithms) DMA. D M I DMI does counting, the most important operation for data mining provided by P-trees, including basic P-trees, value P-trees, tuple P-trees, Interval P-trees, and Cube P-trees. DMI also provide the P-tree algebra, which has four operations, AND, OR, NOT (complement) and XOR, to implement the point wise logical operations on P-trees for (Data Mining Algorithms) DMA.
14
Distributed Ptree Management Interface (DPMI) Layer
The DPMI layer provides: access location and concurrency transparency by hiding the fact that: data representation may differ resources may be located in different places resources may be shared by several competitive users. By resource we meant data and its converted form Ptree. The DPMI layer provides access, location, and concurrency transparency by hiding the fact that data representation may differ and resource access protocol may vary, resources may be located in different places, and resources may be shared by several competitive users.
15
DMA (Data Mining Algorithms) layer
This layer is a collection of data mining tools (algorithms). Upon receiving a request from the client side an algorithm will be fired up for mining. This layer depends on the DMI for accessing meta-info and required counts needed in: Ptree based K Nearest Neighbor PKNN Podium Incremental Neighbor Evaluator PINE P-BAYESIAN Etc. The architecture has the flexibility to plug-in any new algorithm on this layer. DMA Layer This layer is a collection of data mining tools (algorithms). Upon receiving a request from the client side an algorithm will be fired for mining. This layer depends on DMI for accessing meta-info and required counts. Ptree based K Nearest Neighbor PKNN [7], Podium Incremental Neighbor Evaluator PINE [10], and P-BAYESIAN are available as built-in algorithms in the current DataMIMETM system. The architecture has the flexibility to plug-in a new algorithm on this layer.
16
Communication The communication between different layers is designed in such a way that it minimizes the data flow over the network. In the DCI and the DMA communication protocols a client will create a connection, send a request, receive a response and close the connection. A client will send only one request in a single threaded connection. The response for a request is a line with a message indicating the outcome of the request. A DMA protocol request has a similar structure : header and an optional set of binary files with checksums. The header in the DMA protocol is a set of key / value pairs (properties. Response to the DMA protocol request also contains key / value pairs. In the DCI and the DMA communication protocols a client will create a connection, send a request, receive a response and close the connection. A client will send only one request in a single threaded connection. In the DCI protocol, a request contains a text-based header which may be followed by a set of binary files with checksums for each file. The header contains a command request to the server, number of files, and, if request contains files, information about each file (name and length). The response for a request is a line with a message indicating the outcome of the request. DMA protocol request has a similar structure : header and an optional set of binary files with checksums. The header in the DMA protocol is a set of key / value pairs (properties), similar to the Java properties file, followed by a terminator. Response to the DMA protocol request also contains key / value pairs. Each request contains property 'command' with name of a command request (value). Other parameters (key/value) may represent arguments for the requested command. Depending on a command name and its parameters the server will call different data mining algorithms to respond to this request.
17
Client Structure The two main functionalities are:
Capture: Which sends datasets along with their meta information (description of the data) to the DII/DCI layer of the server for capturing. Mining: This sends requests to the DMA layer for applying data mining applications on previously captured datasets and the presentation of the results. Client Side DCI Meta Data DCI Meta-data generator Client side DMA Prediction Model DMA Visualization Tool Unclassified data In the client side DataMIMETM has a graphical user interface (GUI) to visually interact with a user. The two main functionalities are: Capture: Which sends datasets along with their meta information (description of the data) to the DII/DCI layer of the server for capturing. Mining: This sends requests to the DMA layer for applying data mining applications on previously captured datasets and the presentation of the results.
18
Client and GUI Data Capturing Data Mining
In the client side DataMIMETM has a graphical user interface (GUI) to visually interact with a user ( )
19
System Characteristics
Ability to handle formatted record-based, relational-like data with numerical and/or categorical attributes. The data could be in text format, relational format, or TIFF image format. Easy conversion from any other machine readable format can be provided through customized data feeders. Users can do any data analysis and mining on data sets in the system, or on any new data they capture or integrate into the system. Capable of handling large quantities of data and mines them in scalable time. Clients of the system can run on UNIX and Microsoft Windows platform with the server designed to be a UNIX-based system. Initially we raised certain issues that are related in providing scalable data mining services on the web. In the previous section we describe a layered architecture that can address most of the issues. In this section we describe the characteristics of DataMIMETM, a prototype system implemented as a proof-of-concept. To increase usability, we have designed and implemented the system with an increased emphasis on extensibility and flexibility. We have developed a wide variety of functions and algorithms. Most algorithms have turned out to be superior to other well-known methods in terms of speed, and/or accuracy. We summarize the characteristics as follows. The system has the ability to handle formatted record-based, relational-like data with numerical and/or categorical attributes. The data could be in text format, relational format, or TIFF image format. In addition, easy conversion from any other machine readable format can be provided through customized data feeders System users can do any data analysis and mining on data sets in the system, or on any new data they capture or integrate into the system. The system is capable of handling large quantities of data and mines them in scalable time. Clients of the system can run on UNIX and Microsoft Windows (including 95, 98, NT, 2000, XP, and Server 2003) platform with the server designed to be a UNIX-based system.
20
System Characteristics (cont.)
Supports major RDBMS platforms. The server engine can be run on a single machine or distributed across multiple computers for better scalability and efficiency. The system has an open architecture provides high degree of software extensibility and integration capabilities. The system provides high level of asynchronous background operations, performing most data intensive operations in the background or offline and allowing users to continue their work. The system minimizes the flow of data across the network. The system supports major RDBMS platforms. The system has an N-Tier architecture providing high flexibility. The server engine can be run on a single machine or distributed across multiple computers for better scalability and efficiency. The system can automate data ETL (extraction, transformation, and load) processes or just let the users handle everything manually. The system has an open architecture provides high degree of software extensibility and integration capabilities. Users can not only use the system provided approaches in association rule mining, classification, prediction, and similarity search, they can also write their own data mining algorithms using the Ptree API and compile and deploy them in the DataMIME™ environment, so as to see the performance of their own algorithms. With large amounts of data, data operations require time to process. The system provides high level of asynchronous background operations, performing most data intensive operations in the background or offline and allowing users to continue their work. The system minimizes the flow of data across the network.
21
Conclusion We have shown the importance of having a layered architecture for a distributed data mining system. Key elements were identified in deciding on the different layers. Able to identify a unique efficient vertical data structure at the lowest layer that can take advantages of the latest hardware. To facilitate the data distribution a management layer is also recognized. Two other layers are defined: data capture and data mining layer. A prototype system was developed as a proof-of-concept to show the feasibility of the approach.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.