Presentation is loading. Please wait.

Presentation is loading. Please wait.

Overview of Distributed Data Mining Xiaoling Wang March 11, 2003.

Similar presentations


Presentation on theme: "Overview of Distributed Data Mining Xiaoling Wang March 11, 2003."— Presentation transcript:

1 Overview of Distributed Data Mining Xiaoling Wang March 11, 2003

2 2 Data Mining “ We are drowning in information, but starving for knowledge.” - John Naisbett What is data mining? –Closely related to knowledge discovery –Discovering useful, usually unknown patterns from data –Data: a set of facts F (e.g., cases in a database) –Pattern: an expression E describing facts in a subset FE

3 3 Goals of Data Mining Goals –Prediction –Description Domains –Induction, Compression, Querying, Approximation, Search

4 4 Basic Techniques of Data Mining Basic techniques –Clustering –Association rule discovery –Classification –Sequential pattern discovery –Outlier detection

5 5 Data Warehouse Architecture Data Warehouse Data source … Data Transformation & Integration Extractor Data Mining Algorithm

6 6 Distributed Data Mining Framework Data source … Local Model Aggregation Final Model Data Mining Algorithm Data Mining Algorithm Data Mining Algorithm Local Model Local Model Local Model

7 7 Distributed Data Source Definitions Homogeneous –Contain the same set of attributes across distributed data sites Heterogeneous –Define different sets of attributes across distributed data sites

8 8 Distributed Data Mining Techniques Distributed classifier learning –Meta-learning framework –Distributed learning with knowledge probing Collective data mining Distributed clustering Distributed association rule mining Others

9 9 Meta-learning Chan, Florida Institute of Technology & Stolfo, Columbia University “base classifiers” and “meta-classifier” Meta-learning rules: voting, arbitrating, and combining Scalability, efficiency, portability, compatibility, adaptivity, extensibility, and effectiveness For heterogeneous data sites, apply bridging methods

10 10 Meta-learning Framework Training Data Meta-level Training Data Validation Data Meta-learning (Arbitration and Combining) Final Classifier System Classifier Learning Algorithm Training Data Learning Algorithm Classifier Prediction

11 11 Distributed Learning with Knowledge Probing Guo & Sutiwaraphun, Imperial College Objective: distributed classification Meta-learning based technique Applied on homogeneous data sites Knowledge probing: to extract descriptive knowledge from a black box model from a new data set whose classes are assigned by the model

12 12 DLKP (Cont.) Data source 1 Data source 2Data source k … Prediction Scheme Final Model Local Model Derivation Local Model Derivation Local Model Derivation Local Model 1 Local Model 2 Local Model 3 Probing set Probing Strategy

13 13 Collective Data Mining (CDM) Kargupta, University of Maryland & Park, Washington State University Objective: predictive data modeling Applied to heterogeneous (vertically partitioned) data sites Foundation: any function can be represented in a distributed fashion using an appropriate set of basis functions (orthonormal) Example: Collective Principal Component Analysis (CPCA)

14 14 CDM Framework Step 1: Generate approximate orthonormal basis coefficients at each local site Step 2: Move a chosen sample of data sets from each site to a single site; Generate approximate basis coefficients corresponding to non-linear cross terms Step 3: Combine the local models; Transform it into user described representation; Output the model

15 15 Distributed Clustering Sources from parallel center-based clustering algorithms, such as k-means, etc Applied on homogeneous scenarios Two basic approaches –Approximate the underlying distance measure by aggregation –Provide the exact measure by data broadcasting

16 16 Distributed Association Rule Mining Two main approaches –Count Distribution (CD) data is partitioned homogeneously into several data sites –Data Distribution (DD) maximizing parallelism

17 17 Applications of Distributed Data Mining Credit card fraud detection Intrusion detection Information retrieval from Internet Ad hoc sensor networks

18 18 Challenges of Distributed Data Mining Real-time distributed data mining Adaptive to changing environment, new data, new pattern


Download ppt "Overview of Distributed Data Mining Xiaoling Wang March 11, 2003."

Similar presentations


Ads by Google