Mining Time-Changing Data Streams

Slides:



Advertisements
Similar presentations
1 Machine Learning: Lecture 3 Decision Tree Learning (Based on Chapter 3 of Mitchell T.., Machine Learning, 1997)
Advertisements

Mining High-Speed Data Streams
Mining High-Speed Data Streams Presented by: Tyler J. Sawyer UVM Spring CS 332 Data Mining Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International.
M INING H IGH -S PEED D ATA S TREAMS Presented by: Yumou Wang Dongyun Zhang Hao Zhou.
Decision Tree Algorithm (C4.5)
Forest Trees for On-line Data Joao Gama, Pedro Medas, Ricado Rocha Proc. ACM Symposium on Applied Computing- SAC /5/27 報告人:董原賓.
An overview of The IBM Intelligent Miner for Data By: Neeraja Rudrabhatla 11/04/1999.
1 An Adaptive Nearest Neighbor Classification Algorithm for Data Streams Yan-Nei Law & Carlo Zaniolo University of California, Los Angeles PKDD, Porto,
Induction of Decision Trees
1 Mining Decision Trees from Data Streams Tong Suk Man Ivy CSIS DB Seminar February 12, 2003.
(C) 2001 SNU CSE Biointelligence Lab Incremental Classification Using Tree- Based Sampling for Large Data H. Yoon, K. Alsabti, and S. Ranka Instance Selection.
Machine Learning Lecture 10 Decision Trees G53MLE Machine Learning Dr Guoping Qiu1.
Scaling Decision Tree Induction. Outline Why do we need scaling? Cover state of the art methods Details on my research (which is one of the state of the.
Midwestern State University, Wichita Falls TX 1 Computerized Trip Classification of GPS Data: A Proposed Framework Terry Griffin - Yan Huang – Ranette.
By Wang Rui State Key Lab of CAD&CG
Short Introduction to Machine Learning Instructor: Rada Mihalcea.
Handling Numeric Attributes in Hoeffding Trees Bernhard Pfahringer, Geoff Holmes and Richard Kirkby.
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin.
Inferring Decision Trees Using the Minimum Description Length Principle J. R. Quinlan and R. L. Rivest Information and Computation 80, , 1989.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Ming Hsiao Author : Bing Liu Yiyuan Xia Philp S. Yu 國立雲林科技大學 National Yunlin University.
Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence Presented by: Afsoon.
CS690L Data Mining: Classification
1 Mining Decision Trees from Data Streams Thanks: Tong Suk Man Ivy HKU.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
Bootstrapped Optimistic Algorithm for Tree Construction
Discovering Interesting Patterns for Investment Decision Making with GLOWER-A Genetic Learner Overlaid With Entropy Reduction Advisor : Dr. Hsu Graduate.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
COM24111: Machine Learning Decision Trees Gavin Brown
Mining Concept-Drifting Data Streams Using Ensemble Classifiers Haixun Wang Wei Fan Philip S. YU Jiawei Han Proc. 9 th ACM SIGKDD Internal Conf. Knowledge.
Mining High-Speed Data Streams Presented by: William Kniffin Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Conference
Bias Management in Time Changing Data Streams We assume data is generated randomly according to a stationary distribution. Data comes in the form of streams.
CSE573 Autumn /11/98 Machine Learning Administrative –Finish this topic –The rest of the time is yours –Final exam Tuesday, Mar. 17, 2:30-4:20.
Oliver Schulte Machine Learning 726
University of Waikato, New Zealand
Data Transformation: Normalization
Decision Trees an introduction.
Decision Tree Learning
Decision trees (concept learnig)
Machine Learning Lecture 2: Decision Tree Learning.
CACTUS-Clustering Categorical Data Using Summaries
Decision trees (concept learnig)
Classification Algorithms
CSE543: Machine Learning Lecture 2: August 6, 2014
Prepared by: Mahmoud Rafeek Al-Farra
Decision Trees: Another Example
ID3 Vlad Dumitriu.
Bayes Net Learning: Bayesian Approaches
Data Science Algorithms: The Basic Methods
Oliver Schulte Machine Learning 726
Decision Tree Saed Sayad 9/21/2018.
Spatial Online Sampling and Aggregation
Data Mining Concept Description
Communication and Memory Efficient Parallel Decision Tree Construction
Privacy Preserving Data Mining
Machine Learning: Lecture 3
Lecture 05: Decision Trees
Play Tennis ????? Day Outlook Temperature Humidity Wind PlayTennis
COMP61011 : Machine Learning Decision Trees
Decision Trees Decision tree representation ID3 learning algorithm
An Adaptive Nearest Neighbor Classification Algorithm for Data Streams
Data Classification for Data Mining
Decision Trees for Mining Data Streams
Artificial Intelligence 9. Perceptron
Mining Decision Trees from Data Streams
Learning from Data Streams
A task of induction to find patterns
A task of induction to find patterns
Data Mining CSCI 307, Spring 2019 Lecture 6
Presentation transcript:

Mining Time-Changing Data Streams Advisor: Dr. Hsu Graduate: Yung-Chu Lin 2002/3/6 IDS Lab Seminar

Outline Motivation Objective Hoeffding Bounds The VFDT Algorithm The CVFDT Algorithm Window Size Time and Space Complexity Empirical Study Conclusion Opinion 2002/3/6 IDS Lab Seminar

Motivation The volume and time span of accumulated data for future use Real large database is not random sample drawn from a stationary 2002/3/6 IDS Lab Seminar

Objective Solving the classification problem: concept drift 2002/3/6 IDS Lab Seminar

Hoeffding Bounds n examples confidence 1-δ variable r, range R probability: R=1 information gain: R=log C 2002/3/6 IDS Lab Seminar

Concept of VFDT Algorithm Initialize the HT Repeat Scan nmin examples (a window size) Compute Gain(x) of every attribute Split the leaf node or not 2002/3/6 IDS Lab Seminar

The VFDT Algorithm 2002/3/6 IDS Lab Seminar

Example for Algorithms Day Outlook Tempera-ture Humidity Wind Play-Tennis D1 Sunny Hot High Weak No D2 Strong D3 Overcast Yes D4 Rain Mild D5 Cool Normal D6 D7 D8 D9 D10 D11 D12 D13 D14 2002/3/6 IDS Lab Seminar

Concept of CVFDT Algorithm An extension to VFDT, which adds the ability to detect and respond to changes in example-generating Not need to learn a new model from scratch every time a new example arrives Scan HT and alternate trees periodically  look for internal nodes whose sufficient statistics indicate better attribute When alternate subtree becomes more accurate, the old subtree is replaced by the new one 2002/3/6 IDS Lab Seminar

The CVFDT Algorithm 2002/3/6 IDS Lab Seminar

The CVFDTGrow Procedure 2002/3/6 IDS Lab Seminar

The ForgetExample and CheckSplitValidity procedure 2002/3/6 IDS Lab Seminar

Window Size One windows size w will not be appropriate for every concept and every type of drift Shrink w when many of the nodes in HT become questionable at once or in response to a rapid change in data rate Increase w when Few questionable node  concept is stable 2002/3/6 IDS Lab Seminar

Time and Space Complexity VFDT O(lvdvc) CVFDT O(lcdvc) O(ndvc) n: #nodes in CVFDT’s main tree and alternate trees d: #attributes v: max number of values per attribute c: #class lv,lc: longest path 2002/3/6 IDS Lab Seminar

Empirical Study Synthetic data Web data 2002/3/6 IDS Lab Seminar

Synthetic Data A hyperplane in d-dimensional space is the set of points x that satisfy where xi is the ith coordinate of x. positive: negative: 2002/3/6 IDS Lab Seminar

Synthetic Data (cont’d) Initialize weight to .2 except for w0 which is .25d Substitute its coordinates into the left hand side of Equation to obtain a sum s |s|<=.1*w0  positive |s|<=.2*w0  negative xi=[0,1] 2002/3/6 IDS Lab Seminar

Synthetic Data (cont’d) 5 million training examples δ=0.0001 f=20,000 Nmin=300;┬=0.05;w=100,000 2002/3/6 IDS Lab Seminar

Synthetic Data (cont’d) 2002/3/6 IDS Lab Seminar

Synthetic Data (cont’d) CVFDT took 4.3 times longer than VFDT VFDT’s average memory allocation over the course of the run was 23MB while CVFDT’s was 16.5MB The average number of nodes in VFDT’s tree was 2696 and in CVFDT’s tree was 677(132: alternate tree, 545: main tree) 2002/3/6 IDS Lab Seminar

Conclusion Introducing CVFDT, learning accurate models from the most demanding high-speed, concept-drifting data streams Maintain a decision-tree up-to-date with a windows of examples 2002/3/6 IDS Lab Seminar

Opinion Many techniques have the problem, concept drift. So, maybe we could apply the concept in this paper to our other techniques, like the association rule mining, clustering algorithms, and so on. 2002/3/6 IDS Lab Seminar