Decision Trees for Mining Data Streams

Slides:



Advertisements
Similar presentations
Decision trees for stream data mining – new results
Advertisements

Data Mining Lecture 9.
DECISION TREES. Decision trees  One possible representation for hypotheses.
A Paper on RANDOM SAMPLING OVER JOINS by SURAJIT CHAUDHARI RAJEEV MOTWANI VIVEK NARASAYYA PRESENTED BY, JEEVAN KUMAR GOGINENI SARANYA GOTTIPATI.
Mining High-Speed Data Streams
Fast Algorithms For Hierarchical Range Histogram Constructions
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
Paper By - Manish Mehta, Rakesh Agarwal and Jorma Rissanen
Decision Tree Approach in Data Mining
Mining High-Speed Data Streams Presented by: Tyler J. Sawyer UVM Spring CS 332 Data Mining Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International.
Scalable Classification Robert Neugebauer David Woo.
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
BOAT - Optimistic Decision Tree Construction Gehrke, J. Ganti V., Ramakrishnan R., Loh, W.
Forest Trees for On-line Data Joao Gama, Pedro Medas, Ricado Rocha Proc. ACM Symposium on Applied Computing- SAC /5/27 報告人:董原賓.
Learning on Probabilistic Labels Peng Peng, Raymond Chi-wing Wong, Philip S. Yu CSE, HKUST 1.
SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen Presentation by: Vladan Radosavljevic.
Decision Tree Rong Jin. Determine Milage Per Gallon.
1 Decision Tree Classification Tomi Yiu CS 632 — Advanced Database Systems April 5, 2001.
Ensemble Learning: An Introduction
(C) 2001 SNU CSE Biointelligence Lab Incremental Classification Using Tree- Based Sampling for Large Data H. Yoon, K. Alsabti, and S. Ranka Instance Selection.
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning RASTOGI, Rajeev and SHIM, Kyuseok Data Mining and Knowledge Discovery, 2000, 4.4.
Fall 2004 TDIDT Learning CS478 - Machine Learning.
1 SciCSM: Novel Contrast Set Mining over Scientific Datasets Using Bitmap Indices Gangyi Zhu, Yi Wang, Gagan Agrawal The Ohio State University.
Handling Numeric Attributes in Hoeffding Trees Bernhard Pfahringer, Geoff Holmes and Richard Kirkby.
Mining Data Streams Challenges, Techniques, and Future Work Ruoming Jin Joint work with Prof. Gagan Agrawal.
Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.
Decision Tree Learning Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 25, 2014.
CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob SPRINT: A Scalable Parallel Classifier for Data Mining.
Data Reduction via Instance Selection Chapter 1. Background KDD  Nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable.
Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence Presented by: Afsoon.
FREERIDE: System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Ge Yang Gagan Agrawal Department of Computer and Information.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
Finding τ → μ−μ−μ+ Decays at LHCb with Data Mining Algorithms
Bootstrapped Optimistic Algorithm for Tree Construction
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
An Interval Classifier for Database Mining Applications Rakes Agrawal, Sakti Ghosh, Tomasz Imielinski, Bala Iyer, Arun Swami Proceedings of the 18 th VLDB.
Machine Learning Recitation 8 Oct 21, 2009 Oznur Tastan.
1 Systematic Data Selection to Mine Concept-Drifting Data Streams Wei Fan Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery.
Using category-Based Adherence to Cluster Market-Basket Data Author : Ching-Huang Yun, Kun-Ta Chuang, Ming-Syan Chen Graduate : Chien-Ming Hsiao.
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde.
Mining Concept-Drifting Data Streams Using Ensemble Classifiers Haixun Wang Wei Fan Philip S. YU Jiawei Han Proc. 9 th ACM SIGKDD Internal Conf. Knowledge.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
System Support for High Performance Scientific Data Mining Gagan Agrawal Ruoming Jin Raghu Machiraju S. Parthasarathy Department of Computer and Information.
Mining High-Speed Data Streams Presented by: William Kniffin Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Conference
Bias Management in Time Changing Data Streams We assume data is generated randomly according to a stationary distribution. Data comes in the form of streams.
FREQUENCY DISTRIBUTION
Ensemble Classifiers.
Machine Learning: Ensemble Methods
University of Waikato, New Zealand
DECISION TREES An internal node represents a test on an attribute.
Privacy-Preserving Data Mining
k-Nearest neighbors and decision tree
Ge Yang Ruoming Jin Gagan Agrawal The Ohio State University
Mining Time-Changing Data Streams
QianZhu, Liang Chen and Gagan Agrawal
Ch9: Decision Trees 9.1 Introduction A decision tree:
Chapter 6 Classification and Prediction
AP Statistics: Chapter 7
Communication and Memory Efficient Parallel Decision Tree Construction
Roberto Battiti, Mauro Brunato
Bootstrapped Optimistic Algorithm for Tree Construction
Stratified Sampling for Data Mining on the Deep Web
Fast and Exact K-Means Clustering
Mining Decision Trees from Data Streams
Learning from Data Streams
A task of induction to find patterns
A task of induction to find patterns
Presentation transcript:

Decision Trees for Mining Data Streams 240B Fall 2018 Liangkun Zhao

Outline Introduction Background Knowledge Challenges Numerical Interval Pruning A New Sampling Approach Experiment result Conclusion & future work

Introduction Decision tree construction is an important data mining problem Got attention from disk-resident datasets to data streams Key issue: 1. Only one pass over entire data 2. Real-time constraint 3. Memory limit Domingos and Hulten’s algorithm guarantees a probabilistic bound on the accuracy of the decision tree. But…

Background knowledge Decision tree classifier given a data set D = {t1, t2 … tN }, ti =< x, c > ∈ X × C X - domain of data instances X = X1× X2 … Xm Xj – domain of attribute, either categorical set or numerical set ([1,100]) C – domain of class labels Find a function f: X → C Typically binary tree, split point. Growing stage and pruning stage Selecting the attribute and the spilt condition that maximizes information gain is the key step in decision tree construction

Challenges The total size of the data is typically much larger than the available memory. It is not possible to store and re-read all data from memory. Single pass algorithm is required Also meet the real time constraint

Template of algorithm AQ – active queue Q – set of decision tree node not yet been splited

Challenges When we have sufficient samples to make the splitting decision What information needs to be stored from the sample for splitting decision How we effectively examine all possible splitting conditions with a node Number of distinct values associated with a numerical attribute can be large, can make it computationally demanding to choose the best split point

Sampling Statistical test g(xi) - estimate of the best gain we could get from attribute xi ϵ - a small positive number Original test

Sampling α - probability that the original test holds if the statistical test holds, should be as close to 1 as possible ϵ can be viewed as a function of α and sample size |S| Domingos and Hulten use the Hoeffding bound to construct this function

Numerical Interval Pruning Basic idea: partition the range of numerical attribute into intervals, and then use statistical test to prune these intervals. An interval is pruned if it does not appear likely to include the split point. Small class histograms: comprised of class histograms for all categorical attributes. Also add small numerical attributes. Concise class histograms: records the occurrences of instances with each label whose value of numerical interval is within that interval.

Numerical Interval Pruning Detailed Information: one of to formats 1. class histogram for the samples which are within the interval. 2. maintain the samples with each class label. Do not need to process detailed information associated with pruned interval. Significant reduction in the execution time but no loss of accuracy.

NIP algorithm for numerical attributes Main challenge: effectively but correctly prune the intervals. Over-pruning: prune the interval after analyze a small sample, unpruned later. Under-pruning: keep the interval, but have not yet been pruned.

A new sampling approach Hoeffding bound based result is independent of the data instances in the data set. Our method is still independent of the data instances and uses the property of gain functions like entropy.

A new sampling approach za – the (1- α) th percentile of the standard normal distribution. τ2 – variance x, y - variables

A new sampling approach We are interested in the sample size that could separate gxa and gyb , where xa and xb are the points that maximize the gain of split function for top two attributes Xa and Xb The required sample size from Hoeffding bound is Nn <= Nh (conclusion from other paper)

Experiment result Sample-H is the version that uses Hoeffding bound. ClassHist-H uses Hoeffding bound and creates full class histograms. NIP-H and NIP-N use numerical interval pruning, with Hoeffding bound and the normal distribution of entropy function, respectively.

Experiment result Dataset – 10 million training records with 6 numerical attributes and 3 categorical attributes. Memory bound – 60MB Consistent with what was reported from Domingos and Hulten.

Conclusion & future work Numerical interval pruning (NIP) approach which significantly reduces the processing time for numerical attributes, without any loss of accuracy. Use the properties of the gain function entropy to reduce the sample size required to obtain the same probabilistic bound. Future work: 1. Other ways of creating intervals. 2. Extend work to drifting data streams.

reference Efficient Decision Tree Construction on Streaming Data, by Ruoming Jin, Gagan Agrawal, in the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD) 2003.