Decision Trees and more!. Learning OR with few attributes Target function: OR of k literals Goal: learn in time – polynomial in k and log n –  and 

Slides:



Advertisements
Similar presentations
A threshold of ln(n) for approximating set cover By Uriel Feige Lecturer: Ariel Procaccia.
Advertisements

1 Machine Learning: Lecture 3 Decision Tree Learning (Based on Chapter 3 of Mitchell T.., Machine Learning, 1997)
Evaluating Classifiers
Decision Tree Approach in Data Mining
Greedy Algorithms Greed is good. (Some of the time)
BOOSTING & ADABOOST Lecturer: Yishay Mansour Itay Dangoor.
© The McGraw-Hill Companies, Inc., Chapter 2 The Complexity of Algorithms and the Lower Bounds of Problems.
Deterministic Selection and Sorting Prepared by John Reif, Ph.D. Analysis of Algorithms.
Machine Learning Week 2 Lecture 2.
ID3 Algorithm Abbas Rizvi CS157 B Spring What is the ID3 algorithm? ID3 stands for Iterative Dichotomiser 3 Algorithm used to generate a decision.
2 -1 Chapter 2 The Complexity of Algorithms and the Lower Bounds of Problems.
CPSC 668Set 10: Consensus with Byzantine Failures1 CPSC 668 Distributed Algorithms and Systems Fall 2006 Prof. Jennifer Welch.
Analysis of greedy active learning Sanjoy Dasgupta UC San Diego.
Algorithms in Exponential Time. Outline Backtracking Local Search Randomization: Reducing to a Polynomial-Time Case Randomization: Permuting the Evaluation.
Probably Approximately Correct Model (PAC)
Submitted by : Estrella Eisenberg Yair Kaufman Ohad Lipsky Riva Gonen Shalom.
Decision List LING 572 Fei Xia 1/18/06. Outline Basic concepts and properties Case study.
Decision Tree Pruning. Problem Statement We like to output small decision tree  Model Selection The building is done until zero training error Option.
Three kinds of learning
Data Structures – LECTURE 10 Huffman coding
2 -1 Chapter 2 The Complexity of Algorithms and the Lower Bounds of Problems.
The Complexity of Algorithms and the Lower Bounds of Problems
Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert.
Probably Approximately Correct Learning (PAC) Leslie G. Valiant. A Theory of the Learnable. Comm. ACM (1984)
Defining Polynomials p 1 (n) is the bound on the length of an input pair p 2 (n) is the bound on the running time of f p 3 (n) is a bound on the number.
PAC learning Invented by L.Valiant in 1984 L.G.ValiantA theory of the learnable, Communications of the ACM, 1984, vol 27, 11, pp
Machine Learning Algorithms in Computational Learning Theory
Learning CPSC 386 Artificial Intelligence Ellen Walker Hiram College.
CS-424 Gregory Dudek Today’s outline Administrative issues –Assignment deadlines: 1 day = 24 hrs (holidays are special) –The project –Assignment 3 –Midterm.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
Inferring Decision Trees Using the Minimum Description Length Principle J. R. Quinlan and R. L. Rivest Information and Computation 80, , 1989.
Benk Erika Kelemen Zsolt
Learning from Observations Chapter 18 Through
Boosting and other Expert Fusion Strategies. References Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.
Learning from Observations Chapter 18 Section 1 – 3, 5-8 (presentation TBC)
CS Decision Trees1 Decision Trees Highly used and successful Iteratively split the Data Set into subsets one attribute at a time, using most informative.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Decision Tree Learning R&N: Chap. 18, Sect. 18.1–3.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
For Wednesday No reading Homework: –Chapter 18, exercise 6.
1 The Theory of NP-Completeness 2 Cook ’ s Theorem (1971) Prof. Cook Toronto U. Receiving Turing Award (1982) Discussing difficult problems: worst case.
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
Decision Trees. What is a decision tree? Input = assignment of values for given attributes –Discrete (often Boolean) or continuous Output = predicated.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 AdaBoost.. Binary Classification. Read 9.5 Duda,
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
1 Decision Tree Learning Original slides by Raymond J. Mooney University of Texas at Austin.
LIMITATIONS OF ALGORITHM POWER
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
Decision List LING 572 Fei Xia 1/12/06. Outline Basic concepts and properties Case study.
Classification and Regression Trees
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 4-Inducción de árboles de decisión (1/2) Eduardo Poggi.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
CSE 421 Algorithms Richard Anderson Lecture 27 NP-Completeness Proofs.
Learning From Observations Inductive Learning Decision Trees Ensembles.
COSC 3101A - Design and Analysis of Algorithms 14 NP-Completeness.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
By Subhasis Dasgupta Asst Professor Praxis Business School, Kolkata Classification Modeling Decision Tree (Part 2)
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
Machine Learning: Ensemble Methods
DECISION TREES An internal node represents a test on an attribute.
Artificial Intelligence
Computational Learning Theory
Richard Anderson Lecture 26 NP-Completeness
ECE 5424: Introduction to Machine Learning
Analysis and design of algorithm
Orthogonal Range Searching and Kd-Trees
Data Mining Practical Machine Learning Tools and Techniques
Richard Anderson Lecture 25 NP-Completeness
Presentation transcript:

Decision Trees and more!

Learning OR with few attributes Target function: OR of k literals Goal: learn in time – polynomial in k and log n –  and  constants ELIM makes “slow” progress –might disqualifies only one literal per round –Might remain with O(n) candidate literals

ELIM: Algorithm for learning OR Keep a list of all candidate literals For every example whose classification is 0: –Erase all the literals that are 1. Correctness: –Our hypothesis h: An OR of our set of literals. –Our set of literals includes the target OR literals. –Every time h predicts zero: we are correct. Sample size: –m > (1/  ) ln (3 n /  )= O (n/  +1/  ln (1/  ) 

Set Cover - Definition Input: S 1, …, S t and S i  U Output: S i1, …, S ik and  j S jk =U Question: Are there k sets that cover U? NP-complete

Set Cover: Greedy algorithm j=0 ; U j =U; C=  While U j   –Let S i be arg max |S i  U j | –Add S i to C –Let U j+1 = U j – S i –j = j+1

Set Cover: Greedy Analysis At termination, C is a cover. Assume there is a cover C * of size k. C * is a cover for every U j Some S in C * covers U j /k elements of U j Analysis of U j : |U j+1 |  |U j | - |U j |/k Solving the recursion. Number of sets j  k ln ( |U|+1)

Building an Occam algorithm Given a sample T of size m –Run ELIM on T –Let LIT be the set of remaining literals –Assume there exists k literals in LIT that classify correctly all the sample T Negative examples T - –any subset of LIT classifies T - correctly

Building an Occam algorithm Positive examples T + –Search for a small subset of LIT which classifies T + correctly –For a literal z build S z ={x | z satisfies x} –Our assumption: there are k sets that cover T + –Greedy finds k ln m sets that cover T + Output h = OR of the k ln m literals Size (h) < k ln m log 2n Sample size m =O( k log n log (k log n))

k-DNF Definition: –A disjunction of terms at most k literals Term: T= x 3  x 1  x 5 DNF: T 1  T 2  T 3  T 4 Example:

Learning k-DNF Extended input: –For each AND of k literals define a “new” input T –Example: T= x 3  x 1  x 5 –Number of new inputs at most (2n) k –Can compute the new input easily in time k(2n) k –The k-DNF is an OR over the new inputs. –Run the ELIM algorithm over the new inputs. Sample size O ((2n) k /  +1/  ln (1/  )  Running time: same.

Learning Decision Lists Definition: x4x4 x7x7 x1x

Learning Decision Lists Similar to ELIM. Input: a sample S of size m. While S not empty: –For a literal z build T z ={x | z satisfies x} –Find a T z which all have the same classification –Add z to the decision list –Update S = S-T z

DL algorithm: correctness The output decision list is consistent. Number of decision lists: –Length < n+1 –Node: 2n lirals –Leaf: 2 values –Total bound (2*2n) n+1 Sample size: –m = O (n log n/  +1/  ln (1/  ) 

k-DL Each node is a conjunction of k literals Includes k-DNF (and k-CNF) x 4  x x 3  x 1 x 5  x 7

Learning k-DL Extended input: –For each AND of k literals define a “new” input –Example: T= x 3  x 1  x 5 –Number of new inputs at most (2n) k –Can compute the new input easily in time k(2n) k –The k-DL is a DL over the new inputs. –Run the DL algorithm over the new inputs. Sample size Running time

Open Problems Attribute Efficient: –Decision list: very limited results –Parity functions: negative? –k-DNF and k-DL

Decision Trees x1x1 x6x

Learning Decision Trees Using DL Consider a decision tree T of size r. Theorem: –There exists a log (r+1)-DL L that computes T. Claim: There exists a leaf in T of depth log (r+1). Learn a Decision Tree using a Decision List Running time: n log s –n number of attributes –S Tree Size.

Decision Trees x 1 > 5 x 6 >

Decision Trees: Basic Setup. Basic class of hypotheses H. Input: Sample of examples Output: Decision tree –Each internal node from H –Each leaf a classification value Goal (Occam Razor): –Small decision tree –Classifies all (most) examples correctly.

Decision Tree: Why? Efficient algorithms: –Construction. –Classification Performance: Comparable to other methods Software packages: –CART –C4.5 and C5

Decision Trees: This Lecture Algorithms for constructing DT A theoretical justification –Using boosting Future lecture: –DT pruning.

Decision Trees Algorithm: Outline A natural recursive procedure. Decide a predicate h at the root. Split the data using h Build right subtree (for h(x)=1) Build left subtree (for h(x)=0) Running time –T(s) = O(s) + T(s + ) + T(s - ) = O(s log s) –s= Tree size

DT: Selecting a Predicate Basic setting: Clearly: q=up + (1-u)r h Pr[f=1]=q Pr[f=1| h=0]=p Pr[f=1| h=1]=r 0 1 Pr[h=0]=u Pr[h=1]=1-u v v1v1 v2v2

Potential function: setting Compare predicates using potential function. –Inputs: q, u, p, r –Output: value Node dependent: –For each node and predicate assign a value. –Given a split: u val(v 1 ) + (1-u) val(v 2 ) –For a tree: weighted sum over the leaves.

PF: classification error Let val(v)=min{q,1-q} –Classification error. The average potential only drops Termination: –When the average is zero –Perfect Classification

PF: classification error Is this a good split? Initial error 0.2 After Split 0.4 (1/2) + 0.6(1/2) = 0.2 h q=Pr[f=1]=0.8 p=Pr[f=1| h=0]= u=Pr[h=0]=0.5 v v1v1 v2v2 1-u=Pr[h=1]=0.5 p=Pr[f=1| h=1]=1

Potential Function: requirements When zero perfect classification. Strictly convex.

Potential Function: requirements Every Change in an improvement p qr u 1-u val(q) val(p) val(r)

Potential Functions: Candidates Potential Functions: –val(q) = Ginni(q)=2q(1-q) CART –val(q)=etropy(q)= -q log q –(1-q) log (1-q) C4.5 –val(q) = sqrt{2 q (1-q) } Assumption: –Symmetric: val(q) = val(1-q) –Convex –val(0)=val(1) = 0 and val(1/2) =1

DT: Construction Algorithm Procedure DT(S) : S- sample If all the examples in S have the classification b –Create a leaf of value b and return For each h compute val(h,S) –val(h,S) = u h val(p h ) + (1-u h ) val(r h ) Let h’ = arg min h val(h,S) Split S using h’ to S 0 and S 1 Recursively invoke DT(S 0 ) and DT(S 1 )

DT: Analysis Potential function: –val(T) =  v leaf of T Pr[v] val(q v ) For simplicity: use true probability Bounding the classification error –error(T)  val(T) –study how fast val(T) drops Given a tree T define T(l,h) where –h predicate –l leaf. T h

Top-Down algorithm Input: s = size; H= predicates; val(); T 0 = single leaf tree For t from 1 to s do Let (l,h) = arg max (l,h) {val(T t ) – val(T t (l,h))} T t+1 = T t (l,h)

Theoretical Analysis Assume H satisfies the weak learning hypo. –For each D there is an h s.t. error(h)<1/2-  Show, that in every step – a significant drop in val(T) Results weaker than AdaBoost –But algorithm never intended to do it! Use Weak Learning – show a large drop in val(T) at each step Modify initial distribution to be unbiased.

Theoretical Analysis Let val(q) = 2q(1-q) Local drop at a node at least 16  2 [q(1-q)] 2 Claim: At every step t there is a leaf l s.t. : –Pr[l]   t /2t –error(l)= min{q l,1-q l }   t /2 –where  t is the error at stage t Proof!

Theoretical Analysis Drop at time t at least: –Pr[l]  2 [q l (1-q l )] 2   2  t 3 / t For Ginni index –val(q)=2q(1-q) –q  q(1-q)  val(q)/2 Drop at least O(  2 [val(q t )] 3 / t )

Theoretical Analysis Need to solve when val(T k ) <  Bound k. Time exp{O(1/  2 1/  2 )}

Something to think about AdaBoost: very good bounds DT Ginni Index : exponential Comparable results in practice How can it be?