Presentation is loading. Please wait.

Presentation is loading. Please wait.

Decision Trees and more!. Learning OR with few attributes Target function: OR of k literals Goal: learn in time – polynomial in k and log n –  and 

Similar presentations


Presentation on theme: "Decision Trees and more!. Learning OR with few attributes Target function: OR of k literals Goal: learn in time – polynomial in k and log n –  and "— Presentation transcript:

1 Decision Trees and more!

2 Learning OR with few attributes Target function: OR of k literals Goal: learn in time – polynomial in k and log n –  and  constants ELIM makes “slow” progress –might disqualifies only one literal per round –Might remain with O(n) candidate literals

3 ELIM: Algorithm for learning OR Keep a list of all candidate literals For every example whose classification is 0: –Erase all the literals that are 1. Correctness: –Our hypothesis h: An OR of our set of literals. –Our set of literals includes the target OR literals. –Every time h predicts zero: we are correct. Sample size: –m > (1/  ) ln (3 n /  )= O (n/  +1/  ln (1/  ) 

4 Set Cover - Definition Input: S 1, …, S t and S i  U Output: S i1, …, S ik and  j S jk =U Question: Are there k sets that cover U? NP-complete

5 Set Cover: Greedy algorithm j=0 ; U j =U; C=  While U j   –Let S i be arg max |S i  U j | –Add S i to C –Let U j+1 = U j – S i –j = j+1

6 Set Cover: Greedy Analysis At termination, C is a cover. Assume there is a cover C * of size k. C * is a cover for every U j Some S in C * covers U j /k elements of U j Analysis of U j : |U j+1 |  |U j | - |U j |/k Solving the recursion. Number of sets j  k ln ( |U|+1)

7 Building an Occam algorithm Given a sample T of size m –Run ELIM on T –Let LIT be the set of remaining literals –Assume there exists k literals in LIT that classify correctly all the sample T Negative examples T - –any subset of LIT classifies T - correctly

8 Building an Occam algorithm Positive examples T + –Search for a small subset of LIT which classifies T + correctly –For a literal z build S z ={x | z satisfies x} –Our assumption: there are k sets that cover T + –Greedy finds k ln m sets that cover T + Output h = OR of the k ln m literals Size (h) < k ln m log 2n Sample size m =O( k log n log (k log n))

9 k-DNF Definition: –A disjunction of terms at most k literals Term: T= x 3  x 1  x 5 DNF: T 1  T 2  T 3  T 4 Example:

10 Learning k-DNF Extended input: –For each AND of k literals define a “new” input T –Example: T= x 3  x 1  x 5 –Number of new inputs at most (2n) k –Can compute the new input easily in time k(2n) k –The k-DNF is an OR over the new inputs. –Run the ELIM algorithm over the new inputs. Sample size O ((2n) k /  +1/  ln (1/  )  Running time: same.

11 Learning Decision Lists Definition: x4x4 x7x7 x1x1 +1 +1 1 1 1 0 0 0

12 Learning Decision Lists Similar to ELIM. Input: a sample S of size m. While S not empty: –For a literal z build T z ={x | z satisfies x} –Find a T z which all have the same classification –Add z to the decision list –Update S = S-T z

13 DL algorithm: correctness The output decision list is consistent. Number of decision lists: –Length < n+1 –Node: 2n lirals –Leaf: 2 values –Total bound (2*2n) n+1 Sample size: –m = O (n log n/  +1/  ln (1/  ) 

14 k-DL Each node is a conjunction of k literals Includes k-DNF (and k-CNF) x 4  x 2 +1 +1 1 11 000 x 3  x 1 x 5  x 7

15 Learning k-DL Extended input: –For each AND of k literals define a “new” input –Example: T= x 3  x 1  x 5 –Number of new inputs at most (2n) k –Can compute the new input easily in time k(2n) k –The k-DL is a DL over the new inputs. –Run the DL algorithm over the new inputs. Sample size Running time

16 Open Problems Attribute Efficient: –Decision list: very limited results –Parity functions: negative? –k-DNF and k-DL

17 Decision Trees x1x1 x6x6 +1+1 1 1 0 0

18 Learning Decision Trees Using DL Consider a decision tree T of size r. Theorem: –There exists a log (r+1)-DL L that computes T. Claim: There exists a leaf in T of depth log (r+1). Learn a Decision Tree using a Decision List Running time: n log s –n number of attributes –S Tree Size.

19 Decision Trees x 1 > 5 x 6 > 2 +1+1

20 Decision Trees: Basic Setup. Basic class of hypotheses H. Input: Sample of examples Output: Decision tree –Each internal node from H –Each leaf a classification value Goal (Occam Razor): –Small decision tree –Classifies all (most) examples correctly.

21 Decision Tree: Why? Efficient algorithms: –Construction. –Classification Performance: Comparable to other methods Software packages: –CART –C4.5 and C5

22 Decision Trees: This Lecture Algorithms for constructing DT A theoretical justification –Using boosting Future lecture: –DT pruning.

23 Decision Trees Algorithm: Outline A natural recursive procedure. Decide a predicate h at the root. Split the data using h Build right subtree (for h(x)=1) Build left subtree (for h(x)=0) Running time –T(s) = O(s) + T(s + ) + T(s - ) = O(s log s) –s= Tree size

24 DT: Selecting a Predicate Basic setting: Clearly: q=up + (1-u)r h Pr[f=1]=q Pr[f=1| h=0]=p Pr[f=1| h=1]=r 0 1 Pr[h=0]=u Pr[h=1]=1-u v v1v1 v2v2

25 Potential function: setting Compare predicates using potential function. –Inputs: q, u, p, r –Output: value Node dependent: –For each node and predicate assign a value. –Given a split: u val(v 1 ) + (1-u) val(v 2 ) –For a tree: weighted sum over the leaves.

26 PF: classification error Let val(v)=min{q,1-q} –Classification error. The average potential only drops Termination: –When the average is zero –Perfect Classification

27 PF: classification error Is this a good split? Initial error 0.2 After Split 0.4 (1/2) + 0.6(1/2) = 0.2 h q=Pr[f=1]=0.8 p=Pr[f=1| h=0]=0.6 0 1 u=Pr[h=0]=0.5 v v1v1 v2v2 1-u=Pr[h=1]=0.5 p=Pr[f=1| h=1]=1

28 Potential Function: requirements When zero perfect classification. Strictly convex.

29 Potential Function: requirements Every Change in an improvement p qr u 1-u val(q) val(p) val(r)

30 Potential Functions: Candidates Potential Functions: –val(q) = Ginni(q)=2q(1-q) CART –val(q)=etropy(q)= -q log q –(1-q) log (1-q) C4.5 –val(q) = sqrt{2 q (1-q) } Assumption: –Symmetric: val(q) = val(1-q) –Convex –val(0)=val(1) = 0 and val(1/2) =1

31 DT: Construction Algorithm Procedure DT(S) : S- sample If all the examples in S have the classification b –Create a leaf of value b and return For each h compute val(h,S) –val(h,S) = u h val(p h ) + (1-u h ) val(r h ) Let h’ = arg min h val(h,S) Split S using h’ to S 0 and S 1 Recursively invoke DT(S 0 ) and DT(S 1 )

32 DT: Analysis Potential function: –val(T) =  v leaf of T Pr[v] val(q v ) For simplicity: use true probability Bounding the classification error –error(T)  val(T) –study how fast val(T) drops Given a tree T define T(l,h) where –h predicate –l leaf. T h

33 Top-Down algorithm Input: s = size; H= predicates; val(); T 0 = single leaf tree For t from 1 to s do Let (l,h) = arg max (l,h) {val(T t ) – val(T t (l,h))} T t+1 = T t (l,h)

34 Theoretical Analysis Assume H satisfies the weak learning hypo. –For each D there is an h s.t. error(h)<1/2-  Show, that in every step – a significant drop in val(T) Results weaker than AdaBoost –But algorithm never intended to do it! Use Weak Learning – show a large drop in val(T) at each step Modify initial distribution to be unbiased.

35 Theoretical Analysis Let val(q) = 2q(1-q) Local drop at a node at least 16  2 [q(1-q)] 2 Claim: At every step t there is a leaf l s.t. : –Pr[l]   t /2t –error(l)= min{q l,1-q l }   t /2 –where  t is the error at stage t Proof!

36 Theoretical Analysis Drop at time t at least: –Pr[l]  2 [q l (1-q l )] 2   2  t 3 / t For Ginni index –val(q)=2q(1-q) –q  q(1-q)  val(q)/2 Drop at least O(  2 [val(q t )] 3 / t )

37 Theoretical Analysis Need to solve when val(T k ) <  Bound k. Time exp{O(1/  2 1/  2 )}

38 Something to think about AdaBoost: very good bounds DT Ginni Index : exponential Comparable results in practice How can it be?


Download ppt "Decision Trees and more!. Learning OR with few attributes Target function: OR of k literals Goal: learn in time – polynomial in k and log n –  and "

Similar presentations


Ads by Google