Approximating Optimal Binary Decision Trees Brent Heeringa (joint work with Micah Adler) 18 November 2005
Question: àI am thinking of a computer scientist. Which one? àRule: Ask YES/NO questions from a finite set Q
Q1: MIT Professor?
Q1: MIT Professor? YES Q1 YES NO
Q1: MIT Professor? YES Q2: Author of a popular CS text? Q1 YES NO
Q1: MIT Professor? YES Q2: Author of a popular CS text? YES Q1 YES NO Q2 YES NO
Q1: MIT Professor? YES Q2: Author of a popular CS text? YES Q1 YES NO Q2 YES NO
Q1: MIT Professor? YES Q2: Author of a popular CS text? YES Q3: Inventor of RSA? Q1 YES NO Q2 YES NO
Q1: MIT Professor? YES Q2: Author of a popular CS text? YES Q3: Inventor of RSA? NO Q1 YES NO Q2 YES NO
Decision Tree Problem (DT) àInput: A set X=(x 1,…,x n ) of binary strings (called items) ßEach item has exactly m-bits ßE.g. if m=5 then x i might be àSolution: A binary tree with n leaves ßEach internal node indexes some bit k Þ partitions items into two groups ßEach item is a leaf (n leaves total) àCost: Total Sum of Leaf Depths àOptimal Solution: DT with minimum cost k 0 1
Example: Cost: = 9
Example:
Example: Cost: = 8 OPTIMAL!
Alternative Cost = 8
Decision Trees àDecision Trees (DT) model many natural tasks in ßMedical Diagnosis ßExperiment Design àDT is the the 20-questions problem àDT is NP-Complete ßReduction from Set Cover (Exact Cover by 3 Sets) Þ[Hyafill and Rivest ]
Outline àProblem Introduction àA Greedy Approximation Algorithm for DT àAn Analysis of the Greedy Algorithm ß ln n-approximation àOther Results and Open Problems
A Greedy DT Algorithm ? IDEA: Always choose bit which most evenly partitions items 01
A Greedy DT Algorithm IDEA: Always choose bit which most evenly partitions the items
A Greedy DT Algorithm 4 IDEA: Always choose bit which most evenly partitions items
A Greedy DT Algorithm 4 IDEA: Always choose bit which most evenly partitions items
A Greedy DT Algorithm IDEA: Always choose bit which most evenly partitions items GREEDY-DT(X) If X=Ø Return NIL Else k index of the bit most evenly separating X T new tree node T[left] GREEDY-DT({X | X(k)=0}) T[right] GREEDY-DT({X | X(k)=1}) Return T
Optimal vs. Greedy a bc h eb cdfg a Optimal Tree T*Greedy Tree T Cost(T)=26 Cost(T*)=25 deh fg
Outline àProblem Introduction àA Greedy Approximation Algorithm for DT àAn Analysis of the Greedy Algorithm ß (ln n+1)-approximation àOther Results and Open Problems
Approximation Algorithm Review àMinimization Problem àC = cost given by approximation algorithm àC opt = cost of optimal solution -approximation: may be a function of the input size – n
Analysis Outline àAccounting Scheme ßEach pair of items {x i, x j } is separated exactly once in any decision tree Þ True for Greedy and Optimal ßDistribute cost of the Greedy tree among item pairs àAnalyze cost of greedy tree w.r.t. structure of optimal tree Theorem: The greedy algorithm yields a tree with cost at most a factor of (ln n +1) more than the optimal tree
Definitions and Notation àConsider each pair of items {x i,x j } àS ij separates x i from x j àS ij : set of items that are children àS ij + and S ij - child sets respectively à|S ij + | ≥ |S ij - | à|S ij | = |S ij + | + |S ij - | xixi S ij S ij - S ij + xjxj Greedy Tree T
Accounting Method xixi S ij S ij - S ij + xjxj Greedy Tree T àAssign cost c ij to each pair of items {x i,x j } Distribute |S ij | equally among the |S ij + ||S ij - | pairs of items split at S ij àc ij =
xixi 2 4 xjxj Greedy Tree T àAssign cost c ij to each pair of items {x i,x j } Distribute |S ij | equally among the |S ij + ||S ij - | pairs of items split at S ij àc ij = àExample: |S ij |= 6 |S ij + |= 4 |S ij - |= 2 ß {a,f} = {c,e} = c ij = 6/8 = 3/4 6 Accounting Method {a,b,c,d,e,f} {a,b,c,d}{e,f}
Greedy Tree Cost xixi S ij S ij - S ij + xjxj Greedy Tree T Cost of Greedy Tree T: Free to order pair costs in any way we like!
Reorder c ij according to T* xixi Z Z-Z- Z+Z+ xjxj Free to order pair costs in any way we like! Optimal Tree T *
A Lemma xixi Z Z-Z- Z+Z+ xjxj Optimal Tree T * Lemma: For any node Z in T*
Prove of the Theorem xixi Z Z-Z- Z+Z+ xjxj Optimal Tree T * (lemma) (|Z| ≤ n) (Def of tree cost) (CLRS) Lemma: For any node Z in T*
Proving the Lemma Lemma: For any node Z in T* xixi S ij S ij - S ij + xjxj Greedy Tree T Goal: Relate pair cost (defined w.r.t. greedy tree) to the optimal tree Claim 1:
Proving the Lemma Lemma: For any node Z in T* xixi S ij S ij Z- S ij Z+ xjxj Greedy Tree T Claim 1:
Proving the Lemma Lemma: For any node Z in T* Claim 1: xixi S ij S ij Z- S ij Z+ Greedy Tree T xjxj
Claim 2 (claim 1) Claim 2: For any Z in T *, for any x i in Z + :
Proof of Claim 2 Claim 2: For any Z in T *, for any x i in Z + : Z - = {a, b, c, d, e, f} Order Z from 1 to 6 according to when x j is split from x i When t th item is split from x i, |S ij Z - | ≥ 6-t+1 xixi a,b c d,e f S i1 S i2 S i3 S i4 S i5 |S i2 | ≥ 6 |S i3 | ≥ 4 |S i4 | ≥ 3 |S i4 | ≥ 1 Greedy Tree T
Wrapping up the Proof: (claim 1) Claim 2: For any Z in T *, for any x i in Z + : Lemma: For any node Z in T*
Wrapping up the Proof: (claim 1) Claim 2: For any Z in T *, for any x j in Z - : Lemma: For any node Z in T*
Wrapping up the Proof: (claim 1) Lemma: For any node Z in T* QED (claim 2)
Outline àProblem Introduction àA Greedy Approximation Algorithm for DT àAn Analysis of the Greedy Algorithm ß (ln n +1)-approximation àOther Results and Open Problems
DT has no PTAS unless P=NP àMAX3SAT5 [Feige]: ß 3CNF; each literal appears in exactly 5 clauses ßThm: There exists a universal constant > 0 such that it is NP-Hard to distinguish 3SAT5 formula that are satisfiable and those in which at most (1- )|C| clauses are simultaneously satisfied. àGap preserving reduction from MAX3SAT5 to DT ßVia a set cover
DT has no PTAS unless P=NP All clauses satisfied: Cost:
DT has no PTAS unless P=NP At most (1- )|C| clauses satisfied: Cost:
The ConDT Problem: àInput: A set X=(x 1,…,x n ) of m-bit binary strings (called items) ßEach item x i has a label TRUE or FALSE àSolution: A binary tree ßEach internal node is a bit k; each leaf is a label ßThe tree correctly labels each item (consistent) àCost: Total number of leaves àOptimal Solution: Consistent decision tree with minimum number of leaves àNot possible to approx. size s DTs with size s k DTs (for any constant k) unless NP is in DTIME[2 m ] for some < 1
Open Problems àGap in approximation ratios between lower and upper bounds ßTechniques from ConDT don’t work àItems with weights ßTests with weights ßMinimize:
Fin