Treatment Learning: Implementation and Application Ying Hu Electrical & Computer Engineering University of British Columbia
Ying Hu 2 Outline 1.An example 2.Background Review 3.TAR2 Treatment Learner TARZAN: Tim Menzies TAR2: Ying Hu & Tim Menzies 4.TAR3: improved tar2 TAR3: Ying Hu 5.Evaluation of treatment learning 6.Application of Treatment Learning 7.Conclusion
Ying Hu 3 First Impression low high 6.7 <= rooms < 9.8 and 12.6 <= parent teacher ratio < <= nitric oxide < 1.9 and <= living standard < 39 C4.5’s decision tree: Treatment learner: Boston Housing Dataset (506 examples, 4 classes)
Ying Hu 4 Review: Background What is KDD ? –KDD = Knowledge Discovery in Database [fayyad96] –Data mining: one step in KDD process –Machine learning: learning algorithms Common data mining tasks –Classification Decision tree induction (C4.5) [quinlan86] Nearest neighbors [cover67] Neural networks [rosenblatt62] Naive Baye’s classifier [duda73] –Association rule mining APRIORI algorithm [agrawal93] Variants of APRIORI
Ying Hu 5 Treatment Learning: Definition –Input: classified dataset Assume: classes are ordered –Output: Rx=conjunction of attribute-value pairs Size of Rx = # of pairs in the Rx –confidence(Rx w.r.t Class) = P(Class|Rx) –Goal: to find Rx that have different level of confidence across classes –Evaluate Rx: lift –Visualization form of output
Ying Hu 6 Motivation: Narrow Funnel Effect When is enough learning enough? –Attributes: < 50%, accuracy: decrease 3-5% [shavlik91] –1-level decision tree is comparable to C4 [Holte93] –Data engineering: ignoring 81% features result in 2% increase of accuracy [kohavi97] –Scheduling: random sampling outperforms complete search (depth-first) [crawford94] Narrow funnel effect –Control variables vs. derived variables –Treatment learning: finding funnel variables
Ying Hu 7 TAR2: The Algorithm Search + attribute utility estimation –Estimation heuristic: Confidence1 –Search: depth-first search Search space: confidence1 > threshold Discretization: equal width interval binning Reporting Rx –Lift(Rx) > threshold Software package and online distribution
Ying Hu 8 The Pilot Case Study Requirement optimization –Goal: optimal set of mitigations in a cost effective manner Risks Mitigations Requirements Cost reduce relates Benefit incur achieve Iterative learning cycle
Ying Hu 9 The Pilot Study (continue) Cost-benefit distribution (30/99 mitigations) Compared to Simulated Annealing
Ying Hu 10 Problem of TAR2 Runtime vs. Rx size To generate Rx of size r: To generate Rx from size [1..N]
Ying Hu 11 TAR3: the improvement Random sampling –Key idea: Confidence1 distribution = probability distribution sample Rx from confidence1 distribution –Steps: Place item (a i ) in increasing order according to confidence1 value Compute CDF of each a i Sample a uniform value u in [0..1] The sample is the least a i whose CDF>u –Repeat till we get a Rx of given size
Ying Hu 12 Comparison of Efficiency Runtime vs. Data size Runtime vs. Rx size Runtime vs. TAR2
Ying Hu 13 Comparison of Results Mean and STD in each round Final Rx: TAR2=19, TAR3=20 10 UCI domains, identical best Rx pilot2 dataset (58 * 30k )
Ying Hu 14 External Evaluation All attributes (10 UCI datasets) learning FSS framework some attributes learning Compare Accuracy C4.5 Naive Bayes Feature subset selector TAR2less
Ying Hu 15 The Results Accuracy using Naïve Bayes (Avg increase = 0.8% ) Number of attributes Accuracy using C4.5 (avg decrease 0.9%)
Ying Hu 16 Compare to other FSS methods # of attribute selected (C4.5 ) # of attribute selected (Naive Bayes) 17/20, fewest attributes selected Another evidence for funnels
Ying Hu 17 Applications of Treatment Learning Downloading site: Collaborators: JPL, WV, Portland, Miami Application examples –pair programming vs. conventional programming –identify software matrix that are superior error indicators –identify attributes that make FSMs easy to test –find the best software inspection policy for a particular software development organization Other applications: –1 journal, 4 conference, 6 workshop papers
Ying Hu 18 Main Contributions New learning approach A novel mining algorithm Algorithm optimization Complete package and online distribution Narrow funnel effect Treatment learner as FSS Application on various research domains
Ying Hu 19 ====================== Some notes follow
Ying Hu 20 Rx Definition example Input example –classified dataset –Output example: Rx=conjunction of attribute-value pairs confidence(Rx w.r.t C) = P(C|Rx)
Ying Hu 21 TAR2 in practice Domains containing narrow funnels –A tail in the confidence1 distribution –A small number of variables that have disproportionally large confidence1 value –Satisfactory Rx of small size (<6)
Ying Hu 22 Background: Classification 2-step procedure –The learning phase –The testing phase Strategies employed –Eager learning Decision tree induction (e.g. C4.5) Neural Networks (e.g. Backpropagation) –Lazy learning Nearest neighbor classifiers (e.g. K-nearest neighbor classifier)
Ying Hu 23 Background: Association Rule Possible Rule: B => C,E [support=2%, confidence= 80%] Where support(X->Y) = P(X) confidence(X->Y) = P(Y|X) Representative algorithms –APRIORI Apriori property of large itemset –Max-Miner More concise representation of the discovered rules Different prune strategies. IDTransactions 1A, B, C,E,F 2B,C,E 3B,C,D,E 4…
Ying Hu 24 Background: Extension CBA classifier –CBA = Classification Based on Association –X=>Y, Y = class label –More accurate than C4.5 (16/26) JEP classifier –JEP = Jumping Emerging Patterns Support(X w.r.t D1) = 0, Support(X w.r.t D2) > 0 Model: collection of JEPs Classify: maximum collective impact –More accurate than both C4.5 & CBA (15/25)
Ying Hu 25 Background: Standard FSS Method Information Gain attribute ranking Relief Principle Component Analysis (PCA) Correlation based feature selection Consistency based subset evaluation Wrapper subset evaluation
Ying Hu 26 Comparison Relation to classification –Class boundary / class density –Class weighting Relation to association rule mining –Multiple classes / no class –Confidence-based pruning Relation to change detecting algorithm –support: |P(X|y=c1)-P(X|y=c2)| –confidence: |P(y=c1|X)-P(y=c2|X)| –Baye’s rule
Ying Hu 27 Confidence Property Universal-extential upward closure R1: Age.young -> Salary.low R2: Age.young, Gender.m -> Salary.low R2: Age.young, Gender.f -> Salary.low Long rule tend to have high confidence Large Rx tend to have high lift value
Ying Hu 28 TAR3: Usability Usability: more user-friendly –Intuitive, default setting