Download presentation
Presentation is loading. Please wait.
Published byEustace Lynch Modified over 8 years ago
1
Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with the data flood (STT, 65) (pp. 697-700). Den Haag, the Netherlands: STT/Beweton. J.H. Friedman and N.I. Fisher (1999) Bump-hunting in high-dimensional data. Statistics and Computing, 9:123–143.
2
Bump Hunting - The objective Find regions in the feature space, where the outcome variable has high average value. In classification, it means a region of the feature space where the majority of the samples are in one class. The decision rule looks like an intersection of several conditions (each on one predictor variable) If condition 1 & condition 2 &…… & condition N, then predict value … Ex: if 0<x 1 <1 & 2<x 2 <5 &…& -1<x n <0, then class 1
3
Bump Hunting - The objective When the dimension is high, and there is many such boxes, the problem is not easy.
4
Bump Hunting - The objective Let’s formalize the problem: Predictors x=( ) Target variable y, either continuous or binary Feature space: Find subspace such that Note: when y is binary, this is not mean of y. Rather, it is Pr(y=1 | x R) Define any box:
5
Box in continuous feature space: Bump Hunting - The objective
6
Box in categorical feature space. Bump Hunting - The objective
7
Sequentially find box in subsets of the data. Bump Hunting - PRIM Support of a box: Continue search for boxes until not enough support for the new box.
8
Bump Hunting - PRIM “Patient Rule Induction Method” Two steps: (1) Patient successive top-down refinement (2) Bottom-up recursive expansion These are greedy algorithms.
9
Bump Hunting - PRIM Peeling: Begin with box B containing all data (or all remaining data in later steps) Remove sub-box b*, which maximizes in B-b* The candidate box b is defined on a single variable (peeling only in one of the dimensions), and only a small percentile is peeled each time.
10
Bump Hunting - PRIM This is a greedy hill-climb algorithm. Stop the iteration when the support drops to pre- determined threshold. Why called “patient …”? Only remove a small fraction at each step.
11
Bump Hunting - PRIM Pasting: In peeling, box boundries are determined without knowledge of later peels. Some non-optimal steps can be taken. Final box could be improved by boundary adjustments.
12
Bump Hunting - PRIM example
21
2/7
22
Bump Hunting - PRIM example
23
3/7
24
Bump Hunting - PRIM example The winner is:
25
Bump Hunting - PRIM example The next peel: 1. And β= 0.4
26
Bump Hunting - PRIM example
27
Bump Hunting - Beam search algorithm At each step, w best sub-boxes (each on a single variable) are selected. Minimum support requirement. More greedy --- at each step, much more can be peeled than PRIM.
28
Bump Hunting - Beam search algorithm
29
W=2 Bump Hunting - Beam search algorithm
31
Bump Hunting - About PRIM It is a greedy search. However, it is “patient”. This is important. Methods that partition the data much faster, e.g. Beam search and CART, could be less successful. The “patient” method makes it easier to recover from previous “unfortunate” steps, since we don’t run out of the data too fast. It doesn’t select off predictors due to high correlation within them.
32
Neural networks (1) Introduction Fitting neural networks
33
Neural network K-class classification: K nodes in top layer Continuous outcome: Single node in top layer
34
Neural network Z m are created from linear combinations of the inputs, Y k is modeled as a function of linear combinations of the Z m For regression, can use a simple g k (T) =T k. Typically K = 1. K-class classification: (multilogit model)
35
Neural network
36
y1: x1 + x2 + 0.5 ≥ 0 y2: x1 +x2 −1.5 ≥ 0 z1 = +1 if and only if both y1=1 and y2=-1 A simple network with linear functions. Neural network “bias”: intercept
37
Neural network
39
Fitting Neural Networks Set of parameters (weights): Objective function: Regression: Classification:cross-entropy (deviance)
40
Fitting Neural Networks minimizing R(θ) is by gradient descent, called “back-propagation” Middle-layer values for each data point: We use the square error loss for demonstration:
41
Fitting Neural Networks Derivatives: Descent along the gradient: :learning rate k m l i: observation index
42
Fitting Neural Networks By definition
43
Fitting Neural Networks General workflow of back-propagation: Forward: fix weights and compute Backward: compute back propagate to compute use both to compute the gradients for the updates update the weights
44
Fitting Neural Networks Can use parallel computing - each hidden unit passes and receives information only to and from units that share a connection. Online training the fitting scheme allows the network to handle very large training sets, and also to update the weights as new observations come in. Training neural network is an “art” – the model is generally overparametrized optimization problem is nonconvex and unstable A neural network model is a blackbox and hard to directly interpret
45
Fitting Neural Networks Initiation When weight vectors are close to length zero all Z values are close to zero. The sigmoid curve is close to linear. the overall model is close to linear. a relatively simple model. (This can be seen as a regularized solution) Start with very small weights. Let the neural network learn necessary nonlinear relations from the data. Starting with large weights often leads to poor solutions.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.