Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Data Mining: and Knowledge Acquizition — Chapter 5 — BIS 541 2013/2014 Summer.

Similar presentations


Presentation on theme: "1 Data Mining: and Knowledge Acquizition — Chapter 5 — BIS 541 2013/2014 Summer."— Presentation transcript:

1 1 Data Mining: and Knowledge Acquizition — Chapter 5 — BIS 541 2013/2014 Summer

2 2 Classification: predicts categorical class labels Typical Applications {credit history, salary}-> credit approval ( Yes/No) {Temp, Humidity} --> Rain (Yes/No) Classification Mathematically

3 3 Linear Classification Binary Classification problem The data above the red line belongs to class ‘x’ The data below red line belongs to class ‘o’ Examples – SVM, Perceptron, Probabilistic Classifiers x x x x xx x x x x o o o o o o o o oo o o o

4 4 Neural Networks Analogy to Biological Systems (Indeed a great example of a good learning system) Massive Parallelism allowing for computational efficiency The first learning algorithm came in 1959 (Rosenblatt) who suggested that if a target output value is provided for a single neuron with fixed inputs, one can incrementally change weights to learn to produce these outputs using the perceptron learning rule

5 5 Neural Networks Advantages prediction accuracy is generally high robust, works when training examples contain errors output may be discrete, real-valued, or a vector of several discrete or real-valued attributes fast evaluation of the learned target function Criticism long training time difficult to understand the learned function (weights) not easy to incorporate domain knowledge

6 6 Network Topology Input variables number of inputs number of hidden layers # of nodes in each hidden layer # of output nodes can handle discrete or continuous variables normalisation for continuous to 0..1 interval for discrete variables use k inputs for each level use k output for each level if k>2 A has three distinct values a1,a2,a3 three input variables I1,I2I3 when A=a1 I1=1,I2,I3=0 feed-forward:no cycle to input untis fully connected:each unit to each in the forward layer

7 7 Multi-Layer Perceptron Output nodes Input nodes Hidden nodes Output vector Input vector: x i w ij

8 8 Example: Sample iterations A network suggested to solve the XOR problem figure 4.7 of WK page 96-99 learning rate is 1 for simplicity I1=I2=1 T =0 true output I1I2TP 1.01.000.63

9 9 0.1 0.4 -0.2 0.3 O3:0.65 -0.4 0.5 O5:0.63 O4:0.48 I1:1 I2:1

10 10 Variabe Encodings Continuous variables Ex: Dollar amounts Averages: averages sales,volume Ratio income-debt,payment to laoan Physical measures: area, temperature... Transfer between 0-1 or 0.1 – 0.9 -1.0 - +1.0 or -0.9 – 0.9 z scores z = x – mean_x/standard_dev_X

11 11 Continuous variables When a new observation comes it may be out of range What to do Plan for a larger range Reject out of range values Pag values lower then min to minrange higher then max to maxrange

12 12 Ordinal variables Discrete integers Ex: Age ranges : young mid old İncome : low,mid,high Number of children Transfer to 0-1 interval Ex: 5 categories of age 1 young,2 mid young,3 mid, 4 mid old 5 old Transfer between 0 to 1

13 13 Thermometer coding 0  0 0 0 0  0/16 = 0 1  1 0 0 0  8/16 = 0.5 2  1 1 0 0  12/16 = 0.75 3  1 1 1 0  14/16 =0.875 Useful for academic grades or bond ratings Difference on one side of the scale is more important then on the other side of the scale

14 14 Nominal Variables Ex: Gender marital status,occupation 1- treat like ordinary variables Ex marital status 5 codes: Single,divorced,maried,widowed,unknown Mapped to -1,-0.5,0,0.5,1 Network treat them ordinal Even though order does not make sence

15 15 2- break into flags One variable for each category 1 of N coding Gender has three values Male female unknown Male 1 -1 -1 Female -1 1 -1 Unknown -1 -1 1

16 16 1 of N-1 coding Male 1 -1 Female -1 1 Unknown -1 -1 3 replace the varible with an numerical one

17 17 Time Series variables Stock market prediction Output IMKB100 at t Inputs: IMKB100 at t-1, at t-2, at t-3... Dollar at t-1, t-2,t-3.. İnterest rate at t-1,t-2,t-3 Day of week variables Ordinal Monday 1 0 0 0 0,...,Friday 0 0 0 0 1 Nominal Monday to Friday map from -1 to 1 or 0 to 1

18 18 A Neuron The n-dimensional input vector x is mapped into variable y by means of the scalar product and a nonlinear function mapping kk - f weighted sum Input vector x output y Activation function weight vector w  w0w0 w1w1 wnwn x0x0 x1x1 xnxn

19 19 A Neuron kk - f weighted sum Input vector x output y Activation function weight vector w  w0w0 w1w1 wnwn x0x0 x1x1 xnxn

20 20 Network Training The ultimate objective of training obtain a set of weights that makes almost all the tuples in the training data classified correctly Steps Initialize weights with random values repeat until classification error is lower then a threshold (epoch) Feed the input tuples into the network one by one For each unit Compute the net input to the unit as a linear combination of all the inputs to the unit Compute the output value using the activation function Compute the error Update the weights and the bias

21 21 Example: Stock market prediction input variables: individual stock prices at t-1, t-2,t-3,... stock index at t-1, t-2, t-3, inflation rate, interest rate, exchange rates $ output variable: predicted stock price next time train the network with known cases adjust weights experiment with different topologies test the network use the tested network for predicting unknown stock prices

22 22 Other business Applications (1) Marketing and sales Prediction Sales forecasting Price elasticity forecasting Customer responce Classification Target marketing Customer satisfaction Loyalty and retention Clustering segmentation

23 23 Other business Applications (1) Risk Management Credit scoring Financial health Clasification Bankruptcy clasification Fraud detection Credit scoring Clustering Credit scoring Risk assesment

24 24 Other business Applications (1) Finance Prediction Hedging Future prediction Forex stock prediction Clasification Stock trend clasification Bond rating Clustering Economic rating Mutual fond selection

25 25 Perceptrons WK 91 sec 4.2 N inputs I i i:1..N single output O two classes C 0 and C 1 denoted by 0 and 1 one node output: O=1 if w 1 I 1 + w 2 I 2 +...+w n I n +w 0 >0 O=0 if w 1 I 1 + w 2 I 2 +...+w n I n +w 0 <0 sometimes  is used for constant term for w0 called bias or treshold in ANN

26 26 Artificial Neural Nets: Perseptron x1x1 xdxd x2x2 x 0 =+1 w1w1 w2w2 wdwd w0w0 y g

27 27 Perceptron training procedure(rule) (1) Find w weights to separate each training sample correctly Initial weights randomly chosen weight updating samples are presented in sequence after presenting each case weights are updated: w i (t+1) = w i (t)+  w i (t)  i (t+1) =  i (t)+  i (t)  w i (t) =  (T -O)I i  i (t) =  (T -O) O: output of perceptron,T true output for each case,  learning rate 0<  <1 usually around 0.1

28 28 Perceptron training procedure (rule) (2) each case is presented and weights are updated after presenting each case if the error is not zero then present all cases ones each such cycle is called an epoch unit error is zero for perfectly separable samples

29 29 Perceptron convergence theorem: if the sample is linearly separable the perceptron will eventually converge: separate all the sample correctly error =0 the learning rate can be even one This slows down the convergence to increase stability it is gradually decreased linearly separable: a line or hyperplane can separate all the sample correctly

30 30 If classes are not perfectly linearly separable if a plane or line can not separate classes completely The procedure will not converge and will keep on cycling through the data forever

31 31 o o o o o o o o x x x x x x o o o o o o o o x x x x x x linearly separable not linearly separable

32 32 Example calculations Two inputs w1=0.25 w2=0.5 w0 or  =-0.5 Suppose I1= 1.5 I2 =0.5 learning rate=0.1 and T = 0 true output perceptron separate this as: 0.25*1.5+0.5*0.5-0.5=0.125>0 O=1 w1(t+1) = 0.25+0.1(0-1)1.5=0.1 w2(t+1) = 0.5+ 0.1(0-1)0.5 =0.45  (t+1) = -0.5+ 0.1(0-1)=-0.6 with the new weights: O = 0.1*1.5+0.45*0.5-0.6=-0.225 O =0 no error

33 33 I1 I2 0.25*I1+0.5*I2-0.5=0 2 1 1.5 0.5 class 1 class 0 o true class is 0 but classified as class 1 I1 I2 0.1*I1+0.45*I2-0.6=0 1.33 0.5 class 1 class 0 o true class is 0 and classified as class 0 6

34 34 XOR: exclusive OR problem Two inputs I1 I2 when both agree I1=0 and I2=0 or I1=1 and I2=1 class 0, O=0 when both disagree I1=0 and I2=1 or I1=1 and I2=0 class 1, O=1 one line can not solve XOR but two ilnes can

35 35 I1 I2 1 1 0 class 0 class 1 a single line can not separate these classes

36 36 Multi-layer networks Study 4.3 In WK One layer networks can separate a hyperplane two layer networks can any convex region and three layer networks can separate any non convex boundary Examples see notes

37 37 ANN for classification x 0 =+1 oKoK xdxd x2x2 x1x1 o2o2 o1o1 w Kd

38 38 o o o o oo o o + ++ + + + + + + + + + A B C inside the triangle ABC is class O outside the triangle is class + class =0 if I1+I2>=10 I1<=I2 I2<=10 I1 I2 a c b I1 I2 output of hidden node a: 1 if class O w 11 I 1 +w 12 I 2 +w 10 >=0 0 if class is + w 11 I 1 +w 12 I 2 +w 10 <0 so w 1i s are W 11 =1,w 12 =1 and w 10 =-10 d +1

39 39 o o o o oo o o + ++ + + + + + + + + + A B C output of hidden node b: 1 if that is O w 21 I 1 +w 22 I 2 +w 20 >=0 0 if that is + w 21 I 1 +w 22 I 2 +w 20 <0 so w 2i s are W 21 =-1,w 22 =1 and w 10 =-0 I1 I2 a c b I1 I2 output of hidden node c: 1 if w 31 I 1 +w 32 I 2 +w 30 >=0 0 if w 11 I 1 +w 12 I 2 +w 10 <=0 so w 1i s are W 31 =0,w 32 =-1 and w 10 =10 d +1

40 40 o o o o oo o o + ++ + + + + + + + + + A B C an object is class O if all hidden units predict is as class 0 output is 1 if w’ a H a +w’ b H b +w’ c H c +w d >=0 output is 0 if w’ a H a +w’ b H b +w’ c H c +w d <0 I1 I2 a c b I1 I2 weights of output node d: w a =1,w b =1w c =1 w d =-3+x where x a small number d +1

41 41 o o o o oo o o + ++ + + + + + ++ + + A B C ADBC is the union of two convex regions in this case triangles each triangular region can be separated by a two layer network I1 I2 a c b I1 I2 d separates ABC e separates ADB ADBC is union of ABC and ADB d +1 o o o o D e f first hidden layer second hidden layer w’’ f0 w’’ f1 w’’ f2 output is class O if w’’ f0 +w’’ f1 H e +w’’ f2 H f >=0 w’’ f0 =--0.99,w’’ f =1,w’’ f2 =1 Two hidden layers Can seperate any Nonconvex region

42 42 In practice boundaries are not known but increasing number of hidden node: two layer perceptron can separate any convex region if it is perfectly separable adding a second hidden layer and or ing the convex regions any nonconvex boundary can be separated if it is perfectly separable Weights are unknown but are found by training the network

43 43 For prediction problems Any function can be approximated with a oe hiden layer network X Y

44 44 Network Training The ultimate objective of training obtain a set of weights that makes almost all the tuples in the training data classified correctly Steps Initialize weights with random values Feed the input tuples into the network one by one For each unit Compute the net input to the unit as a linear combination of all the inputs to the unit Compute the output value using the activation function Compute the error Update the weights and the bias

45 45 Multi-Layer Perceptron Output nodes Input nodes Hidden nodes Output vector Input vector: x i w ij

46 46 Back propagation algorithm LMS uses a linear activation function not so useful threshold activation function is very good in separating but not differentiable back propagation uses logistic function O = 1/(1+exp(-N))=(1+exp(-N)) -1 N = w 1 I 1 +w 2 I 2 +... w N I N +  the derivative of logistic function dO/dN = O*(1-O) expressed as a function of output where O =1/(1+exp(-N)), 0<=O<=1

47 47 Minimize total error again E= (1/2)  N d=1  M k=1 (T k,d -O k,d ) 2 where N is number of cases M number of output units T k,d :true value of sample d in output unit k O k,d :predicted value of sample d in output unit k the algorithm updates weights by a similar method to the delta rule for each output units  w ij =  d=1 O d (1- O d )(T d -O d )I i,d or  w ij (t) =  O(1-O)(T -O)I i | when objects are  ij (t) =  O(1-O)(T -O) | presented sequentially here O(1-O)(T -O)= is the error term

48 48 so  w ij (t)=  *error j *I i or  ij (t)=  *error j for all training samples new weights are w i (t+1) = w i (t)+  w i (t)  i (t+1) =  i (t)+  i (t) but for hidden layer weights no target value is available  w ij (t) =  O d (1- O d ) (  M k=1 error k *w kh )I i  ij (t) =  O d (1- O d )(  M k=1 error k *w kh ) the error rate of each output is weighted by its weight and summed up to find the error derivative The weights from hidden unit h to output unit k is responsible for the error in output unit k

49 49 Example: Sample iterations A network suggested to solve the XOR problem figure 4.7 of WK page 96-99 learning rate is 1 for simplicity I1=I2=1 T =0 true output

50 50 0.1 0.4 -0.2 0.3 O3:0.65 -0.4 0.5 O5:0.63 O4:0.48 I1:1 I2:1

51 51

52 52 Exercise Carry out one more iteration for the XOR problem

53 53 Practical Applications of BP Revision by epoch or case dE/dw j =  N i=1 O i (1-O i )(T i -O i )I ij where i= 1,..,N index for samples N sample size j: index for inputs Iij input variable j for sample i This is the theoretical and actual derivtives information in all samples are used in one update of the weight j weights are revised after each epoch

54 54 If samples are presented one by one weight j is updated after presenting each sample by dE/dw j = O i (1-O i )(T i -O i )I ij this is just one term in the epoch update or gradient formula of derivative called case update updating by case is more common and give better results less likely to stack to local minima Random or sequential presentation in each epoch case are presented in sequential order or in random order

55 55 1 2 3 3 5.. 1 2 3 4 5..1 2 3 4 5.. 1 2 3 4 5.. epoch 1 epoch 2 epoch 3 1 2 5 4 3.. 3 2 1 4 5..5 1 4 2 3.. 2 5 4 1 3.. epoch 1 epoch 2 epoch 3 random presentation sequential presentation

56 56 Neural Networks Random initial state weights and biases are initialized to random values usually between -0.5 to 0.5 the final solution may depend on the initial values of weights the algorithm may converge to different local minima Learning rate and local minima learning rate  too small: slow convergence too large: much fast but osilations With a small learning rate local minimum is less likely

57 57 Momentum Momentum is added to the update equations  w ij (t+1) =  *errorderivative j *I i + mon*  w ij (t)  ij (t+1) =  *errorderivative j + mom*  ij (t) momentum term slows down the change of direction avoids falling into a local minima or speed up convergence by increasing the gradient by adding a value to it when it falls into flat regions

58 58 Stoping criteria Limit the number of epochs improvement in error is so small sample error after a fixed number of epochs measure the reduction in error no change in  w values above a threshold

59 59 Overfitting Sec 4.6.5 pp 108-112 Mitchell E monotonically decreases as number of iterations increases Fig 4.9 in Mitchell Validation or test case error in general decreases first then start increasing Why as training progress some weights values are high fit noise in training data not representative features of the population

60 60 What to do Weight decay slowly decrease weights put a penalty to error function for high weights Monitoring the validation set error as well as the training set error as a function of iterations see figure 4.9 in Mitchell

61 61 Error and Complexity Sec 4.4 of WK pp 102-107 error rate on the training set decreases as number of hidden units is increased error rate on test set first decreases flatten out then start increasing as number of hidden layers is increased Start with zero hidden units increase gradually the number of units in hidden layer at each network size 10 fold cross validation or sampling the different initial weights may be used to estimate error. error may be averaged

62 62 A General Network Training Procedure Define the problem Select input and output variables Make necessary transformations Decide on algorithm gradient decent or stochastic approximation (delta rule) Choose the transfer function logistic, hyperbolic tangent Select a learning rate a momentum after experimenting with possibly different rates

63 63 A General Network Training Procedure cnt Determine the stopping criteria after error decreases by to a level or number of epochs Start from zero hidden units increment number of hidden units for each number of hidden units repeat train the network on training data set perform cross validation to estimate test error rate by averaging on different test samples for a set of initial weights find best initial weights

64 64 Network Pruning and Rule Extraction Network pruning Fully connected network will be hard to articulate N input nodes, h hidden nodes and m output nodes lead to h(m+N) weights Pruning: Remove some of the links without affecting classification accuracy of the network Extracting rules from a trained network Discretize activation values; replace individual activation value by the cluster average maintaining the network accuracy Enumerate the output from the discretized activation values to find rules between activation value and output Find the relationship between the input and activation value Combine the above two to have rules relating the output to input

65 65 Neural Network Approach Neural network approaches Represent each cluster as an exemplar, acting as a “prototype” of the cluster New objects are distributed to the cluster whose exemplar is the most similar according to some distance measure Typical methods SOM (Soft-Organizing feature Map) Competitive learning Involves a hierarchical architecture of several units (neurons) Neurons compete in a “winner-takes-all” fashion for the object currently being presented

66 66 Self-Organizing Feature Map (SOM) SOMs, also called topological ordered maps, or Kohonen Self-Organizing Feature Map (KSOMs) It maps all the points in a high-dimensional source space into a 2 to 3-d target space, s.t., the distance and proximity relationship (i.e., topology) are preserved as much as possible Similar to k-means: cluster centers tend to lie in a low-dimensional manifold in the feature space Clustering is performed by having several units competing for the current object The unit whose weight vector is closest to the current object wins The winner and its neighbors learn by having their weights adjusted SOMs are believed to resemble processing that can occur in the brain Useful for visualizing high-dimensional data in 2- or 3-D space

67 67 Web Document Clustering Using SOM The result of SOM clustering of 12088 Web articles The picture on the right: drilling down on the keyword “mining” Based on websom.hut.fi Web page

68 68


Download ppt "1 Data Mining: and Knowledge Acquizition — Chapter 5 — BIS 541 2013/2014 Summer."

Similar presentations


Ads by Google