Learning Decision Trees Brief tutorial by M Werner
Medical Diagnosis Example Goal – Diagnose a disease from a blood test Clinical Use –Blood sample is obtained from a patient –Blood is tested to measure current expression of various proteins, say by using a DNA microarray –Data is analyzed to produce a Yes or No answer
Data Analysis Use a decision tree such as: P1 > K1 P2 > K2 P3 > K3P4 > K4 YesNoYes No Yes Y N YY Y Y N N N N No Y
How to Build the Decision Tree Start with samples of blood from patients known to either have the disease or not (training set). Suppose there are 20 patients and 10 are known to have the disease and 10 not From the training set get expression levels for all proteins of interest i.e. if there are 20 patients and 50 proteins we get a 50 X 20 array of real numbers Rows are proteins Columns are patients
Choosing the decision nodes We would like the tree to be as short as possible Start with all 20 patients in one group Choose a protein and a level that gains the most information Px > Kx 10/10 9/31/7 10 have disease 10 don’t Possible splitting condition Mostly diseased Mostly not diseased Py > Ky 10/10 7/73/3 Alternative splitting condition
How to determine information gain Purity – A measure to which the patients in a group share the same outcome. A group that splits 1/7 is fairly pure – Most patients don’t have the disease 0/8 is even purer 4/4 is the opposite of pure. This group is said to have high entropy. Knowing that a patient is in this group does not make her more or less likely to have the disease. The decision tree should reduce entropy as test conditions are evaluated
Measuring Purity (Entropy) Let f(i,j)=Prob(Outcome=j in node i) i.e. If node 2 has a 9/3 split –f(2,0) = 9/12 =.75 –f(2,1) = 3/12 =.25 Gini impurity: Entropy:
Computing Entropy
Goal is to use a test which best reduces total entropy in the subgroups
Building the Tree
Links publications/courses/ece_8463/lectures/cu rrent/lecture_27/lecture_27.pdfhttp:// publications/courses/ece_8463/lectures/cu rrent/lecture_27/lecture_27.pdf Decision Trees & Data Mining Andrew Moore Tutorial