Learning: Identification Trees Larry M. Manevitz All rights reserved
Given data; Find rules characterizing Problems –Multi-Dimensional –Want rules that generalize
Sample Data (Winston) NameHairHeightWeightLotionBurnt? Sarahblondeaverlightnoyes Danablondetallaveryesno Alexbrownshortaveryesno Annieblondeshortavernoyes Emilyredaverheavynoyes Petebrowntallheavyno Johnbrownaverheavyno Katieblondeshortlightyesnone
How to analyze data? Make identical match (unlikely) Find best match (other techniques) Build Decision Tree that –gives correct decisions for all data –is simplest (for generalizability)
Decision Trees blonde red brown Emily (yes) Alex Pete John (All No) Hair color Lotion? Yes Sarah Annie (all yes) Dana Katie (All No)
Another Decision Tree Height Hair Color Weight Hair Color
Want to make homogeneous sets? Idea: Minimize Disorder by Tests Can be measured by “ average entropy ” Formula from information theory
Disorder or Entropy Suppose two classes perfectly balanced –Then Entropy = 1 (highest value) Suppose all members in one class then entropy is 0. (note 0 log 0 = 0)
So on our tabular example, we compute the entropy result from each test and then choose the one with lowest average entropy. Test Entropy Hair0.5 Height0.69 Lotion0.61 Weight0.94
Summary Figure out the disorder (or entropy) for each possible test. Choose the one with smallest entropy Then continue on each subbranch