THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGY CSIT 5220: Reasoning and Decision under Uncertainty L10: Model-Based Classification and Clustering Nevin L. Zhang Room 3504, phone: , Home page
CSIT 5220 L10: Model-Based Classification and Clustering l Probabilistic Models (PMs) for Classification l PMs for Clustering Page 2
CSIT 5220 l The problem: n Given data: n Find mapping (A1, A2, …, An) |- C l Possible solutions n ANN n Decision tree (Quinlan) n…n… n (SVM: Continuous data) Classification
CSIT 5220 Probabilistic Approach to Classification
CSIT 5220 Page 5 Will Boss Play Tennis?
CSIT 5220 Page 6 Will Boss Play Tennis?
CSIT 5220 Page 7
CSIT 5220 Page 8
CSIT 5220 Page 9
CSIT 5220 Page 10
CSIT 5220 Page 11 l Naïve Bayes model often has good performance in practice l Drawbacks of Naïve Bayes: n Attributes mutually independent given class variable n Often violated, leading to double counting. l Fixes: n General BN classifiers n Tree augmented Naïve Bayes (TAN) models n…n… Bayesian Networks for Classification
CSIT 5220 Page 12 l General BN classifier n Treat class variable just as another variable n Learn a BN. n Classify the next instance based on values of variables in the Markov blanket of the class variable. n Pretty bad because it does not utilize all available information because of Markov boundary Bayesian Networks for Classification
CSIT 5220 Page 13 Bayesian Networks for Classification l Tree-Augmented Naïve Bayes (TAN) model n Capture dependence among attributes using a tree structure. n During learning, First learn a tree among attributes: use Chow-Liu algorithm Special structure learning problem, easy Add class variable and estimate parameters n Classification arg max_c P(C=c|A1=a1, …, An=an) BN inference Many other methods
CSIT 5220 l Task: Find a tree model over observed variables that has maximum likelihood given data. l Maximized loglikelihood Chow-Liu Trees
CSIT 5220
l Mutual Information Chow-Liu Trees Task is equivalent to finding maximum spanning tree of the following weighted and undirected graph:
CSIT 5220 Maximum Spanning Trees
CSIT 5220 l Illustration of Kruskal’s Algorithm
CSIT 5220 L10: Probabilistic Models (PMs) for Classification and Clustering Page 24 l Probabilistic Models (PMs) for Classification l PMs for Clustering
CSIT 5220 Page 25
CSIT 5220 Page 26
CSIT 5220 Page 27
CSIT 5220 Page 28
CSIT 5220 Page 29
CSIT 5220 Page 30
CSIT 5220 Page 31
CSIT 5220 Page 32
CSIT 5220 An Medical Application l In medical diagnosis, sometimes gold standard exists l Example: Lung Cancer n Symptoms: Persistent cough, Hemoptysis (Coughing up blood), Constant chest pain, Shortness of breath, Fatigue, etc n Information for diagnosis: symptoms, medical history, smoking history, X-ray, sputum. n Gold standard: Biopsy: the removal of a small sample of tissue for examination under a microscope by a pathologist
CSIT 5220 An Medical Application l Sometimes gold standard does not exist l Example: Rheumatoid Arthritis (RA) n Symptoms: Back Pain, Neck Pain, Joint Pain, Joint Swelling, Morning Joint Stiffness, etc n Information for diagnosis: Symptoms, medical history, physical exam, Lab tests including a test for rheumatoid factor. (Rheumatoid factor is an antibody found in the blood of about 80 percent of adults with RA. ) n No gold standard: None of the symptoms or their combinations are not clear-cut indicators of RA The presence or absence of rheumatoid factor does not indicate that one has RA.
CSIT 5220 LC Analysis of Hannover Rheumatoid Arthritis Data n Class specific probabilities n Cluster 1: “disease” free n Cluster 2: “back-pain type” n Cluster 3: “Joint type” n Cluster 4: “Severe type”