Use of Active Learning for Selective Annotation of Training Data in a Supervised Classification System for Digitized Histology Scott Doyle 1, Michael Feldman 2, John Tomaszewski 2, Anant Madabhushi 1 1 Department of Biomedical Engineering, Rutgers, The State University of New Jersey 2 Department of Surgical Pathology, University of Pennsylvania
Outline Background Digital Prostate Histopathology Supervised Classification Active Learning Methodology Active Learning Data Description Experimental Setup Experimental Results Concluding Remarks
Prostate Cancer Detection ~1 million biopsies per year in USA tissue samples per biopsy 80% benign diagnosis Large amount of data to analyze
Computer-Aided Diagnosis Identifies regions of interest / suspicion Quantitative Automated Reduces variability Supervised classification system Doyle, S., Feldman, M., Tomaszewski, J., Madabhushi, A. “A Hierarchical Computer-aided Classification Scheme for Automated Detection of Prostatic Adenocarcinoma from Digitized Histology,” APIII 2006
Supervised Classification Expert segmentation for training Histopathology: Expensive, time-consuming to annotate Cost per training sample is high
Supervised Classification Random training inefficient Possible redundancy with existing training No guarantee of improved accuracy
Solution: Active Learning Choose training samples intelligently, not randomly Increased accuracy per training sample Forced choice of training, maximized accuracy Useful where: Large amount of unlabeled data Annotations are expensive Ideally suited for histopathology data
Active Learning Classifier Performance Accuracy # of Training Samples Random Learning Active Learning
Previous Work Liu [2004], Vogiatzis and Tsapatsoulis [2006] Gene microarray data Yao, et al [2008] Content-based image retrieval Little work done in histopathology with Active Learning
Outline Background Digital Prostate Histopathology Supervised Classification Active Learning Methodology Active Learning Data Description Experimental Setup Experimental Results Concluding Remarks
Build Classifier Active Learning Methodology Cancer Non-cancer Uncertain Classification Obtained from pathologist Training Data Labeled Unlabeled Build Classifier Classify Unlabeled Training
Active Learning Methodology Uncertain Classification Informative Samples Certain Classification Uninformative Obtain Expert LabelsCombine With Original Set Eliminate, labeling these adds no information + Identify Informative Regions
Active Learning Methodology Generate New ClassifierNew Training Set
Feature Extraction Cancer Region Original Image Feature Images
Classification Feature ImagesC4.5 Decision Tree Doyle, S., Madabhushi, A., Feldman, M., Tomaszeweski, J.: A Boosting Cascade for Automated Detection of Prostate Cancer from Digitized Histology, MICCAI, Lecture Notes in Computer Science, Vol. 4191, pp , “Random Forest” [Brieman, 2001] Majority voting determines classification
Image Data Description 27 H&E stained digital biopsy samples Data breakdown: Initial Training Set Unlabeled Training Set Testing Set Active Learning drawn from Unlabeled Training Groups rotated so all images are tested
Classification Three training groups evaluated: Initial set: Active Learning set: Random Learning set: Initial Training Active Learning Initial Training Random Learning Initial Training + +
Outline Background Digital Prostate Histopathology Supervised Classification Active Learning Methodology Active Learning Data Description Experimental Setup Experimental Results Concluding Remarks
Results: Qualitative Original ImageRandom LearningActive Learning
Results: Qualitative Random Learning Active Learning
Results: Qualitative Original ImageRandom LearningActive Learning
Results: Qualitative Active LearningRandom Learning
Quantitative Evaluation
Outline Background Digital Prostate Histopathology Supervised Classification Active Learning Methodology Active Learning Data Description Experimental Setup Experimental Results Concluding Remarks
Concluding Remarks Maximize classification accuracy by choosing training intelligently Efficiently obtain annotations Make the most use of “training budget” Build Active Learning into clinical applications Online training correction / modification User feedback
Acknowledgements The Coulter foundation (WHCF ) New Jersey Commission on Cancer Research The National Cancer Institute (R21CA , R03CA ) The US Department of Defense (427327) The Society for Medical Imaging and Informatics