Yukun Chen Subramani Mani Discovery Systems Lab (DSL) Department of Biomedical Informatics Vanderbilt University May 2010.

Slides:



Advertisements
Similar presentations
EcoTherm Plus WGB-K 20 E 4,5 – 20 kW.
Advertisements

A Fast PTAS for k-Means Clustering
AGVISE Laboratories %Zone or Grid Samples – Northwood laboratory
Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.
Variations of the Turing Machine
AP STUDY SESSION 2.
EuroCondens SGB E.
& dding ubtracting ractions.
STATISTICS INTERVAL ESTIMATION Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National Taiwan University.
STATISTICS POINT ESTIMATION Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National Taiwan University.
David Burdett May 11, 2004 Package Binding for WS CDL.
Measurements and Their Uncertainty 3.1
Add Governors Discretionary (1G) Grants Chapter 6.
CALENDAR.
CHAPTER 18 The Ankle and Lower Leg
Who Wants To Be A Millionaire? Decimal Edition Question 1.
The 5S numbers game..
1 A B C
A Fractional Order (Proportional and Derivative) Motion Controller Design for A Class of Second-order Systems Center for Self-Organizing Intelligent.
Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)
Media-Monitoring Final Report April - May 2010 News.
Active Learning based on Bayesian Networks Luis M. de Campos, Silvia Acid and Moisés Fernández.
Break Time Remaining 10:00.
The basics for simulations
PP Test Review Sections 6-1 to 6-6
Chapter 10: Applications of Arrays and the class vector
MM4A6c: Apply the law of sines and the law of cosines.
Regression with Panel Data
TCCI Barometer March “Establishing a reliable tool for monitoring the financial, business and social activity in the Prefecture of Thessaloniki”
1 Prediction of electrical energy by photovoltaic devices in urban situations By. R.C. Ott July 2011.
Dynamic Access Control the file server, reimagined Presented by Mark on twitter 1 contents copyright 2013 Mark Minasi.
TCCI Barometer March “Establishing a reliable tool for monitoring the financial, business and social activity in the Prefecture of Thessaloniki”
Feature Selection 1 Feature Selection for Image Retrieval By Karina Zapién Arreola January 21th, 2005.
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
Progressive Aerobic Cardiovascular Endurance Run
Biology 2 Plant Kingdom Identification Test Review.
Chapter 1: Expressions, Equations, & Inequalities
Adding Up In Chunks.
MaK_Full ahead loaded 1 Alarm Page Directory (F11)
TCCI Barometer September “Establishing a reliable tool for monitoring the financial, business and social activity in the Prefecture of Thessaloniki”
Artificial Intelligence
When you see… Find the zeros You think….
1 Using Bayesian Network for combining classifiers Leonardo Nogueira Matos Departamento de Computação Universidade Federal de Sergipe.
Before Between After.
: 3 00.
5 minutes.
1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)
1 hi at no doifpi me be go we of at be do go hi if me no of pi we Inorder Traversal Inorder traversal. n Visit the left subtree. n Visit the node. n Visit.
DTU Informatics Introduction to Medical Image Analysis Rasmus R. Paulsen DTU Informatics TexPoint fonts.
Types of selection structures
Static Equilibrium; Elasticity and Fracture
Converting a Fraction to %
Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)
Clock will move after 1 minute
& dding ubtracting ractions.
Copyright © 2013 Pearson Education, Inc. All rights reserved Chapter 11 Simple Linear Regression.
Physics for Scientists & Engineers, 3rd Edition
Select a time to count down from the clock above
1.step PMIT start + initial project data input Concept Concept.
9. Two Functions of Two Random Variables
A Data Warehouse Mining Tool Stephen Turner Chris Frala
1 Dr. Scott Schaefer Least Squares Curves, Rational Representations, Splines and Continuity.
Chart Deception Main Source: How to Lie with Charts, by Gerald E. Jones Dr. Michael R. Hyman, NMSU.
1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)
Introduction Embedded Universal Tools and Online Features 2.
Schutzvermerk nach DIN 34 beachten 05/04/15 Seite 1 Training EPAM and CANopen Basic Solution: Password * * Level 1 Level 2 * Level 3 Password2 IP-Adr.
Active Learning Strategies for Drug Screening 1. Introduction At the intersection of drug discovery and experimental design, active learning algorithms.
AISTATS 2010 Active Learning Challenge: A Fast Active Learning Algorithm Based on Parzen Window Classification L.Lan, H.Shi, Z.Wang, S.Vucetic Temple.
Presentation transcript:

Yukun Chen Subramani Mani Discovery Systems Lab (DSL) Department of Biomedical Informatics Vanderbilt University May 2010

 Introduction  Datasets in the challenge  Probabilistic models  Querying methods  Other methods for active learning  Experiments and Results  Conclusion 2

 Active learning challenge is based on the pool-based active learning model.  Practically, labeling is costly but observational data is abundantly available at low cost.  Active learner could find the most informative instance and perform high learning accuracy with minimal querying cost.  In the challenge, we need to optimize the global score (ALC score) by implementing probabilistic prediction model, querying strategy, and more.  Learning from datasets in the challenge is not easy because the data is very sparse, is unbalanced for class label, has high dimensional feature space, and has missing values.  Uncertainty sampling with biasing consensus (USBC) is our basic active learning strategy for prediction and querying for labels. 3

4 Development DatasetsFinal Datasets

 Random Forests (RF) classifier is the basic prediction model we used in this challenge.  We built a multi-model committee with multiple RF classifiers.  The final prediction was based on consensus posterior probability (CPP):  We also considered the variance of posterior probabilities from multiple models. The high-variance filter was used in querying method. 5

 Querying method ranks the samples based on the informative values, and outputs the most informative sample(s) to query.  Least confidence with bias (LCB) was our basic querying method.  The informative value of sample is a function of CPP and bias factor pp (the positive fraction for the current training set in active learning). 6 Pmax

 It is very important to have a good starting point on the learning curve in active learning, which is the prediction performance by knowing just one positive label.  Pure unsupervised learning method (for example the metric based on distance, similarity, clustering result) might not be good enough to make prediction.  We combined unsupervised and supervised learning:  (1) For all samples, compute the cosine similarity to the positive-labeled seed;  (2) Assign negative labels to K samples with smallest cosine similarity values;  (3) Train the training set with one given positive sample and K predicted negative samples by our multiple models, and predict for other samples.  Here is our comparison result between cosine similarity function and semi-supervised learning method for the initial AUC: 7 Dataset Name Initial AUC by Cosine Similarity Function Initial AUC by semi-supervised learning HIVA /– 0.41% /– 0.65% IBN_SINA /– 0.28% /– 0.28% NOVA /– 0.39% /– 0.38% ORANGE /– 0.51% /– 0.78% SYLVA /– 0.27% /– 0.22% ZEBRA /– 0.27% /– 0.48%

 For some datasets (ZEBRA, ORANGE, HIVA, and NOVA), our models did not have a good prediction when the size of training set is small. The bad initial performance could badly affect the global score based on learning curve in Log2 space (see the learning curves with respect to initial batch size).  We ran batch size validation to search for the minimal sufficient size of initial training set.  This prevented a significant drop in performance at the beginning for our prediction model.  Batch size validation result figure for ZEBRA, IBN_SINA and NOVA: 8

ZEBRA USBC IBATCH 2^14 (16384): Global score= Log 2 (Number of labels queried) Area under the ROC curve ZEBRA USBC IBATCH 2^12 (4096): Global score= Log 2 (Number of labels queried) Area under the ROC curve ZEBRA USBC IBATCH 2^10 (1024): Global score= Log 2 (Number of labels queried) Area under the ROC curve ZEBRA USBC IBATCH 2^1 (2): Global score= Log 2 (Number of labels queried) Area under the ROC curve ZEBRA USBC IBATCH 2^4 (16): Global score= Log 2 (Number of labels queried) Area under the ROC curve ZEBRA USBC IBATCH 2^8 (256): Global score= Log 2 (Number of labels queried) Area under the ROC curve Initial Batch: 256 Initial Batch: 4096Initial Batch: 1024Initial Batch: Initial Batch: 16Initial Batch: 2

 (1) Initialization:  (1.1) Run preprocessing steps (missing value imputation, PCA, etc) if needed.  (1.2) Assign batch size as the function of iteration, depending on the batch size validation result.  (2) Run semi-supervised learning for initial prediction and basic uncertainty sampling to rank and query samples.  (3) Run uncertainty sampling with biasing consensus (USBC) in the iterations of active learning:  (3.1) Add predicted negative samples into the training sets (if activated).  (3.2) Train by 5 RF models and predict for all unlabeled samples.  (3.3) Run high-variance filter (if activated).  (3.4) Run uncertainty sampling with bias to rank and query samples (Bias factor is the function of positive fraction and the size of training set).  (4) Output learning curves and global ALC score. 10

Dataset NameALCAUCInitial AUC Initial Batch Size Use Filter Use Predicted Negative HIVA /– 0.79% /– 0.65%1No IBN_SINA /– 0.09% /– 0.28%1NoYes NOVA /– 0.14% /– 0.38%16Yes ORANGE /– 1.11% /– 0.78%1NoYes SYLVA /– 0.04% /– 0.22%1No ZEBRA /– 0.56% /– 0.48%16384No 11 Dataset NameALCAUCInitial AUC Initial Batch Size Use Filter Use Predicted Negative Rank A /– 0.39% NoYes9 B /– 0.44% NoYes12 C /– 0.52% No 12 D /– 0.33% Yes 12 E /– 0.39% No 1 F /– 0.09% No 3 The results for development datasets The results for final datasets

12 Dataset: D; Global score: 0.54 Dataset: B; Global score: 0.13Dataset: C; Global score: 0.19Dataset: A; Global score: 0.36 Dataset: E; Global score: 0.63Dataset: F; Global score: 0.79

 For dataset E, the global score is benefited by the batch size validation. Semi-supervised learning generates a good starting point. We won on dataset E.  For dataset F, the learning curve based on USBC is acceptable except that the initial performance is not stable. We were ranked 3 rd on F.  For dataset D also the batch size validation was effective. The high- variance filter successfully helped prevent a significant drop in the curve. But the starting point is quite low.  For dataset A, USBC worked well when the size of training set was at least 64. However, the initial low performance hurt our global score.  Datasets B and C are the hardest datasets like HIVA and ORANGE. Our prediction models were not effective in these datasets. 13

 Our strategies consider more than prediction model and query model. Semi-supervised learning and batch size validation are also important parts of the active learning process.  Our methods need further evaluation using additional datasets.  The active learning challenge is still a very open problem to solve.  One possible future direction to explore is to automatically assign batch size as a function of predictive performance and informativeness. 14