Active Learning Challenge Active Learning Challenge Isabelle Guyon (Clopinet, California) Gavin Cawley (University of East Anglia, UK) Olivier Chapelle (Yahhoo!, California) Gideon Dror (Academic College of Tel-Aviv-Yaffo, Israel) Vincent Lemaire (Orange, France) Amir Reza Saffari Azar (Graz University of Technology) Alexander Statnikov (New York University, USA)
Active Learning Challenge What is the problem?
Active Learning Challenge Labeling data is expensive $$ $$$$$
Active Learning Challenge Examples of domains Chemo-informatics Handwriting and speech recognition Image processing Text processing Marketing Ecology Embryology
Active Learning Challenge What is active learning?
Active Learning Challenge What is out there?
Active Learning Challenge Scenarios Burr Settles. Active Learning Literature Survey. CDTR 1648, Univ. Wisconsin – Madison
Active Learning Challenge De novo queries De novo queries implicitly assume interventions on the system under study: not for this challenge
Active Learning Challenge Focus on pool-based AL Simplest scenario for a challenge. Training data: labels can be queried Test data: unknown labels Methods developed for pool-based AL should also be useful for stream-based AL.
Active Learning Challenge Example (a) Toy 2-class problem, 400 instances Gaussian distributed. (b) Linear logistic regression model trained w. 30 random instances. (c) Linear logistic regression model trained w. 30 actively queried instances using uncertainty sampling. Accuracy=0.7Accuracy=0.9 Burr Settles, 2009
Active Learning Challenge Learning curve Burr Settles, 2009
Active Learning Challenge Other methods Expected model change (greatest gradient if sample were used for training) Query by committee (query the sample subject to largest disagreement) Bayesian active learning (maximize change in revised posterior distribution) Expected error reduction (maximize generalization performance improvement) Information density (ask for examples both informative and representative) Burr Settles, 2009
Active Learning Challenge Datasets
Active Learning Challenge Data donors This project would not have been possible without generous donations of data: Chemoinformatics -- Charles Bergeron, Kristin Bennett and Curt Breneman (Rensselaer Polytechnic Institute, New York) contributed a dataset, which will be used for final testing.Kristin Bennett Embryology -- Emmanuel Faure, Thierry Savy, Louise Duloquin, Miguel Luengo Oroz, Benoit Lombardot, Camilo Melani, Paul Bourgine, and Nadine Peyriéras (Institut des systèmes complexes, France) contributed the ZEBRA dataset.Emmanuel Faure Handwriting recognition -- Reza Farrahi Moghaddam, Mathias Adankon, Kostyantyn Filonenko, Robert Wisnovsky, and Mohamed Chériet (Ecole de technologie supérieure de Montréal, Quebec) contributed the IBN_SINA dataset.Mohamed Chériet Marketing -- Vincent Lemaire, Marc Boullé, Fabrice Clérot, Raphael Féraud, Aurélie Le Cam, and Pascal Gouzien (Orange, France) contributed the ORANGE dataset, previously used in the KDD cup 2009.Vincent LemaireMarc BoulléKDD cup 2009 We also reused data made publicly available on the Internet: Chemoinformatics -- The National Cancer Institute (USA) for the HIVA dataset.The National Cancer Institute Ecology -- Jock A. Blackard, Denis J. Dean, and Charles W. Anderson (US Forest Service, USA) for the SYLVA dataset (Forest cover type).US Forest ServiceForest cover type Text processing -- Tom Mitchell (USA) and Ron Bekkerman (Israel) for the NOVA datset (derived from the Twenty Newsgroups).Ron BekkermanTwenty Newsgroups
Active Learning Challenge Development datasets
Active Learning Challenge Difficulties Spase data Missing values Unbalanced classes Categorical variables Noisy data Large datasets
Active Learning Challenge Final test datasets Will serve to do the final ranking Will be from the same domains May have different data representations and distributions No feed-back: the results will not be revealed until the end of the challenge
Active Learning Challenge Protocol
Active Learning Challenge Virtual Lab Joint work with: Constantin Aliferis, New York University Gregory F. Cooper, Pittsburg University André Elisseeff, Nhumi, Zürich Jean-Philippe Pellet, IBM Zürich Alexander Statnikov, New York University Peter Spirtes, Carnegie Mellon Virtual cash
Active Learning Challenge Step by step instructions 1.Predict 2.Sample 3.Submit a query 4.Retrieve the labels Download the data. You get 1 labeled example.
Active Learning Challenge Two phases Development phase: –6 datasets available –Can try as many times as you want –Matlab users can run queries on their computers –Others can use the labels (provided) Final test phase: –6 new datasets available –A single try –No feed-back
Active Learning Challenge Evaluation
Active Learning Challenge AUC score For each set of samples queried, we assess the predictions of the learning machine with the Area under the ROC curve.
Active Learning Challenge Area under the Learning Curve (ALC) Linear interpolation. Horizontal extrapolation. One queryFive queriesThirteen queries Lazy: ask for all labels at once
Active Learning Challenge Prizes 1 dataset: $100 2 datasets: $200 3 datasets: $400 4 datasets: $800 5 datasets: $ datasets: $3200! Plus travel awards for top ranking students. If you win on…
Active Learning Challenge Schedule
Active Learning Challenge Conclusion Try our new challenge, learn, and win!!!! –Workshops: AISTATS 2010, Sardinia, May, 2010 WCCI 2010 Workshop, Barcelona, July, 2010 Travel awards for top ranking students. –Proceedings published by JMLR & IEEE. –Prizes: P(i)=$100 * 2 (n-1) –Your problem solved by dozens of research groups: Help us organize the next challenge!