Kiri Wagstaff Jet Propulsion Laboratory, California Institute of Technology July 25, 2012 Association for the Advancement of Artificial Intelligence CHALLENGES.

Slides:



Advertisements
Similar presentations
Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.
Advertisements

Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.
Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.
PDAs Accept Context-Free Languages
ALAK ROY. Assistant Professor Dept. of CSE NIT Agartala
AS Sociology Exam Technique.
An analysis of Social Network-based Sybil defenses Bimal Viswanath § Ansley Post § Krishna Gummadi § Alan Mislove ¶ § MPI-SWS ¶ Northeastern University.
Western Public Lands Grazing: The Real Costs Explore, enjoy and protect the planet Forest Guardians Jonathan Proctor.
A centralized approach to language resources Piek Vossen S&T Forum on Multilingualism, Luxembourg, June 6th 2005.
Loss-Sensitive Decision Rules for Intrusion Detection and Response Linda Zhao Statistics Department University of Pennsylvania Joint work with I. Lee,
Alternative Conceptions and the Geoscience Concept Inventory Heather L. Petcovic - What are alternative conceptions? - The Geoscience Concept Inventory.
The 5S numbers game..
A Fractional Order (Proportional and Derivative) Motion Controller Design for A Class of Second-order Systems Center for Self-Organizing Intelligent.
Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)
The basics for simulations
More and Better Test Ideas Rikard Edgren TIBCO Spotfire EuroSTAR share one-liner test ideas.
Connecticut Mastery Test (CMT) and the Connecticut Academic Achievement Test (CAPT) Spring 2013 Presented to the Guilford Board of Education September.
In-Service Physics Teacher Survey Kimberly A. Shaw Southern Illinois University Edwardsville Carl Wenning Illinois State University 14 October 2004.
Document Filtering Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike.
Dynamic Access Control the file server, reimagined Presented by Mark on twitter 1 contents copyright 2013 Mark Minasi.
EXPERIMENTAL PAPERS IN THE BJS – LESSONS LEARNED AND ROOM FOR IMPROVEMENT Editors Assistant Project Malin Sund 2011.
University of Sheffield NLP Module 4: Machine Learning.
Review of Health Inequalities at the local level Maggie Rae Head of Health Inequalities & Head of Local Delivery 11 May 2006.
Diagnostic Metrics Week 2 Video 3. Different Methods, Different Measures  Today we’ll continue our focus on classifiers  Later this week we’ll discuss.
NWSC November Math Cohort Meeting WELCOME! Nancy Berkas Cyntha Pattison.
Before Between After.
Foundation Stage Results CLL (6 or above) 79% 73.5%79.4%86.5% M (6 or above) 91%99%97%99% PSE (6 or above) 96%84%100%91.2%97.3% CLL.
Lial/Hungerford/Holcomb/Mullins: Mathematics with Applications 11e Finite Mathematics with Applications 11e Copyright ©2015 Pearson Education, Inc. All.
A Data Warehouse Mining Tool Stephen Turner Chris Frala
1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)
2007 COOPERATIVE INSTITUTIONAL RESEARCH PROGRAM (CIRP) First-Year Student Survey Profile Administered August 2007 West Chester University of Pennsylvania.
Predicting Students Drop Out: a Casestudy Gerben Dekker, Mykola Pechenizkiy and Jan Vleeshouwers.
Introduction Embedded Universal Tools and Online Features 2.
DATA TRACKING AND EVALUATION 1. Goal of the STEP program: To increase the number of STEM graduates within the five-year period of the grant. You have.
D ON ’ T G ET K ICKED – M ACHINE L EARNING P REDICTIONS FOR C AR B UYING Albert Ho, Robert Romano, Xin Alice Wu – Department of Mechanical Engineering,
Imbalanced data David Kauchak CS 451 – Fall 2013.
ESEM | October 9, 2008 On Establishing a Benchmark for Evaluating Static Analysis Prioritization and Classification Techniques Sarah Heckman and Laurie.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Higher Education: Improved Tax Information Could Help Families Pay for College (GAO ) Presentation at the 30 th Annual SFARN Conference June 21,
Scientific method - 1 Scientific method is a body of techniques for investigating phenomena and acquiring new knowledge, as well as for correcting and.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
CHOOSING THE APPROPRIATE MEDIUM FOR YOUR PUBLICATION: JOURNAL SELECTION TACTICS SCIENTIFIC LAWRENCE LIBERTI, MS, RPh VP, GENERAL MANAGER JUNE 2008.
A Taxonomy of Evaluation Approaches in Software Engineering A. Chatzigeorgiou, T. Chaikalis, G. Paschalidou, N. Vesyropoulos, C. K. Georgiadis, E. Stiakakis.
Mapping the Impact Pathways: How Management Research Can Be Relevant? Hazhir Rahmandad Post-doctoral Associate MIT.
Introduction to inference Use and abuse of tests; power and decision IPS chapters 6.3 and 6.4 © 2006 W.H. Freeman and Company.
Software Engineering Experimentation Rules for Reviewing Papers Jeff Offutt See my editorials 17(3) and 17(4) in STVR
Is there a better model for P&T preparation and evaluation of geoscience education research (GER) in geoscience departments? Kristen St. John James Madison.
Quality Software Project Management Software Size and Reuse Estimating.
Stefan Mutter, Mark Hall, Eibe Frank University of Freiburg, Germany University of Waikato, New Zealand The 17th Australian Joint Conference on Artificial.
Chapter 8 Delving Into The Use of Inference 8.1 Estimating with Confidence 8.2 Use and Abuse of Tests.
Learning Analytics: Process & Theory March 24, 2014.
Fault Tolerance Benchmarking. 2 Owerview What is Benchmarking? What is Dependability? What is Dependability Benchmarking? What is the relation between.
1 Modeling in MS Science. 2 ANNOUNCEMENTS Q3 Assessments, scantrons due back Apr 13 th end of day (drop off at security if needed) Q4 Assessments: May.
Inter-American Institute (IAI) Proposal Evaluation Paul E. Filmer National Science Foundation Second IAI Summer Institute, July 2000 University of Miami.
© John M. Abowd 2005, all rights reserved Assessing Data Quality John M. Abowd April 2005.
يادگيري ماشين Machine Learning Lecturer: A. Rabiee
Meetup/Discussion VLA Annual Conference Richmond, VA October 23, 2015 “Shall we march without our neighbours I trust not”: Defining the Roles and Goals.
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 March 6, 2013.
UCI Large-Scale Collection of Application Usage Data to Inform Software Development David M. Hilbert David F. Redmiles Information and Computer Science.
April Center for Open Fostering openness, integrity, and reproducibility of scientific research.
IMPACT EVALUATION PBAF 526 Class 5, October 31, 2011.
Table 3. Merits and Demerits of Selected Water Quality Indices Shweta Tyagi et al. Water Quality Assessment in Terms of Water Quality Index. American Journal.
Warm Up Hw on desk. Work silently. 1.If I measure volume as 3 mL and the true volume is 4 mL, what is the percent error? 2.A student measures density 3.
Technical Writing (Applies to research papers and theses)
Objectives of the Course and Preliminaries
CHAPTER 4 Designing Studies
The Future of Software Engineering: Tools
Introduction Artificial Intelligent.
iSRD Spam Review Detection with Imbalanced Data Distributions
Evaluating Classifiers
Presentation transcript:

Kiri Wagstaff Jet Propulsion Laboratory, California Institute of Technology July 25, 2012 Association for the Advancement of Artificial Intelligence CHALLENGES FOR MACHINE LEARNING IMPACT ON THE REAL WORLD © 2012, California Institute of Technology. Government sponsorship acknowledged. This talk was prepared at the Jet Propulsion Laboratory, California Institute of Technology, under a contract with NASA.

MACHINE LEARNING IS GOOD FOR: Photo: Matthew W. Jackson [Nguyen et al., 2008] Photo: Eugene Fratkin

WHAT IS ITS IMPACT? (i.e., publishing results to impress other ML researchers) Machine Learning world Data ? 76% 83% 89% 91%

ML RESEARCH TRENDS THAT LIMIT IMPACT Data sets disconnected from meaning Metrics disconnected from impact Lack of follow-through

UCI DATA SETS The standard Irvine data sets are used to determine percent accuracy of concept classification, without regard to performance on a larger external task. Jaime Carbonell But that was way back in 1992, right? UCI: Online archive of data sets provided by the University of California, Irvine [Frank & Asuncion, 2010]

UCI DATA SETS TODAY

DATA SETS DISCONNECTED FROM MEANING UCI today … UCI initially … Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. – UCI Mushroom data set page Did you know that the mushroom data set has 3 classes, not 2? Have you ever used this knowledge to interpret your results on this data set?

DATA SETS CAN BE USEFUL BENCHMARKS 1.Enable direct empirical comparisons with other techniques And reproducing others results 2.Easier to interpret results since data set properties are well understood No standard for reproducibility We dont actually understand these data sets The field doesnt require any interpretation Too often, we fail at both goals

BENCHMARK RESULTS THAT MATTER Show me: Data set properties that permit generalization of results Does your method work on binary data sets? Real-valued features? Specific covariance structures? Overlapping classes? 4.6% improvement in detecting cardiac arrhythmia? We could save lives! 96% accuracy in separating poisonous and edible mushrooms? Not good enough for me to trust it! OR How your improvement matters to the originating field

2. METRICS DISCONNECTED FROM IMPACT Accuracy, RMSE, precision, recall, F-measure, AUC, … Deliberately ignore problem-specific details Cannot tell us WHICH items were classified correctly or incorrectly? What impact does a 1% change have? (What does it mean?) How to compare across problem domains? The approach we proposed in this paper detected correctly half of the pathological cases, with acceptable false positive rates (7.5%), early enough to permit clinical intervention. A Machine Learning Approach to the Detection of Fetal Hypoxia during Labor and Delivery by Warrick et al., 2010 This doesnt mean accuracy, etc. are bad measures, just that they should not remain abstractions

3. LACK OF FOLLOW-THROUGH ML research program This is hard! ML publishing incentives

CHALLENGES FOR INCREASING IMPACT Increase the impact of your work 1.Employ meaningful evaluation methods Direct measurement of impact when possible Translate abstract metrics into domain context 2.Involve the world outside of ML 3.Choose problems to tackle biased by expected impact Increase the impact of the field 1.Evaluate impact in your reviews 2.Contribute to the upcoming MLJ Special Issue (Machine Learning for Science and Society) 3.More ideas? Contribute to

MLIMPACT.COM