Capturing Human Insight for Visual Learning Kristen Grauman Department of Computer Science University of Texas at Austin Work with Sudheendra Vijayanarasimhan,

Capturing Human Insight for Visual Learning Kristen Grauman Department of Computer Science University of Texas at Austin Work with Sudheendra Vijayanarasimhan, Adriana Kovashka, Devi Parikh, Prateek Jain, Sung Ju Hwang, and Jeff Donahue Frontiers in Computer Vision Workshop, MIT August 22, 2011

Problem: how to capture human insight about the visual world? The complex space of visual objects, activities, and scenes. [tiny image montage by Torralba et al.] Annotator Point+label “mold” restrictive Human effort expensive

Problem: how to capture human insight about the visual world? The complex space of visual objects, activities, and scenes. [tiny image montage by Torralba et al.] Annotator Our approach: Listen: Explanations, Comparisons, Implied cues,… Ask: Actively learn

Deepening human communication to the system What is this? ? What property is changing here? What’s worth mentioning? Do you find him attractive? Why? How do you know? Is it ‘furry’? ? Which is more ‘open’? < ? [Donahue & Grauman ICCV 2011; Hwang & Grauman BMVC 2010; Parikh & Grauman ICCV 2011, CVPR 2011; Kovashka et al. ICCV 2011]

Soliciting rationales We propose to ask the annotator not just what, but also why. Is the team winning? Is her form perfect? Is it a safe route? How can you tell?

Soliciting rationales Annotation task: Is her form perfect? How can you tell? [Donahue & Grauman, ICCV 2011] pointed toes balanced falling knee angled falling pointed toes knee angled balanced pointed toes knee angled balanced Synthetic contrast example Spatial rationale Attribute rationale Influence on classifier Good form Bad form Spatial rationale Attribute rationale [Zaidan et al. HLT 2007]

Rationale results Scene Categories: How can you tell the scene category? Hot or Not: What makes them hot (or not)? Public Figures: What attributes make them (un)attractive? Collect rationales from hundreds of MTurk workers. [Donahue & Grauman, ICCV 2011]

Rationale results PubFigOriginals+Rationales Male64.60%68.14% Female51.74%55.65% Hot or NotOriginals+Rationales Male54.86%60.01% Female55.99%57.07% ScenesOriginals+Rationales Kitchen0.11960.1395 Living Rm0.11420.1238 Inside City0.12990.1487 Coast0.42430.4513 Highway0.22400.2379 Bedroom0.30110.3167 Street0.07780.0790 Country0.09260.0950 Mountain0.11540.1158 Office0.10510.1052 Tall Building0.06880.0689 Store0.08660.0867 Forest0.39560.4006 [Donahue & Grauman, ICCV 2011] Mean AP

Issue: presence of objects != significance Our idea: Learn cross-modal representation that accounts for “what to mention” Learning what to mention Textual: Frequency Relative order Mutual proximity Visual: Texture Scene Color… TAGS: Cow Birds Architecture Water Sky Training: human-given descriptions Birds Architecture Water Cow Sky Tiles

Importance-aware semantic space View yView x [Hwang & Grauman, BMVC 2010] Learning what to mention

[Hwang & Grauman, BMVC 2010] Our method Words + Visual Visual only Query Image Learning what to mention: results

Problem: how to capture human insight about the visual world? The complex space of visual objects, activities, and scenes. [tiny image montage by Torralba et al.] Annotator Our approach: Listen: Explanations, Comparisons, Implied cues Ask: Actively learn

Traditional active learning Unlabeled data Labeled data Current Model Active Selection Annotator At each cycle, obtain label for the most informative or uncertain example. [Mackay 1992, Freund et al. 1997, Tong & Koller 2001, Lindenbaum et al. 2004, Kapoor et al. 2007,…] ?

$ $ $ $ $ Unlabeled data Labeled data Current Model $ Active Selection Annotator Annotation tasks vary in cost and info Multiple annotators working parallel Massive unlabeled pools of data ? [Vijayanarasimhan & Grauman NIPS 2008, CVPR 2009, Vijayanarasimhan et al. CVPR 2010, CVPR 2011, Kovashka et al. ICCV 2011] Challenges in active visual learning

Current classifier Unlabeled data Sub-linear time active selection [Jain, Vijayanarasimhan, Grauman, NIPS 2010] 110 Hash table 111 101 We propose a novel hashing approach to identify the most uncertain examples in sub-linear time. Actively selected examples For 4.5 million unlabeled instances, 10 minutes machine time per iter, vs. 60 hours for a naïve scan. For 4.5 million unlabeled instances, 10 minutes machine time per iter, vs. 60 hours for a naïve scan.

Live active learning results on Flickr test set Outperforms status quo data collection approach [Vijayanarasimhan & Grauman, CVPR 2011]

Summary Humans are not simply “label machines” Widen access to visual knowledge –New forms of input, often requiring associated new learning algorithms Manage large-scale annotation efficiently –Cost-sensitive active question asking Live learning: moving beyond canned datasets

Capturing Human Insight for Visual Learning Kristen Grauman Department of Computer Science University of Texas at Austin Work with Sudheendra Vijayanarasimhan,

Similar presentations

Presentation on theme: "Capturing Human Insight for Visual Learning Kristen Grauman Department of Computer Science University of Texas at Austin Work with Sudheendra Vijayanarasimhan,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Capturing Human Insight for Visual Learning Kristen Grauman Department of Computer Science University of Texas at Austin Work with Sudheendra Vijayanarasimhan,

Similar presentations

Presentation on theme: "Capturing Human Insight for Visual Learning Kristen Grauman Department of Computer Science University of Texas at Austin Work with Sudheendra Vijayanarasimhan,"— Presentation transcript:

Similar presentations

About project

Feedback