Labeling Images for FUN!!! Yan Cao, Chris Hinrichs.

Slides:



Advertisements
Similar presentations
Patient information extraction in digitized X-ray imagery Hsien-Huang P. Wu Department of Electrical Engineering, National Yunlin University of Science.
Advertisements

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Top-Down & Bottom-Up Segmentation
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Efficient access to TIN Regular square grid TIN Efficient access to TIN Let q := (x, y) be a point. We want to estimate an elevation at a point q: 1. should.
Imbalanced data David Kauchak CS 451 – Fall 2013.
Computer Vision Lecture 18: Object Recognition II
THE ESP GAME, & PEEKABOOM LUIS VON AHN CARNEGIE MELLON UNIVERSITY.
Searching on Multi-Dimensional Data
Recovering Human Body Configurations: Combining Segmentation and Recognition Greg Mori, Xiaofeng Ren, and Jitentendra Malik (UC Berkeley) Alexei A. Efros.
UCB Computer Vision Animals on the Web Tamara L. Berg CSE 595 Words & Pictures.
Learning Visual Similarity Measures for Comparing Never Seen Objects Eric Nowak, Frédéric Jurie CVPR 2007.
Computer Vision for Human-Computer InteractionResearch Group, Universität Karlsruhe (TH) cv:hci Dr. Edgar Seemann 1 Computer Vision: Histograms of Oriented.
Database-Based Hand Pose Estimation CSE 6367 – Computer Vision Vassilis Athitsos University of Texas at Arlington.
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
Stephan Gammeter, Lukas Bossard, Till Quack, Luc Van Gool.
Large dataset for object and scene recognition A. Torralba, R. Fergus, W. T. Freeman 80 million tiny images Ron Yanovich Guy Peled.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Evaluating Search Engine
Proceedings of the IEEE 2010 Antonio Torralba, MIT Jenny Yuen, MIT Bryan C. Russell, MIT.
Object Recognition with Informative Features and Linear Classification Authors: Vidal-Naquet & Ullman Presenter: David Bradley.
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman ICCV 2003 Presented by: Indriyati Atmosukarto.
Ensemble Tracking Shai Avidan IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE February 2007.
Evaluating Hypotheses
Scale Invariant Feature Transform
 Image Search Engine Results now  Focus on GIS image registration  The Technique and its advantages  Internal working  Sample Results  Applicable.
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003.
Peekaboom: A Game for Locating Objects in Images
Presented by Zeehasham Rasheed
Opportunities of Scale, Part 2 Computer Vision James Hays, Brown Many slides from James Hays, Alyosha Efros, and Derek Hoiem Graphic from Antonio Torralba.
Internet Research Finding Free and Fee-based Obituaries Online.
Human abilities Presented By Mahmoud Awadallah 1.
1 Direct Manipulation Proposal 17 Direct Manipulation is when physical actions are used instead of commands. E.g. In a word document when the user inputs.
The Technology Behind. The World Wide Web In July 2008, Google announced that they found 1 trillion unique webpages! Billions of new web pages appear.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
The BSP-tree from Prof. Seth MIT.. Motivation for BSP Trees: The Visibility Problem We have a set of objects (either 2d or 3d) in space. We have.
A Statistical Approach to Speed Up Ranking/Re-Ranking Hong-Ming Chen Advisor: Professor Shih-Fu Chang.
September 23, 2014Computer Vision Lecture 5: Binary Image Processing 1 Binary Images Binary images are grayscale images with only two possible levels of.
Digital Image Processing CCS331 Relationships of Pixel 1.
Classifying Images with Visual/Textual Cues By Steven Kappes and Yan Cao.
Video Google: A Text Retrieval Approach to Object Matching in Videos Josef Sivic and Andrew Zisserman.
80 million tiny images: a large dataset for non-parametric object and scene recognition CS 4763 Multimedia Systems Spring 2008.
BING: Binarized Normed Gradients for Objectness Estimation at 300fps
Artificial Intelligence in Game Design N-Grams and Decision Tree Learning.
Unsupervised Learning of Visual Sense Models for Polysemous Words Kate Saenko Trevor Darrell Deepak.
CROWDSOURCING Massimo Poesio Part 2: Games with a Purpose.
Chapter 4: Pattern Recognition. Classification is a process that assigns a label to an object according to some representation of the object’s properties.
Personalizing Web Search using Long Term Browsing History Nicolaas Matthijs, Cambridge Filip Radlinski, Microsoft In Proceedings of WSDM
Fourier Descriptors For Shape Recognition Applied to Tree Leaf Identification By Tyler Karrels.
Algorithmic Detection of Semantic Similarity WWW 2005.
Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.
CVPR2013 Poster Detecting and Naming Actors in Movies using Generative Appearance Models.
Project by: Cirill Aizenberg, Dima Altshuler Supervisor: Erez Berkovich.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
VIP: Finding Important People in Images Clint Solomon Mathialagan Andrew C. Gallagher Dhruv Batra CVPR
Finding frequent and interesting triples in text Janez Brank, Dunja Mladenić, Marko Grobelnik Jožef Stefan Institute, Ljubljana, Slovenia.
Learning to Detect Faces A Large-Scale Application of Machine Learning (This material is not in the text: for further information see the paper by P.
Category Independent Region Proposals Ian Endres and Derek Hoiem University of Illinois at Urbana-Champaign.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Query by Image and Video Content: The QBIC System M. Flickner et al. IEEE Computer Special Issue on Content-Based Retrieval Vol. 28, No. 9, September 1995.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Computer Vision Group Department of Computer Science University of Illinois at Urbana-Champaign.
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
Instructor: Mircea Nicolescu Lecture 5 CS 485 / 685 Computer Vision.
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003.
CSCI 631 – Foundations of Computer Vision March 15, 2016 Ashwini Imran Image Stitching.
Opportunities of Scale, Part 2
Lecture 25: Introduction to Recognition
Recognizing and Learning Object categories
Multimedia Information Retrieval
Presentation transcript:

Labeling Images for FUN!!! Yan Cao, Chris Hinrichs

How do you improve Learning systems? Get more processing power. (Faster computers, more memory, more parallel.) Find a more sophisticated algorithm. Get lots and lots of quality data.

Why Manually label Images? A job that’s easy for humans but challenging for Computer Vision Why? Acquire Ground Truth – Segmentation, i.e. object extraction from an image, is hard – Multiple poses and views of objects – Depth of objects, which one is in the front when there is an intersection – Relationships between objects and their parts. E.g., face and eyes, car and wheels

General Idea to make computers do labeling – Supervised learning Enough training data – Images with manually pre-assigned labels. Classifiers which are trained by the training data and used to label the queried images. If we want do segmentations on the queried images, the training images need to include the information about the boundaries of the inside objects.

Who is willing to be volunteer Manually Labeling numerous images is a tedious job Motivations which can make humans do something – Money! You know you will be paid – Fun! You enjoy doing it – Gain respect from others

ESP – an image labeling game Rules – Server randomly arranges a partner to you (could be a “bot”) – The same image on the two partners’ screen – When the labels typed by the partners match each other, gain scores and move to the next image – There might be some taboo words which can not be the labels for the image

ESP – an image labeling game Rules – Partners strive to agree on as many images as they can in 2.5 minutes – Partners can choose to pass images when they both click “Pass” button – The more images the partners agree on the labels, the higher the final scores they achieve

ESP – an image labeling game

Taboo Words Gained from the game – When the image is shown the first time in ESP, there are no taboo words – If the image is used again, there is a taboo word which is obtained from last agreements At most 6 taboo words for one image Taboo words guarantee that each image has many different labels

Good Label Threshold It is the threshold to include a label to the list of taboo words for an image If threshold = 1, it means that once a pair of partners agree on a label, this label will be set as a taboo word If threshold =10, when 10 pairs of partners agree on the same label for an image, it is set as a taboo word

Image source Randomly selected from the Web using a small amount of filters From “Random Bounce Me”, which randomly returns images from Google database Qualifications of images: – Large enough (>20 pixels on either dimension) – Aspect ratio between (1/4.5, 4.5) – Not blank/single color image

Evaluation Are the labels relevant to the images? – Do a search within the labeled images in the ESP database Are the players motivated by the game? – Do statistics on user log How’s the labeling rate? – See how many images are labeled within a time period

Accuracy of Labels 20 images are randomly selected from ESP 15 participants are asked to label 20 images with 6 labels on each image, given no information about the taboo words. When the labels made by the participants are compared with the labels obtained from the game, 83% of the labels match For all images, the 3 most common words entered by the participants were contained by ESP labels

Example: some images labeled with “car”

Is it fun? Over 80% users played the game on multiple dates In 4 months, 33 players played more than 50 hours on the game

Labeling Rate If there are 5000 users online all 24 hours (it is easy to reach for online games), within a month all images in Google database (425,000,000) will be labeled!

More than Labeling What if the players tell more information about images, such as where the objects are in the images? Peekaboom – An interesting game which is fun and at the same time, collects information other than labels

Peekaboom

Rules of Peekboom Pairs of partners randomly arranged by Server One sees a whole image and its label (Boom side) The other sees a blank screen and an input box at bottom (Peek side) The boom partner clicks on the image and each click reveals an area with a 20-pixel radius to the peek partner

Rules of Peekaboom According to the revealed parts, the peek partner inputs labels until one matches the label shown on the boom side The boom partner can give hints to help the peek partner get the right label – Ping the “key” parts in the images – Tell how the word is related to the image

Hints given by the boom partner

Rules of Peekaboom The partners switch between peek and boom alternatively For images with a hard-to-guess label, the partners can choose to pass The more images they correctly label in 2.5 minutes, the higher their score To make the game more fun, bonus rounds are added and users are ranked by their scores

Information collected by Peekaboom How the word relates to the image (from hints) Pixels necessary to guess the word The pixels inside the object, animal, or person (from pings) The most salient aspects of the objects in the image (from the sequence of clicks) Elimination of poor image-word pairs (passing)

Applications based on the information Improving Image-Search Results – images in which the word refers to a higher fraction of the total pixels should be ranked higher Bounding boxes of objects

Applications based on the information Using Ping data for pointing

Evaluation Do people have fun? – More than 90% people play multiple times on different days – Players on the “Top Scores” all played over 53 hours Accuracies of collected data – Bounding boxes. Participants VS Peekaboom. Overlap percentage – Accuracies of Pings. Participants VS Peekaboom. 100% accuracy!

Label Me Russel et. al. MIT CSAILab

Improving on image captions Many image DBs are available which have captions for every image, which say what is in the image. LabelMe allows users to add their own bounding boxes around objects and label them directly. LabelMe’s authors claim their pictures are taken from a wide variety of places. (They seem to be mostly street scenes, and other travel photos, and a few insides of houses.)

How do you participate? Just go to the URL: You are given an image, which may or may not have previously drawn boundaries. If you see an object which you can identify, draw a boundary, and when you close the polygon it asks for a label. There are no rules on how to choose the labels, or on how closely to draw the boxes. They trust your judgment – but more importantly, it reflects peoples’ different ideas.

How good are the bounding boxes? It varies.

More general results: 25 th, 50 th, and 75 th percentile by polygon count of come common object types.

We can learn something about the way people take pictures from the distribution of where objects are located. Generally, people are standing when they take pictures.

What do the average objects look like?

Tying it in with WordNet Some words have synonyms: man/woman, person, pedestrian; car, automobile, cab, suv Look up each label on Wordnet. The authors report 93% of labels found a matching WordNet entry, though some manual word sense disambiguation had to be done. This allows queries to match at various levels of specificity in the WordNet tree, and more general queries.

Some general queries & results, using WordNet

Dealing with occlusion: simple rules If an object is completely contained, it is inside. If it has more control points in the overlapping region is probably on top. Can use features like color histograms to match the overlapping region with one region or the other, but this is expensive, complicated, and doesn’t work as well.

Depth ordering results

Image search reranking Do segmentation on query image, extract features, compare with features of regions labeled with search terms, reorder by strength of correlation.

80 Million Tiny Images Torralba et. al.

Shrinking images How much information does an image need to contain in order to identify its contents? Why not ask humans before asking computers? Torralba et. al. looked for the minimum resolution that humans need in order to identify the contents of an image.

Can you tell what these are?

Note that for color images, the humans’ accuracy levels off at 32x32. For grayscale, the same happens at 64x64. The humans did much better at 32x32 resolution than the best recognition algorithms did at full resolution. 32x32x3 dimensions for color images, 32x32x4 dimensions for grayscale with very nearly the same accuracy, so ~3000 dimensions needed for recognition.

Next: Acquire a huge number of images Where do you start? – even at reduced resolution, there are just too many images out there to get them all. Start with WordNet. For each of the 75,062 concrete nouns in Wordnet, do an image retrieval search on many image search engines. They used, Google, Cydral, AltaVista, Flickr, Picsearch, and Webshots.WordNet Then eliminate duplicates and solid-color images. About 10% of the words were rare, and had no matching images.

Finding nearest neighbors Need a distance metric to compare the tiny images. They examine 3: SSD(Sum of Squared Differences), Warp, and Shift.

SSD Normal SSD is done by summing the squared difference over all dimensions. Computing distance between all pairs this way is too expensive, so they used the top 19 Principal Components. They did some experiments to show that this works reliably.

Warp & Shift Warp: Just warp the image in some simple way, like flipping, scaling or translating, and see if that improves the SSD. Shift: Allow each pixel to shift in a 5x5 window, and take the best SSD from that. (Crude approximation of general warping.)

Effect of DB size As the DB grows, the quality of nearest neighbors noticeably changes, even up to ~100,000,000.

Applications Object Recognition Image retrieval reranking Person Detection & localization Image Colorization

Recognition Recognition is done by finding neighbors, and retrieving the Wordnet entry for each. Each one corresponds to a unique leaf node in the WordNet tree, and gets a single “vote”. Unify the branches into a tree, weighting internal nodes by how many branches pass through them. Classify by following link to highest voted child node.

Image search reranking Do an image search on, say, “person”, on any image retrieval engine. Then find the correlation with the search term with the neighbor set of each image returned, and rank them based on the strength of the correlation with the original search terms.

Person detection Images matched with the Wordnet node “person” and their nearest neighbors. Note that the neighbors match the part of the person shown in the query image, and their poses and color of clothing. Here, the system only returns whether the best match passes through the “person” internal node. The internet has a large bias towards images with people in them, so not all applications of this method will work with things that are not people.

Person location Given a portion of an image, we can find its neighbors, and measure the correlation with “person” in that set. Extending this, we can find the portion of a query image whose neighbor set has the highest correlation with “person”. This region is very likely to have a person in it.

Colorization Given a query image, (grayscale,) find its neighbor set, and take the average color of the set. Then apply that coloring to the grayscale image. Surprisingly, this works, especially given that not all neighbor images are even of the same type of object!