Download presentation
Presentation is loading. Please wait.
Published byJoan Gallagher Modified over 9 years ago
1
Obtaining Data for Face Recognition from the web By Tal blum Advisor: Henry Schneiderman
2
Sample Images
3
Overview System Purpose System Purpose Collecting Data methods Collecting Data methods System Structure System Structure Problems Problems Numbers & Statistics Numbers & Statistics
4
System Purpose Collecting face images from the www for: Data for face recognition purposes Data for face recognition purposes A system that people can submit images to and it will tell you who are the celebrities they most resemble. A system that people can submit images to and it will tell you who are the celebrities they most resemble. Goal: to collect images of 1000 people with at least 50 images for each Goal: to collect images of 1000 people with at least 50 images for each
5
Collection Vs. Web Collecting Cost Cost Data size Data size Aging Aging Controlled Setting Controlled Setting Limited backgrounds, poses, lightings, etc. Limited backgrounds, poses, lightings, etc. Duplicates Duplicates Metadata Metadata Alignment Alignment Tagging Errors Tagging Errors Authorization Authorization
6
System Overview Names Extraction Cleaning/Refinement/ remove duplicates Spidering Download remove duplicates remove faceless Manual Tagging html text Names Files Names Files URLs Images Face images
7
Names Extraction Sources: Sources: Web Directories Web Directories Types: Actors, Politicians, Sports players, singers … Types: Actors, Politicians, Sports players, singers … Infomedia project Infomedia project Extract names from html Extract names from html Result: Names Files Result: Names Files Cleaning Cleaning Duplicates Removed Duplicates Removed Refinement Refinement
8
Spidering 5 different image search engine: 5 different image search engine: Altavista, Yahoo-news, Yahoo, Picsearch, Alltheweb Altavista, Yahoo-news, Yahoo, Picsearch, Alltheweb Different Interface Different Interface Different results quality Different results quality Limited availability Limited availability Query refinement Query refinement Quoted names Quoted names
9
Downloading Gets the URLs and downloads them Gets the URLs and downloads them Only about 2/3 of the URLs were downloaded Only about 2/3 of the URLs were downloaded Work in the background Work in the background http://news.bbc.co.uk/media/images/38378000/jpg/_38941_bushap150.jpg
10
remove duplicates remove faceless Uses simple heuristics to compare files Uses simple heuristics to compare files Uses Schneiderman's face detection algorithm to find faces in the images Uses Schneiderman's face detection algorithm to find faces in the images
11
Manual Tagging Decide who is the person by that name Decide who is the person by that name Choose between several people in the image Choose between several people in the image Add additional metadata s.a. age race, gender … Add additional metadata s.a. age race, gender … Problems: unrelated images & multiple people by the same name Problems: unrelated images & multiple people by the same name Possible classification errors Possible classification errors Go over millions of images Go over millions of images
12
Manual Tagging
13
Manual Tagging – Face extraction
14
Problems - Name Duplicates Example: Example: George Bush, George Bush, President George Bush, President George Bush, George W. Bush George W. Bush Another example: Another example: Wham (a band) Wham (a band) George Michael George Michael
15
Problems - Name Duplicates Solution: Detect duplicates on 3 levels Solution: Detect duplicates on 3 levels Names – automatic, manual Names – automatic, manual URLs URLs By Recognition errors By Recognition errors Approaches Approaches Semi-automatic Semi-automatic Fully-automatic Fully-automatic
16
Numbers & Statistics We collected 36000 people names We collected 36000 people names For each we spidered up to 1000 URLs For each we spidered up to 1000 URLs On average only 1/3 of the URLs reach the manual stage. On average only 1/3 of the URLs reach the manual stage. So far we run the system on 9500 people So far we run the system on 9500 people Total # of URLs 1,500,000 Total # of URLs 1,500,000 1,000,000 image files consisting of 60GB. 1,000,000 image files consisting of 60GB. An average of 157 URLs for person or 182 for person not including people with no URLs An average of 157 URLs for person or 182 for person not including people with no URLs
17
More Information Contacts: Contacts: Tal Blum tblum@cmu.edu tblum@cmu.edu Henry Schneiderman hws@cs.cmu.edu hws@cs.cmu.edu Acknowledgement to David Fields
18
THE END
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.