Obtaining Data for Face Recognition from the web By Tal blum Advisor: Henry Schneiderman.

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

WEB DESIGN TABLES, PAGE LAYOUT AND FORMS. Page Layout Page Layout is an important part of web design Why do you think your page layout is important?
A guide to HTML. Slide 1 HTML: Hypertext Markup Language Pull down View, then Source, to see the HTML code. Slide 1.
Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen SIMS, UC Berkeley Susan Dumais Adaptive Systems & Interactions Microsoft.
On-the-fly Specific Person Retrieval University of Oxford 24 th May 2012 Omkar M. Parkhi, Andrea Vedaldi and Andrew Zisserman.
University of Connecticut Automated Counterfeit IC Physical Defect Characterization Team 176 Wesley Stevens Dan Guerrera Ryan Nesbit Advisors: Professor.
Large dataset for object and scene recognition A. Torralba, R. Fergus, W. T. Freeman 80 million tiny images Ron Yanovich Guy Peled.
Best Web Directories and Search Engines Order Out of Chaos on the World Wide Web.
Bulk uploading CONTENTdm images & metadata into Flickr Midwest CONTENTdm Users Group 4th Annual Meeting March 18-20, 2009 Jason Paul Michel User Experience.
Information Extraction and Ontology Learning Guided by Web Directory Authors:Martin Kavalec Vojtěch Svátek Presenter: Mark Vickers.
Using Relevance Feedback in Multimedia Databases
Best Web Directories and Search Engines Order Out of Chaos on the World Wide Web.
Web Searching. Web Search Engine A web search engine is designed to search for information on the World Wide Web and FTP servers The search results are.
Sachin Chopra Trevor Garson Madhuri Rapaka Introduction Tool to Tag all your Photographs within minutes 2 – Pass Processing Face Detection – Run the.
1 Newspaper Digitisation Workflows Rose Holley- Manager ANDP Presentation to Cultural Heritage Digitisation professionals 26 November 2008.
1 Australian Newspapers Digitisation Program Development of the Newspapers Content Management System Rose Holley – ANDP Manager ANPlan/ANDP Workshop, 28.
Databases & Data Warehouses Chapter 3 Database Processing.
1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009.
Wasim Rangoonwala ID# CS-460 Computer Security “Privacy is the claim of individuals, groups or institutions to determine for themselves when,
WP5.4 - Introduction  Knowledge Extraction from Complementary Sources  This activity is concerned with augmenting the semantic multimedia metadata basis.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
Citing Web Sources A helpful research guide for getting the most from the web.
A Web Crawler Design for Data Mining
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
The Internet : Exploration, Evaluation, and Elaboration presented by Kathy Schrock.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Overview What is a Web search engine History Popular Web search engines How Web search engines work Problems.
Internet Business Foundations © 2004 ProsoftTraining All rights reserved.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.
1 Search Engine Optimization An introduction to optimizing your web site for best possible search engine results.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
McLean HIGHER COMPUTER NETWORKING Lesson 7 Search engines Description of search engine methods.
The HERMES Heterogeneous Reasoning and Mediator System V.S. Subrahmanian University of Maryland [These slides originated from the HERMES Project sponsored.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
4 1 SEARCHING THE WEB Using Search Engines and Directories Effectively New Perspectives on THE INTERNET.
Vidispine Data Model Vidispine Bootcamp. Overview Collection Storage File Item Shape Item Component Shape Component Metadata abstract entity physical.
Data Mining for Web Intelligence Presentation by Julia Erdman.
Intelligent Web Topics Search Using Early Detection and Data Analysis by Yixin Yang Presented by Yixin Yang (Advisor Dr. C.C. Lee) Presented by Yixin Yang.
EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila.
The World Wide Web: Information Resource. Hock, Randolph. The Extreme Searcher’s Internet Handbook. 2 nd ed. CyberAge Books: Medford. (2007). Internet.
EndNote: The Next Steps Rebecca Starkey Reference Librarian The Joseph Regenstein Library
Searching the World Wide Web: Meta Crawlers vs. Single Search Engines By: Voris Tejada.
CPT 499 Internet Skills for Educators Session Three Class Notes.
1 University of Qom Information Retrieval Course Web Search (Spidering) Based on:
Chapter 1 Getting Listed. Objectives Understand how search engines work Use various strategies of getting listed in search engines Register with search.
Internet Documentation and Integration of Metadata (IDIOM) Presented by Ahmet E. Topcu Advisor: Prof. Geoffrey C. Fox 1/14/2009.
USING ACCESS TO SEGMENT SURVEY DATA. OPEN ACCESS You May Need to Search for the Program You May Need to Search for the Program Access is a Database Access.
By Roland Foster Supervisors: Mr. Mehrdad Ghaziasgar Mr. James Connan Mentor: Mr. Warren Nel.
Field Trip #24 Setting Up a Web Server. Apache Apache is one of the most successful open source web servers In 1995 the most popular web server was the.
Web Tools Assignment This assignment requires you to build a simple HTML page with an HTML editor of your choice and use an image or drawing tool to create.
Learning a Monolingual Language Model from a Multilingual Text Database Rayid Ghani & Rosie Jones School of Computer Science Carnegie Mellon University.
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon -Smit Shilu.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio.
1 Chapter 5 (3 rd ed) Your library is an excellent resource tool. Your library is an excellent resource tool.
Design and Implementation of a High- Performance Distributed Web Crawler Vladislav Shkapenyuk, Torsten Suel 실시간 연구실 문인철
Lesson 6: Databases and Web Search Engines
Lesson 6: Databases and Web Search Engines
Creating and Managing Categories, Tags & Topics
Working with External Data and OU Campus Tags
Network Controllable MP3 Player
Getting Started With Solr
Face Detection Gender Recognition 1 1 (19) 1 (1)
Collecting Data Online
Lab 2: Information Retrieval
AI Discovery Template IBM Cloud Architecture Center
Presentation transcript:

Obtaining Data for Face Recognition from the web By Tal blum Advisor: Henry Schneiderman

Sample Images

Overview System Purpose System Purpose Collecting Data methods Collecting Data methods System Structure System Structure Problems Problems Numbers & Statistics Numbers & Statistics

System Purpose Collecting face images from the www for: Data for face recognition purposes Data for face recognition purposes A system that people can submit images to and it will tell you who are the celebrities they most resemble. A system that people can submit images to and it will tell you who are the celebrities they most resemble. Goal: to collect images of 1000 people with at least 50 images for each Goal: to collect images of 1000 people with at least 50 images for each

Collection Vs. Web Collecting Cost Cost Data size Data size Aging Aging Controlled Setting Controlled Setting Limited backgrounds, poses, lightings, etc. Limited backgrounds, poses, lightings, etc. Duplicates Duplicates Metadata Metadata Alignment Alignment Tagging Errors Tagging Errors Authorization Authorization

System Overview Names Extraction Cleaning/Refinement/ remove duplicates Spidering Download remove duplicates remove faceless Manual Tagging html text Names Files Names Files URLs Images Face images

Names Extraction Sources: Sources: Web Directories Web Directories Types: Actors, Politicians, Sports players, singers … Types: Actors, Politicians, Sports players, singers … Infomedia project Infomedia project Extract names from html Extract names from html Result: Names Files Result: Names Files Cleaning Cleaning Duplicates Removed Duplicates Removed Refinement Refinement

Spidering 5 different image search engine: 5 different image search engine: Altavista, Yahoo-news, Yahoo, Picsearch, Alltheweb Altavista, Yahoo-news, Yahoo, Picsearch, Alltheweb Different Interface Different Interface Different results quality Different results quality Limited availability Limited availability Query refinement Query refinement Quoted names Quoted names

Downloading Gets the URLs and downloads them Gets the URLs and downloads them Only about 2/3 of the URLs were downloaded Only about 2/3 of the URLs were downloaded Work in the background Work in the background

remove duplicates remove faceless Uses simple heuristics to compare files Uses simple heuristics to compare files Uses Schneiderman's face detection algorithm to find faces in the images Uses Schneiderman's face detection algorithm to find faces in the images

Manual Tagging Decide who is the person by that name Decide who is the person by that name Choose between several people in the image Choose between several people in the image Add additional metadata s.a. age race, gender … Add additional metadata s.a. age race, gender … Problems: unrelated images & multiple people by the same name Problems: unrelated images & multiple people by the same name Possible classification errors Possible classification errors Go over millions of images Go over millions of images

Manual Tagging

Manual Tagging – Face extraction

Problems - Name Duplicates Example: Example: George Bush, George Bush, President George Bush, President George Bush, George W. Bush George W. Bush Another example: Another example: Wham (a band) Wham (a band) George Michael George Michael

Problems - Name Duplicates Solution: Detect duplicates on 3 levels Solution: Detect duplicates on 3 levels Names – automatic, manual Names – automatic, manual URLs URLs By Recognition errors By Recognition errors Approaches Approaches Semi-automatic Semi-automatic Fully-automatic Fully-automatic

Numbers & Statistics We collected people names We collected people names For each we spidered up to 1000 URLs For each we spidered up to 1000 URLs On average only 1/3 of the URLs reach the manual stage. On average only 1/3 of the URLs reach the manual stage. So far we run the system on 9500 people So far we run the system on 9500 people Total # of URLs 1,500,000 Total # of URLs 1,500,000 1,000,000 image files consisting of 60GB. 1,000,000 image files consisting of 60GB. An average of 157 URLs for person or 182 for person not including people with no URLs An average of 157 URLs for person or 182 for person not including people with no URLs

More Information Contacts: Contacts: Tal Blum Henry Schneiderman Acknowledgement to David Fields

THE END