Tamara Berg Retrieval 790-133 Language and Vision.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
VisualRank: Applying PageRank to Large-Scale Image Search Yushi Jing, Member, IEEE, and Shumeet Baluja, Member, IEEE.
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
UCB Computer Vision Animals on the Web Tamara L. Berg CSE 595 Words & Pictures.
How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Presentation of Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page (1997) Presenter: Scott White.
1 Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003.
UCB Computer Vision Animals on the Web Tamara L. Berg SUNY Stony Brook.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.
Information Retrieval in Practice
Presented By: - Chandrika B N
Advanced Multimedia Intro to NLP & Web Retrieval Tamara Berg.
Lecture 12 IR in Google Age. Traditional IR Traditional IR examples – Searching a university library – Finding an article in a journal archive – Searching.
The Technology Behind. The World Wide Web In July 2008, Google announced that they found 1 trillion unique webpages! Billions of new web pages appear.
PageRank for Product Image Search Kevin Jing (Googlc IncGVU, College of Computing, Georgia Institute of Technology) Shumeet Baluja (Google Inc.) WWW 2008.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Classifying Images with Visual/Textual Cues By Steven Kappes and Yan Cao.
Video Google: A Text Retrieval Approach to Object Matching in Videos Josef Sivic and Andrew Zisserman.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
Web Search. Crawling Start from some root site e.g., Yahoo directories. Traverse the HREF links. Search(initialLink) fringe.Insert( initialLink ); loop.
Web Search Algorithms By Matt Richard and Kyle Krueger.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Lecture #10 PageRank CS492 Special Topics in Computer Science: Distributed Algorithms and Systems.
Meet the web: First impressions How big is the web and how do you measure it? How many people use the web? How many use search engines? What is the shape.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
1 CS 430: Information Discovery Lecture 5 Ranking.
Information Discovery Lecture 20 Web Search 2. Example: Heritrix Crawler A high-performance, open source crawler for production and research Developed.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
CS 440 Database Management Systems Web Data Management 1.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Automated Information Retrieval
Methods and Apparatus for Ranking Web Page Search Results
Text & Web Mining 9/22/2018.
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Information Retrieval
Anatomy of a search engine
Data Mining Chapter 6 Search Engines
Multimedia Information Retrieval
Introduction to Information Retrieval
Unsupervised learning of visual sense models for Polysemous words
Presentation transcript:

Tamara Berg Retrieval Language and Vision

How big is the web? The first Google index in 1998 already had 26 million pages By 2000 the Google index reached the one billion mark. July 25, 2008 – Google announced that search had discovered one trillion unique URLs

Slide from Takis Metaxas

How hard is it to go from one page to another? Over 75% of the time there is no directed path from one random web page to another. Kleinberg: The small-world phenomenon

How hard is it to go from one page to another? Over 75% of the time there is no directed path from one random web page to another. When a directed path exists its average length is 16 clicks. When an undirected path exists its average length is 7 clicks. Kleinberg: The small-world phenomenon

How hard is it to go from one page to another? Over 75% of the time there is no directed path from one random web page to another. When a directed path exists its average length is 16 clicks. When an undirected path exists its average length is 7 clicks. Short average path between pairs of nodes is characteristic of a small-world network (“six degrees of separation” Stanley Milgram). Kleiberg: The small-world phenomenon

Information Retrieval Information retrieval (IR) is the science of searching for documents, for information within documents, and for metadata about documents, as well as that of searching relational databases and the World Wide Web Wikipedia

Slide from Takis Metaxas

The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page

“The ultimate search engine would understand exactly what you mean and give back exactly what you want.” - Larry Page Google – misspelling of googol

The Google Search Engine Founded 1998 (1996) by two Stanford students Originally academic / research project that later became a commercial tool Distinguishing features (then!?): - Special (and better) ranking - Speed - Size Slide from Jeff Dean

The web in 1997 Internet was growing very quickly “Junk results often wash out any results that a user is interested in. In fact, as of November 1997, only one of the top four commercial search engines finds itself (returns its own search page in response to its name in the top ten results).”

The web in 1997 Internet was growing very quickly “Junk results often wash out any results that a user is interested in. In fact, as of November 1997, only one of the top four commercial search engines finds itself (returns its own search page in response to its name in the top ten results).” Need high precision in the top results

Google’s first search engine

Components of Web Search Service Components Web crawler Indexing system Search system Advertising system Considerations Economics Scalability Legal issues Slide from William Y. Arms

Web Searching: Architecture Build index Search Index to all Web pages Documents stored on many Web servers are indexed in a single central index. The central index is implemented as a single system on a very large number of computers Examples: Google, Yahoo! Web servers with Web pages Crawl Web pages retrieved by crawler Slide from William Y. Arms

What is a Web Crawler? Web Crawler A program for downloading web pages. Given an initial set of seed URLs, it recursively downloads every page that is linked from pages in the set. A focused web crawler downloads only those pages whose content satisfies some criterion. Also known as a web spider Slide from William Y. Arms

Simple Web Crawler Algorithm Basic Algorithm Let S be set of URLs to pages waiting to be indexed. Initially S is is a set of known seeds. Take an element u of S and retrieve the page, p, that it references. Parse the page p and extract the set of URLs L it has links to. Update S = S + L - u Repeat as many times as necessary. [Large production crawlers may run continuously] Slide from William Y. Arms

Indexing the Web Goals: Precision Short queries applied to very large numbers of items leads to large numbers of hits. Goal is that the first hits presented should satisfy the user's information need -- requires ranking hits in order that fits user's requirements Recall is not an important criterion Completeness of index is not an important factor. Comprehensive crawling is unnecessary Slide from William Y. Arms

Concept of Relevance and Importance Document measures Relevance, as conventionally defined, is binary (relevant or not relevant). It is usually estimated by the similarity between the terms in the query and each document. Importance measures documents by their likelihood of being useful to a variety of users. It is usually estimated by some measure of popularity. Web search engines rank documents by a weighted combination of estimates of relevance and importance. Slide from William Y. Arms

Relevance Words in document (stored in inverted index) Location information – for use of proximity in multi-word search. In page title, page url? Visual presentation details – font size of words, words in bold.

Relevance The Faculty of Computing and Information Science The source of Document A contains the marked-up text: The anchor text: The Faculty of Computing and Information Science can be considered descriptive metadata about the document: Slide from William Y. Arms Anchor Text

Importance - PageRank Algorithm Used to estimate popularity of documents Concept: The rank of a web page is higher if many pages link to it. Links from highly ranked pages are given greater weight than links from less highly ranked pages. Slide from William Y. Arms

Intuitive Model (Basic Concept) Basic (no damping) A user: 1. Starts at a random page on the web 2. Selects a random hyperlink from the current page and jumps to the corresponding page 3.Repeats Step 2 a very large number of times Pages are ranked according to the relative frequency with which they are visited. Slide from William Y. Arms

PageRank

Example

Basic Algorithm: Matrix Representation Slide from William Y. Arms

Basic Algorithm: Normalize by Number of Links from Page Slide from William Y. Arms

Basic Algorithm: Normalize by Number of Links from Page Slide from William Y. Arms

Basic Algorithm: Weighting of Pages Initially all pages have weight 1/n w 0 = 0.17 Recalculate weights w 1 = Bw 0 = If the user starts at a random page, the j th element of w 1 is the probability of reaching page j after one step. Slide from William Y. Arms

Basic Algorithm: Weighting of Pages Initially all pages have weight 1/n w 0 = 0.17 Recalculate weights w 1 = Bw 0 = If the user starts at a random page, the j th element of w 1 is the probability of reaching page j after one step. Slide from William Y. Arms

Basic Algorithm: Weighting of Pages Initially all pages have weight 1/n w 0 = 0.17 Recalculate weights w 1 = Bw 0 = If the user starts at a random page, the j th element of w 1 is the probability of reaching page j after one step. Slide from William Y. Arms

Basic Algorithm: Iterate Iterate: w k = Bw k > w 0 w 1 w 2 w 3... converges to... w At each iteration, the sum of the weights is Slide from William Y. Arms

Special Cases of Hyperlinks on the Web There is no link out of {2, 3, 4} Slide from William Y. Arms

Google PageRank with Damping A user: 1. Starts at a random page on the web 2a. With probability 1-d, selects any random page and jumps to it 2b.With probability d, selects a random hyperlink from the current page and jumps to the corresponding page 3. Repeats Step 2a and 2b a very large number of times Pages are ranked according to the relative frequency with which they are visited. [For dangling nodes, always follow 2a.] Slide from William Y. Arms Teleport!

The PageRank Iteration The basic method iterates using the normalized link matrix, B. w k = Bw k-1 This w is an eigenvector of B PageRank iterates using a damping factor. The method iterates: w k = (1 - d)w 0 + dBw k-1 w 0 is a vector with every element equal to 1/n. Slide from William Y. Arms

The PageRank Iteration The iteration expression with damping can be re-written. Let R be a matrix with every element equal to 1/n Rw k-1 = w 0 (The sum of the elements of w k-1 equals 1) Let G = dB + (1-d)R (G is called the Google matrix) The iteration formula w k = (1-d)w 0 + dBw k-1 is equivalent to w k = Gw k-1 so that w is an eigenvector of G Slide from William Y. Arms

Iterate with Damping Iterate: w k = Gw k-1 (d = 0.7) > w 0 w 1 w 2 w 3... converges to... w 0.17 Slide from William Y. Arms

Choice of d Conceptually, values of d that are close to 1 are desirable as they emphasize the link structure of the Web graph, but... The rate of convergence of the iteration decreases as d approaches 1. The sensitivity of PageRank to small variations in data increases as d approaches 1. It is reported that Google uses a value of d = 0.85 and that the computation converges in about 50 iterations Slide from William Y. Arms

Image retrieval

Types of queries 1)Text query based retrieval 2) Image query based retrieval

1) Text query retrieval

2) Image query retrieval Content based image retrieval: Analyze visual content of images – Extract features – Build visual descriptor of each image (query and database images). For a query image, match descriptors between query and database images. Return closest matches in ranked order by similarity.

Image query retrieval Query Image

Reminder: Image Representation Represent the image as a spatial grid of average pixel colors Convert data base of images to this representation Represent query image in this representation. Find images from data base that are similar to query. Photo by: marielitomarielito

Image query retrieval Query Image Database Images

Image query retrieval Query Image Ranked Results – database images ranked by similarity to query

Image query retrieval What’s easy? What’s difficult?

Image Retrieval Image relevance Image importance

Image Retrieval Image relevance Image importance

Text info Idea – most images have associated text. Analyze words around picture & associated with picture (title, words, links, etc). For a query word return pictures based on standard web search on text associated with image.

Human info Just leave the content analysis/labeling to people. ESP game Luis von Ahn, Ruoran Liu and Manuel Blum

User data Watch what people click on!

Text+Image info

Image Retrieval Image relevance Image importance

PageRank For web pages – use links between two pages as a measure of their similarity. For images – use number of matching features between two images as a measure of their similarity. – Features – SIFT features (based on histograms of edges in different directions). – Two features are considered matching if their SSD distance is below a threshold.

Pros/Cons Where will it work well? Where will it fail? What happens to polysemous queries? What about logos?

Text + Image PageRank How could we extend this algorithm to incorporate image and text information?

Animals on the Web Tamara L. Berg & David Forsyth

I want to find lots of good pictures of monkeys… What can I do?

Google Image Search -- monkey Circa 2006

Google Image Search -- monkey

Words alone won’t work

Flickr Search - monkey Even with humans doing the labeling, the data is extremely noisy -- context, polysemy, photo sets Words alone still won’t work!

Our Results

General Approach - Vision alone won’t solve the problem. - Text alone won’t solve the problem. -> Combine the two!

Animals on the Web Extremely challenging visual categories. Free text on web pages. Take advantage of language advances. Combine multiple visual and textual cues.

Goal: Classify images depicting semantic categories of animals in a wide range of aspects, configurations and appearances. Images typically portray multiple species that differ in appearance.

Animals on the Web Outline: Harvest pictures of animals from the web using Google Text Search. Select visual exemplars using text based information. Use visual and textual cues to extend to similar images.

Harvested Pictures 14,051 images for 10 animal categories. 12,886 additional images for monkey category using related monkey queries (primate, species, old world, science…)

Text Model Latent Dirichlet Allocation (LDA) on the words in collected web pages to discover 10 latent topics for each category. Each topic defines a distribution over words. Select the 50 most likely words for each topic. 1.) frog frogs water tree toad leopard green southern music king irish eggs folk princess river ball range eyes game species legs golden bullfrog session head spring book deep spotted de am free mouse information round poison yellow upon collection nature paper pond re lived center talk buy arrow common prince Example Frog Topics: 2.) frog information january links common red transparent music king water hop tree pictures pond green people available book call press toad funny pottery toads section eggs bullet photo nature march movies commercial november re clear eyed survey link news boston list frogs bull sites butterfly court legs type dot blue

Select Exemplars Rank images according to whether they have these likely words near the image in the associated page (word score) Select up to 30 images per topic as exemplars. 2.) frog information january links common red transparent music king water hop tree pictures pond green people available book call press... 1.) frog frogs water tree toad leopard green southern music king irish eggs folk princess river ball range eyes game species legs golden bullfrog session head...

Senses There are multiple senses of a category within the Google search results. Ask the user to identify which of the 10 topics are relevant to their search. Merge. Optional second step of supervision – ask user to mark erroneously labeled exemplars.

Image Model Match Pictures of a category

Geometric Blur Shape Feature Sparse SignalGeometric Blur (A.) Berg & Malik ‘01 Captures local shape, but allows for some deformation. Robust to differences in intra category object shape. Used in current best object recognition systems Zhang et al, CVPR 2006 Frome et al, NIPS 2006

Image Model (cont.) Color Features: Histogram of what colors appear in the image Texture Features: Histograms of 16 filters * =

* * * * * * * * * * * * * * * * * * * + * * * ? * * ? + + Scoring Images Relevant Features Irrelevant Features Relevant Exemplar For each query feature apply a 1-nearest neighbor classifier. Sum votes for relevant class. Normalize. Combine 4 cue scores (word, shape, color, texture) using a linear combination. Query * Irrelevant Exemplar

Classification Comparison Words Words + Picture

Cue Combination: Monkey

Cue Combination: FrogGiraffe

Re-ranking Precision Classification Performance Google

Re-ranking Precision Monkey Category Classification Performance Google Monkey

Ranked Results:

Commercial systems