SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.

Slides:



Advertisements
Similar presentations
Getting Your Web Site Found. Meta Tags Description Tag This allows you to influence the description of your page with the web crawlers.
Advertisements

Database Management Systems, R. Ramakrishnan1 Web Search Engines Chapter 27, Part C Based on Larson and Hearsts slides at UC-Berkeley
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Principles of IR Hacettepe University Department of Information Management DOK 324: Principles of IR.
How to Cha-Cha Looking under the hood of the Cha-Cha Intranet Search Engine Marti Hearst SIMS SIMposium, April 21, 1999.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Information Retrieval in Practice
1 When Information Technology “Goes Social” Marti Hearst UCB SIMS SLA Meeting Oct 19, 2000.
Web Search Engines 198:541 Based on Larson and Hearst’s slides at UC-Berkeley /is202/f00/
SLIDE 1IS 202 – FALL 2004 Lecture 05: Web Search Issues and Algorithms Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday.
Mastering the Internet, XHTML, and JavaScript Chapter 7 Searching the Internet.
SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Search Engines CS 186 Guest Lecture Prof. Marti Hearst SIMS.
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
SLIDE 1IS 202 – FALL 2003 Lecture 21: Web Search Issues and Algorithms Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday.
SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Chapter 5 Searching for Truth: Locating Information on the WWW.
SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.
Information Retrieval
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Internet Research Search Engines & Subject Directories.
 Search engines are programs that search documents for specified keywords and returns a list of the documents where the keywords were found.  A search.
SEARCH ENGINE By Ms. Preeti Patel Lecturer School of Library and Information Science DAVV, Indore E mail:
An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.
Chapter 5 Searching for Truth: Locating Information on the WWW.
Web Search Created by Ejaj Ahamed. What is web?  The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain.
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
SLIDE 1IS 240 – Spring 2013 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval Lecture.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
SEO  What is it?  Seo is a collection of techniques targeted towards increasing the presence of a website on a search engine.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
Search Engines June 20, 2005 LIBS100 Linda Galloway.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
1 University of Qom Information Retrieval Course Web Search (Spidering) Based on:
Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.
SLIDE 1IS 202 – FALL 2002 Lecture 20: Web Search Issues and Algorithms Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
1 Web Search Engines. 2 Search Engine Characteristics  Unedited – anyone can enter content Quality issues; Spam  Varied information types Phone book,
Crawling When the Google visit your website for the purpose of tracking, Google does this with help of machine, known as web crawler, spider, Google bot,
Information Retrieval in Practice
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Search Engines & Subject Directories
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Search Engine 101 Qu, Miao Nov
Anatomy of a search engine
Data Mining Chapter 6 Search Engines
Searching for Truth: Locating Information on the WWW
Search Engines & Subject Directories
Search Engines & Subject Directories
Searching for Truth: Locating Information on the WWW
Searching for Truth: Locating Information on the WWW
Web Search Engines.
Instructor : Marina Gavrilova
Information Retrieval and Web Design
Presentation transcript:

SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000

Last Time l Web Search –Directories vs. Search engines –How web search differs from other search »Type of data searched over »Type of searches done »Type of searchers doing search –Web queries are short »This probably means people are often using search engines to find starting points »Once at a useful site, they must follow links or use site search –Web search ranking combines many features

What about Ranking? l Lots of variation here –Pretty messy in many cases –Details usually proprietary and fluctuating l Combining subsets of: –Term frequencies –Term proximities –Term position (title, top of page, etc) –Term characteristics (boldface, capitalized, etc) –Link analysis information –Category information –Popularity information l Most use a variant of vector space ranking to combine these l Here’s how it might work: –Make a vector of weights for each feature –Multiply this by the counts for each feature

From description of the NorthernLight search engine, by Mark Krellenstein

High-Precision Ranking Proximity search can help get high- precision results if > 1 term –Hearst ’96 paper: »Combine Boolean and passage-level proximity »Proves significant improvements when retrieving top 5, 10, 20, 30 documents »Results reproduced by Mitra et al. 98 »Google uses something similar

Boolean Formulations, Hearst 96 Results

Spam l Spam: –Undesired content l Web Spam: –Content is disguised as something it is not, in order to »Be retrieved more often than it otherwise would »Be retrieved in contexts that it otherwise would not be retrieved in

Web Spam l What are the types of Web spam? –Add extra terms to get a higher ranking »Repeat “cars” thousands of times –Add irrelevant terms to get more hits »Put a dictionary in the comments field »Put extra terms in the same color as the background of the web page –Add irrelevant terms to get different types of hits »Put “sex” in the title field in sites that are selling cars –Add irrelevant links to boost your link analysis ranking l There is a constant “arms race” between web search companies and spammers

Commercial Issues General internet search is often commercially driven –Commercial sector sometimes hides things – harder to track than research –On the other hand, most CTOs for search engine companies used to be researchers, and so help us out –Commercial search engine information changes monthly –Sometimes motivations are commercial rather than technical »Goto.com uses payments to determine ranking order »iwon.com gives out prizes

Web Search Architecture

l Preprocessing –Collection gathering phase »Web crawling –Collection indexing phase l Online –Query servers –This part not talked about in the readings

From description of the FAST search engine, by Knut Risvik

Standard Web Search Engine Architecture crawl the web create an inverted index Check for duplicates, store the documents Inverted index Search engine servers user query Show results To user DocIds

More detailed architecture, from Brin & Page 98. Only covers the preprocessing in detail, not the query serving.

Inverted Indexes for Web Search Engines l Inverted indexes are still used, even though the web is so huge l Some systems partition the indexes across different machines; each machine handles different parts of the data l Other systems duplicate the data across many machines; queries are distributed among the machines l Most do a combination of these

From description of the FAST search engine, by Knut Risvik In this example, the data for the pages is partitioned across machines. Additionally, each partition is allocated multiple machines to handle the queries. Each row can handle 120 queries per second Each column can handle 7M pages To handle more queries, add another row.

Cascading Allocation of CPUs l A variation on this that produces a cost-savings: –Put high-quality/common pages on many machines –Put lower quality/less common pages on fewer machines –Query goes to high quality machines first –If no hits found there, go to other machines

Web Crawlers l How do the web search engines get all of the items they index? l Main idea: –Start with known sites –Record information for these sites –Follow the links from each site –Record information found at new sites –Repeat

Web Crawlers l How do the web search engines get all of the items they index? l More precisely: –Put a set of known sites on a queue –Repeat the following until the queue is empty: »Take the first page off of the queue »If this page has not yet been processed: l Record the information found on this page –Positions of words, links going out, etc l Add each link on the current page to the queue l Record that this page has been processed l In what order should the links be followed?

Page Visit Order Animated examples of breadth-first vs depth-first search on trees: Structure to be traversed

Page Visit Order l Animated examples of breadth-first vs depth-first search on trees: Breadth-first search (must be in presentation mode to see this animation)

Page Visit Order l Animated examples of breadth-first vs depth-first search on trees: Depth-first search (must be in presentation mode to see this animation)

Page Visit Order l Animated examples of breadth-first vs depth-first search on trees:

Depth-First Crawling (more complex – graphs & sites) Page 1 Page 3 Page 2 Page 1 Page 2 Page 1 Page 5 Page 6 Page 4 Page 1 Page 2 Page 1 Page 3 Site 6 Site 5 Site 3 Site 1 Site 2

Breadth First Crawling (more complex – graphs & sites) Page 1 Page 3 Page 2 Page 1 Page 2 Page 1 Page 5 Page 6 Page 4 Page 1 Page 2 Page 1 Page 3 Site 6 Site 5 Site 3 Site 1 Site 2

Web Crawling Issues l Keep out signs –A file called norobots.txt tells the crawler which directories are off limits l Freshness –Figure out which pages change often –Recrawl these often l Duplicates, virtual hosts, etc –Convert page contents with a hash function –Compare new pages to the hash table l Lots of problems –Server unavailable –Incorrect html –Missing links –Infinite loops l Web crawling is difficult to do robustly!

Cha-Cha l Cha-cha searches an intranet –Sites associated with an organization l Instead of hand-edited categories –Computes shortest path from the root for each hit –Organizes search results according to which subdomain the pages are found in

Cha-Cha Web Crawling Algorithm l Start with a list of servers to crawl –for UCB, simply start with l Restrict crawl to certain domain(s) –*.berkeley.edu l Obey No Robots standard l Follow hyperlinks only –do not read local filesystems »links are placed on a queue »traversal is breadth-first l See first lecture or the technical papers for more information

Summary l Web search differs from traditional IR systems –Different kind of collection –Different kinds of users/queries –Different economic motivations l Ranking combines many features in a difficult-to-specify manner –Link analysis and proximity of terms seems especially important –This is in contrast to the term-frequency orientation of standard search »Why?

Summary (cont.) l Web search engine archicture –Similar in many ways to standard IR –Indexes usually duplicated across machines to handle many queries quickly l Web crawling –Used to create the collection –Can be guided by quality metrics –Is very difficult to do robustly

Web Search Statistics

Information from searchenginewatch.com Searches per Day Info missing For fast.com, Excite, Northernlight, etc.

Information from searchenginewatch.com Web Search Engine Visits

Information from searchenginewatch.com Percentage of web users who visit the site shown

Information from searchenginewatch.com Search Engine Size (July 2000)

Information from searchenginewatch.com Does size matter? You can’t access many hits anyhow.

Information from searchenginewatch.com Increasing numbers of indexed pages, self- reported

Information from searchenginewatch.com Increasing numbers of indexed pages (more recent) self- reported

Information from searchenginewatch.com Web Coverage

From description of the FAST search engine, by Knut Risvik

Information from searchenginewatch.com Directory sizes