Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.

Slides:



Advertisements
Similar presentations
Critical Reading Strategies: Overview of Research Process
Advertisements

Information Retrieval in Practice
Introduction to Information Retrieval
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Computer Information Technology – Section 3-2. The Internet Objectives: The Student will: 1. Understand Search Engines and how they work 2. Understand.
Information Retrieval in Practice
Search Engines and Information Retrieval
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Information Retrieval in Practice
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.
1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.
Overview of Search Engines
Search engine structure Web Crawler Page archive Page Analizer Control Query resolver ? Ranker text Structure auxiliary Indexer.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma
HOW SEARCH ENGINE WORKS. Aasim Bashir.. What is a Search Engine? Search engine: It is a website dedicated to search other websites and there contents.
Search Engines and Information Retrieval Chapter 1.
XHTML Introductory1 Linking and Publishing Basic Web Pages Chapter 3.
Using a Web Browser What does a Web Browser do? A web browser enables you to surf the World Wide Web. What are the most popular browsers?
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
SEO  What is it?  Seo is a collection of techniques targeted towards increasing the presence of a website on a search engine.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
Introduction n How to retrieval information? n A simple alternative is to search the whole text sequentially n Another option is to build data structures.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Kevin Mauricio Apaza Huaranca San Pablo Catholic University.
Web Search Algorithms By Matt Richard and Kyle Krueger.
Curtis Spencer Ezra Burgoyne An Internet Forum Index.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
1 Centroid Based multi-document summarization: Efficient sentence extraction method Presenter: Chen Yi-Ting.
A search engine is a web site that collects and organizes content from all over the internet Search engines look through their own databases of.
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
1 CS 430: Information Discovery Lecture 17 Web Crawlers.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Seminar on seminar on Presented By L.Nageswara Rao 09MA1A0546. Under the guidance of Ms.Y.Sushma(M.Tech) asst.prof.
Search Engine Optimization
Information Retrieval in Practice
Information Retrieval in Practice
Why indexing? For efficient searching of a document
Information Retrieval in Practice
Search Engine Architecture
SEARCH ENGINES & WEB CRAWLER Akshay Ghadge Roll No: 107.
Information Retrieval in Practice
Implementation Issues & IR Systems
CS 430: Information Discovery
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Data Mining Chapter 6 Search Engines
Searching for Truth: Locating Information on the WWW
Searching for Truth: Locating Information on the WWW
Searching for Truth: Locating Information on the WWW
Information Retrieval and Web Design
Presentation transcript:

Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006

First -- projects The class web page suggests these types of projects: 1.A detailed literature review on one of the topics of this course. This involves discovering, reading, summarizing and comparing published material about either search technology or personal information management. Conference papers are an appropriate source of materials. Materials found on the web are fine, as long as you do a suitable evaluation of the credibility of the resource. 2.A comparative review of a number of tools for one type of information management. For example, you might compare several photo management tools, describing each and listing the features that set each apart from the others and then summarizing their strengths and weaknesses. Your report would conclude with your evaluation of the state of the art of this type of information management based on your review of these materials. 3.A significant contribution to an open source project related to our topics. Do you have a way to improve Lucene? Can you find a tool for managing that you can improve? You must prepare your project for evaluation by the class and for submission to the open source project organization. 4.A totally new tool that you have created. Have you had an idea for a useful tool and never got around to doing anything about it? Maybe this will be the beginning of an important product.

First - the search Describe your experience in finding the required reading –What steps did you take? –Were there any problems? –Was anything about the search difficult? –Was anything different from what you expected?

Initial discussion What surprised you in these articles? –What did you recognize from previous courses but did not expect to see in discussion of Web Search? –What works differently from the image you had? What would you like to have learned that was not included? What are the biggest areas of challenge to the Web Search enterprise? Are there things that cannot be solved? –Are there issues of scale that are just impossible? –Are there limitations that just cannot be overcome? Are there problems to solve that require more work but are within the range of manageable improvements?

The Web Search Three Distinct Phases: –Crawling –Indexing –Searching Each has specific challenges to address

Crawlers Basic process –Open an HTML page that has at least one anchor tag ( link description –Send HTTP request to the site and receive the page. –Parse the page, looking for other anchor tags –Place anchors on a queue for further processing –Submit the actual page for indexing and storing

Indexers Scanning –“For each indexable term … the indexer writes a posting consisting of a document number and a term number to a temporary file.” Parse this sentence: What is an indexable term? Posting? Document number? Term number? What does a posting look like? Invert the file –Sort by term, secondarily by document number –Record start location and list length for each term

Searching (Query Processing) Look up query term in term dictionary Get the postings list Find documents that match all search terms –Find documents for each term and merge lists where common documents occur Rank documents and report –As many as required or until end of the list Still possible to find a result on one search and not find that same item on a subsequent search of the same terms

Expanding from the basics Each of the phases of web searching is simple in concept, but complicated by the sheer magnitude of the task. The same ideas applied on a smaller scale -- in a company intra-net, for example, can be done efficiently. The Web presents special challenges.

Crawling A single machine running a simple crawling algorithm would not do well in finding all Web pages. Large data centers –Redundancy and fault tolerance –Parallel operation –(SIGCSE talk by Marissa Mayer of Google)SIGCSE talk by Marissa Mayer of Google

Crawling reality Speed - amazing numbers: sec per http request, max 86,400 per day = 634 years for 20 billion pages Politeness - –Overwhelming web servers Excluded content –Robots.txt Duplicate content –Identifying duplicates can be tricky - why? Continuous crawling –Keeping current –Note comment about “current time” - how would you fix that? –Priority queue for crawling schedule - why? Spam

Indexing large collections The Web is the ultimate “large collection” “Estimating 500 terms in each of 20 billion pages” --> 10 trillion entries! Divide and conquer, as the crawler did –Each indexer builds a partial file in memory –Stops when memory is full –Write to disk, clear memory, and start over Merge the partial files to make the full index

Data structures for indexing Trees, tries, hash tables –Various ways to organize the terms for easy lookup Numbers of terms –Not just all words in all languages –Acronyms, proper names, etc. –Must deal with common phrases also Separate index entries (postings) for common word combinations Compression –Saves space, increases processing Anchor text -- fie on those who use “click here”!! Link popularity score –Give a score to a page based on popularity, also on query-independent factors. –Think about the implications of this.

Query Processing Most queries are short, do not provide much context Result quality -- use some of the techniques from information retrieval –Once a preliminary list of responses is obtained, treat that as the collection and use IR techniques to improve the quality of the response. Some limitations. No way to judge how complete the initial list is. –Techniques are part of the trade secrets of the companies Speeding: –Skipping –Early termination –Document numbering –Caching