1 CS 430: Information Discovery Lecture 18 Web Search Engines: Google.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

The Inside Story Christine Reilly CSCI 6175 September 27, 2011.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Natural Language Processing WEB SEARCH ENGINES August, 2002.
1 Technical Developments Related to Quality Issues Brian Kelly UK Web Focus UKOLN University of Bath Bath, BA2 7AY
Search Engines. 2 What Are They?  Four Components  A database of references to webpages  An indexing robot that crawls the WWW  An interface  Enables.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
1 CS 430 / INFO 430 Information Retrieval Lecture 15 Usability 3.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
1 CS/INFO 430 Information Retrieval Lecture 17 Web Search 3.
(c) Maria Indrawan Distributed Information Retrieval.
1 CS 430: Information Discovery Lecture 21 Web Search 3.
1 CS 502: Computing Methods for Digital Libraries Lecture 16 Web search engines.
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
What is the Internet? The Internet is a computer network connecting millions of computers all over the world It has no central control - works through.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Sigir’99 Inside Internet Search Engines: Search Jan Pedersen and William Chang.
Search engines. The number of Internet hosts exceeded in in in in in
1 Automated Digital Libraries William Y. Arms Department of Computer Science Cornell University.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.
Internet Research Search Engines & Subject Directories.
Introductions Search Engine Development COMP 475 Spring 2009 Dr. Frank McCown.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
Overview What is a Web search engine History Popular Web search engines How Web search engines work Problems.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Gregor Gisler-Merz How to hit in google The anatomy of a modern web search engine.
1 CS 430 / INFO 430: Information Discovery Lecture 19 Web Search 1.
Search Engines AGCM 4143 Electronic Communications in Agriculture.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
1 CS/INFO 430 Information Retrieval Lecture 21 Web Search 3.
CSM06 Information Retrieval Lecture 1a – Introduction Dr Andrew Salway
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
1 The NSDL Program Stephen Griffin National Science Foundation.
Web Search Architecture & The Deep Web
1 CS 430 / INFO 430 Information Retrieval Lecture 19 Web Search 1.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
1 Google: Case Study cs430 lecture 15 03/13/01 Kamen Yotov.
1 CS 430: Information Discovery Lecture 17 Web Crawlers.
1 CS 430: Information Discovery Lecture 20 Web Search Engines.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems.
Lecture 4 Access Tools/Searching Tools. Learning Objectives To define access tools To identify various access tools To be able to formulate a search strategy.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
SEMINAR ON INTERNET SEARCHING PRESENTED BY:- AVIPSA PUROHIT REGD NO GUIDED BY:- Lect. ANANYA MISHRA.
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
SEARCH ENGINES & WEB CRAWLER Akshay Ghadge Roll No: 107.
CS 430: Information Discovery
CS 430: Information Discovery
IST 516 Fall 2011 Dongwon Lee, Ph.D.
CS 430 / INFO 430 Information Retrieval
Federated & Meta Search
Search Engines & Subject Directories
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Data Mining Chapter 6 Search Engines
Introduction to Information Retrieval
Search Engines & Subject Directories
Search Engines & Subject Directories
CS/INFO 430 Information Retrieval
Information Retrieval and Web Design
cs430 lecture 02/22/01 Kamen Yotov
Discussion Class 9 Google.
Presentation transcript:

1 CS 430: Information Discovery Lecture 18 Web Search Engines: Google

2 Course Administration

3 Web Search Goal Provide information discovery for large amounts of open access material on the web Challenges Volume of material -- several billion items, growing steadily Items created dynamically or in databases Great variety -- length, formats, quality control, purpose, etc. Inexperience of users -- range of needs Economic models to pay for the service

4 Strategies Subject hierarchies Yahoo! -- use of human indexing Web crawling + automatic indexing General -- Google, AltaVista, Ask Jeeves, NorthernLight,... Subject based -- Psychcrawler, PoliticalInformation.Com, Inomics.Com,... Mixed models Human directed web crawling and automatic indexing -- BBC News

5 Components of Web Search Service Components Web crawler Indexing system Search system Considerations Economics Scalability

6 Economic Models Subscription Monthly fee with logon provides unlimited access (introduced by InfoSeek) Advertising Access is free, with display advertisements (introduced by Lycos) Can lead to distortion of results to suit advertisers Licensing Cost of company are covered by fees, licensing of software and specialized services

7

8 Cost Example (Google) 85 people 50% technical, 14 Ph.D. in Computer Science Equipment 2,500 Linux machines 80 terabytes of spinning disks 30 new machines installed daily Reported by Larry Page, Google, March 2000 At that time, Google was handling 5.5 million searches per day Increase rate was 20% per month By fall 2002, Google had grown to over 400 people.

9 Indexing Goals: Precision Short queries applied to very large numbers of items leads to large numbers of hits. Usability requires: Ranking hits in order that fits user's requirements Effective presentation helpful summary records removal of duplicates grouping results from a single site Completeness of index is not the most important factor.

10 Effective Information Retrieval Comprehensive metadata with Boolean retrieval (e.g., monograph catalog). Can be excellent for well-understood categories of material, but requires expensive metadata, which is rarely available. Full text indexing with ranked retrieval (e.g., news articles). Excellent for relatively homogeneous material, but requires available full text.

11 Effective Information Retrieval (cont) Full text indexing with contextual information and ranked retrieval (e.g., Google). Excellent for mixed textual information with rich structure. Contextual information without non-textual materials and ranked retrieval (e.g., Google image retrieval). Promising, but still experimental.

12 Google: Ranking 1.Paid advertisers 2.Manually created classification 3.Vector space ranking with corrections for document length 4.Extra weighting for specific fields, e.g., title, anchors, etc. 5.PageRank The balance between 3, 4, and 5 is not made public.

13 Usability: Display of Results

14 Usability: Dynamic Abstracts Query: Cornell sports LII: Law about...Sports... sports law: an overview. Sports Law encompasses a multitude areas of law brought together in unique ways. Issues... vocation. Amateur Sports Query: NCAA Tarkanian LII: Law about...Sports... purposes. See NCAA v. Tarkanian, 109 US 454 (1988). State action status may also be a factor in mandatory drug testing rules. On...

15 Limitations of Web Crawling Time delay. Typically a monthly cycle. Crawlers are ineffective with sites that change rapidly, e.g., news. Pages not linked to. Crawlers find only those pages that are linked by paths from their seeds. Depth of crawl. Crawlers do not index every page on a site (algorithms to avoid crawler traps). but... Creators of information are increasingly organizing them to be accessible to the web search services (e.g., Springer- Verlag)

16 Scalability ,000 10, ,000 1,000,000 10,000, ,000,000 1,000,000,000 10,000,000, The growth of the web

17 Web search services are centralized systems Over the past 3-5 years, Moore's Law has enabled the services to keep pace with the growth of the web and the number of users, while adding extra function. Will this continue? Possible areas for concern are telecommunications costs, disk access rates. Scalability

18 Case Study: Google Python with C/C++ Linux Module-based architecture Multi-machine Multi-thread

19 Performance Storage –Scale with the size of the Web –Repository is comparatively small –Good/Fast compression/decompression System –Crawling, Indexing, Sorting –Last two simultaneously Searching –Bounded by disk I/O

20 Image Search: indexing by contextual information only

21 Google API

22 Selective searching

23 Google News

24 Conclusion Google: –Scalable search engine –Complete architecture Many research ideas arise –Always something to improve High quality search is the dominant factor –precision –presentation of results