1 CS/INFO 430 Information Retrieval Lecture 21 Web Search 3.

Slides:



Advertisements
Similar presentations
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
GOOGLE SEARCH ENGINE Presented By Richa Manchanda.
Natural Language Processing WEB SEARCH ENGINES August, 2002.
Objective Understand web-based digital media production methods, software, and hardware. Course Weight : 10%
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Information Retrieval in Practice
Search Engines and Information Retrieval
Web Server Hardware and Software
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
1 CS/INFO 430 Information Retrieval Lecture 17 Web Search 3.
1 CS 430: Information Discovery Lecture 21 Web Search 3.
1 CS 502: Computing Methods for Digital Libraries Lecture 16 Web search engines.
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page.
Enterprise Search With SharePoint Portal Server V2 Steve Tullis, Program Manager, Business Portal Group 3/5/2003.
1 CS/INFO 430 Information Retrieval Lecture 18 Web Search 4.
1 CS/INFO 430 Information Retrieval Lecture 23 Usability 1.
SEARCH ENGINES By, CH.KRISHNA MANOJ(Y5CS021), 3/4 B.TECH, VRSEC. 8/7/20151.
Overview of Search Engines
WEB SCIENCE: SEARCHING THE WEB. Basic Terms Search engine Software that finds information on the Internet or World Wide Web Web crawler An automated program.
An innovative platform to allow translation and indexing of internet sites Localization World
IDK0040 Võrgurakendused I Building a site: Publicising Deniss Kumlander.
1 Introduction to Web Development. Web Basics The Web consists of computers on the Internet connected to each other in a specific way Used in all levels.
Lecturer: Ghadah Aldehim
Data Access Worldwide May 16 – 18, 2007 Copyright 2007, Data Access Worldwide May 16 – 18, 2007 Copyright 2007, Data Access Worldwide Search Engine Optimization.
HOW SEARCH ENGINE WORKS. Aasim Bashir.. What is a Search Engine? Search engine: It is a website dedicated to search other websites and there contents.
Search Engines and Information Retrieval Chapter 1.
Web Search Created by Ejaj Ahamed. What is web?  The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
SEO  What is it?  Seo is a collection of techniques targeted towards increasing the presence of a website on a search engine.
1 Search Engine Optimization An introduction to optimizing your web site for best possible search engine results.
Module 10 Administering and Configuring SharePoint Search.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Search Engines By: Faruq Hasan.
1 CS 430: Information Discovery Lecture 18 Web Search Engines: Google.
Copyright © 2002 Pearson Education, Inc. Slide 3-1 Internet II A consortium of more than 180 universities, government agencies, and private businesses.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Mining of Massive Datasets Edited based on Leskovec’s from
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
1 CS 430: Information Discovery Lecture 17 Web Crawlers.
1 CS 430: Information Discovery Lecture 20 Web Search Engines.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
SEMINAR ON INTERNET SEARCHING PRESENTED BY:- AVIPSA PUROHIT REGD NO GUIDED BY:- Lect. ANANYA MISHRA.
Design and Implementation of a High- Performance Distributed Web Crawler Vladislav Shkapenyuk, Torsten Suel 실시간 연구실 문인철
SEARCH ENGINE by: by: B.Anudeep B.Anudeep Y5CS016 Y5CS016.
CS 501: Software Engineering Fall 1999 Lecture 23 Design for Usability I.
Data mining in web applications
Information Retrieval in Practice
22C:145 Artificial Intelligence
Search Engine Optimization
Search Engine Architecture
Web Mining Ref:
SEARCH ENGINE OPTIMIZATION SEO. What is SEO? It is the process of optimizing structure, design and content of your website in order to increase traffic.
The Anatomy of a Large-Scale Hypertextual Web Search Engine
WIRED Week 2 Syllabus Update Readings Overview.
Objective Understand web-based digital media production methods, software, and hardware. Course Weight : 10%
Data Mining Chapter 6 Search Engines
Unit# 5: Internet and Worldwide Web
Agenda What is SEO ? How Do Search Engines Work? Measuring SEO success ? On Page SEO – Basic Practices? Technical SEO - Source Code. Off Page SEO – Social.
The Search Engine Architecture
CS/INFO 430 Information Retrieval
cs430 lecture 02/22/01 Kamen Yotov
Presentation transcript:

1 CS/INFO 430 Information Retrieval Lecture 21 Web Search 3

2 Course Administration Wednesday, November 16 No discussion class Thursday, November 17 No lecture No office hours

3 Scalability Question: How big is the Web and how fast is it growing? Answer: Nobody knows Estimates of the Crawled Web: ,000 pages 19971,000,000 pages 20001,000,000,000 pages 20058,000,000,000 pages Rough estimates of the Crawlable Web suggest at least 4x Rough estimates of the Deep Web suggest at least 100x

4 Scalability: Software and Hardware Replication Search service index server document server spell checking advertisement server

5 Scalability: Large-scale Clusters of Commodity Computers "Component failures are the norm rather than the exception.... The quantity and quality of the components virtually guarantee that some are not functional at any given time and some will not recover from their current failures. We have seen problems caused by application bugs, operating system bugs, human errors, and the failures of disks, memory, connectors, networking, and power supplies...." Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, "The Google File System." 19th ACM Symposium on Operating Systems Principles, October ghemawat.pdf

6 Scalability: Performance Very large numbers of commodity computers Algorithms and data structures scale linearly Storage –Scale with the size of the Web –Compression/decompression System –Crawling, indexing, sorting simultaneously Searching –Bounded by disk I/O

7 Scalability: Numbers of Computers Very rough calculation In March 2000, 5.5 million searches per day, required 2,500 computers In fall 2004, computers are about 8 times more powerful. Estimated number of computers for 250 million searches per day: = (250/5.5) x 2,500/8 = about 15,000 Some industry estimates (based on Google's capital expenditure) suggest that Google or Yahoo may have had as many as 80,000 computers in spring 2005.

8 Scalability of Staff: Growth of Google In 2000: 85 people 50% technical, 14 Ph.D. in Computer Science In 2000: Equipment 2,500 Linux machines 80 terabytes of spinning disks 30 new machines installed daily Reported by Larry Page, Google, March 2000 At that time, Google was handling 5.5 million searches per day Increase rate was 20% per month By fall 2002, Google had grown to over 400 people. In 2004, Google hired 1,000 new people.

9 Scalability: Staff Programming: Have very well trained staff. Isolate complex code. Most coding is single image. System maintenance: Organize for minimal staff (e.g., automated log analysis, do not fix broken computers). Customer service: Automate everything possible, but complaints, large collections, etc. still require staff.

10 Scalability of Staff: The Neptune Project The Neptune Clustering Software: Programming API and runtime support, which allows a network service to be programmed quickly for execution on a large-scale cluster in handling high-volume user traffic. The system shields application programmers from the complexities of replication, service discovery, failure detection and recovery, load balancing, resource monitoring and management. Tao Yang, University of California, Santa Barbara

11 Web search services are centralized systems Over the past 9 years, Moore's Law has enabled the services to keep pace with the growth of the web and the number of users, while adding extra function. Will this continue? Possible areas for concern are: staff costs, telecommunications costs, disk and memory access rates, equipment costs. Scalability: the long term

12 Growth of Web Searching In November 1997: AltaVista was handling 20 million searches/day. Google forecast for 2000 was 100s of millions of searches/day. In 2004, Google reported 250 million webs searches/day, and estimated that the total number over all engines was 500 million searches/day. Moore's Law and Web searching In 7 years, Moore's Law predicts computer power increased by a factor of at least 2 4 = 16. It appears that computing power is growing at least as fast as web searching.

13 Search Engine Spam: Objective Success of commercial Web sites depends on the number of visitors that find the site while searching for a particular product. 85% of searchers look at only the first page of results A new business sector – search engine optimization M. Henzinger, R. Motwani, and C. Silverstein. Challenges in web search engines. International Joint Conference on Artificial Intelligence, Drost, I. and Scheffer, T., Thwarting the Nigritude Ultramarine: Learning to Identify Link Spam. 16th European Conference on Machine Learning, Porto, 2005

14 Search Engine Spam: Techniques Text based: Add keywords to a page in the hope that search engines will index it, e.g., in meta-tags, in special type of format, etc. Cloaking: Return different page to Web crawlers than to ordinary downloads. (Can also be used to help Web search, e.g., by providing a text version of a highly visual page.) Link-based: (see next slide)

15 Link Spamming: Techniques Link farms: Densely connected arrays of pages. Farm pages propagate their PageRank to the target, e.g., by a funnel- shaped architecture that points directly or indirectly towards the target page. To camouflage link farms, tools fill in inconspicuous content, e.g., by copying news bulletins. Link exchange services: Listings of (often unrelated) hyperlinks. To be listed, businesses have to provide a back link that enhances the PageRank of the exchange service. Guestbooks, discussion boards, and weblogs: Automatic tools post large numbers of messages to many sites; each message contains a hyperlink to the target website.

16 Link Spamming: Defenses Manual identification of spam pages and farms to create a blacklist. Automatic classification of pages using machine learning techniques. BadRank algorithm. The "bad rank" is initialized to a high value for blacklisted pages. It propagates bad rank to all referring pages (with a damping factor) thus penalizing pages that refer to spam.

17 Search Engine Friendly Pages Good ways to get your page indexed and ranked highly Use straightforward URLs, with simple structure, which do not change with time. Submit your site to be crawled. Provide a site map of the pages that you wish to be crawled. Have the words that you would expect to see in queries: - in the content of your pages. - in and tags Attempt to have links to your page from appropriate authorities. Avoid suspicious behavior.

18 Adding audience information to ranking Conventional information retrieval: A given query returns the same set of hits, ranked in the same sequence, irrespective of who submitted the query. If the search service has information about the user: The results set and/or the ranking can be varied to match the user's profile Example: In an educational digital library, the order of search results can be varied for: instructor v. student grade level of course

19 Adding audience information to ranking Metadata based methods: Label documents with controlled vocabulary to define intended audience. Provide users with means to specify their needs, either through a profile (preferences) or by a query parameter Automatic methods Capture persistent information about user behavior Adjust tf.idf rankings using terms derived from user behavior Data-mining to capture user information raises privacy concerns

20 How many of these services collect information about the user?

21 Other Uses of Web Crawling and Associated Technology The technology developed for Web search services has many other applications. Conversely, technology developed for other Internet applications can be applied in Web searching Related objects (e.g., Amazon's "Other people bought the following"). Recommender and reputation systems (e.g., ePinion's reputation system).

22 Context: Image Searching HTML source From the Information Science web site Captions and other adjacent text on the web page

23 Browsing Users give queries of 2 to 4 words Most users click only on the first few results; few go beyond the fold on the first page 80% of users, use search engine to find sites search to find site browse to find information Amil Singhal, Google, 2004 Browsing is a major topic in the lectures on Usability

24 Evaluation Web Searching Test corpus must be dynamic The web is dynamic (10%-20%) of URLs change every month Spam methods change change continually Queries are time sensitive Topic are hot and then not Need to have a sample of real queries Languages At least 90 different languages Reflected in cultural and technical differences Amil Singhal, Google, 2004