IST 497 Vladimir Belyavskiy 11/21/02

Slides:



Advertisements
Similar presentations
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Advertisements

1 Presented By Avinash Gutte Under The Guidance of Mrs. Hemangi Kulkarni Department of Computer Engineering Pimpri-Chinchwad College of Engineering, Pune.
Synchronizing a Database To Improve Freshness Junghoo Cho Hector Garcia-Molina Stanford University.
Web Crawling Notes by Aisha Walcott
1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.
Parallel Crawlers Junghoo Cho, Hector Garcia-Molina Stanford University Presented By: Raffi Margaliot Ori Elkin.
1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.
distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.
1 Intelligent Crawling Junghoo Cho Hector Garcia-Molina Stanford InfoLab.
WEB SCIENCE: SEARCHING THE WEB. Basic Terms Search engine Software that finds information on the Internet or World Wide Web Web crawler An automated program.
IDK0040 Võrgurakendused I Building a site: Publicising Deniss Kumlander.
Chapter 16 The World Wide Web Chapter Goals ( ) Compare and contrast the Internet and the World Wide Web Describe general Web processing.
Chapter 16 The World Wide Web Chapter Goals Compare and contrast the Internet and the World Wide Web Describe general Web processing Describe several.
HOW SEARCH ENGINE WORKS. Aasim Bashir.. What is a Search Engine? Search engine: It is a website dedicated to search other websites and there contents.
CPS120: Introduction to Computer Science The World Wide Web Nell Dale John Lewis.
Design and Implement an Efficient Web Application Server Presented by Tai-Lin Han Date: 11/28/2000.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
XHTML Introductory1 Linking and Publishing Basic Web Pages Chapter 3.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
5 Chapter Five Web Servers. 5 Chapter Objectives Learn about the Microsoft Personal Web Server Software Learn how to improve Web site performance Learn.
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
Web Categorization Crawler Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Design & Architecture Dec
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
ITIS 1210 Introduction to Web-Based Information Systems Chapter 27 How Internet Searching Works.
Search Engine By Bhupendra Ratha, Lecturer School of Library and Information Science Devi Ahilya University, Indore
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
Curtis Spencer Ezra Burgoyne An Internet Forum Index.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Parallel Crawlers Junghoo Cho (UCLA) Hector Garcia-Molina (Stanford) May 2002 Ke Gong 1.
Search Engines By: Faruq Hasan.
1 University of Qom Information Retrieval Course Web Search (Spidering) Based on:
Digital Literacy Concepts and basic vocabulary. Digital Literacy Knowledge, skills, and behaviors used in digital devices (computers, tablets, smartphones)
What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
The Internet. Important Terms Network Network Internet Internet WWW (World Wide Web) WWW (World Wide Web) Web page Web page Web site Web site Browser.
How to Crawl the Web Hector Garcia-Molina Stanford University Joint work with Junghoo Cho.
A s s i g n m e n t W e e k 7 : T h e I n t e r n e t B Y : P a t r i c k O b i s p o.
Web Crawling and Automatic Discovery Donna Bergmark March 14, 2002.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
Week-6 (Lecture-1) Publishing and Browsing the Web: Publishing: 1. upload the following items on the web Google documents Spreadsheets Presentations drawings.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
SEMINAR ON INTERNET SEARCHING PRESENTED BY:- AVIPSA PUROHIT REGD NO GUIDED BY:- Lect. ANANYA MISHRA.
1 Efficient Crawling Through URL Ordering Junghoo Cho Hector Garcia-Molina Lawrence Page Stanford InfoLab.
Crawling When the Google visit your website for the purpose of tracking, Google does this with help of machine, known as web crawler, spider, Google bot,
Search Engine Optimization
Web Programming Language
4.01 How Web Pages Work.
Search Engine Optimization
Search Engines and Search techniques
WWW and HTTP King Fahd University of Petroleum & Minerals
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
SEARCH ENGINES & WEB CRAWLER Akshay Ghadge Roll No: 107.
E-commerce | WWW World Wide Web - Concepts
E-commerce | WWW World Wide Web - Concepts
How to Crawl the Web Peking University 12/24/2003 Junghoo “John” Cho
UNIT 15 Webpage Creator.
Search Engines & Subject Directories
Chapter 27 WWW and HTTP.
Information Retrieval
Searching for Truth: Locating Information on the WWW
Search Engines & Subject Directories
Search Engines & Subject Directories
Chapter 16 The World Wide Web.
All About the Internet.
Searching for Truth: Locating Information on the WWW
Searching for Truth: Locating Information on the WWW
Information Retrieval and Web Design
Best Digital Marketing Tips For Quick Web Pages Indexing Presented By:- Abhinav Shashtri.
Presentation transcript:

IST 497 Vladimir Belyavskiy 11/21/02 Web Crawlers IST 497 Vladimir Belyavskiy 11/21/02

Overview Introduction to Crawlers Focused Crawling Issues to consider Parallel Crawlers Ambitions for the future Conclusion

Introduction What is a crawler? Why are crawlers important? Used by many Main use is to create indexes for search engines Tool was needed to keep track of web content In March of 2002 there were 38,118,962 web sites Web has doubled in less than two years.

Web crawlers start by parsing a specified web page, noting any hypertext links on that page that point to other web pages. They then parse those pages for new links, and so on, recursively. Web-crawler software doesn't actually move around to different computers on the Internet, as viruses or intelligent agents do. A crawler resides on a single machine. The crawler simply sends HTTP requests for documents to other machines on the Internet, just as a web browser does when the user clicks on links. All the crawler really does is to automate the process of following links. Following links isn't greatly useful in itself, of course. The list of linked pages almost always serves some subsequent purpose. The most common use is to build an index for a web search engine, but crawlers are also used for other purposes,

Focused Crawling Focused Crawler: selectively seeks out pages that are relevant to a pre-defined set of topics. Topics specified by using exemplary documents (not keywords) Crawl most relevant links Ignore irrelevant parts. Leads to significant savings in hardware and network resources.

Issues to consider Where to start crawling? Keyword search Users input User specifies keywords Search for given criteria Popular sites are found using weighted degree measures Approached used for 966 Yahoo category searches (ex Business/Electronics) Users input User gives document examples Crawler compared documents to find matches

Issues to consider URLs found are stored in a queue, stack or a deck Which link do you crawl next? Ordering metrics: Breadth-First URLs are placed in the queue in order discovered First link found is the first to crawl

Issues to consider Backlink count Page Rank Counts the number of links to the page Site with greatest # of links is given priority Page Rank backlinks are also counted Popular backlinks are given extra value (Ex. Yahoo) Works the best

Issues to consider What pages should crawler download? Not enough space Not enough time How to keep content fresh? Fixed Order - Explicit list of URL’s to visit Random Order – Start from seed and follow links Purely Random – Refresh pages on demand In most cases, the crawler cannot download all pages on the Web. Even the most comprehensive search engine currently indexes a small fraction of the entire Web [LG99, BB99]. Given this fact, it is important for the crawler to carefully select the pages and to visit “important” pages, so that the fraction of the Web that is visited (and kept up-to-date) is more meaningful. Once the crawler has downloaded a significant number of pages, it has to start revisiting the downloaded pages in order to detect changes and refresh the downloaded collection. Because Web pages are changing at very different rates [CGM00a, WM99], the crawler needs to carefully decide, which pages to revisit and which pages to skip in order to achieve high “freshness” of pages. For example, if a certain page rarely changes, the crawler may want to revisit the page less often, in order to visit more frequently changing ones. Change is defined by user. Ex: 30% change in a page, or 3 different columns

Issues to consider Estimate frequency of changes Visit pages once a week for five weeks Estimate change frequency Adjust revisit frequency based on the estimate Most effective method

Issues to consider How to minimize the load on visited pages? Crawler should obey the constraints Crawler html tags Robot.txt file User-Agent: * Disallow: / Spider Traps When the crawler collects pages from the Web, it consumes resources belonging to other organizations [Kos95]. For example, when the crawler downloads page p on site S, the site needs to retrieve page p from its file system, consuming disk and CPU resources. After this retrieval the page then needs to be transferred through the network, which is another resource shared by multiple organizations. Therefore, the crawler should minimize its impact on these resources [Rob]. Otherwise, the administrators of a Web site or a particular network may complain and sometimes may completely block access by the crawler.

Parallel Crawlers Web is too big to be crawled by a single crawler, work should be divided Independent assignment Each crawler starts with its own set of URLs Follows links without consulting other crawlers Reduces communication overhead Some overlap is unavoidable

Parallel Crawlers Dynamic assignment Static assignment Central coordinator divides web into partitions Crawlers crawl their assigned partition Links to other URLs are given to Central coordinator Static assignment Web is partitioned and divided to each crawler Crawler only crawls its part of the web in dynamic assignment, the central coordinator may become a major bottleneck because it has to maintain a large number of URLs reported from all C-proc’s and has to constantly coordinate all C-proc’s. For static assignment user Must know what they want to crawl. They may not know all the desired domains.

Evaluation Content Quality better for single-process crawler Overlap in most multiple processors or they don’t cover all of the content Overall crawlers are useful tools parallel crawler may be worse than that of a single-process crawler, because many importance metrics depend on the global structure of the Web (e.g., backlink count): Each C-proc in a parallel crawler may know only the pages that are downloaded by itself and may make a poor crawling decision based solely on its own pages. In contrast, a single-process crawler knows all pages it has downloaded and may make a more informed decision. Certain parts of domains can only be reached from other domains. If a crawler isn’t allowed to access the other domain it won’t be able to crawl those documents.

Future Query interface pages Detect web page changes better Ex. http://www.weatherchannel.com Detect web page changes better Separate dynamic from static content Share data better between servers and crawlers As more and more pages are dynamically generated, however, some pages are “hidden” behind a query Interface. They are reachable only when the user issues keyword queries to a query interface. In order to crawl them Crawler has to figure out what keywords to issue. Crawler can use the context of pages to guess the keywords and retrieve the data. Some web pages only change in certain sections. Ex: On eBay prices change frequently, but product description doesn’t. Crawlers should ignore changes in dynamic portion, since its irrelevant for description of the webpage. This way you save some resources by not downloading web pages all the time. A mechanism needs to be developed, which will allow crawlers to subscribe to the changes its interested in. Both servers and crawlers will benefit if the changes made on the server were published. Then crawler can make better crawling decisions. This will limit the amount of information that needs to be saved by a crawler and will reduce traffic on the server.

Bibliography Cheng, Rickie & Kwong, April. April 2000 http://sirius.cs.ucdavis.edu/teaching/289FSQ00/project/Reports/crawl_init.pdf. Cho, Junghoo. http://rose.cs.ucla.edu/~cho/papers/cho-thesis.pdf 2002. Dom, Brian. http://www8.org/w8-papers/5a-search-query/crawling/ March 1999. Polytechnic University, CIS Department http://hosting.jrc.cec.eu.int/langtech/Documents/Slides-001220_Scheer_OSILIA.pdf

The End Any Questions?