Optimal Crawling Strategies for Web Search Engines Wolf, Sethuraman, Ozsen Presented By Rajat Teotia.

Slides:



Advertisements
Similar presentations
Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.
Advertisements

February 20, Spatio-Temporal Bandwidth Reuse: A Centralized Scheduling Mechanism for Wireless Mesh Networks Mahbub Alam Prof. Choong Seon Hong.
Hadi Goudarzi and Massoud Pedram
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Dealing with Complexity Robert Love, Venkat Jayaraman July 24, 2008 SSTP Seminar – Lecture 10.
@ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University.
Modeling and Analysis of Random Walk Search Algorithms in P2P Networks Nabhendra Bisnik, Alhussein Abouzeid ECSE, Rensselaer Polytechnic Institute.
Information Retrieval Lecture 8 Introduction to Information Retrieval (Manning et al. 2007) Chapter 19 For the MSc Computer Science Programme Dell Zhang.
All Hands Meeting, 2006 Title: Grid Workflow Scheduling in WOSE (Workflow Optimisation Services for e- Science Applications) Authors: Yash Patel, Andrew.
Provenance in Open Distributed Information Systems Syed Imran Jami PhD Candidate FAST-NU.
Computer Information Technology – Section 3-2. The Internet Objectives: The Student will: 1. Understand Search Engines and how they work 2. Understand.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Evaluating Search Engine
Freshness Policy Binoy Dharia, K. Rohan Gandhi, Madhura Kolwadkar Department of Computer Science University of Southern California Los Angeles, CA.
Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado ! Francisco Tenorio ! Jacques.
Synchronizing a Database To Improve Freshness Junghoo Cho Hector Garcia-Molina Stanford University.
1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.
Web Search – Summer Term 2006 IV. Web Search - Crawling (c) Wolfgang Hürst, Albert-Ludwigs-University.
A Topic Specific Web Crawler and WIE*: An Automatic Web Information Extraction Technique using HPS Algorithm Dongwon Lee Database Systems Lab.
Academic Advisor: Prof. Ronen Brafman Team Members: Ran Isenberg Mirit Markovich Noa Aharon Alon Furman.
1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.
Ant Colonies As Logistic Processes Optimizers
Monitoring the dynamic Web to respond to Continuous Queries Sandeep Pandey Krithi Ramamritham Soumen Chakrabarti IIT Bombay
Fair Scheduling in Web Servers CS 213 Lecture 17 L.N. Bhuyan.
1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.
How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.
WEB SCIENCE: SEARCHING THE WEB. Basic Terms Search engine Software that finds information on the Internet or World Wide Web Web crawler An automated program.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
Search Engine Optimization
Roger ZimmermannCOMPSAC 2004, September 30 Spatial Data Query Support in Peer-to-Peer Systems Roger Zimmermann, Wei-Shinn Ku, and Haojun Wang Computer.
A User Experience-based Cloud Service Redeployment Mechanism KANG Yu.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Web Categorization Crawler Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Design & Architecture Dec
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
Lesley Charles November 23, 2009.
A Survey of Distributed Task Schedulers Kei Takahashi (M1)
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Search Engine Optimization & Pay Per Click Advertising
Continuous Deployment JEFFREY KNAPP 8/6/14. Introduction Why is it valuable How to achieve What to consider.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
ECO-DNS: Expected Consistency Optimization for DNS Chen Stephanos Matsumoto Adrian Perrig © 2013 Stephanos Matsumoto1.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Hypersearching the Web, Chakrabarti, Soumen Presented By Ray Yamada.
Monitoring the dynamic Web to respond to Continuous Queries Presented by Qing Cao CS851 Spring 2005.
Search Engine-Crawler Symbiosis: Adapting to Community Interests
Unit – I Presentation. Unit – 1 (Introduction to Software Project management) Definition:-  Software project management is the art and science of planning.
Quality Is in the Eye of the Beholder: Meeting Users ’ Requirements for Internet Quality of Service Anna Bouch, Allan Kuchinsky, Nina Bhatti HP Labs Technical.
How to Crawl the Web Hector Garcia-Molina Stanford University Joint work with Junghoo Cho.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
Jan 27, Digital Preservation Seminar1 Effective Page Refresh Policies for Web Crawlers Written By: Junghoo Cho & Hector Garcia-Molina Presenter:
Design and Implementation of a High- Performance Distributed Web Crawler Vladislav Shkapenyuk, Torsten Suel 실시간 연구실 문인철
Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,
Crawling When the Google visit your website for the purpose of tracking, Google does this with help of machine, known as web crawler, spider, Google bot,
Search Engine Optimization
Authors: Jiang Xie, Ian F. Akyildiz
OPERATING SYSTEMS CS 3502 Fall 2017
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.
How to Crawl the Web Peking University 12/24/2003 Junghoo “John” Cho
Whether you decide to use hidden frames or XMLHttp, there are several things you'll need to consider when building an Ajax application. Expanding the role.
Prepared by Rao Umar Anwar For Detail information Visit my blog:
An Efficient method to recommend research papers and highly influential authors. VIRAJITHA KARNATAPU.
Junghoo “John” Cho UCLA
Kostas Kolomvatsos, Christos Anagnostopoulos
Citation databases and social networks for researchers: measuring research impact and disseminating results - exercise Elisavet Koutzamani
Presentation transcript:

Optimal Crawling Strategies for Web Search Engines Wolf, Sethuraman, Ozsen Presented By Rajat Teotia

Overview : Why Do we care ? Purpose of the paper. Proposed solution for optimal crawling Pros Cons Current Trends.

Why Do We Care? Search engines use crawlers in a automated manner to build local repositories containing web pages. These local copies of web pages are used for later processing, like creating the index, run ranking algorithms etc. Due to dynamic nature of website web pages are updated frequently. To maintain the fresh copy of these pages, we require efficient crawling mechanism

Purpose of The Paper? This paper provide efficient solution to: 1.Optimum crawling frequency problem. 2.Crawling scheduling problem. 3.Minimization of the average level of staleness over all web pages. 4.Minimize search engine embarrassment level metric. 5.To use efficient resource allocation algorithms to achieve optimum crawling mechanism

Solution: Minimize staleness over all web pages Size of the web is estimated to be 10 + billion pages. According to the study around 25% - 30 % of the web pages change daily. In order to maintain fresh web page repository, efficient crawling algorithm should be used. Two main aspects to build an efficient crawling algorithm are: 1) Optimal frequency : Number of crawls for each web page over a fixed period of time and Ideal crawl times between these intervals. 2) Efficient scheduling for these crawling process. To handle the update pattern of the web pages, Some pages are updated in quasi-deterministic manner other tend to be updated in Poisson manner.

Solution: Optimal frequency problem To compute a particular probability function that captures, whether the search engine have a stale copy of web page i at an arbitrary time t in the interval [0; T]. From this we can compute the time-average staleness estimate, by averaging this probability function over all t within [0; T] To find a time interval to minimize the time-average staleness estimate. To find the importance of web pages (weights), in order to organize possible results search query. This can be efficiently explained by search engine embarrassment metrics.

Search engine embarrassment level metrics The frequency with which a client makes a query, and finds that the resulting page is inconsistent. Case 1: lucky case, stale page is not returned to user. Case 2 : unlucky case, stale page is returned to user but not clicked by user Case 3: stale page returned and user clicks the result page to find the correct query Case 4: returned pages has inconsistent result w.r.t query

Solution: Greedy approach for resource allocation Probability of clicking a page to the position or weight of the web page For quasi deterministic case for updating a page, crawl should be done at potential update time. To solve the resource allocation problem, in order to find optimum crawl time author has used dynamic programming and greedy algorithms. To find the optimal time interval between minimum and maximum bound

Solution : Optimum scheduling problem Number of crawls to obtain fresh copy of the page for a time period T, the problem is to decide optimum time interval between these crawls. Since for most of the cases scheduling the crawl bit early or bit late does not affect performance too much. But for the quasi – deterministic process being late is acceptable but being early is not useful. This scheduling problem can be posed and solved as transportation problem and network flow. A bipartite network graph with one sided flow depicts this problem.

Solution : Optimum scheduling problem ……. If C be total no of crawlers and S be crawl task in time T. Each node has supply of 1 unit and there is one demand node per time slot and crawler pair. Then they are indexed by 1≤ l ≤ S and 1 ≤ k ≤ C Where k is individual crawler and is the no of tasks. The solution for this transportation problem ensures the existence of optimal solution with optimal flow

Parameterization issue about update process: Information about last crawl time does not tell anything about other updates occurring since last crawl. Crawl time, update pattern and data can be used together to formulate the statistical properties of update process. This information can be then used to build probability distribution for the interupdate for any page.

Pros: Precisely describe the optimal crawling process to reduce staleness of web pages. Provide good introduction to search engine embarrassment metrics. Provides schemes for optimized number of crawls for a dynamic page using dynamic programming. Give us clear idea about the optimal crawling schedule.

Cons : Research data is quite outdated, and lot of advancement have been made since then. No strategy has been proposed for handling the content replication. Introduction of blogs, forums and social networking site has changed the way we calculate weight for the pages.

Conclusion : Crawling process can improve the quality of services provided by search engine. Optimal crawling process and the scheduling algorithm plays a vital role in determining the quality and freshness of web pages. Overall objective is to reduce the search engine embarrassment metrics and to provide best possible search results.

Further Research : Event driven web page crawler, to be able to fetch ajax based data. Adaptive Model based crawling strategies, fixed order vs random order crawling. Implementing ranking based crawling strategies Formulate the crawling strategies keeping page replication in account. To reduce the crawling task to some extent.

Current Trends : Building adaptive model based web crawlers. Using separate crawling strategies for finding fresh pages and for deep crawls (eg. Googles organic crawl) Fresh bot to fetch fresh pages, and deep crawl bot to index all the web pages. Duplicate content aware crawling to reduce the crawling load.

Current Trends …… URL ordering and queuing based on priority. Context focused crawling for better Result Distributed crawling and multi threaded crawlers Crawling and real time web search.

References J. L. Wolf, M. S. Squillante, P. S. Yu, J. Sethuraman, L. Ozsen, Optimal crawling strategies for web search engines, Proceedings of the 11th international conference on World Wide Web, May 07-11, 2002, Honolulu, Hawaii, USA [doi> / ] Edwards, J., McCurley, K. S., and Tomlin, J. A. (2001). "An adaptive model for optimizing performance of an incremental web crawler". In Proceedings of the Tenth Conference on World Wide Web (Hong Kong: Elsevier Science): 106–113. doi: / Diligenti, M., Coetzee, F., Lawrence, S., Giles, C. L., and Gori, M. (2000). Focused crawling using context graphs. In Proceedings of 26th International Conference on Very Large Databases (VLDB), pages , Cairo, Egypt. Pant, Gautam; Srinivasan, Padmini; Menczer, Filippo (2004). "Crawling the Web". in Levene, Mark; Poulovassilis, Alexandra. Web Dynamics: Adapting to Change in Content, Size, Topology and Use. Springer. pp. 153–178. ISBN Articles from search engine journal, search engine round table, Wikipedia …..

Q & A