SLASHPack Collector - 5/4/20061 SLASHPack: Collector Performance Improvement and Evaluation Rudd Stevens CS 690 Spring 2006.

Slides:



Advertisements
Similar presentations
Other Web Application Development Technologies. PHP.
Advertisements

INTRODUCTION TO SIMULATION WITH OMNET++ José Daniel García Sánchez ARCOS Group – University Carlos III of Madrid.
Introduction to Computing Using Python CSC Winter 2013 Week 8: WWW and Search  World Wide Web  Python Modules for WWW  Web Crawling  Thursday:
Department of Computer Science and Engineering University of Washington Brian N. Bershad, Stefan Savage, Przemyslaw Pardyak, Emin Gun Sirer, Marc E. Fiuczynski,
Web Categorization Crawler – Part I Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Final Presentation Sep Web Categorization.
Introduction to Web Crawling and Regular Expression CSC4170 Web Intelligence and Social Computing Tutorial 1 Tutor: Tom Chao Zhou
Page: 1 Director 1.0 TECHNION Department of Computer Science The Computer Communication Lab (236340) Summer 2002 Submitted by: David Schwartz Idan Zak.
Web architecture Dr Jim Briggs Web architecture.
CHIME: A Metadata-Based Distributed Software Development Environment Stephen E. Dossick Dept. of Computer Science Columbia University
March 26, 2003CS502 Web Information Systems1 Web Crawling and Automatic Discovery Donna Bergmark Cornell Information Systems
Interpret Application Specifications
© De Montfort University, Web Servers Chris Hand And Howell Istance De Montfort University.
1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.
Session-01. What is a Servlet? Servlet can be described in many ways, depending on the context: 1.Servlet is a technology i.e. used to create web application.
1 Archive-It Training University of Maryland July 12, 2007.
Web Crawling David Kauchak cs160 Fall 2009 adapted from:
HTML Comprehensive Concepts and Techniques Intro Project Introduction to HTML.
Nutch Search Engine Tool. Nutch overview A full-fledged web search engine Functionalities of Nutch  Internet and Intranet crawling  Parsing different.
1 Network Statistic and Monitoring System Wayne State University Division of Computing and Information Technology Information Technology.
6/1/2001 Supplementing Aleph Reports Using The Crystal Reports Web Component Server Presented by Bob Gerrity Head.
FALL 2005CSI 4118 – UNIVERSITY OF OTTAWA1 Part 4 Web technologies: HTTP, CGI, PHP,Java applets)
ITD 3194 Web Application Development Chapter 4: Web Programming Language.
1 HTML and CGI Scripting CSC8304 – Computing Environments for Bioinformatics - Lecture 10.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
Simple Web Services. Internet Basics The Internet is based on a communication protocol named TCP (Transmission Control Protocol) TCP allows programs running.
Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1.
A Web Crawler Design for Data Mining
1 Analysis of Push Initiator Tool used for Wireless Application Protocol Taotao Huang Helsinki University of Technology Department of Electrical and Communication.
1 Apache. 2 Module - Apache ♦ Overview This module focuses on configuring and customizing Apache web server. Apache is a commonly used Hypertext Transfer.
10/6/2015 ©2007 Scott Miller, University of Victoria 1 2a) Systems Introduction to Systems Introduction to Software Systems Rev. 2.0.
Module 10: Monitoring ISA Server Overview Monitoring Overview Configuring Alerts Configuring Session Monitoring Configuring Logging Configuring.
Web Categorization Crawler Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Design & Architecture Dec
Crawling Slides adapted from
Dynamic Content On Edge Cache Server (using Microsoft.NET) Name: Aparna Yeddula CS – 522 Semester Project Project URL: cs.uccs.edu/~ayeddula/project.html.
University of Illinois at Urbana-Champaign A Unified Platform for Archival Description and Access Christopher J. Prom, Christopher A. Rishel, Scott W.
ELECTRONIC COMMERCE- Framework, Technologies and Applications © Tata McGraw-Hill 1 Electronic Commerce: Information Distribution and Messaging.
1 Geospatial and Business Intelligence Jean-Sébastien Turcotte Executive VP San Francisco - April 2007 Streamlining web mapping applications.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
1 MSCS 237 Overview of web technologies (A specific type of distributed systems)
Server to Server Communication Redis as an enabler Orion Free
Department of Computer Science Internet Performance Measurements using Firefox Extensions Scot L. DeDeo Professor Craig Wills.
World Wide Web “WWW”, "Web" or "W3". World Wide Web “WWW”, "Web" or "W3"
AFTERCOLLEGE SELF- SERVICE SCRAPE CONFIGURATION AND POSTING UTILITY Kai Hu Haiyan Wu March 17, Cowell 416 Midterm Presentation.
Design a full-text search engine for a website based on Lucene
Information Retrieval and Web Search Crawling in practice Instructor: Rada Mihalcea.
Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 1 Fundamentals.
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
Module 9 Planning and Implementing Monitoring and Maintenance.
1 CS 430 / INFO 430 Information Retrieval Lecture 19 Web Search 1.
MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.
09/13/04 CDA 6506 Network Architecture and Client/Server Computing Peer-to-Peer Computing and Content Distribution Networks by Zornitza Genova Prodanoff.
Web Crawling and Automatic Discovery Donna Bergmark March 14, 2002.
1 Crawling Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan.
1 CS 430: Information Discovery Lecture 17 Web Crawlers.
Simulation Production System Science Advisory Committee Meeting UW-Madison March 1 st -2 nd 2007 Juan Carlos Díaz Vélez.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Allan Heydon and Mark Najork --Sumeet Takalkar. Inspiration of Mercator What is a Mercator Crawling Algorithm and its Functional Components Architecture.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
Bucharest, 23 February 2005 CHM PTK technologies Adriana Baciu Finsiel Romania.
1 Design and Implementation of a High-Performance Distributed Web Crawler Polytechnic University Vladislav Shkapenyuk, Torsten Suel 06/13/2006 석사 2 학기.
Database System Laboratory Mercator: A Scalable, Extensible Web Crawler Allan Heydon and Marc Najork International Journal of World Wide Web, v.2(4), p ,
SDN controllers App Network elements has two components: OpenFlow client, forwarding hardware with flow tables. The SDN controller must implement the network.
HTTP – An overview.
W3 Status Analyzer.
CS 430: Information Discovery
IS333D: MULTI-TIER APPLICATION DEVELOPMENT
Web Servers (IIS and Apache)
OPeNDAP/Hyrax Interfaces
Presentation transcript:

SLASHPack Collector - 5/4/20061 SLASHPack: Collector Performance Improvement and Evaluation Rudd Stevens CS 690 Spring 2006

SLASHPack Collector - 5/4/20062 Outline 1. Introduction, system overview and design. 2. Performance modifications, re-factoring and re-structuring. 3. Performance testing results and evaluation.

SLASHPack Collector - 5/4/20063 Outline 1. Introduction, system overview and design. 2. Performance modifications, re-factoring and re-structuring. 3. Performance testing results and evaluation.

SLASHPack Collector - 5/4/20064 Introduction SLASHPack Toolkit (Semi-LArge Scale Hypertext Package) Sponsored by Prof. Chris Brooks, engineered for initial clients Nancy Montanez and Ryan King. Collector component  Framework for collecting documents. Evaluate and improve performance.

SLASHPack Collector - 5/4/20065 Contact and Information Sources Contact Information:  Rudd Stevens rstevens (at) cs.usfca.edu Project Website:  Project Sponsor:  Professor Christopher Brooks Department of Computer Science University of San Francisco cbrooks (at) cs.usfca.edu

SLASHPack Collector - 5/4/20066 Stages Addition of protocol module for Weblog data set. Performance testing using the Weblog and HTTP modules. Identify problem areas. Modify Collector to improve scalability and performance. Repeat performance testing and evaluate performance improvements.

SLASHPack Collector - 5/4/20067 Implementation Language: Python Platform: Any Python supported OS.  Python 2.4 or later  (Developed and tested under Linux.) Progress: Fully built, newly re-factored for performance and usability.

SLASHPack Collector - 5/4/20068 High level design SLASHPack designed as a framework. Modular components, that contain sub- modules. Collector pluggable for protocol modules, parsers, filters, output writers, etc.

SLASHPack Collector - 5/4/20069 High level design (cont.)

SLASHPack Collector - 5/4/ Outline 1. Introduction, system overview and design. 2. Performance modifications, re-factoring and re-structuring. 3. Performance testing results and evaluation.

SLASHPack Collector - 5/4/ Performance Testing Large scale text collection.  Weblog data set.  Long web crawls. Performance testing monitoring  Python Profiling.  Integrated Statistics. Functionality Testing  Python logging.  Functionality test runs.

SLASHPack Collector - 5/4/ Collector Runtime Statistics UrlFrontier Url Frontier size, current number of links: 3465 Urls requested from frontier: 659 Url Frontier, current number of server queues: 78 Urls delivered from frontier: 639 Collector Documents per second: Total runtime: 2 Minutes Seconds UrlSentry Urls filtered using robots: 38 Urls filtered for depth: 9 Urls Processed: 5881 Urls filtered using filters: 165 UrlBookkeeper Duplicate Urls: 1557 Urls recorded: 4104

SLASHPack Collector - 5/4/ Collector Runtime Statistics DocFingerprinter Documents Written: 386 Average Document Size (bytes): HTTP Status Responses: 200: : : 8302: : 91403: 7 401: 1 400: : 1 Duplicate Documents: 51 Total Documents Collected: 561 Documents by mimetype: text/xml: 1image/jpeg: 1 text/html: 451image/gif: 1 text/plain: 106 application/octet-stream: 1

SLASHPack Collector - 5/4/ Challenges Large text (XML) files  21 1 GB XML files.  ~450,000 files per XML file.  ~10 Million files, after processing. Memory/Storage  Disk space.  Memory usage during processing. (XML)

SLASHPack Collector - 5/4/ Weblog raw data ""Evolve!"“ Flickr Darwin (chuckdarwin) <html><head><meta content="text/html; charset=UTF-8" http- equiv="Content-Type"/><title>""Evolve!""</title></head><body> <div style="text-align: center;"><font size="+1"><a href=" 1cfcd b1a&amp;ex= &amp;partner=rssnyt&amp;e mc=rss&amp;pagewanted=print">7/7 and 9/11?</a></font></div></body></html> Press

SLASHPack Collector - 5/4/ Weblog processed data WeblogPosts ”"Evolve!"" Flickr Darwin (chuckdarwin) Press text/plain 9949bba4ac535d18c3f11db66cdb194e Jmx0O2h0bWwmZ3Q7CiZsdDtoZWFkJmd0OwombHQ7bWV0YSBjb250ZW50P ….

SLASHPack Collector - 5/4/ Original Design

SLASHPack Collector - 5/4/ Problems to Address Overall collection performance  Streamline processing. Robot file look up  Incredibly slow and inefficient. (Not mine!) Thread interaction  Efficient use of threads and queues to process data. Inefficient code  Python code not always the fastest.  miniDom XML parsing. Faster data structures  Re-work collection protocols, DNS prefetch.  Re-structure URL Frontier, URL Bookkeeper.

SLASHPack Collector - 5/4/ New Design

SLASHPack Collector - 5/4/ Performance Modifications Structure Re-design (threading)  More queues, more independence. Robot Parser  String creation, debug calls. URL Frontier  More efficient data structures. Protocol Modules  More efficient data structures.  Re-factoring for reliable collection. XML parsing  Switch to faster parser, removal of DOM parser. DNS Pre-fetching  More efficient structuring.

SLASHPack Collector - 5/4/ New data structures Dictionary fields for Base data type. (Must be implemented by any data protocol). Now passed in dictionary to storage component. Key Value Type datatype user defined datatype name string status HTTP document status string url URL of document string date collection date string crawlname name of current crawl string size byte length of content string mimetype mime type of document string fingerprint md5sum hash of content string content raw text of document string

SLASHPack Collector - 5/4/ Outline 1. Introduction, system overview and design. 2. Performance modifications, re-factoring and re-structuring. 3. Performance testing results and evaluation.

SLASHPack Collector - 5/4/ Performance Comparison Initial Results: Weblog data set w/o parsing, robots: 161 doc/s, 50 min. w/ parsing, robots: 3.9 doc/s, 162 min. (killed) HTTP Web crawl 100 docs w/ parsing, robots: 0.2 doc/s,16 min:13s 150 docs w/ parsing, robots: 0.3 doc/s, 21min:3s Modified Results: Weblog data set w/o parsing, robots: 170 doc/s, 42 min. w/ parsing, robots: 186 doc/s, 63 min. HTTP Web crawl 100 docs w/ parsing, robots: 2.2 doc/s, 1min:10s 150 docs w/ parsing, robots: 2.9 doc/s, 1min:14s

SLASHPack Collector - 5/4/ Performance Comparison (cont.) Hardware considerations - HTTP web crawl for 500 documents  Pentium 4 2.4GHz 1 GB RAM 3.7 doc/s 3min:18s, 728 docs total (faster connection)  Pentium 4 2.0GHz, 1GB RAM 3.7 doc/s 4min:25s, 725 docs total  Pentium 4 3.2GHz HT, 2GB RAM 4.3 doc/s 2min:47s, 717 docs total (faster connection)

SLASHPack Collector - 5/4/ Performance Comparison (cont.) Comparison to other web crawlers (published results, 1999)  Google: 33.5 doc/s  Internet Archive: 46.3 doc/s  Mercator: 112 doc/s Consideration of functionality  More than just a web crawler.  Mime types.

SLASHPack Collector - 5/4/ Available Documentation Pydoc API  Generated with Epydoc. Use and configuration guide (README).  Quick start guide. Full Report  Full specification of Collector, use, configuration and development background.

SLASHPack Collector - 5/4/ Future Work Addition of pluggable modules. Improved fingerprint sets. Improved Python memory management and threading.

SLASHPack Collector - 5/4/ References Allan Heydon and Marc Najork. Mercator: A scalable, extensible web crawler.  Soumen Chakrabati, Mining the Web,  Ch. 2, pages Heritrix, Internet Archive.  Python Performance Tips  Prof. Chris Brooks and the SLASHPack Team.

SLASHPack Collector - 5/4/ Conclusion Four stages:  Addition of protocol module for Weblog data set.  Performance testing and identifying problem areas.  Modify Collector to improve scalability and performance.  Repeat performance testing and evaluate performance improvements. Results:  Expanded functionality for data types.  Modifications improved performance.  More stable and flexible design.