Master Thesis Defense Jan Fiedler 04/17/98 Mobile Web Crawling Master Thesis Defense Jan Fiedler 04/17/98
Presentation Outline Resource Discovery Problem Web Crawling Techniques Traditional Web Crawling Mobile Web Crawling Mobile Crawling Architecture Distributed Runtime Environment Application Framework Performance Evaluation Summary and Conclusion 04/17/98 jfiedler@cise.ufl.edu
Resource Discovery Problem Web establishes large distributed hypertext system 1.6 million Web sites 320 million Web documents 40% of the Web content changes within a month exponential growing rate lack of structure (i.e. no strict hierarchy) Goal: overlay the distributed Web structure with a centralized information system which allows resource discovery 04/17/98 jfiedler@cise.ufl.edu
Web Indices and Search Engines Search engine statistics: index size 30-110 million pages (approx. 700GB) web coverage 10%-35% daily crawl 3-10 million pages (approx. 60GB) Year 2000 estimates: index size 880 million pages (approx. 5.6TB) daily crawl 80 million pages (approx. 480GB) Traditional Web crawling will experience severe scaling problems in the near future. 04/17/98 jfiedler@cise.ufl.edu
Traditional Crawling Overview 04/17/98 jfiedler@cise.ufl.edu
Traditional Web Crawling Characteristics of traditional Web crawling: remote data access focus on rapid data retrieval centralized, database oriented architecture brute force download of Web content resource intensive approach Traditional Web crawling techniques do not exploit information about the pages being crawled in order to reduce the crawling costs. 04/17/98 jfiedler@cise.ufl.edu
Mobile Crawling Overview 04/17/98 jfiedler@cise.ufl.edu
Mobile Web Crawling Characteristics of mobile Web crawling: local data access focus on effective data retrieval distributed, data source oriented architecture intelligent download of significant Web content resource preserving approach Mobility allows a Web crawler to analyze Web pages before investing Web resources for their transmission 04/17/98 jfiedler@cise.ufl.edu
Mobile Crawling Advantages Remote page selection determine significance of a page prior to transmission applicable for specialized search engines Remote page filtering use effective page representation model applicable for non-fulltext search engines Remote page compression compress page data prior to transmission applicable for all search engines 04/17/98 jfiedler@cise.ufl.edu
Crawler Specification Rule based programming paradigm represent crawler data as facts (e.g. page-facts) describe crawler behavior as a set rules which operate upon facts Advantages it is easier to specify crawling rules than to devise a crawling algorithm no need to model control flow rule based programs have very simple runtime states 04/17/98 jfiedler@cise.ufl.edu
Mobile Crawling Architecture 04/17/98 jfiedler@cise.ufl.edu
Mobile Crawling Architecture Distributed Crawler Runtime Environment provide platform independent execution environment virtual machine for remote crawler execution communication layer for crawler migration Application Framework support for crawler specification and configuration crawler manager for crawler specification query engine as crawler/application interface archive manager as database connectivity framework 04/17/98 jfiedler@cise.ufl.edu
Crawler Virtual Machine How to execute a rule based crawler specification? crawler execution = rule application upon fact base use inference engine for the the rule application process 1. Initialization insert rules and facts into inference engine 2. Rule application start rule application process within inference engine 3. Finalization extract rules and facts once the rule application stopped 04/17/98 jfiedler@cise.ufl.edu
Crawler Virtual Machine 04/17/98 jfiedler@cise.ufl.edu
Crawler Query Engine How to access the crawler knowledge? provide a query facility to query the crawler fact base implement a SQL subset as query language represent query result as data tuples, not as facts allows the user to reason about crawling results query engine implementation uses inference engine Query engine serves as the primary interface between the user application and the mobile crawler 04/17/98 jfiedler@cise.ufl.edu
Crawler Query Engine 04/17/98 jfiedler@cise.ufl.edu
Performance Evaluation Setup Use distributed virtual machines to support mobile as well as traditional Web crawling 04/17/98 jfiedler@cise.ufl.edu
Performance Evaluation Controlled environment setup static HTML data set with known properties personal HTTP server unshared communication channel (dialup line) Measurements 1. network load for traditional (stationary) crawler 2. network load for mobile crawler without page compression 3. network load for mobile crawler with page compression 04/17/98 jfiedler@cise.ufl.edu
Benefit of Remote Page Selection Traditional crawler (S1) versus mobile crawlers (M1-M4) with different keyword sets for page selection 04/17/98 jfiedler@cise.ufl.edu
Benefit of Remote Page Filtering Mobile crawler (M1) with a decreasing degree of page filtering (10%-90% page data preserved) 04/17/98 jfiedler@cise.ufl.edu
Benefit of Page Compression Traditional crawler (S1) and mobile crawler (M1) with an increasing number of crawled pages 04/17/98 jfiedler@cise.ufl.edu
Costs and Benefits Overhead Benefits without page compression overhead due to crawler migration (<5K) overhead due to facts based data representation (6%) Benefits without page compression as soon as less than 85% per page needs to be preserved as soon as less than 90% of all pages are transmitted Benefits with page compression reduction in network load by a factor of 4.5 04/17/98 jfiedler@cise.ufl.edu
Summary and Conclusion Mobile crawling advantages: approach fits better in distributed web environment approach beneficial for all types of search engines better support for specialized search engines network overhead due to crawler mobility is small Mobile crawling solves the scaling problems of the traditional crawling approach by allowing remote operations to be performed on the crawled data. Approach provides a base for smart Web crawling. 04/17/98 jfiedler@cise.ufl.edu
Future Work Security Crawler mobility support crawler identification based on digital signatures restrict crawler execution to positive identified crawlers implement virtual machine as a secure sandbox Crawler mobility support integrate virtual machine into web servers Mobile crawling algorithms optimize crawling algorithms with crawler mobility in mind (e.g. crawler communication) 04/17/98 jfiedler@cise.ufl.edu