Crawlers - Presentation 2 - April (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch
Crawlers - Presentation 2 - April Crawlers 1. Crawlers: Background 2. Unified Domain Model 3. Individual Applications 3.1 WebSphinx 3.2 WebLech 3.3. Grub 3.4 Aperture 4. Summary and Conclusions
Crawlers - Presentation 2 - April Crawlers – Background What is a crawler? Collect information about internet pages Near-infinite amount of web pages, no directory Use links contained within pages to find out about new pages to visit How do crawlers work? Pick a starting page URL (seed) Load starting page from internet Find all links in page and enqueue them Get any desired information from page Loop
Crawlers - Presentation 2 - April Crawlers – Background Rules which apply on the Domain: All crawlers have a URL Fetcher All crawlers have a Parser (Extractor) Crawlers are a Multi Threaded processes All crawlers have a Crawler Manager All crawlers have a Queue structure Strongly related to the search engine domain
Crawlers - Presentation 2 - April Unified Domain Class Diagram * Common features ExternalDB Merger DB PageData CrawlerHelper Filter *Added by code modeling StorageManager Spider SpiderConfig Queue Thread Extractor Fetcher Robots Scheduler
Crawlers - Presentation 2 - April Unified Domain Sequence Diagram Pre-crawling phase:Pre-fetching phase: Main loop Optional objects! Fetching and extracting phase: Optional object! Post-processing phase:Finish crawling phase: End of main loop
Crawlers - Presentation 2 - April Unified Domain - Applications For the User Modeling group, the applications were the first chance to see things in practice For the entire group, the applications provided a fresh view about the domain, which led to many changes (Assignment 2) With everyone viewing the applications in the domain context, most differences were explained as being application-specific Interesting experiment: Let new Code Modeling group use applications as basis for domain?
Crawlers - Presentation 2 - April WebSphinx WebSphinx: Website-Specific Processors for HTML INformation eXtraction (2002) The WebSphinx class library provides support for writing web crawlers in Java Designation: Small-scope crawls for mirroring, offline viewing, hyperlink trees Extensible to saving information about page elements
Crawlers - Presentation 2 - April WebSphinx Hyperlink Tree
Crawlers - Presentation 2 - April WebSphinx Extractor Scheduler Settings Link Spider, Queue (Configuration) Fetcher, PageData, StorageManager Mirror Element Thread Robots Filters Mirror: A collection of files (Pages) intended to provide a perfect copy of another website Element: Web pages are composed of many elements ( ). Elements can be nested (For example, will have many child elements) Link: A link is a type of element, usually, which points to a specific page or file. Storing information about each link relative to our seeds can help us analyze results
Crawlers - Presentation 2 - April WebSphinx
Crawlers - Presentation 2 - April Web Lech Web Lech allows you to "spider" a website and to recursively download all the pages on it.
Crawlers - Presentation 2 - April Web Lech Web Lech is a fully featured web site download/mirror tool in Java, which supports : download websites emulate standard web-browser behavior Web Lech is multithreaded and will feature a GUI console.
Crawlers - Presentation 2 - April Web Lech Open Source MIT License means it's totally free and you can do what you want with it Pure Java code means you can run it on any Java- enabled computer Multi-threaded operation for downloading lots of files at once Supports basic HTTP authentication for accessing password-protected sites HTTP referrer support maintains link information between pages (needed to Spider some websites)
Crawlers - Presentation 2 - April Web Lech Lots of configuration options: Depth-first or breadth-first traversal of the site Candidate URL filtering, so you can stick to one web server, one directory, or just Spider the whole web Configurable caching of downloaded files allows restart without needing to download everything again URL prioritization, so you can get interesting files first and leave boring files till last (or ignore them completely) Check pointing so you can snapshot spider state in the middle of a run and restart without lots of processing.
Crawlers - Presentation 2 - April Class Diagram
Crawlers - Presentation 2 - April
Crawlers - Presentation 2 - April
Crawlers - Presentation 2 - April Sequence Diagram
Crawlers - Presentation 2 - April
Crawlers - Presentation 2 - April Common Features
Crawlers - Presentation 2 - April Common Features
Crawlers - Presentation 2 - April Unique Features
Crawlers - Presentation 2 - April Grub Crawler A Little bit about NASA’s SETI What are distributed Crawlers? Why distributed Crawlers? Pros & Cons of distributed Crawlers
Crawlers - Presentation 2 - April Class Diagram
Crawlers - Presentation 2 - April Class Diagram (2) Spider & Thread Config & Robot
Crawlers - Presentation 2 - April Class Diagram (3) Fetcher Extractor Queue & Storage Manager
Crawlers - Presentation 2 - April Sequence Diagram
Crawlers - Presentation 2 - April Sequence Diagram
Crawlers - Presentation 2 - April Use Case
Crawlers - Presentation 2 - April Aperture Developing Year: 2005 Designation: crawling and indexing Crawl different information systems Many common file formats Flexible architecture Main process phases: Fetch information from a chosen source Identify source type (MIME protocol) Full-text and metadata extraction Store and index information
Crawlers - Presentation 2 - April Crawlers - Presentation 2 - April Aperture Web Demo Go to:
Crawlers - Presentation 2 - April Crawlers - Presentation 2 - April Aperture Class Diagram Aperture offers a crawler for each data source. Our domain focus on web !crawling Aperture offers many extractors which are able to extract data and metadata from files, ,sites,calendars etc. CrawlReport Mime DataObject RDFContainer StorageManager Spider, SpiderConfig, Queue Thread,Scheduler,Robots Fetcher,CrawlerHelper DB CrawlerHelper Extractor CrawlerTypes Extractor Types Classes name: DataObject RDFContainer Aperture’s unique! Roll: Represnet a source object after fetching it. Object includes source data and metadata in a RDF format. Class name: Mime Aperture’s unique! Roll: Identify source type in order to choose the correct extractor. Interface name: CrawlReport Aperture’s unique! Roll: Help crawler to keep necessary information about crawling changing status, fails and successes
Crawlers - Presentation 2 - April Crawlers - Presentation 2 - April Aperture Sequence Diagram
Crawlers - Presentation 2 - April Summary - ADOM ADOM was helpful in establishing domain requirements With better understanding of ADOM, abstraction became easier – level of abstraction was improved (increased) with each assignment Using XOR and OR limitations on relations helpful in creating domain class diagram Difficult not to get carried away with “It’s optional, no harm in adding it” decisions
Crawlers - Presentation 2 - April Summary – Domain Modeling Difficulty in modeling functional entities – functions are often contained within another class Difficult to model when many optional entities exist, some of which heavily impact class relations and sequences Vast difference in application scale Next time, we’ll pick a different domain…
Crawlers - Presentation 2 - April Crawlers Thank you Any questions?