Download presentation
Presentation is loading. Please wait.
Published byTamsyn Byrd Modified over 8 years ago
1
Crawlers and Crawling Strategies CSCI 572: Information Retrieval and Search Engines Summer 2010
2
May-20-10CS572-Summer2010CAM-2 Outline Crawlers –Web –File-based Characteristics Challenges
3
May-20-10CS572-Summer2010CAM-3 Why Crawling? Origins were in the web –Web is a big “spiderweb”, so like a a “spider” crawl it Focused approach to navigating the web –It’s not just visit all pages at once –…or randomly –There needs to be a sense of purpose Some pages more important or different than others Content-driven –Different crawlers for different purposes
4
May-20-10CS572-Summer2010CAM-4 Different classifications of Crawlers Whole-web crawlers –Must deal with different concerns than more focused vertical crawlers, or content-based crawlers –Politeness, ability to mitigate any and all protocols defined in the URL space –Deal with URL filtering, freshness and recrawling strategies –Examples: Heretix, Nutch, Bixo, crawler-commons, clever uses of wget and curl, etc.
5
May-20-10CS572-Summer2010CAM-5 Different classifications of Crawlers File-based crawlers –Don’t necessitate the understanding of protocol negotiation – it’s a hard problem in its own right! –Assume that the content is already local –Uniqueness is in the methodology for File identification and selection Ingestion methodology Examples: OODT CAS, scripting (ls/grep/UNIX), internal appliances (Google), Spotlight
6
May-20-10CS572-Summer2010CAM-6 Web-scale Crawling What do you have to deal with? –Protocol negotiation How do you get data from FTP, HTTP, SMTP, HDFS, RMI, CORBA, SOAP, Bittorrent, ed2k URLs? Build a flexible protocol layer like Nutch did? –Determination of which URLs are important or not Whitelists Blacklists Regular Expressions
7
May-20-10CS572-Summer2010CAM-7 Politeness How do you take into account that web servers and Internet providers can and will –Block you after a certain # of concurrent attempts –Block you if you ignore their crawling desirements codified in e.g., a robots.txt file –Block you if you don’t specify a User Agent –Identify you based on Your IP Your User Agent
8
May-20-10CS572-Summer2010CAM-8 Politeness Queuing is very important Maintain host-specific crawl patterns and policies –Sub-collection based using regex Threading and brute-force is your enemy Respect robots.txt Declare who you are
9
May-20-10CS572-Summer2010CAM-9 Crawl Scheduling When and where should you crawl –Based on URL freshness within some N day cycle? Relies on unique identification of URLs and approaches for that –Based on per-site policies? Some sites are less busy at certain times of the day Some sites are on higher bandwidth connections than others Profile this? Adaptative fetching/scheduling –Deciding the above on the fly while crawling Regular fetching/scheduling –Profiling the above and storing it away in policy/config
10
May-20-10CS572-Summer2010CAM-10 Data Transfer Download in parallel? Download sequentially? What to do with the data once you’ve crawled in, is it cached temporarily or persisted somewhere?
11
May-20-10CS572-Summer2010CAM-11 Identification of Crawl Path Uniform Resource Locators Inlinks Outlinks Parsed data –Source of inlinks, outlinks Identification of URL protocol schema/path –Deduplication
12
May-20-10CS572-Summer2010CAM-12 File-based Crawlers Crawling remote content, getting politeness down, dealing with protocols, and scheduling is hard! Let some other component do that for you –CAS Pushpull great ex. –Staging areas, delivery protocols Once you have the content, there is still interesting crawling strategy
13
May-20-10CS572-Summer2010CAM-13 What’s hard? The file is already here Identification of which files are important, and which aren’t –Content detection and analysis MIME type, URL/filename regex, MAGIC detection, XML root chars detection, combinations of them Apache Tika Mapping of identified file types to mechanisms for extracting out content and ingesting it
14
May-20-10CS572-Summer2010CAM-14 Quick intro to content detection By URL, or file name –People codified classification into URLs or file names –Think file extensions By MIME Magic –Think digital signatures By XML schemas, classifications –Not all XML is created equally By combinations of the above
15
May-20-10CS572-Summer2010CAM-15 Case Study: OODT CAS Set of components for science data processing Deals with file-based crawling
16
May-20-10CS572-Summer2010CAM-16 File-based Crawler Types Auto- detect Met Extractor Std Product Crawler
17
May-20-10CS572-Summer2010CAM-17 Other Examples of File Crawlers Spotlight –Indexing your hard drive on Mac and making it readily available for fast free-text search –Involves CAS/Tika like interactions Scripting with ls and grep –You may find yourself doing this to run processing in batch, rapidly and quickliy –Don’t encode the data transfer into the script! Mixing concerns
18
May-20-10CS572-Summer2010CAM-18 Challenges Reliability –If crawl fails during web-scale crawl, how do you mitigate? Scalability –Web-based vs. file based Commodity versus appliance –Google or build your own Separation of concerns –Separate processing from ingestion from acquisition
19
May-20-10CS572-Summer2010CAM-19 Wrapup Crawling is a canonical piece of a search engine Utility is seen in data systems across the board Determine what your strategy for acquisition vis a vis your processing and ingestion strategy is Separate and insulate Identify content flexibly
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.