© 2016 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania Crawling and Publish/Subscribe February 22, 2016.

© 2016 A. Haeberlen, Z. Ives Announcements First midterm will be next Wednesday (Mar 2) Needs to be spread over several rooms - please watch Piazza for announcements You will get an email with a specific seat assignment, too Open-book, open-notes – you may use the slides, the assigned readings, your textbook, and your notes from class. This has obvious consequences for the kinds of questions I can ask you You may bring your laptop or tablet, as long as you can guarantee that all wireless interfaces will be disabled If we catch someone using unauthorized materials, or with any wireless interface enabled, they will receive an immediate zero + OSC referral We can't guarantee that you will be seated near a power outlet Covers all the material up to, and including, February 29th Reading for next time: Ghemawat et al.: "The Google File System" 2 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Motivation Suppose you want to build a search engine Need a large corpus of web pages How can we find these pages? Idea: crawl the web What else can you crawl? For example, social network, publication network,... 4 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives 5 Crawling: The basic process What state do we need? Q := Queue of URLs to visit P := Set of pages already crawled Basic process: 1. Initialize Q with a set of seed URLs 2. Pick the first URL from Q and download the corresponding page 3. Extract all URLs from the page ( tag, anchor links, CSS, DTDs, scripts, optionally image links) 4. Append to Q any URLs that a) meet our criteria, and b) are not already in P 5. If Q is not empty, repeat from step 2 Can one machine crawl the entire web? Of course not! Need to distribute crawling across many machines. University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Crawling visualized 6 University of Pennsylvania Seeds Unseen web "Frontier" URLs already crawled URLs that will eventually be crawled URLs that do not fit our criteria (too deep, etc) The Web Pages currently being crawled

© 2016 A. Haeberlen, Z. Ives Crawling complications What order to traverse in? Polite to do BFS - why? Malicious pages Spam pages / SEO Spider traps (incl. dynamically generated ones) General messiness Cycles, site mirrors, duplicate pages, aliases Varying latency and bandwidth to remote servers Broken HTML, broken servers,... Web masters' stipulations How deep to crawl? How often to crawl? Continuous crawling; freshness 7 University of Pennsylvania Need to be robust!

© 2016 A. Haeberlen, Z. Ives SEO: "White-hat" version There are several ways you can make your web page easier to crawl and index: Choose a good title Use tags, e.g., description Use meaningful URLs with actual words BAD: http://xyz.com/BQF/17823/bbq GOOD: http://xyz.com/cars/ferrari/458-spider.html Provide an XML Sitemap Use mostly text for navigation (not flash, JavaScript,...) Descriptive file names, anchor texts, ALT tags More information from search engines, e.g.: http://www.google.com/webmasters/docs/search-engine- optimization-starter-guide.pdf https://www.microsoft.com/web/seo 8 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives SEO: "Black-hat" version Tries to trick the search engine into ranking pages higher than it normally would: Shadow domains Doorway pages Keyword stuffing Hidden or invisible text Link farms Page hijacking Blog/comment/wiki spam Scraper sites, article spinning Cloaking... 9 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Normalization; eliminating duplicates Some of the extracted URLs are relative URLs Example: /~ahae/papers/ from www.cis.upenn.edu Normalize it: http://www.cis.upenn.edu:80/~ahae/papers Duplication is widespread on the web If the fetched page is already in the index, do not process it Can verify using document fingerprint (hash) or shingles 10 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives 11 Crawler etiquette Explicit politeness Look for meta tags; for example, ignore pages that have Implement the robot exclusion protocol; for example, look for, and respect, robots.txt Implicit politeness Even if no explicit specifications are present, do not hit the same web site too often University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives 12 Robots.txt What should be in robots.txt? See http://www.robotstxt.org/robotstxt.html To exclude all robots from a server: User-agent: * Disallow: / To exclude one robot from two directories: User-agent: BobsCrawler Disallow: /news/ Disallow: /tmp/ University of Pennsylvania User-agent: * Disallow: /_mm/ Disallow: /_notes/ Disallow: /_baks/ Disallow: /MMWIP/ User-agent: googlebot Disallow: *.csi User-agent: * Crawl-delay: 5 http://www.cis.upenn.edu/robots.txt

© 2016 A. Haeberlen, Z. Ives 13 Recap: Crawling How does the basic process work? What are some of the main challenges? Duplicate elimination Politeness Malicious pages / spider traps Normalization Scalability University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives 15 Mercator: A scalable web crawler Written entirely in Java Expands a “URL frontier” Avoids re-crawling same URLs Also considers whether a document has been seen before Same content, different URL [when might this occur?] Every document has signature/checksum info computed as it’s crawled Despite the name, it does not actually scale to a large number of nodes But it would not be too difficult to parallelize Heydon and Najork: Mercator, a scalable, extensible web crawler (WWW'99) University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives 16 Mercator architecture 1. Dequeue frontier URL 2. Fetch document 3. Record into RewindInputStream (RIS) 4. Check against fingerprints to verify it’s new 5. Extract hyperlinks 6. Filter unwanted links 7. Check if URL repeated (compare its hash) 8. Enqueue URL Source: Mercator paper University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives 17 Mercator’s polite frontier queues Tries to go beyond breadth-first approach Goal is to have only one crawler thread per server What does this mean for the load caused by Mercator? Distributed URL frontier queue: One subqueue per worker thread The worker thread is determined by hashing the hostname of the URL Thus, only one outstanding request per web server University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Mercator’s HTTP fetcher First, needs to ensure robots.txt is followed Caches the contents of robots.txt for various web sites as it crawls them Designed to be extensible to other protocols Had to write own HTTP requestor in Java – their Java version didn’t have timeouts Today, can use setSoTimeout() Could use Java non-blocking I/O: But they use multiple threads and synchronous I/O Multi-threaded DNS resolver 18 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives 19 Other caveats Infinitely long URL names (good way to get a buffer overflow!) Aliased host names Alternative paths to the same host Can catch most of these with signatures of document data (e.g., MD5) Comparison to Bloom filters Crawler traps (e.g., CGI scripts that link to themselves using a different name) May need to have a way for human to override certain URL paths – see Section 5 of paper Checkpointing!! University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Mercator document statistics PAGE TYPE PERCENT text/html69.2% image/gif 17.9% image/jpeg8.1% text/plain 1.5% pdf 0.9% audio0.4% zip 0.4% postscript0.3% other1.4% Histogram of document sizes (60M pages) 20 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives 21 Further considerations May want to prioritize certain pages as being most worth crawling Focused crawling tries to prioritize based on relevance May need to refresh certain pages more often University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Where to go from here You will need to build a crawler for HW2MS2 and for the final project Please learn from others' experiences! Several crawling-related papers are linked from the reading list - e.g., the Google paper, the Mercator paper,... Reading these papers carefully before you begin will save you a lot of time Get a sense of what to expect Avoid common problems and bottlenecks Identify designs that won't work well 22 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives The publish/subscribe model Each publisher produces events Example: Web page update, stock quote, announcement,... Each subscriber wants a subset of the events But usually not all How do we implement this efficiently? 24 University of Pennsylvania Events Interests ? ? ?

© 2016 A. Haeberlen, Z. Ives Example: RSS Web server publishes XML file with events Clients periodically request the file to see if there are new events Is this a good solution? 25 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Interest-based crawling Suppose we want to crawl XML documents based on user interests We need several parts: A list of “interests” – expressed in an executable form, perhaps XPath queries A crawler – goes out and fetches XML content A filter / routing engine – matches XML content against users’ interests, sends them the content if it matches 26 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives 28 XML-Based information dissemination Basic model (XFilter, YFilter, Xyleme): Users are interested in data relating to a particular topic, and know the schema /politics/usa//body A crawler-aggregator reads XML files from the web (or gets them from data sources) and feeds them to interested parties XPath (here used to match documents, not nodes) University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives 30 How does it work? Each XPath segment is basically a subset of regular expressions over element tags Convert into finite state automata Parse data as it comes in – use SAX API Match against finite state machines Most of these systems use modified FSMs because they want to match many patterns at the same time University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives 31 Path nodes and FSMs XPath parser decomposes XPath expressions into a set of path nodes These nodes act as the states of corresponding FSM A node in the Candidate List denotes the current state The rest of the states are in corresponding Wait Lists Simple FSM for /politics[@topic=“president”]/usa//body: politics usabody Q1_1 Q1_2 Q1_3 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives 32 Decomposing into path nodes Query ID Position in state machine Relative Position (RP) in tree: 0 for root node if it’s not preceded by “//” -1 for any node preceded by “//” Else =1+ (no of “*” nodes from predecessor node) Level: If current node has fixed distance from root, then 1+ distance Else if RP = –1, then –1, else 0 Finally, NextPathNodeSet points to next node Q1=/politics[@topic=“president”]/usa//body Q1 123 01 10 Q1-1Q1-2Q1-3 Q2 123 21 00 Q2-1Q2-2Q2-3 Q2=//usa/*/body/p University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives 33 Query index Query index entry for each XML tag Two lists: Candidate List (CL) and Wait List (WL) divided across the nodes “Live” queries’ states are in CL; “pending” queries + states are in WL Events that cause state transition are generated by the XML parser politics usa body p Q1-1 Q2-1 Q1-3Q2-2 Q2-3 X X X X X X X X CL WL Q1-2 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives 34 Encountering an element Look up the element name in the Query Index and all nodes in the associated CL Validate that we actually have a match Q1 1 0 1 Q1-1 politics Q1-1 X X WL startElement: politics CL Query ID Position Rel. Position Level Entry in Query Index: NextPathNodeSet University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives 35 Validating a match We first check that the current XML depth matches the level in the user query: If level in CL node is less than 1, then ignore height else level in CL node must = height This ensures we’re matching at the right point in the tree! Finally, we validate any predicates against attributes (e.g., [@topic=“president”]) University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives 36 Processing further elements Queries that don’t meet validation are removed from the Candidate Lists For other queries, we advance to the next state We copy the next node of the query from the WL to the CL, and update the RP and level When we reach a final state (e.g., Q1-3), we can output the document to the subscriber When we encounter an end element, we must remove that element from the CL University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives 38 Recap: Publish-subscribe model Publish-subscribe model Publishers produce events Each subscriber is interested in a subset of the events Challenge: Efficient implementation Comparison: XFilter vs RSS XFilter Interests are specified with XPaths (very powerful!) Sophisticated technique for efficiently matching documents against many XPaths in parallel University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania Crawling and Publish/Subscribe February 22, 2016.

Similar presentations

Presentation on theme: "© 2016 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania Crawling and Publish/Subscribe February 22, 2016."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

© 2016 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania Crawling and Publish/Subscribe February 22, 2016.

Similar presentations

Presentation on theme: "© 2016 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania Crawling and Publish/Subscribe February 22, 2016."— Presentation transcript:

Similar presentations

About project

Feedback