© 2016 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania Crawling and Publish/Subscribe February 22, 2016.

Slides:

Advertisements

Similar presentations

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

Advertisements

SEO Best Practices with Web Content Management Brent Arrington, Services Developer, Hannon Hill Morgan Griffith, Marketing Director, Hannon Hill 2009 Cascade.

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 20: Crawling 1.

Web Categorization Crawler – Part I Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Final Presentation Sep Web Categorization.

Introduction to Web Crawling and Regular Expression CSC4170 Web Intelligence and Social Computing Tutorial 1 Tutor: Tom Chao Zhou

CpSc 881: Information Retrieval. 2 How hard can crawling be?  Web search engines must crawl their documents.  Getting the content of the documents is.

Selective Dissemination of Streaming XML By Hyun Jin Moon, Hetal Thakkar.

Crawling the WEB Representation and Management of Data on the Internet.

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 20: Crawling 1.

March 26, 2003CS502 Web Information Systems1 Web Crawling and Automatic Discovery Donna Bergmark Cornell Information Systems

WEB CRAWLERs Ms. Poonam Sinai Kenkre.

1 Web Crawling and Data Gathering Spidering. 2 Some Typical Tasks Get information from other parts of an organization –It may be easier to get information.

1 CS 430 / INFO 430 Information Retrieval Lecture 15 Web Search 1.

XML Transformations and Content-based Crawling Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems August 7, 2015.

Internet Research Search Engines & Subject Directories.

Search engine structure Web Crawler Page archive Page Analizer Control Query resolver ? Ranker text Structure auxiliary Indexer.

Web Crawling David Kauchak cs160 Fall 2009 adapted from:

1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009.

PrasadL16Crawling1 Crawling and Web Indexes Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning (Stanford)

Wasim Rangoonwala ID# CS-460 Computer Security “Privacy is the claim of individuals, groups or institutions to determine for themselves when,

Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1.

XHTML Introductory1 Linking and Publishing Basic Web Pages Chapter 3.

Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.

HKU CSIS DB Seminar: HKU CSIS DB Seminar: Efficient Filtering of XML Documents for Selective Dissemination of Information Mehmet Altinel, Micheal J. Franklin.

Scalable Web Crawling and Basic Transactions Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems October 6, 2015.

Mercator: A scalable, extensible Web crawler Allan Heydon and Marc Najork, World Wide Web, Young Geun Han.

Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 20: Crawling and web indexes.

Crawlers - Presentation 2 - April (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch.

Web Search Module 6 INST 734 Doug Oard. Agenda The Web  Crawling Web search.

Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Crawling.

Crawling Slides adapted from

CSCI 5417 Information Retrieval Systems Jim Martin Lecture 19 11/1/2011.

How Does a Search Engine Work? Part 1 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial.

1 CS 430 / INFO 430: Information Discovery Lecture 19 Web Search 1.

استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.

The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.

CRAWLER DESIGN YÜCEL SAYGIN These slides are based on the book “Mining the Web” by Soumen Chakrabarti Refer to “Crawling the Web” Chapter for more information.

1 University of Qom Information Retrieval Course Web Search (Spidering) Based on:

Information Retrieval and Web Search Web Crawling Instructor: Rada Mihalcea (some of these slides were adapted from Ray Mooney’s IR course at UT Austin)

Finding What We Want: DNS and XPath-Based Pub-Sub Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems February 12, 2008.

1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.

What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.

Google, Web Crawling, and Distributed Synchronization Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems April 1, 2008.

1 CS 430 / INFO 430 Information Retrieval Lecture 19 Web Search 1.

Web Crawling and Automatic Discovery Donna Bergmark March 14, 2002.

Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.

1 Crawling Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan.

1 CS 430: Information Discovery Lecture 17 Web Crawlers.

ITCS 6265 Lecture 11 Crawling and web indexes. This lecture Crawling Connectivity servers.

The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.

Allan Heydon and Mark Najork --Sumeet Takalkar. Inspiration of Mercator What is a Mercator Crawling Algorithm and its Functional Components Architecture.

Search Engine and Optimization 1. Introduction to Web Search Engines 2.

1 Web Search Spidering (Crawling)

Design and Implementation of a High-Performance distributed web crawler Vladislav Shkapenyuk and Torsten Suel Proc. 18 th Data Engineering Conf., pp ,

Design and Implementation of a High- Performance Distributed Web Crawler Vladislav Shkapenyuk, Torsten Suel 실시간 연구실 문인철

Database System Laboratory Mercator: A Scalable, Extensible Web Crawler Allan Heydon and Marc Najork International Journal of World Wide Web, v.2(4), p ,

CS276 Information Retrieval and Web Search

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

Lecture 17 Crawling and web indexes

CS 430: Information Discovery

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

Search Engine Optimisation

Search Engines & Subject Directories

Crawlers: Nutch CSE /12/2018 5:08 AM.

Hvhmi ارائه دهنده : ندا منقاش. Hvhmi ارائه دهنده : ندا منقاش.

Search Engines & Subject Directories

Search Engines & Subject Directories

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

12. Web Spidering These notes are based, in part, on notes by Dr. Raymond J. Mooney at the University of Texas at Austin.

Anwar Alhenshiri.

Presentation transcript:

© 2016 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania Crawling and Publish/Subscribe February 22, 2016

© 2016 A. Haeberlen, Z. Ives Announcements First midterm will be next Wednesday (Mar 2) Needs to be spread over several rooms - please watch Piazza for announcements You will get an with a specific seat assignment, too Open-book, open-notes – you may use the slides, the assigned readings, your textbook, and your notes from class. This has obvious consequences for the kinds of questions I can ask you You may bring your laptop or tablet, as long as you can guarantee that all wireless interfaces will be disabled If we catch someone using unauthorized materials, or with any wireless interface enabled, they will receive an immediate zero + OSC referral We can't guarantee that you will be seated near a power outlet Covers all the material up to, and including, February 29th Reading for next time: Ghemawat et al.: "The Google File System" 2 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Plan for today Basic crawling Mercator Publish/subscribe XFilter 3 University of Pennsylvania NEXT

© 2016 A. Haeberlen, Z. Ives Motivation Suppose you want to build a search engine Need a large corpus of web pages How can we find these pages? Idea: crawl the web What else can you crawl? For example, social network, publication network,... 4 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives 5 Crawling: The basic process What state do we need? Q := Queue of URLs to visit P := Set of pages already crawled Basic process: 1. Initialize Q with a set of seed URLs 2. Pick the first URL from Q and download the corresponding page 3. Extract all URLs from the page ( tag, anchor links, CSS, DTDs, scripts, optionally image links) 4. Append to Q any URLs that a) meet our criteria, and b) are not already in P 5. If Q is not empty, repeat from step 2 Can one machine crawl the entire web? Of course not! Need to distribute crawling across many machines. University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Crawling visualized 6 University of Pennsylvania Seeds Unseen web "Frontier" URLs already crawled URLs that will eventually be crawled URLs that do not fit our criteria (too deep, etc) The Web Pages currently being crawled

© 2016 A. Haeberlen, Z. Ives Crawling complications What order to traverse in? Polite to do BFS - why? Malicious pages Spam pages / SEO Spider traps (incl. dynamically generated ones) General messiness Cycles, site mirrors, duplicate pages, aliases Varying latency and bandwidth to remote servers Broken HTML, broken servers,... Web masters' stipulations How deep to crawl? How often to crawl? Continuous crawling; freshness 7 University of Pennsylvania Need to be robust!

© 2016 A. Haeberlen, Z. Ives SEO: "White-hat" version There are several ways you can make your web page easier to crawl and index: Choose a good title Use tags, e.g., description Use meaningful URLs with actual words BAD: GOOD: Provide an XML Sitemap Use mostly text for navigation (not flash, JavaScript,...) Descriptive file names, anchor texts, ALT tags More information from search engines, e.g.: optimization-starter-guide.pdf 8 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives SEO: "Black-hat" version Tries to trick the search engine into ranking pages higher than it normally would: Shadow domains Doorway pages Keyword stuffing Hidden or invisible text Link farms Page hijacking Blog/comment/wiki spam Scraper sites, article spinning Cloaking... 9 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Normalization; eliminating duplicates Some of the extracted URLs are relative URLs Example: /~ahae/papers/ from Normalize it: Duplication is widespread on the web If the fetched page is already in the index, do not process it Can verify using document fingerprint (hash) or shingles 10 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives 11 Crawler etiquette Explicit politeness Look for meta tags; for example, ignore pages that have Implement the robot exclusion protocol; for example, look for, and respect, robots.txt Implicit politeness Even if no explicit specifications are present, do not hit the same web site too often University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives 12 Robots.txt What should be in robots.txt? See To exclude all robots from a server: User-agent: * Disallow: / To exclude one robot from two directories: User-agent: BobsCrawler Disallow: /news/ Disallow: /tmp/ University of Pennsylvania User-agent: * Disallow: /_mm/ Disallow: /_notes/ Disallow: /_baks/ Disallow: /MMWIP/ User-agent: googlebot Disallow: *.csi User-agent: * Crawl-delay: 5

© 2016 A. Haeberlen, Z. Ives 13 Recap: Crawling How does the basic process work? What are some of the main challenges? Duplicate elimination Politeness Malicious pages / spider traps Normalization Scalability University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Plan for today Basic crawling Mercator Publish/subscribe XFilter 14 University of Pennsylvania NEXT

© 2016 A. Haeberlen, Z. Ives 15 Mercator: A scalable web crawler Written entirely in Java Expands a “URL frontier” Avoids re-crawling same URLs Also considers whether a document has been seen before Same content, different URL [when might this occur?] Every document has signature/checksum info computed as it’s crawled Despite the name, it does not actually scale to a large number of nodes But it would not be too difficult to parallelize Heydon and Najork: Mercator, a scalable, extensible web crawler (WWW'99) University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives 16 Mercator architecture 1. Dequeue frontier URL 2. Fetch document 3. Record into RewindInputStream (RIS) 4. Check against fingerprints to verify it’s new 5. Extract hyperlinks 6. Filter unwanted links 7. Check if URL repeated (compare its hash) 8. Enqueue URL Source: Mercator paper University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives 17 Mercator’s polite frontier queues Tries to go beyond breadth-first approach Goal is to have only one crawler thread per server What does this mean for the load caused by Mercator? Distributed URL frontier queue: One subqueue per worker thread The worker thread is determined by hashing the hostname of the URL Thus, only one outstanding request per web server University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Mercator’s HTTP fetcher First, needs to ensure robots.txt is followed Caches the contents of robots.txt for various web sites as it crawls them Designed to be extensible to other protocols Had to write own HTTP requestor in Java – their Java version didn’t have timeouts Today, can use setSoTimeout() Could use Java non-blocking I/O: But they use multiple threads and synchronous I/O Multi-threaded DNS resolver 18 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives 19 Other caveats Infinitely long URL names (good way to get a buffer overflow!) Aliased host names Alternative paths to the same host Can catch most of these with signatures of document data (e.g., MD5) Comparison to Bloom filters Crawler traps (e.g., CGI scripts that link to themselves using a different name) May need to have a way for human to override certain URL paths – see Section 5 of paper Checkpointing!! University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Mercator document statistics PAGE TYPE PERCENT text/html69.2% image/gif 17.9% image/jpeg8.1% text/plain 1.5% pdf 0.9% audio0.4% zip 0.4% postscript0.3% other1.4% Histogram of document sizes (60M pages) 20 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives 21 Further considerations May want to prioritize certain pages as being most worth crawling Focused crawling tries to prioritize based on relevance May need to refresh certain pages more often University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Where to go from here You will need to build a crawler for HW2MS2 and for the final project Please learn from others' experiences! Several crawling-related papers are linked from the reading list - e.g., the Google paper, the Mercator paper,... Reading these papers carefully before you begin will save you a lot of time Get a sense of what to expect Avoid common problems and bottlenecks Identify designs that won't work well 22 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Plan for today Basic crawling Mercator Publish/subscribe XFilter 23 University of Pennsylvania NEXT

© 2016 A. Haeberlen, Z. Ives The publish/subscribe model Each publisher produces events Example: Web page update, stock quote, announcement,... Each subscriber wants a subset of the events But usually not all How do we implement this efficiently? 24 University of Pennsylvania Events Interests ? ? ?

© 2016 A. Haeberlen, Z. Ives Example: RSS Web server publishes XML file with events Clients periodically request the file to see if there are new events Is this a good solution? 25 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Interest-based crawling Suppose we want to crawl XML documents based on user interests We need several parts: A list of “interests” – expressed in an executable form, perhaps XPath queries A crawler – goes out and fetches XML content A filter / routing engine – matches XML content against users’ interests, sends them the content if it matches 26 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Plan for today Basic crawling Mercator Publish/subscribe XFilter 27 University of Pennsylvania NEXT

© 2016 A. Haeberlen, Z. Ives 28 XML-Based information dissemination Basic model (XFilter, YFilter, Xyleme): Users are interested in data relating to a particular topic, and know the schema /politics/usa//body A crawler-aggregator reads XML files from the web (or gets them from data sources) and feeds them to interested parties XPath (here used to match documents, not nodes) University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives 29 Engine for XFilter [Altinel & Franklin 00] University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives 30 How does it work? Each XPath segment is basically a subset of regular expressions over element tags Convert into finite state automata Parse data as it comes in – use SAX API Match against finite state machines Most of these systems use modified FSMs because they want to match many patterns at the same time University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives 31 Path nodes and FSMs XPath parser decomposes XPath expressions into a set of path nodes These nodes act as the states of corresponding FSM A node in the Candidate List denotes the current state The rest of the states are in corresponding Wait Lists Simple FSM for politics usabody Q1_1 Q1_2 Q1_3 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives 32 Decomposing into path nodes Query ID Position in state machine Relative Position (RP) in tree: 0 for root node if it’s not preceded by “//” -1 for any node preceded by “//” Else =1+ (no of “*” nodes from predecessor node) Level: If current node has fixed distance from root, then 1+ distance Else if RP = –1, then –1, else 0 Finally, NextPathNodeSet points to next node Q Q1-1Q1-2Q1-3 Q Q2-1Q2-2Q2-3 Q2=//usa/*/body/p University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives 33 Query index Query index entry for each XML tag Two lists: Candidate List (CL) and Wait List (WL) divided across the nodes “Live” queries’ states are in CL; “pending” queries + states are in WL Events that cause state transition are generated by the XML parser politics usa body p Q1-1 Q2-1 Q1-3Q2-2 Q2-3 X X X X X X X X CL WL Q1-2 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives 34 Encountering an element Look up the element name in the Query Index and all nodes in the associated CL Validate that we actually have a match Q Q1-1 politics Q1-1 X X WL startElement: politics CL Query ID Position Rel. Position Level Entry in Query Index: NextPathNodeSet University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives 35 Validating a match We first check that the current XML depth matches the level in the user query: If level in CL node is less than 1, then ignore height else level in CL node must = height This ensures we’re matching at the right point in the tree! Finally, we validate any predicates against attributes (e.g., University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives 36 Processing further elements Queries that don’t meet validation are removed from the Candidate Lists For other queries, we advance to the next state We copy the next node of the query from the WL to the CL, and update the RP and level When we reach a final state (e.g., Q1-3), we can output the document to the subscriber When we encounter an end element, we must remove that element from the CL University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives A simpler approach Instantiate a DOM tree for each document Traverse and recursively match XPaths Pros and cons? 37 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives 38 Recap: Publish-subscribe model Publish-subscribe model Publishers produce events Each subscriber is interested in a subset of the events Challenge: Efficient implementation Comparison: XFilter vs RSS XFilter Interests are specified with XPaths (very powerful!) Sophisticated technique for efficiently matching documents against many XPaths in parallel University of Pennsylvania