Web Crawling Notes by Aisha Walcott Modeling the Internet and the Web: Probabilistic Methods and Algorithms Authors: Baldi, Frasconi, Smyth
Outline Basic Crawling Selective crawling Focused crawling Distributed crawling Web dynamics- age/lifetime of documents -Anchors are very useful in search engines, they are the text “on top” of a link on a webpage Eg: <a href=“URL”> anchor text </a> -Many topics presented here have pointers to a number of references
Basic Crawling A simple crawler uses a graph algorithm such as BFS Maintains a queue, Q, that stores URLs Two repositories: D- stores documents, E- stores URLs Given S0 (seeds): initial collection of URLs Each iteration Dequeue, fetch, and parse document for new URLs Enqueue new URLs not visited (web is acyclic) Termination conditions Time allotted to crawling expired Storage resources are full Consequently Q, D have data, so anchors to the URLs in Q are used to return query results (many search engines do this)
Practical Modifications & Issues Time to download a doc is unknown DNS lookup may be slow Network congestion, connection delays Exploit bandwidth- run concurrent fetching threads Crawlers should be respectful of servers and not abuse resources at target site (robots exclusion protocol) Multiple threads should not fetch from same server simultaneously or too often Broaden crawling fringe (more servers) and increase time between requests to same server Storing Q, and D on disk requires careful external memory management Crawlers avoid aliases “traps”- same doc is addressed by many different URLs Web is dynamic and changes in topology and content
Selective Crawling (Selective Crawling) Recognizing the relevance or importance of sites, limit fetching to most important subset Define a scoring function for relevance Eg. Best first search using score to enqueue Measure efficiency: rt/t, t = #pages fetched, rt = #fetched pages with score > st (ideally rt =t) where u is a URL, is the relevance criterion, is the set of parameters.
Ex: Scoring Functions (Selective Crawling) Depth- limit #docs downloaded from a single site by a) setting threshold, b) depth in dir tree, or c) limit path length; maximizes breadth Popularity- assigning importance by most popular; eg. a relevance function based on backlinks PageRank- measure of popularity recursively assigns ea. link a weight proportional to popularity of doc 1, if |root(u) ~> u| < , root(u) is root of site with u 0, otherwise Backlinks- are links that point to the URL 1, if indegree(u) > 0, otherwise
Focused Crawling Searches for info related to certain topic not driven by generic quality measures Relevance prediction Context graphs Reinforcement learning Examples: Citeseer, Fish algm (agents accumulate energy for relative docs, consume energy for network resources)
Relevance Prediction (Focused Crawling) Define a score as cond. prob. that a doc is relevant given text in the doc. Strategies for approx topic score Parent-based: score a fetched doc and extend score to all URLs in that doc, “topic locality” Anchor-based: just use text d(v,u) in the anchor(s) where link to u is referred to, “semantic linkage Eg. naïve Bayes classifier trained on relevant docs. c is topic of interest are adjustable params of classifier d(u) is contents of doc at vertex u v is parent of u
Context Graphs Take adv of knowledge of internet topology (Focused Crawling) Take adv of knowledge of internet topology Train machine learning system to predict “how far” relevant info can be expected to be found Eg. 2 layered context graph, layered graph of node u After training, predict layer a new doc belongs to indicating # links to follow before relevant info reached Layer 2 Layer 1 u
Reinforcement Learning (Focused Crawling) Immediate rewards when crawler downloads a relevant doc Policy learned by RL can guide agent toward high long-term cumulative rewards Internal state of crawler- sets of fetched and discovered URLs Actions- fetching a URL in the queue of URLs State space too large
Distributed Crawling Scalable system by “divide and conquer” Want to minimize significant overlap Characterize interaction between crawlers Coordination Confinement Partitioning
Coordination (Distributed Crawling) The day different crawlers agree about the subset of pages ea. of them is responsible for If 2 crawlers are completely independent then overlap only controlled by having different seeds (URLs) Hard to compute the partition that minimizes overlap Partition web into subgraphs-crawler is responsible for fetching docs from their subgraphs Static or dynamic partition based on whether or not it changes during crawling (static more autonomous, dynamic is subject to reassignment from external observer)
Confinement (Distributed Crawling) Assumes static coordination; defines how strict ea. crawler should operate within its own partition What happens when a crawler pops “foreign” URLs from its queue (URLs from another partition) 3 suggested modes Firewall: never follow interpartition links Poor coverage Crossover: follow links when Q has no more local URLs Good coverage, potential high overlap Exchange: never follows interpartition links, but periodically communicates foreign URLs w/ the correct crawler(s) No overlap, potential perfect coverage, but extra bandwidth
Partitioning (Distributed Crawling) Strategy used to split URLs into non-overlapping subsets assigned to ea. crawler Eg. Hash fn. of IPs assigning them to a crawler Take into account geographical dislocations
Web Dynamics How info on web changes over time SE w/ a collection of dos is (, )-current if the probability that a doc is -current is at least ( is the “grace period”) Eg. How many docs per day to be (0.9, 1wk)-current Assume changes in the web are random and independent Model this according to a Poisson process “Dot coms” much more dynamic than “dot edu”
Lifetime and Aging of Documents Model based on reliability theory in Ind Engr’g
Table cdfs pdf