Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Advanced Crawling Techniques Chapter 6.

Slides:

Advertisements

Similar presentations

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

Advertisements

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

“Keeping up with Changing Web.” Dartmouth college. Brian E Brewington. George Cybenko. Presented by : Shruthi R Bompelli.

Web Search – Summer Term 2006 IV. Web Search - Crawling (part 2) (c) Wolfgang Hürst, Albert-Ludwigs-University.

@ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University.

Evaluating Search Engine

Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado ! Francisco Tenorio ! Jacques.

Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.

CS345 Data Mining Web Spam Detection. Economic considerations  Search has become the default gateway to the web  Very high premium to appear on the.

CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.

1 Searching the Web Junghoo Cho UCLA Computer Science.

Web Crawling Notes by Aisha Walcott

1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.

Advanced Crawling Techniques Chapter 6. Outline Selective Crawling Focused Crawling Distributed Crawling Web Dynamics.

Web Search – Summer Term 2006 IV. Web Search - Crawling (c) Wolfgang Hürst, Albert-Ludwigs-University.

CS246 Search Engine Bias. Junghoo "John" Cho (UCLA Computer Science)2 Motivation “If you are not indexed by Google, you do not exist on the Web” --- news.com.

1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.

1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.

ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.

Web Projections Learning from Contextual Subgraphs of the Web Jure Leskovec, CMU Susan Dumais, MSR Eric Horvitz, MSR.

Parallel Crawlers Junghoo Cho, Hector Garcia-Molina Stanford University Presented By: Raffi Margaliot Ori Elkin.

1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

WEB CRAWLERs Ms. Poonam Sinai Kenkre.

Computer vision: models, learning and inference Chapter 10 Graphical Models.

1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.

How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.

distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.

Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado ! Francisco Tenorio ! Jacques.

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

Search engine structure Web Crawler Page archive Page Analizer Control Query resolver ? Ranker text Structure auxiliary Indexer.

Adversarial Information Retrieval The Manipulation of Web Content.

Distributed Constraint Optimization Michal Jakob Agent Technology Center, Dept. of Computer Science and Engineering, FEE, Czech Technical University A4M33MAS.

Crawlers Padmini Srinivasan Computer Science Department Department of Management Sciences

Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.

Web Categorization Crawler Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Design & Architecture Dec

Crawling Slides adapted from

Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

Focused Crawling for both Topical Relevance and Quality of Medical Information By Tim Tang, David Hawking, Nick Craswell, Kathy Griffiths CIKM ’05 November,

윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.

Text Clustering.

Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.

Autumn Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University

Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science ＆ Information Engineering.

Publication Spider Wang Xuan 07/14/2006. What is publication spider Gathering publication pages Using focused crawling With the help of Search Engine.

Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.

Parallel Crawlers Junghoo Cho (UCLA) Hector Garcia-Molina (Stanford) May 2002 Ke Gong 1.

Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.

For: CS590 Intelligent Systems Related Subject Areas: Artificial Intelligence, Graphs, Epistemology, Knowledge Management and Information Filtering Application.

CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.

Algorithmic Detection of Semantic Similarity WWW 2005.

Most of contents are provided by the website Graph Essentials TJTSD66: Advanced Topics in Social Media.

1 University of Qom Information Retrieval Course Web Search (Spidering) Based on:

CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.

What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.

How to Crawl the Web Hector Garcia-Molina Stanford University Joint work with Junghoo Cho.

1 Crawling Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan.

© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.

Jan 27, Digital Preservation Seminar1 Effective Page Refresh Policies for Web Crawlers Written By: Junghoo Cho & Hector Garcia-Molina Presenter:

UbiCrawler: a scalable fully distributed Web crawler

Evaluation of IR Systems

How to Crawl the Web Peking University 12/24/2003 Junghoo “John” Cho

IST 497 Vladimir Belyavskiy 11/21/02

Robotic Search Engines for the Physical World

Bring Order to The Web Ruey-Lung, Hsiao May 4 , 2000.

Text Categorization Berlin Chen 2003 Reference:

Presentation transcript:

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Advanced Crawling Techniques Chapter 6

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Outline Selective Crawling Focused Crawling Distributed Crawling Web Dynamics

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Web Crawler Program that autonomously navigates the web and downloads documents For a simple crawler –start with a seed URL, S 0 –download all reachable pages from S 0 –repeat the process for each new page –until a sufficient number of pages are retrieved Ideal crawler –recognize relevant pages –limit fetching to most relevant pages

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Nature of Crawl Broadly categorized into –Exhaustive crawl broad coverage used by general purpose search engines –Selective crawl fetch pages according to some criteria, for e.g., popular pages, similar pages exploit semantic content, rich contextual aspects

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Selective Crawling Retrieve web pages according to some criteria Page relevance is determined by a scoring function s  (  ) (u) –  relevance criterion –  parameters –for e.g., a boolean relevance function s(u) =1 document is relevant s(u) =0 document is irrelevant

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Selective Crawler Basic approach –sort the fetched URLs according to a relevance score –use best-first search to obtain pages with a high score first –search leads to most relevant pages

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Examples of Scoring Function Depth –length of the path from the site homepage to the document –limit total number of levels retrieved from a site –maximize coverage breadth Popularity –assign relevance according to which pages are more important than others –estimate the number of backlinks

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Examples of Scoring Function PageRank –assign value of importance –value is proportional to the popularity of the source document – estimated by a measure of indegree of a page

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Efficiency of Selective Crawler BFS crawler Crawler using backlinks Crawler using PageRank

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Focused Crawling Fetch pages within a certain topic Relevance function –use text categorization techniques –s  (topic) (u) = P(c|d(u),  ) Parent based method –score of parent is extended to children URL Anchor based method –anchor text is used for scoring pages

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Focused Crawler Basic approach –classify crawled pages into categories –use a topic taxonomy, provide example URLs, and mark categories of interest –use a Bayesian classifier to find P(c|p) –compute relevance score for each page –R(p) =  c  good P(c|p)

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Focused Crawler Soft Focusing –compute score for a fetched document, S 0 –extend the score to all URL in S 0 –s  (topic) (u) = P(c|d(v),  ) –if same URL is fetched from multiple parents, update s(u) Hard Focusing –for a crawled page d, find leaf node with highest probability (c*) –if some ancestor of c* is marked good, extract URLS from d –else the crawl is pruned at d

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Efficiency of a Focused Crawler

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Context Focused Crawlers Classifiers are trained –to estimate the link distance between a crawled page and the relevant pages –use context graph of L layers for each seed page

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Context Graphs Seed page forms layer 0 Layer i contains all the parents of the nodes in layer i-1

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Context Graphs To compute the relevance function –set of Naïve Bayes classifiers are built for each layer –compute P(t|c1) from the pages in each layer –compute P(c1 |p) –class with highest probability is assigned the page –if (P (c1 | p) < , then page is assigned to ‘other’ class

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Context Graphs Maintain a queue for each layer Sort queue by probability scores P(cl|p) For the next URL in the crawler –pick top page from the queue with smallest l –results in pages that are closer to the relevant page first –explore outlink of such pages

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Reinforcement Learning Learning what action yields maximum rewards To maximize rewards –learning agent uses previously tried actions that produced effective rewards –explore better action selections in future Properties –trial and error method –delayed rewards

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Elements of Reinforcement Learning Policy  (s,a) –probability of taking an action a in state s Rewards function r(a) –maps state-action pairs to a single number –indicate immediate desirability of the state Value Function V  (s) –indicate long-term desirability of states –takes into account the states that are likely to follow, and the rewards available in those states

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Reinforcement Learning Optimal policy  * maximizes value function over all states LASER uses reinforcement learning for indexing of web pages –for a user query, determine relevance using TFIDF –propagate rewards into the web –discounting them at each step, by value iteration –after convergence, documents at distance k from u provides a contribution  K times their relevance to the relevance of u

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Fish Search Web agents are like the fishes in sea –gain energy when a relevant document found agents –search for more relevant documents –lose energy when exploring irrelevant pages Limitations –assigns discrete relevance scores 1 – relevant, 0 or 0.5 for irrelevant –low discrimination of the priority of pages

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Shark Search Algorithm Introduces real-valued relevance scores based on –ancestral relevance score –anchor text –textual context of the link

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Distributed Crawling A single crawling process –insufficient for large-scale engines –data fetched through single physical link Distributed crawling –scalable system –divide and conquer –decrease hardware requirements –increase overall download speed and reliability

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Parallelization Physical links reflect geographical neighborhoods Edges of the Web graph associated with “communities” across geographical borders Hence, significant overlap among collections of fetched documents Performance of parallelization –communication overhead –overlap –coverage –quality

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Performance of Parallelization Communication overhead –fraction of bandwidth spent to coordinate the activity of the separate processes, with respect to the bandwidth usefully spent to document fetching Overlap –fraction of duplicate documents Coverage –fraction of documents reachable from the seeds that are actually downloaded Quality –e.g. some of the scoring functions depend on link structure, which can be partially lost

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Crawler Interaction Recent study by Cho and Garcia-Molina (2002) Defined framework to characterize interaction among a set of crawlers Several dimensions –coordination –confinement –partitioning

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Coordination The way different processes agree about the subset of pages to crawl Independent processes –degree of overlap controlled only by seeds –significant overlap expected –picking good seed sets is a challenge Coordinate a pool of crawlers –partition the Web into subgraphs –static coordination partition decided before crawling, not changed thereafter –dynamic coordination partition modified during crawling (reassignment policy must be controlled by an external supervisor)

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Confinement Specifies how strictly each (statically coordinated) crawler should operate within its own partition Firewall mode –each process remains strictly within its partition –zero overlap, poor coverage Crossover mode –a process follows interpartition links when its queue does not contain any more URLs in its own partition –good coverage, potentially high overlap Exchange mode –a process never follows interpartition links –can periodically dispatch the foreign URLs to appropriate processes –no overlap, perfect coverage, communication overhead

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Crawler Coordination Let Aij be the set of documents belonging to partition i that can be reached from the seeds Sj

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Partitioning A strategy to split URLs into non- overlapping subsets to be assigned to each process –compute a hash function of the IP address in the URL e.g. if n  {0,…, } corresponds to IP address m is the number of processes documents with n mod m = i assigned to process i –take to account geographical dislocation of networks

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Web Dynamics Rate of change of information on the Web Used by search engines for updating index Notion of recency proposed by Brewington and Cybenko (2000) The index entry for a document indexed at time t o is  -current at time t if the document has not changed during [t 0,t-  ]  is a grace period A search engine for a given collection is ( ,  )- current if the probability that a document is  -current is at least 

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Lifetime of Documents T – lifetime of a component –the time when a component breaks down –a continuous random variable The cumulative distribution function (cfd) of lifetime F(t) = P( T  t) Reliability (suvivorship function) - the probability that the component will be functioning at time t

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Aging of Components Age of a component – time elapsed since the last replacement Cdf of age – the expected fraction of components that are still operating at t The age probability density function (pdf) g(t) is proportional to the suvivorship function

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Lifetime vs. Age

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Aging of Documents S(t) is the probability that a document last changed at time zero will remain unmodified at time t G(t) is the expected fraction of documents that are older that t Probability that a document will be modified before an additional time h has passed is the conditional probability P( t t ) The change rate - where f(t) is the lifetime pdf

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Age and Lifetime on the Web Assuming constant change rate, estimate lifetime from observed age F( t ) = 1 - e - t, g( t ) = f( t ) = e - t –use Last-Modified timestamp in the HTTP header Brewington and Cybenko (2000) –7 million Web pages –observed between 1999 and 2000

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Empirical Distributions

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Growth with Time Essential property that the Web is growing with time – not captured by the model Most documents are young Suppose exponential distribution for growth –if documents were never edited, their age is the time since their creation –trivial growth model yields an exponential age distribution Realistic model should take to account both Web growth and document refreshing –use hybrid model

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Meaningful Timestamps Often servers do not return meaningful timestamps –especially for dynamic Web pages Thus, estimate of change rates using lifetimes rather than ages Brewington and Cybenko (2000) –an improved model that explicitly takes to account the probability of observing a change, given the change rate and the timespan

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Improved Model Document changes are controlled by an underlying Poisson process –probability of observing a change is independent of previous changes Mean lifetimes are Weibull distributed Change rates and timespans are independent The resulting lifetime distribution where w(1/ ) is an estimate of the mean lifetime

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Estimated Mean Lifetime Distribution

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Refresh Interval Estimate how often a crawler should refresh the index of a search engine to guarantee that it remains ( ,  )-current Consider a single document Let t=0 be the time the document was last fetched Let I be the interval between two consecutive visits

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Refresh Interval The probability that for a particular time t the document is unmodified in [0,t-  ] is The probability that a collection of documents is  -current is

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Example Brewington and Cybenko (2000) –reindexing period of about 18 days –assuming a Web size of 800 million pages –to guarantee that 95% of repository was current up to one week ago

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Freshness Another measure of recency Freshness  (t) at time t of a document –binary function indicating whether the document is up- to-date in the index at time t Expected freshness – probability that the document did not change in ( 0, t ] E[  (t) ] = e - t Corresponds to  -currency for  =0 If document d is refreshed regularly each I time units, average freshness is

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Index Age Index age a(t) of a document is the age of the document if the local copy is outdated, or zero if the local copy is fresh –e.g. if the document was modified at time s  (0,t), its index age is t – s Expected index age at time t Index age average in (0,I)

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Recency and Synchronization Policies Not all sites change their pages at the same rate Cho and Garcia-Molina (1999) –monitored about 720,000 popular Web pages –a vast fraction of this Web is dynamic –dramatically different results for top-level domains

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Study Results Average change interval found in a study. 270 popular sites were monitored for changes from 17 February to 24 June A sample of 3000 pages was collected from each site by visiting in breadth first order from the homepage

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Optimal Synchronization Policy Resource allocation policy should consider specific site dynamics Simplifying assumptions: –N documents of interest –Estimated change rate i, i = 1,…,N –A crawler regularly fetches each document i with a refresh interval I i –Fetch each document in constant time –B – the available bandwidth – the number of documents that can be fetched in a time unit

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Optimal Resource Allocation Limited bandwidth constraint The problem of optimal resource allocation –select the refresh intervals I i to maximize a recency measure of the resulting index E.g. maximize freshness subject to

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Example WebFountain (Edwards et al. 2001) A fully distributed and incremental crawler –no central control –repository entry of a document is updated as soon as it is fetched –crawling process is never regarded complete –changes are detected when a document is re-fetched –documents are grouped by similar rates of change –trade-off between re-fetching (freshness) and exploring (coverage) controlled by optimizing the number of old and new URLs