Bring Order to The Web Ruey-Lung, Hsiao May 4 , 2000.

Slides:



Advertisements
Similar presentations
Markov Models.
Advertisements

1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
Natural Language Processing WEB SEARCH ENGINES August, 2002.
Link Analysis David Kauchak cs160 Fall 2009 adapted from:
How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011.
COSC 878 Seminar on Large Scale Statistical Machine Learning 1.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Web Search – Summer Term 2006 VI. Web Search - Ranking (c) Wolfgang Hürst, Albert-Ludwigs-University.
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.
1 Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee.
The PageRank Citation Ranking “Bringing Order to the Web”
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.
Link Analysis, PageRank and Search Engines on the Web
Presented By: Wang Hao March 8 th, 2011 The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.
1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.
1 Intelligent Crawling Junghoo Cho Hector Garcia-Molina Stanford InfoLab.
Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with.
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.
The effect of New Links on Google Pagerank By Hui Xie Apr, 07.
The PageRank Citation Ranking: Bringing Order to the Web Larry Page etc. Stanford University, Technical Report 1998 Presented by: Ratiya Komalarachun.
Presented By: - Chandrika B N
The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
Overview of Web Ranking Algorithms: HITS and PageRank
1 Efficient Crawling Through URL Ordering by Junghoo Cho, Hector Garcia-Molina, and Lawrence Page appearing in Computer Networks and ISDN Systems, vol.
Publication Spider Wang Xuan 07/14/2006. What is publication spider Gathering publication pages Using focused crawling With the help of Search Engine.
Link-based and Content-based Evidential Information in a Belief Network Model I. Silva, B. Ribeiro-Neto, P. Calado, E. Moura, N. Ziviani Best Student Paper.
Lecture #10 PageRank CS492 Special Topics in Computer Science: Distributed Algorithms and Systems.
David Evans CS150: Computer Science University of Virginia Computer Science Class 38: Googling.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
CS 484 – Artificial Intelligence1 Announcements Homework 5 due Tuesday, October 30 Book Review due Tuesday, October 30 Lab 3 due Thursday, November 1.
Focused Crawler for Topic Specific Portal Construction Ruey-Lung, Hsiao 25 Oct, 2000 Toward A Full Automatic Web Site Construction & Service (II)
CS 440 Database Management Systems Web Data Management 1.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
1 Efficient Crawling Through URL Ordering Junghoo Cho Hector Garcia-Molina Lawrence Page Stanford InfoLab.
Brief Intro to Machine Learning CS539
Machine Learning and Data Mining – Adaptive Agents
The PageRank Citation Ranking: Bringing Order to the Web
The PageRank Citation Ranking: Bringing Order to the Web
DATA MINING Introductory and Advanced Topics Part III – Web Mining
Search Engines and Link Analysis on the Web
Link-Based Ranking Seminar Social Media Mining University UC3M
DTMC Applications Ranking Web Pages & Slotted ALOHA
A Comparative Study of Link Analysis Algorithms
Planning to Maximize Reward: Markov Decision Processes
Lecture 22 SVD, Eigenvector, and Web Search
Hidden Markov Models Part 2: Algorithms
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
CS 440 Database Management Systems
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Reinforcement Learning in MDPs by Lease-Square Policy Iteration
Junghoo “John” Cho UCLA
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
Reinforcement Nisheeth 18th January 2019.
Presentation transcript:

Bring Order to The Web Ruey-Lung, Hsiao May 4 , 2000

Coverage of the Web (Est. 1 billion total pages) 40% 35% 30% 25% 20% 38% 40% 35% 32% 31% 27% 30% 26% 25% 20% 14% 15% 17% 10% 6% 6% 5% 0% FAST AltaVista Excite Northern Light Google Inktomi Go Lycos Report Date: Feb.3,2000

Focused Crawling Toward topic-specific web resource discovery Ranking a focused crawler analyzes its crawl boundary to find the links relevant for the topic, avoiding irrelevant regions of the web. Ranking core of information retrieval/discovery system Classification another view of ranking Important metrics for target pages Similarity to the driving query Backlink count Pagerank, HITS Location Metrics (the above metrics are adopted from ref. 1) estimated Q-value (adopted from ref. 2) ...

U : uniform distribution A : adjacency matrix PageRank (1/3) Consider a random web surfer Jumps to random page with probability ε With probability (1-ε ) follows a random hyperlink of the current page Transition probability matrix ε x U + (1-ε) x A U : uniform distribution A : adjacency matrix Query-independent ranking = stationary probability for this Markov chain adopted from ref. 6

PageRank (2/3) Simplified Definition of PageRank R(u) = c SUMvєBu (R(v)/Nv) Fu : the set of pages u points to Bu : the set of pages that point to u Nu = | Fu |    URL: _____________    URL: _____________ 100 53 50 Hyperlink 1 Hyperlink 1 Hyperlink 2 Hyperlink 2 3 50    URL: _____________    URL: _____________ 9 50 Hyperlink 1 Hyperlink 1 Hyperlink 2 3 Hyperlink 2 Hyperlink 3 3

PageRank (3/3) Definition of PageRank    URL: _____    URL: _____    URL: _____ ∞ ∞ ∞ Rank Sink Definition of PageRank R’(u) = c SUMvєBu (R’(v)/Nv) + cE(u) E(u) is some vector over the web pages that corresponds to a source of rank C is a decay factor. Personalized PageRank Aside from solving rank sinks, E turns out to be a powerful parameter to adjust the page ranks Change the uniformed distribution to a biased distribution favoring specific topic.

Reinforcement Learning (1/4) Goal : Autonomous agents learn to choose optimal actions to achieve its goal. Learn a control strategy, or policy, for choosing actions. Method : Use reward (reinforcement) to learn a successful agent function. Model : Agent STATE , REWARD a0 a1 a2 S0 S1 S2 ... r0 r1 r2 ACTION Environment Goal: learn to choose actions that maximize discounted cumulated reward r0+γr1+γ2r2+… , where 0 ≦γ<1 adopted from ref. 3

Reinforcement Learning (2/4) Interaction between agent and environment agent can perceive a set S of distinct states of its environment agent has a set A of distinct actions that it can perform environment responds by a reward function rt=r(st,at) environment produces the succeeding state st+1=δ(st,at) r, δ are parts of environment and not necessary known to agent Markov decision process (MDP) the functions r(st,at), δ(st,at) depend only on the current state and action. Formulate policy agent learns π : S→A , selecting next action at based on state st such a policy should lead to maximize cumulative value Vπ(st). Vπ(st) = rt+γrt+1+γ2rt+2+ … = SUMi=0, ∞ ( γirt+i) π* = argmaxπVπ(s) for all s

Reinforcement Learning (3/4) Example (suppose γ=0.9) 100 G G 100 r(s,a) immediate reward values One optimal policy 90 100 90 100 G G 81 72 90 81 81 100 81 90 81 90 100 72 81 Q(s,a) values V*(s) values V = 100+0.9x0+...=100 V = 0+0.9x100+0.92x0+...=90 V = 0+0.9x0+0.92x100+0.93x0...=81

Reinforcement Learning (4/4) Q Learning It’s difficult to learn π* : S→A directly , because training data does not provide examples of the form <s,a> Agent prefer state s1 over s2 whenever V*(s1)>V*(s2) The optimal action in state s is the action a that maximizes the sum of the immediate reward r(s,a) plus the value V* of the immediate successor state, discounted by γ π* = argmaxa [r(s,a) + γV*(δ(s,a)) ] Corelated measurement Q Q(s,a) = r(s,a) + γV*(δ(s,a)) => π* = argmaxa Q(s,a) Relation between Q and V* V*(s) = maxa’ Q(s,a’) Estimate Q-value iteratively Q’(s,a) ← r + γmaxa’Q’(s’,a’)

Efficient Web Spidering (1/3) Paper : Efficient Web Spidering with Reinforcement Learning (ICML ‘98) part of the project for building ,cora, domain-specific search engines containing computer science research papers. Series of papers appear in AAAI-99 symp. On IA, IJCAI ‘99 Demo site : http://www.cora.justresearch.com Why use reinforcement learning ? Performance of a topic-specific spidering is measured in terms of reward over time. The environment presents situations with delayed reward. How to do that ? Learn a mapping from the text in the neighborhood of a hyperlink to the expected (discounted) number of relevant pages that ca be found as a result of following that hyperlink. Use naïve Bayes to classify the text into a corresponding finite number of classes

Efficient Web Spidering (2/3) Obtaining Training Data Off-line training using 4 CS department, including 53012 documents and 592216 hyperlinks. State transition function , T, and reward function, R, are known Crawling learn the Q function Calculating Q function    URL: _____________ target hyperlink 3 reward=1    URL: _____________ reward=0 hyperlink 1 hyperlink 2 reward=0    URL: _____________ Neighborhood of a hyperlink Anchor text of the link, header , Page title of the linked document hyperlink 4

Efficient Web Spidering (3/3) Mapping Text to Q-value Given we have calculated Q-values for hyperlinks in training data Discretize the discounted sum of reward values into bins, place the text in the neighborhood of the hyperlinks into the bind corresponding to their Q-values Train a naïve Bayes text classifier using those text For each hyperlink, calculate the probabilistic class membership of each bin, The estimated Q-value of that hyperlink is the weighted average of each bins’ value. Evaluation Measurement : # of hyperlinks followed before 75% target found. Reinforcement Learning : 16% of the hyperlinks Breadth-first : 48% of the hyperlinks

Reference 1. Efficient crawling through URL ordering - Junghoo Cho, Lawrence Page, 7-th WWW conference. 2. Using Reinforcement Learning to Spider the Web Efficiently - Jason Rennie, Andrew McCallum, ICML ‘98. 3. Machine Learning - Tom M. Mitchell, McGRAW-HILL 4. The PageRank Citation Ranking : Bringing Order to the Web 5. Mining the Link Structure of the World Wide Web, Soumen Chakrabarti, … ’99 6. Information Retrieval on the Web , Tutorial in SIGIR ‘98