Web Information retrieval (Web IR)

Slides:



Advertisements
Similar presentations
Topic-Sensitive PageRank Presented by : Bratislav V. Stojanović University of Belgrade School of Electrical Engineering Page 1/29.
Advertisements

Markov Models.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
TrustRank Algorithm Srđan Luković 2010/3482
Information Networks Link Analysis Ranking Lecture 8.
Sandeep Pandey 1, Sourashis Roy 2, Christopher Olston 1, Junghoo Cho 2, Soumen Chakrabarti 3 1 Carnegie Mellon 2 UCLA 3 IIT Bombay Shuffling a Stacked.
How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-
22 May 2006 Wu, Goel and Davison Models of Trust for the Web (MTW) WWW2006 Workshop L EHIGH U NIVERSITY.
More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Authoritative Sources in a Hyperlinked Environment Hui Han CSE dept, PSU 10/15/01.
Multimedia Databases SVD II. Optimality of SVD Def: The Frobenius norm of a n x m matrix M is (reminder) The rank of a matrix M is the number of independent.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
ICS 278: Data Mining Lecture 15: Mining Web Link Structure
CS246 Search Engine Bias. Junghoo "John" Cho (UCLA Computer Science)2 Motivation “If you are not indexed by Google, you do not exist on the Web” --- news.com.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
Multimedia Databases SVD II. SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case studies SVD properties More case.
Link Analysis, PageRank and Search Engines on the Web
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Topic-Sensitive PageRank Taher H. Haveliwala. PageRank Importance is propagated A global ranking vector is pre-computed.
The Web as Network Networked Life CSE 112 Spring 2006 Prof. Michael Kearns.
Link Analysis. 2 HITS - Kleinberg’s Algorithm HITS – Hypertext Induced Topic Selection For each vertex v Є V in a subgraph of interest: A site is very.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
Adversarial Information Retrieval The Manipulation of Web Content.
Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Hyperlink Analysis for the Web. Information Retrieval Input: Document collection Goal: Retrieve documents or text with information content that is relevant.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-
Overview of Web Ranking Algorithms: HITS and PageRank
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Autumn Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Ch 14. Link Analysis Padmini Srinivasan Computer Science Department
Algorithmic Detection of Semantic Similarity WWW 2005.
Ranking CSCI 572: Information Retrieval and Search Engines Summer 2010.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Ranking Link-based Ranking (2° generation) Reading 21.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
1 CS 430: Information Discovery Lecture 5 Ranking.
Autumn Web Information retrieval (Web IR) Handout #11:FICA: A Fast Intelligent Crawling Algorithm Ali Mohammad Zareh Bidoki ECE Department, Yazd.
Autumn Web Information retrieval (Web IR) Handout #14: Ranking Based on Click Through data Ali Mohammad Zareh Bidoki ECE Department, Yazd University.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
IR Theory: Web Information Retrieval. Web IRFusion IR Search Engine 2.
GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.
Extrapolation to Speed-up Query- dependent Link Analysis Ranking Algorithms Muhammad Ali Norozi Department of Computer Science Norwegian University of.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
WEB SPAM.
HITS Hypertext-Induced Topic Selection
Search Engines and Link Analysis on the Web
Link-Based Ranking Seminar Social Media Mining University UC3M
A Comparative Study of Link Analysis Algorithms
Lecture 22 SVD, Eigenvector, and Web Search
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Junghoo “John” Cho UCLA
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
Web Information retrieval (Web IR)
Presentation transcript:

Web Information retrieval (Web IR) Handout #9: Connectivity Ranking Ali Mohammad Zareh Bidoki ECE Department, Yazd University alizareh@yaduni.ac.ir Autumn 2011

Outline PageRank HITS Personalized PageRank HostRank Distance Rank Autumn 2011

Ranking : Definition Ranking is the process which estimates the quality of a set of results retrieved by a search engine Ranking is the most important part of a search engine Autumn 2011

Ranking Types Content-based Connectivity based (web) Classical IR Connectivity based (web) Query independent Query dependent User-behavior based Autumn 2011

Web information retrieval Queries are short: 2.35 terms in avg. Huge variety in documents: language, quality, duplication Huge vocabulary: 100s millions terms Deliberate misinformation Spamming! Its rank is completely under the control of Web page’s author Autumn 2011

Ranking in Web IR Words Docs 1 2 w n Web graph Query Ranking is a function of the query terms and of the hyperlink structure Using content of other pages to rank current pages It is out of the control of the page’s author Spamming is hard Autumn 2011

Connectivity-based Ranking Query independent PageRank Query dependent HITS Autumn 2011

Google’s PageRank Algorithm Idea: Mine structure of the web graph Each web page is a node Each hyperlink is a directed edge Autumn 2011

PageRank Assumption: A link from page A to page B is a recommendation of page B by the author of A (we say B is successor of A) Quality of a page is related to its in-degree Recursion: Quality of a page is related to its in-degree, and to the quality of pages linking to it PageRank A B Successor Autumn 2011

Definition of PageRank Consider the following infinite random walk (surf): Initially the surfer is at a random page At each step, the surfer proceeds to a randomly chosen successor of the current page (With probability 1/outdegree) The PageRank of a page p is the fraction of steps the surfer spends at p in the limit. Random surfer model p s Surfer Autumn 2011

PageRank (cont.) By previous theorem: PageRank = stationary probability for this Markov chain, i.e. Autumn 2011

PageRank (cont.) B A P PageRank of P is P(A)/4 + P(B)/3 Autumn 2011

PageRank (cont.) Autumn 2011

Damping Factor (d) where n is the total number of nodes in the graph Web graph is not strongly connected Convergence of PageRank is not guaranteed Effects of sinking web pages Pages without outputs Trapping pages Damping factor (d) Surfer proceeds to a randomly chosen successor of the current page with probability d or to a randomly chosen web page with probability (1-d) where n is the total number of nodes in the graph Autumn 2011

PageRank Vector (Linear Algebra) R is the rank vector (eigen vector) ri is rank value of page i P is a matrix in that pij=1/O(i) if i points to j then else pij=0 Goal is to find eigen vector of matrix P with eigen value one It iterates to converge (power method) Using damping factor we have (ei=1/n) Autumn 2011

PageRank Properties Advantages Disadvantages Finds popularity It is offline Disadvantages It is query independent All of pages will compete together Unfairness Autumn 2011

HITS (An online query dependent) Hypertext Induced Topic Search By Kleinberg Autumn 2011

HITS (Hypertext Induced Topic Selection) The algorithm produces two types of pages: - Authority: A page is very authoritative if it receives many citations. Citation from important pages weight more than citations from less-important pages - Hub: Hubness shows the importance of a page. A good hub is a page that links to many authoritative sites For each vertex v Є V in a graph of interest: a(v) - the authority of v h(v) - the hubness of v Autumn 2011

HITS 2 5 3 1 1 6 4 7 a(1) = h(2) + h(3) + h(4) h(1) = a(5) + a(6) + a(7) Autumn 2011

Authority and Hubness Convergence Authorities and hubs exhibit a mutually reinforcing relationship: a better hub points to many good authorities, and a better authority is pointed to by many good hubs Autumn 2011

HITS Example Find a base subgraph: Start with a root set R {1, 2, 3, 4} {1, 2, 3, 4} - nodes relevant to the topic Expand the root set R to include all the children and a fixed number of parents (d) of nodes in R A new set S (base subgraph) Real version of HITS is based on site relations Autumn 2011

Topic Sensitive PageRank (TSPR) It precomputes the importance scores online, as with ordinary PageRank. However, it compute multiple importance scores for each page; It computes a set of scores of the importance of a page with respect to various topics. At query time, these importance scores are combined based on the topics of the query to form a composite PageRank score for those pages matching the query. Autumn 2011

TSPR (Cont.) We have n topics on the web, The rank of page v in topic t Difference with original PageRank is in E vector (it is not uniform and we have n E vectors) The are n ranking values for each page Problem: finding the topic of a page and a query (we do not know user interest) Autumn 2011

TSPR (Cont.) Cj = category j Given a query q, let q’ be the context of q (here q’=q) P(q’|cj) is computed from the class term-vector Dj (number of the terms in the documents below each of the 16 top-level categories, Djt simply gives the total number of occurrences of term t in documents listed below class cj) Autumn 2011

TSPR (Cont.) The quantity P(cj ) is not as straightforward. It is used uniformly, although we could personalize the query results for deferent users by varying this distribution. In other words, for some user k, we can use a prior distribution Pk(cj ) that reflects the interests of user k. This method provides an alternative framework for user-based personalization, rather than directly varying the damping vector E Autumn 2011

TrustRank Spamming in Web TrustRank is used to overcome Spamming Good and bad pages TrustRank is used to overcome Spamming It proposes techniques to semi-automatically separate reputable, good pages from spam. It first selects a small set of seed pages to be evaluated by an expert. Once it manually identifies the reputable seed pages, then it uses the link structure of the web to discover other pages that are likely to be good. Autumn 2011

TrustRank (cont.) Idea: Good page links to other Good page and bad pages link to other bad pages Autumn 2011

TrustRank (cont.) It formalizes the notion of a human checking a page for spam by a binary oracle function O over all pages p : Autumn 2011

Trust damping and Trust Splitting Autumn 2011

Computing Trustiness of each page Goal: Trust Propagation (E(i) is computed from Normalized Oracle vector for example if O(5)=O(10)=O(15)=1 then E(5)=E(10)=E(15)=0.33 ): Autumn 2011

HostRank Pervious link analysis algorithms generally work on a flat link graph, ignoring the hierarchal structure of the Web graph. They suffer from two problems: the sparsity of link graph and biased ranking of newly-emerging pages. HostRank considers both the hierarchical structure and the link structure of the Web. Autumn 2011

Example of Domain, Host, Directory Autumn 2011

Supper node & Hierarchical Structure of a Web Graph Tupper-layer is an aggregated link graph which consists with supernodes (such as domain, host and directory). The lower-layer graph is the hierarchical tree structure, in which each node is the individual Web page in the supernode, and the edges are the hierarchical links between the pages. Autumn 2011

Hierarchical Random Walk Model At the beginning of each browsing session, a user randomly selects the supernode. 2. After the user finished reading a page in a supernode, he may select one of the following three actions with a certain probability: Going to another page within current supernode. Jumping to another supernode that is linked by current supernode. Ending the browsing. Autumn 2011

Two stages in HostRank First computing score of each suppernode by random walk (PageRank) Second propagating score among pages inside a supernode Dissipative Heat Conductance (DHC) model Autumn 2011

Ranking & Crawling Challenges Rich-get-richer problem Unfairness Low precision Spamming phenomenon Autumn 2011

Popularity and Quality Definition 1:We define the popularity of page p at time t, P(p; t), as the fraction of Web users who like the page We can interpret the PageRank of page as its popularity on the web Definition 2 : We define the quality of a page p, Q(p), as the probability that an average user will like the page when user sees the page for the first time Autumn 2011

Rich-get-richer Problem It causes young high quality pages receive less popularity It is from search-engine bias Entrenchment Effect Autumn 2011

Entrenchment Effect Search engines show entrenched (already-popular) pages at the top Users discover pages via search engines; tend to focus on top results --------- … entrenched pages user attention new unpopular pages Autumn 2011

Popularity as a Surrogate for Quality Search engines want to measure the “quality” of pages Quality is hard to define and measure Various “popularity” measures are used in ranking e.g., in-links, PageRank, user traffic Autumn 2011

Measuring Search-Engine Bias Random-surfer model Users follow links randomly Never use search engines Search-dominant model Users always start with a search engine Only visit pages returned by search engines It has been found that it takes 60 times longer for a new page to become popular under Search-Dominant than Random-Surfer model. Autumn 2011

Popularity Evaluation Random Surfer Search Dominant Autumn 2011

Some Definitions Autumn 2011

Relation between Popularity & Visit rate in Random Surfer Model r1 is constant We can consider PageRank as Popularity (the current PageRank of a page represents the probability that a person arrives at the page if the person follows links on the Web randomly) Autumn 2011

Popularity evolution Autumn 2011

Popularity evolution (Q(p)=1) Autumn 2011

Relation in Search Dominant Model Autumn 2011

Random Surfer vs. Search Dominant Autumn 2011

Search Dominant Formula Detail (Found from AltaVista log-power law) Autumn 2011

Rank Promotion (by Pandey) Autumn 2011

Relationship Between Popularity and Quality aware of page p Users like page p Popularity : depends on the number of users who “like” a page relies on both quality and awareness of the page Popularity is different from quality But strongly correlated when awareness is large P(p,t) = A(p,t)*Q(p) Autumn 2011

Exploitation vs. Exploration Popular pages High quality --------- + Others Opportunity for no-popular pages Autumn 2011

An Intelligent Ranking Algorithm for Web Pages DistanceRank: An Intelligent Ranking Algorithm for Web Pages Autumn 2011

DistanceRank Goals A Ranking algorithm based on connectivity with less sensitivity to Rich-get-richer problem in addition to the better ranking Autumn 2011

Definitions Definition 1. If page i points to page j then the weight of link between i and j is equal to Log10O(i) j i logO(i) Definition 2. The distance between two pages i and j is the weight of the shortest path from i to j. We call this logarithmic distance and denote it with dij i j dij Autumn 2011

An Example log4 s q t log3 p log2 r The distance between p and t is equal to log(2)+log(3) if the path p-r-t was the shortest path between p and q The distance between p and s is log(2)+log(4) Autumn 2011

Distance as a Random Surfer Model If distance between i and j was less than between i and k (dij<dik ) the probability that a random surfer from i reach to j is more than k j dij i dik k Autumn 2011

Definition 3 If dij shows the (logarithmic) distance between two page i and j as definition 2, then dj denotes the average distance of page j and is defined as the following (N shows number of web pages): Autumn 2011

Distance as Ranking Criterion Each page that has less average distance from others has a higher rank A page having many input links should have low average distance If pages pointing to this page have low distance then this page should have a low average distance Autumn 2011

Disadvantage of Average Distance The main problem of the average distance is its complexity, O(|V|*|E|) So the implementation of average distance is not practical in real web with 25 billion pages Autumn 2011

The DistanceRank In general, suppose O(i) denotes the number of forwarding (outgoing) links from i and B(j) denotes the set of pages that point to page j. The DistanceRank of page j, denoted by dj , is given by: Min i di+logO(i) j dk+logO(k) k Low complexity! B(j) Autumn 2011

Intelligent Surfer Intuitively, when a user start browsing from a random page, s/he does not have any background about the web Then, by surfing and visiting web pages, s/he clicks links based on both his/her pervious experiences and the current status (content) of web pages Continuously, s/he accumulates knowledge that causing to reach his/her goal and suitable pages faster Autumn 2011

DistanceRank based on Reinforcement Learning State: Web pages (V) Action: Move from i to j (click j) Punishment: logO(i) Minimize punishments (distance) First: Definition of RL در محيط يادگيري تقويتي يادگيري با استفاده از تعامل صورت مي‌گيرد كه يادگيرنده يا تصميم‌گيرنده عامل ناميده مي‌شود. چيزهايي كه عامل با آنها تعامل مي‌كند شامل همه شرايط خارج از كنترل عامل مي‌باشد كه اصطلاحاً محيط ناميده مي‌شود. عامل با انجام يك سري عمل با محيط تعامل كرده و محيط در پاسخ به هر عمل به عامل، پاداش يا توبيخارائه مي‌كند. عامل و محيط با يكديگر در بازه‌هاي زماني متوالي t=0,1,2,… تعامل خواهند كرد. در هر مرحله‌ي زماني عامل تصويري از حالت محيط دريافت مي‌كند كه S شامل همه حالتهاي ممكن مي‌باشد، و عمل را براي رسيدن به حالت جديد انتخاب مي‌كند. هدف اصلي عامل بيشينه (كمينه) كردن پاداش(توبيخ)هاي دريافتي در طي زمان مي‌باشد. i logO(i) j logO(j) Autumn 2011

DistanceRank is similar to Q-Learning The learning rate: log(O(i)) is the instantaneous punishment it receives in transition state from i to j djt and dit show average distance value of page j and i in time t respectively djt+1 is average distance of page j at time t+1 It is mapped on Intelligent surfer First alpha is one and then decreases in time Gamma is discount factor It is iterative Finally distances are sorted in increasing order Autumn 2011

DistanceRank example 1 2 3 4 5 Autumn 2011 Iteration 4 Iteration 3 Distnace Formula Node log2 d1=0.5*d4+log2 1 0.25log2 d2=0.5*Min(d3,d1) 2 0.5log2 d3=0.5*d5 3 0.12log2 d4=0.5*d2 4 d5=0.5*d4+log2 5 Autumn 2011

Experimental Results We used University of California, Berkeley’s Web site which includes 5 millions web pages to evaluate DistanceRank Three scenarios were used: Crawling scheduling The goal was to find more important pages faster in the crawling process Rank ordering We compared the ordering of DistanceRank with PageRank and found their similarity Comparison with Google Autumn 2011

Rank Ordering We used Kendall's metric for correlation between two rank lists Ideal is PageRank Algorithm Kendall's Tau Breadth-first 0.11 Back Link 0.40 OPIC 0.62 DistanceRank 0.75 Autumn 2011

Rich-get-richer Problem It is proved that the popularity of each page is computed as the following. Pp and Pc are pervious and current popularity respectively There is a relation between popularity and quality (Q=0.80, ): Autumn 2011

Popularity growth in DistanceRank (Q=0.8) DistanceRank is closer to the page quality Autumn 2011

Experimental result DistanceRank is less sensitive to the "rich get richer" problem in in comparison with PageRank dj=Alpha*(difference between current and previous distance) + pervious distance DistanceRank is closer to the page quality Autumn 2011

DistanceRank Convergence The ordering obtained by only 5 iterations agree closely with ordering of 20 iterations for 5 million pages the time complexity of algorithm reduces to O(p*|E|) where p<<N in that p shows number of iterations for convergence Autumn 2011

Different learning rates Autumn 2011