Google 搜索与 Inter 网的信息检索 马志明 May 16, 2008

Slides:



Advertisements
Similar presentations
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Advertisements

Markov Models.
Optimizing search engines using clickthrough data
Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
Link Analysis: PageRank
Web Markov Skeleton Processes and Applications Zhi-Ming Ma 10 June, 2013, St.Petersburg
Web Markov Skeleton Processes and their Applications Zhi-Ming Ma 18 April, 2011, BNU.
10/11/2001Random walks and spectral segmentation1 CSE 291 Fall 2001 Marina Meila and Jianbo Shi: Learning Segmentation by Random Walks/A Random Walks View.
A Fuzzy Web Surfer Model Narayan L. Bhamidipati and Sankar K. Pal Indian Statistical Institute Kolkata.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
Introduction to PageRank Algorithm and Programming Assignment 1 CSC4170 Web Intelligence and Social Computing Tutorial 4 Tutor: Tom Chao Zhou
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Estimating the Global PageRank of Web Communities Paper by Jason V. Davis & Inderjit S. Dhillon Dept. of Computer Sciences University of Texas at Austin.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Link Analysis, PageRank and Search Engines on the Web
Topic-Sensitive PageRank Taher H. Haveliwala. PageRank Importance is propagated A global ranking vector is pre-computed.
Link Structure and Web Mining Shuying Wang
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
Web Search – Summer Term 2006 VII. Selected Topics - PageRank (closer look) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Google and the Page Rank Algorithm Székely Endre
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.
The effect of New Links on Google Pagerank By Hui Xie Apr, 07.
Presented By: - Chandrika B N
The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.
Piyush Kumar (Lecture 2: PageRank) Welcome to COT5405.
Google’s Billion Dollar Eigenvector Gerald Kruse, PhD. John ‘54 and Irene ‘58 Dale Professor of MA, CS and I T Interim Assistant Provost Juniata.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
The Technology Behind. The World Wide Web In July 2008, Google announced that they found 1 trillion unique webpages! Billions of new web pages appear.
A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Center for E-Business Technology Seoul National University Seoul, Korea BrowseRank: letting the web users vote for page importance Yuting Liu, Bin Gao,
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
Overview of Web Ranking Algorithms: HITS and PageRank
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.
Personalized Course Navigation Based on Grey Relational Analysis Han-Ming Lee, Chi-Chun Huang, Tzu- Ting Kao (Dept. of Computer Science and Information.
How works M. Ram Murty, FRSC Queen’s Research Chair Queen’s University or How linear algebra powers the search engine.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Post-Ranking query suggestion by diversifying search Chao Wang.
“In the beginning -- before Google -- a darkness was upon the land.” Joel Achenbach Washington Post.
NTU & MSRA Ming-Feng Tsai
1 CS 430: Information Discovery Lecture 5 Ranking.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.
1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
Autumn Web Information retrieval (Web IR) Handout #14: Ranking Based on Click Through data Ali Mohammad Zareh Bidoki ECE Department, Yazd University.
PageRank Google : its search listings always seemed deliver the “good stuff” up front. 1 2 Part of the magic behind it is its PageRank Algorithm PageRank™
Mathematics of the Web Prof. Sara Billey University of Washington.
Extrapolation to Speed-up Query- dependent Link Analysis Ranking Algorithms Muhammad Ali Norozi Department of Computer Science Norwegian University of.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Google搜索与 Inter网的信息检索
Link-Based Ranking Seminar Social Media Mining University UC3M
PageRank and Markov Chains
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Bring Order to The Web Ruey-Lung, Hsiao May 4 , 2000.
Information retrieval and PageRank
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Presentation transcript:

Google 搜索与 Inter 网的信息检索 马志明 May 16,

约有 626,000 项符合中国科学院数 学与系统科学研究院的查询结果, 以下是第 项。 ( 搜索用时 0.45 秒) How can google make a ranking of 626,000 pages in 0.45 seconds?

A main task of Internet (Web) Information Retrieval = Design and Analysis of Search Engine (SE) Algorithm involving plenty of Mathematics

HITS PageRank 1998 Jon Kleinberg Cornell University 1998 Sergey Brin and Larry Page Stanford University

Nevanlinna Prize ( 2006) Jon Kleinberg One of Kleinberg‘s most important research achievements focuses on the internetwork structure of the World Wide Web. Prior to Kleinberg‘s work, search engines focused only on the content of web pages , not on the link structure. Kleinberg introduced the idea of “authorities” and “hubs”: An authority is a web page that contains information on a particular topic, and a hub is a page that contains links to many authorities. Zhuzihu thesis.pdf Zhuzihu thesis.pdf

Page Rank, the ranking system used by the Google search engine. Query independent content independent. using only the web graph structure

Page Rank, the ranking system used by the Google search engine.

PageRank as a Function of the Damping Factor Paolo Boldi Massimo Santini Sebastiano Vigna DSI, Università degli Studi di Milano WWW 2005 paper 3.1 Choosing the damping factor 3 General Behaviour 3.2 Getting close to 1  can we somehow characterise the properties of ?  what makes different from the other (infinitely many, if P is reducible) limit distributions of P?

is the limit distribution of P when the starting distribution is uniform, that is, Conjecture 1 :

Website provide plenty of information : pages in the same website may share the same IP, run on the same web server and database server, and be authored / maintained by the same person or organization. there might be high correlations between pages in the same website, in terms of content, page layout and hyperlinks. websites contain higher density of hyperlinks inside them (about 75% ) and lower density of edges in between.

HostGraph loses much transition information Can a surfer jump from page 5 of site 1 to a page in site 2 ?

From: [mailto:s06-pc-chairs- Sent: 2006 年 4 月 4 日 8:36 To: Tie-Yan Liu; Subject: [SIGIR2006] Your Paper #191 Title: AggregateRank: Bring Order to Web Sites 29th Annual International Conference on Research & Development on Information Retrieval (SIGIR’06, August 6–11, 2006, Seattle, Washington, USA).

Ranking Websites, a Probabilistic View Ying Bao, Gang Feng, Tie-Yan Liu, Zhi-Ming Ma, and Ying Wang Internet Mathematics, Volume 3 (2007), Issue 3

- --- We suggest evaluating the importance of a website with the mean frequency of visiting the website for the Markov chain on the Internet Graph describing random surfing. ---We show that this mean frequency is equal to the sum of the PageRanks of all the webpages in that website (hence is referred as PageRankSum )

---We propose a novel algorithm (AggregateRank Algorithm) based on the theory of stochastic complement to calculate the rank of a website. ---The AggregateRank Algorithm can approximate the PageRankSum accurately, while the corresponding computational complexity is much lower than PageRankSum

--- By constructing return-time Markov chains restricted to each website, we describe also the probabilistic relation between PageRank and AggregateRank. ---The complexity and the error bound of AggregateRank Algorithm with experiments of real dada are discussed at the end of the paper.

n webs in N sites,

The stationary distribution, known as the PageRank vector, is given by We may rewrite the stationary distribution as with as a row vector of length

We define the one-step transition probability from the website to the website by where e is an dimensional column vector of all ones

The N×N matrix C(α)=(c ij (α)) is referred to as the coupling matrix, whose elements represent the transition probabilities between websites. It can be proved that C(α) is an irreducible stochastic matrix, so that it possesses a unique stationary probability vector. We use ξ(α) to denote this stationary probability, which can be gotten from

Since One can easily check that is the unique solution to We shall refer as the AggregateRank

That is, the probability of visiting a website is equal to the sum of PageRanks of all the pages in that website. This conclusion is consistent to our intuition.

the transition probability from S i to S j actually summarizes all the cases that the random surfer jumps from any page in S i to any page in S j within one-step transition. Therefore, the transition in this new HostGraph is in accordance with the real behavior of the Web surfers. In this regard, the so- calculated rank from the coupling matrix C(α) will be more reasonable than those previous works.

Let denote the number of visiting the website during the n times, that is We have

Assume a starting state in website A, i.e. It is clear that all the variables are stopping times for X. We define and inductively

Let denote the transition matrix of the return-time Markov chain for site Similarly, we have

Since Therefore Suppose that AggregateRank, i.e. the stationary distribution of is

Based on the above discussions, the direct approach of computing the AggregateRank ξ(α) is to accumulate PageRank values (denoted by PageRankSum). However, this approach is unfeasible because the computation of PageRank is not a trivial task when the number of web pages is as large as several billions. Therefore, Efficient computation becomes a significant problem.

1. Divide the n × n matrix into N × N blocks according to the N sites. AggregateRank 2.Construct the stochastic matrix for by changing the diagonal elements of to make each raw sum up to 1.

3. Determine from 4. Form an approximation to the coupling matrix, by evaluating 5. Determine the stationary distribution of and denote it, i.e.,

Experiments In our experiments, the data corpus is the benchmark data for the Web track of TREC 2003 and 2004, which was crawled from the.gov domain in the year of It contains 1,247,753 webpages in total.

we get 731 sites in the.gov dataset. The largest website contains 137,103 web pages while the smallest one contains only 1 page.

Performance Evaluation of Ranking Algorithms based on Kendall's distance

Similarity between PageRankSum and other three ranking results.

From: Sent: Thursday, April 03, :48 AM Dear Yuting Liu, Bin Gao, Tie-Yan Liu, Ying Zhiming Ma, Shuyuan He, Hang Li We are pleased to inform you that your paper Title: BrowseRank: Letting Web Users Vote for Page Importance has been accepted for oral presentation as a full paper and for publication as an eight-page paper in the proceedings of the 31st Annual International ACM SIGIR Conference on Research & Development on Information Retrieval. Congratulations!!

Building model Properties of Q process: –Stationary distribution: –Jumping probability: –Embedded Markov chain: is a Markov chain with the transition probability matrix

Main conclusion 1 – is the mean of the staying time on page i. The more important a page is, the longer staying time on it is. – is the mean of the first re-visit time at page i. The more important a page is, the smaller the re-visit time is, and the larger the visit frequency is.

Main conclusion 2 – is the stationary distribution of –The stationary distribution of discrete model is easy to compute Power method for Log data for

Further questions How about inhomogenous process? –Statistic result show: different period of time possesses different visiting frequency. –Poisson processes with different intensity. Marked point process –Hyperlink is not reliable. –Users’ real behavior should be considered.

Relevance Ranking Many features for measuring relevance –Term distribution (anchor, URL, title, body, proximity, ….) –Recommendation & citation (PageRank, click- through data, …) –Statistics or knowledge extracted from web data Questions –What is the optimal ranking function to combine different features (or evidences)? –How to measure relevance?

Learning to Rank What is the optimal weightings for combining the various features –Use machine learning methods to learn the ranking function –Human relevance system (HRS) –Relevance verification tests (RVT) Wei-Ying Ma, Microsoft Research Asia

Learning to Rank Model Learning System Learning System Ranking System Ranking System min Loss 66 Wei-Ying Ma, Microsoft Research Asia

Learning to Rank (Cont) State-of-the-art algorithms for learning to rank take the pairwise approach –Ranking SVM –RankBoost –RankNet (employed at Live Search) 67 Break down Wei-Ying Ma, Microsoft Research Asia

learning to rank The goal of learning to rank is to construct a real-valued function that can generate a ranking on the documents associated with the given query. The state-of-the-art methods transforms the learning problem into that of classification and then performs the learning task:

For each query, it is assumed that there are two categories of documents: positive and negative (representing relevant and irreverent with respect to the query). Then document pairs are constructed between positive documents and negative documents. In the training process, the query information is actually ignored.

[5] Y. Cao, J. Xu, T.-Y. Liu, H. Li, Y. Huang, and H.-W. Hon. Adapting ranking svm to document retrieval. In Proc. of SIGIR’06, pages 186–193, [11] T. Qin, T.-Y. Liu, M.-F. Tsai, X.-D. Zhang, and H. Li. Learning to search web pages with query-level loss functions. Technical Report MSR-TR , As case studies, we investigate Ranking SVM and RankBoost. We show that after introducing query-level normalization to its objective function, Ranking SVM will have query-level stability. For RankBoost, the query-level stability can be achieved if we introduce both query-level normalization and regularization to its objective function.

We re-represent the learning to rank problem by introducing the concept of ‘query’ and ‘distribution given query’ into its mathematical formulation. More precisely, we assume that queries are drawn independently from a query space Q according to an (unknown) probability distribution

It should be noted that if, then the bound makes sense. This condition can be satisfied in many practical cases. As case studies, we investigate Ranking SVM and RankBoost. We show that after introducing query-level normalization to its objective function, Ranking SVM will have query-level stability. For RankBoost, the query-level stability can be achieved if we introduce both query-level normalization and regularization to its objective function. These analyses agree largely with our experiments and the experiments in [5] and [11].

Rank aggregation Rank aggregation is to combine ranking results of entities from multiple ranking functions in order to generate a better one. The individual ranking functions are referred to as base rankers, or simply rankers.

Score-based aggregation Rank aggregation can be classified into two categories [2]. In the first category, the entities in individual ranking lists are assigned scores and the rank aggregation function is assumed to use the scores (denoted as score-based aggregation) [11][18][28].

order-based aggregation In the second category, only the orders of the entities in individual ranking lists are used by the aggregation function (denoted as order-based aggregation). Order-based aggregation is employed at meta-search, for example, in which only order (rank) information from individual search engines is available.

Previously order-based aggregation was mainly addressed with the unsupervised learning approach, in the sense that no training data is utilized; methods like Borda Count [2][7][27], median rank aggregation [9], genetic algorithm [4], fuzzy logic based rank aggregation [1], Markov Chain based rank aggregation [7] and so on were proposed.

It turns out that the optimization problems for the Markov Chain based methods are hard, because they are not convex optimization problems. We are able to develop a method for the optimization of one Markov Chain based method, called Supervised MC2. We prove that we can transform the optimization problem into that of Semidefinite Programming. As a result, we can efficiently solve the issue.

Next Generation Web Search ? ( Web Search > 3.0) Directions for new innovations –Process-centric vs. data-centric –Infrastructure for Web-scale data mining –Intelligence & knowledge discovery Wei-Ying Ma, Microsoft Research Asia

Web Search – Past, Present, and Future Wei-Ying Ma Web Search and Mining Group Microsoft Research Asia Web Search - Past Present and Future - public.ppt next generation.ppt