Download presentation
Presentation is loading. Please wait.
Published byMarian Cummings Modified over 9 years ago
1
윤언근 2008. 10. 29 DataMining lab
2
The Web has grown exponentially in size but this growth has not been isolated to good-quality pages. spamming and malicious sites To date, most work on web page ranking has focused on dynamic ranking(query- dependent ranking).
3
However static ranking(query-dependent ranking) is also crucially important for search engine and provides benefits: Relevance Efficiency ▪ the search engine’s index is ordered by static ranking. ▪ By traversing the index from high-quality to low-quality pages, the dynamic ranker may abort the search. Crawl Priority ▪ Search engines need a way to prioritize their crawl - to determine which pages to re-crawl, how often to seek out new pages. ▪ the static rank of page is used to determine this prioritization.
4
The Google is first commercially successful search engine PageRank algorithm PageRank algorithm regarded as best mathod for static ranking
5
Advantages of a machine learning approach for static ranking besides the quality of the ranking. Because the measure consist of many features, it is harder for malicious users to manipulate it ▪ by adjusting the ranking model in advance of the spammer’s attempts, the spammer’s action is blocked This paper’s Contribution is a systematic study of static features, including PageRank.
6
The basic idea link from a Web page to another when creating a page, an author presumably chooses to link to pages deemed to be of good quality. PageRank score for node j : ▪PR(j) = P(j) Though much work has been done on optimizing the PageRank computation, It remains a relatively slow, expensive property to compute. F is the set of pages that page i links to. B is the set of pages that page j links to.
7
RankNet is a modification to the standard neural network back-prop algorithm. RankNet cost function is based on difference between a pair of network outputs. For each pair of feature vector in the training set, RankNet computes the network output oi and oj The larger is Oj – Oi, the larger the cost. Where { } and ranking of item i > ranking of item j
8
Work in machine learning has been done on the problems of classification and regression. classification problem ▪ learn a function f that maps yi = f(xi), for all I ▪ when yi is real-valued as well, this is called regression regression problem ▪ static ranking ▪ If we let xi = features of page i, yi = value(rank) for each page, we could learn regression function that mapped each page’s features to their rank X = {xi} be a collection of feature vector( real number). Y = {yi} be a collection of associated classes, where yi is the class of the object described by feature vector xi
9
Recent work on ranking problem attempts to optimize the ordering of the object the goal of the ranking problem ▪ ▪ where z is { } and item i > item j
10
PageRank this paper optionally used PageRank of a page Popularity number of times that it has been visited by user at some period of time Anchor text and inlink these features are based on the information associated with link to the page Page simple features such as the number of words, the frequency of the most common term. Domain these features are computed as averages across all pages in the domain
11
this paper uses fRank(for feature-based ranking) that uses RankNet and the set of features.
12
This paper needed the correct ordering for a set of pages. This paper employed a dataset for 28000 queries that are randomly selected by the MSN search engine. The probability(that a query is selected) is proportional to its ferquency. Each query is assigned a rating, from 0 to 4.
13
Common queries are more likely to be judged than uncommon queries. The judged pages tend to be of higher quality. the data is converted from query-dependent to query-independent the query is removed maximum over judgments
14
Because this paper evaluated the pages on query that occur frequency, our data indicates the correct index ordering and assigned high value to pages
15
This paper chose to use pairwise accuracy to evaluate the quality of a static ranking. Pairwise accuracy is the fraction of time If s(x) is the static ranking assinged to page x, H(x) is the human judgment of relevance for x, then consider the following sets:
16
This measure was chosen for two reasons first, the discrete human judgments provide only a partial ordering over Web pages two, the pairwise accuracy is the fraction of pairs of documents
17
This paper tranined fRank using the following parameters. 2 layer network 10 hidden nodes input weights ▪ all initialized to be zero output weights ▪ uniform random distribution in the range[-0.1,0.1] This paper used ‘tanh’ as the transfer function from the input to the hidden layer and a linear function from the hidden layer to output.
18
Where k is the initial rate (0.001) and is the number of times In our experiments, computing the fRank for all 5billion Web pages was approximately 100 times faster than pageRank for the same set ℰ In general, lower training rates will require more training iteration. A higher training rate allows the network to converge more rapidly; however, the chances of a nonoptimal solution are greater
19
Basic Results Results for individual feature sets
20
Ablation Result
21
fRank performance as feature sets are added
22
Top ten URLs for PageRank vs. fRank Technology sites Consumer- oriented sites
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.