Presentation is loading. Please wait.

Presentation is loading. Please wait.

Web Search Engines Page Ranking. Web Search Engine Web Search Engine is a tool enabling document search, with respect to specified keywords, in the Web.

Similar presentations


Presentation on theme: "Web Search Engines Page Ranking. Web Search Engine Web Search Engine is a tool enabling document search, with respect to specified keywords, in the Web."— Presentation transcript:

1 Web Search Engines Page Ranking

2 Web Search Engine Web Search Engine is a tool enabling document search, with respect to specified keywords, in the Web and returns a list of documents where the keywords were found. Web Search Engine is a tool enabling document search, with respect to specified keywords, in the Web and returns a list of documents where the keywords were found.

3 Market Share

4 Components of Web Search Engine 1. User Interface 2. Parser 3. Web Crawler 4. Database 5. Ranking Engine

5 User Interface It is the part of Web Search Engine interacting with the users and allowing them to query and view query results. It is the part of Web Search Engine interacting with the users and allowing them to query and view query results.

6 Parser It is the component providing term (keyword) extraction for both sides. It is the component providing term (keyword) extraction for both sides. The parser determines the keywords of the user query and all the terms of the Web documents which have been scanning by the crawler. The parser determines the keywords of the user query and all the terms of the Web documents which have been scanning by the crawler. Term extraction procedure includes the following subprocedures: Term extraction procedure includes the following subprocedures: 1. Tokenization 2. Normalization 3. Stemming 4. Stop word handling

7 Web Crawler A web crawler is a relatively simple automated program, or script, that methodically scans or "crawls" through Internet pages to create an index of the data it is looking for. A web crawler is a relatively simple automated program, or script, that methodically scans or "crawls" through Internet pages to create an index of the data it is looking for. Alternative names for a web crawler include web spider, web robot, bot, crawler, and automatic indexer. Alternative names for a web crawler include web spider, web robot, bot, crawler, and automatic indexer.

8 How does a Web Crawler Works? When a web crawler visits a web page, it reads the visible text, the hyperlinks, and the content of the various tags used in the site, such as keyword rich meta tags. When a web crawler visits a web page, it reads the visible text, the hyperlinks, and the content of the various tags used in the site, such as keyword rich meta tags. Using the information gathered from the crawler, a search engine will then determine what the site is about and index the information. Using the information gathered from the crawler, a search engine will then determine what the site is about and index the information. Lastly, the website is included in the search engine's database and its page ranking process. Lastly, the website is included in the search engine's database and its page ranking process.

9 Web Crawler Architecture

10 Database It is the component that all the text and metadata specifying the web documents scanned by the crawler. It is the component that all the text and metadata specifying the web documents scanned by the crawler.

11 Ranking Engine The component is mainly the ranking algorithm operating on the current data, which is indexed by the crawler, to be able to provide some order of relevance, for the web documents, with respect to the user query. The component is mainly the ranking algorithm operating on the current data, which is indexed by the crawler, to be able to provide some order of relevance, for the web documents, with respect to the user query.

12 INFORMATION RETRIEVAL

13 Information Retrieval Information retrieval is a process need to be handled while crawling the web documents in World Wide Web. After retrival, the most important part is how to organize the information so that it can be proccessed in an efficient manner. There are some information retrieval models available providing efficient organization of data for further matching process determined by the system.

14 Boolean Information Retrieval Model It is an exact-match model that the page retrieved or not according to match case. It is an exact-match model that the page retrieved or not according to match case. It uses the boolean operators AND, OR and NOT. It uses the boolean operators AND, OR and NOT. It is the first model used in information retrieval but has many disadvantages like: It is the first model used in information retrieval but has many disadvantages like: 1. Its main disadvantage is that it does not provide a ranking of retrieved documents. 2. The model either retrieves a document or not, which might lead to the system making rather frustrating decisions.

15 Vector Space Model It is a model that each concept in this space is organized as a vector. It is a model that each concept in this space is organized as a vector. The space includes: The space includes: Terms (Vocabulary): Terms (Vocabulary): Documents: Documents: Query: Query:

16 Page Weight-Vector Computation Any document d in vector space is also represented as a weight vector, in which each weight component is computed based on one of TF (Term Frequency) or TF with IDF (Inverse Document Frequency) weighting schemes. Any document d in vector space is also represented as a weight vector, in which each weight component is computed based on one of TF (Term Frequency) or TF with IDF (Inverse Document Frequency) weighting schemes. According to the chosen scheme, a weight vector, in which each element wi corresponds to weight of term ti, is computed for the document d. According to the chosen scheme, a weight vector, in which each element wi corresponds to weight of term ti, is computed for the document d.

17 TF Weighting Scheme In TF scheme, the weight of a term in document dj is the number of times that ti appears in document dj and denoted by fij. In TF scheme, the weight of a term in document dj is the number of times that ti appears in document dj and denoted by fij. The scheme computes intra-class weight, but what about inter-class?

18 TF-IDF Weighting Scheme If there are N web pages in the database and dfi represents the number of documents in which ti is appeared at least once. Then inverse document frequency is computed as: If there are N web pages in the database and dfi represents the number of documents in which ti is appeared at least once. Then inverse document frequency is computed as: Then, the term weight is computed by: Then, the term weight is computed by: wij= tfij x idfi Inter-class effect is reflected by this scheme that if a term appears in several pages, then it is probably not so important. Inter-class effect is reflected by this scheme that if a term appears in several pages, then it is probably not so important.

19 19 Similarity Measure Formulas Dot product: Cosine: Cosine: Dice: Dice: Jaccard: t1 t2 D Q

20 Introduction There are so many web servers in the Internet and numerous web pages on each of them. There are so many web servers in the Internet and numerous web pages on each of them. ıt is so important for any web search engine to rank the pages with the aim of providing more useful data, by listing the pages containing the data at higher places, to the searcher about the searched keyword or subject. ıt is so important for any web search engine to rank the pages with the aim of providing more useful data, by listing the pages containing the data at higher places, to the searcher about the searched keyword or subject. So to be able to provide desired ordering for the web pages: A page ranking algorithm is the technique utilizing some valuable metrics about the web pages and ordering the pages accordingly. So to be able to provide desired ordering for the web pages: A page ranking algorithm is the technique utilizing some valuable metrics about the web pages and ordering the pages accordingly.

21 Introduction (cont.) Together with the development of the Internet and the popularity of World Wide Web, Web page ranking systems have drawn significant attention. Together with the development of the Internet and the popularity of World Wide Web, Web page ranking systems have drawn significant attention. Many Web Search Engines have been introduced until now, but still have difficulty in providing completely relevant answers to the general subject of queries. Many Web Search Engines have been introduced until now, but still have difficulty in providing completely relevant answers to the general subject of queries. The main reason is not the lack of data but rather an The main reason is not the lack of data but rather an excess of data. excess of data.

22 Most Popular Ranking Algorithms 1. PageRank Algorithm (Google) 2. HITS Algorithm (IBM)

23 PageRank Algorithm (Google) The “PageRank” algorithm, proposed by founders of Google Sergey Brin and Lawrance Page, is one of the most common page ranking algorithms that is also currently used by the leading search engine Google. The “PageRank” algorithm, proposed by founders of Google Sergey Brin and Lawrance Page, is one of the most common page ranking algorithms that is also currently used by the leading search engine Google. In general the algorithm uses the linking (citation) info occurring among the pages as the core metric in ranking procedure. In general the algorithm uses the linking (citation) info occurring among the pages as the core metric in ranking procedure. Existence of a link from page p1 to p2 may indicate that the author of is interested in page. Existence of a link from page p1 to p2 may indicate that the author of is interested in page.

24 PageRank Algorithm The PageRank metric PR(p), defines the importance of page p to be the sum of the importance of the pages that point to p. The PageRank metric PR(p), defines the importance of page p to be the sum of the importance of the pages that point to p. More formally, consider pages p1,…,pn, which link to a page pi and let cj be the total number of links going out of page pj. Then, PageRank of page pi is given by: More formally, consider pages p1,…,pn, which link to a page pi and let cj be the total number of links going out of page pj. Then, PageRank of page pi is given by: PR(pi)=d+(1-d)[PR(p1)/c1+…+PR(pn)/cn] PR(pi)=d+(1-d)[PR(p1)/c1+…+PR(pn)/cn] where d is the damping factor.

25 PageRank Algorithm This damping factor d makes sense because users will only continue clicking on links for a finite amount of time before they get distracted and start exploring something completely unrelated. This damping factor d makes sense because users will only continue clicking on links for a finite amount of time before they get distracted and start exploring something completely unrelated. With the remaining probability (1-d), the user will click on one of the cj links on page pj at random. With the remaining probability (1-d), the user will click on one of the cj links on page pj at random. Damping factor is usually set to 0.85. So it is easy to infer that every page distributes 85% of its original PageRank evenly among all pages to which it points. Damping factor is usually set to 0.85. So it is easy to infer that every page distributes 85% of its original PageRank evenly among all pages to which it points.

26 HITS Algorithm (IBM) It is executed at query time, not at indexing time, with the associated hit on performance that accompanies query-time processing. Thus, the hub(going) and authority(coming) scores assigned to a page are query- specific. It is executed at query time, not at indexing time, with the associated hit on performance that accompanies query-time processing. Thus, the hub(going) and authority(coming) scores assigned to a page are query- specific. It is not commonly used by search engines. It is not commonly used by search engines. It computes two scores per document, hub and authority, as opposed to a single score of PageRank. It computes two scores per document, hub and authority, as opposed to a single score of PageRank. It is processed on a small subset of ‘relevant’ documents, not all documents as was the case with PageRank. It is processed on a small subset of ‘relevant’ documents, not all documents as was the case with PageRank.

27 Problems of PageRank Algorithm 1. It is a static algorithm that, because of its cumulative scheme, popular pages tend to stay popular generally. 2. Popularity of a site does not guarantee the desired information to the searcher so relevance factor also needs to be included. 3. In Internet, available data is huge and the algorithm is not fast enough. 4. It should support personalized search that personal specifications should be met by the search result.

28 REVIEW OF PAPERS

29 Papers to be Reviewed 1. “SimRank: A Page Rank Approach based on Similarity Measure”, Shaojie Qiao, Tianrui Li, Hong Li, Yan Zhu, Jing Peng, Jiangtao Qiu (Intelligent Systems and Knowledge Engineering (ISKE), 2010 International Conference) Intelligent Systems and Knowledge Engineering (ISKE), 2010 International ConferenceIntelligent Systems and Knowledge Engineering (ISKE), 2010 International Conference 2. “A Relation Based Page Rank Algorithm for Semantic Web Search Engines”, Fabrizio Lamberti, Andrea Sanna, and Claudio Demartini (IEEE Transactions On Knowledge and Data Engineering, Vol. 21, No. 1, January 2009) IEEE Transactions On Knowledge and Data Engineering, Vol. 21, No. 1, January 2009IEEE Transactions On Knowledge and Data Engineering, Vol. 21, No. 1, January 2009 3. “Page Quality: In Search of an Unbiased Web Ranking”, Junghoo Cho Sourashis Roy Robert E. Adams (SIGMOD 2005 June 14-16, 2005, Baltimore, Maryland, USA.) SIGMOD 2005 June 14-16, 2005, Baltimore, Maryland, USA.SIGMOD 2005 June 14-16, 2005, Baltimore, Maryland, USA.

30 SimRank: A Page Rank Approach based on Similarity Measure The algorithm is proposed with the main aim of coming up with one of the most substantial some drawback of the PageRank algorithm that the algorithm does not take content-relation into consideration(theme-draft). The algorithm is proposed with the main aim of coming up with one of the most substantial some drawback of the PageRank algorithm that the algorithm does not take content-relation into consideration(theme-draft). At first, the algorithm proposes a new similarity measure to compute the similarity of pages and apply it to partition a web database into several web social networks(WSNs) At first, the algorithm proposes a new similarity measure to compute the similarity of pages and apply it to partition a web database into several web social networks(WSNs)

31 Contributions of the Paper 1. It uses a new similarity measure derived from vector space model to compute the similarity between pages based on terms, and applied it to partition the Web into distinct WSNs. 2. It proposes a weighted page rank algorithm, which is called SimRank, that considers the relevance of a page to the given query which probably improve the accuracy of page scoring. 3. The algorithm also intrduce a new Web crawler, developed by them, that has the ability of filtering useless pages.

32 Similartiy Measure The algorithm mainly use a variation vector space model TF-IDF weighting scheme. The algorithm mainly use a variation vector space model TF-IDF weighting scheme. The slightly varied weight camputation is realized as follows: The slightly varied weight camputation is realized as follows: Lastly it uses, Jaccard similarity measure formula to compute the similarity between documents.

33 SimRank Algorithm It contains two phases which are: It contains two phases which are: 1. Apply the similarity measure to the k-means algorithm to partition the crawled web pages into distinct WSNs. 2. Use an improved PageRank algorithm in which two distinct weight values are assigned to the title and body of a page, respectively.

34 Web Crawler of SimRank It has the ability of eliminating useless documents such as advertising web pages that resulting in both disk and time saving. It has the ability of eliminating useless documents such as advertising web pages that resulting in both disk and time saving.

35 Experiment Results Accuracy of Page Ranking

36 A Relation Based Page Rank Algorithm for Semantic Web Search Engines In this paper, a relation-based page rank algorithm, to be used in conjuction with Semantic Web search engines that simply relies on information that could be extracted from user queries and on annotated resources, is proposed. In this paper, a relation-based page rank algorithm, to be used in conjuction with Semantic Web search engines that simply relies on information that could be extracted from user queries and on annotated resources, is proposed. Relevance is measured as the probability that a retrieved resource actually contains those relations whose existence was assumed by the user at the time of query definition. Relevance is measured as the probability that a retrieved resource actually contains those relations whose existence was assumed by the user at the time of query definition.

37 Semantic Web In Semantic Web, each page possesses semantic metadata that record additional details concerning the Web page itself. In Semantic Web, each page possesses semantic metadata that record additional details concerning the Web page itself.

38 What is Ontology? An ontology is an explicit and formal specification of a conceptualization. An ontology is an explicit and formal specification of a conceptualization. Ontologies provide a shared understanding of a domain which allows interoperability between semantics. Ontologies provide a shared understanding of a domain which allows interoperability between semantics. Components of an ontology: Components of an ontology: 1. Terms 2. Relations

39 Semantic Web Infrastructure

40 Ontology Graph

41 Annotation Graphs Activities, accommadations, and sightseeing places in Rome Hotel in the historical center of Rome, close to museums For Two Pages

42 Query Subgraph and Page Subgraph In a query subgraph, nodes are represented by concepts that have been specified within the query. In a query subgraph, nodes are represented by concepts that have been specified within the query. Concepts are linked by a weighted edge only if there exists at least one relation between those concepts in the ontology. Concepts are linked by a weighted edge only if there exists at least one relation between those concepts in the ontology. The weight is represented by the actual number of relations. The weight is represented by the actual number of relations. Similarly, a page subgraph is built based on the annotation associated to the page itself. Similarly, a page subgraph is built based on the annotation associated to the page itself.

43 Graph showing transitions to be able to reach page-query relevance graph

44 Main Steps of the Algorithm(1) 1. The algorithm starts from a page subgraph computed over an annotated page. Ontology GraphPage Subgraph

45 Main Steps of the Algorithm(2) 1. It generates all the possible combinations of the edges belonging to the subgraph itself, not including cycles, since there could exist pages in which there are concepts that do not show any relations with other concepts but that could still be of interest to the user. (Spanning Forest) Spanning Forest Page Subgraph built on the given query

46 Main Steps of the Algorithm(3) It continues by reducing the number of edges in the page subgraph and computes the probability that each of the resulting subgraphs obtained by a combination of the remaining edges is the one that matches the user’s intention. It continues by reducing the number of edges in the page subgraph and computes the probability that each of the resulting subgraphs obtained by a combination of the remaining edges is the one that matches the user’s intention.

47 Page Quality: In Search of an Unbiased Web Ranking The algorithm is proposed with the aim of coming up with the most important problem of Google’s page ranking algorithm that PageRank algorithm is biased against unpopular pages, especially that were created recently. The algorithm is proposed with the aim of coming up with the most important problem of Google’s page ranking algorithm that PageRank algorithm is biased against unpopular pages, especially that were created recently. It tries use as the ranking metric is not the current popularity of the page, but the probability that a web user will like the page when the user sees it for the first time. It tries use as the ranking metric is not the current popularity of the page, but the probability that a web user will like the page when the user sees it for the first time.

48 The algorithm firstly introduce some definitions about the critical notions of concern measurability. The algorithm firstly introduce some definitions about the critical notions of concern measurability. Then derive some useful formulas based on the definitions given before. Then derive some useful formulas based on the definitions given before. Lastly, propose a quality estimator formula taking the quality of a page into consideration in ranking procedure. Lastly, propose a quality estimator formula taking the quality of a page into consideration in ranking procedure.

49 Page Quality It defines the quality of a page p, Q(p), as the conditional probability that an average user will like the page for the first time. It defines the quality of a page p, Q(p), as the conditional probability that an average user will like the page for the first time. Q(p) = P(Lp | Ap) where Ap represents the event that the user becomes newly aware of the page p by visiting the page for the first time and Lp represents the event that the user likes the page.

50 Its main idea is based on the following derivations: Its main idea is based on the following derivations: 1. The creation of a link often indicates that a user likes the page. 2. A high quality page will be liked by most of its visitors, so its popularity may increase more rapidly than others.

51 Quality Estimation Principle The algortihm suggests that for a page p, current popularity P(p) and the popularity increase both should be considered. The algortihm suggests that for a page p, current popularity P(p) and the popularity increase both should be considered. By the definition, for quality estimation the following formula is given By the definition, for quality estimation the following formula is given

52 Definitions Popularity: It defines the popularity of a page p at time t, P(p,t) as the fraction of web users who like the page. Popularity: It defines the popularity of a page p at time t, P(p,t) as the fraction of web users who like the page. Visit Popularity: We define visit popularity of a page p at time t, V(p,t), as the number of visits or page views the page gets within a unit time interval at time t. Visit Popularity: We define visit popularity of a page p at time t, V(p,t), as the number of visits or page views the page gets within a unit time interval at time t. User Awareness: It defines the user awareness of page p at time t, A(p,t), as the fraction of web users who are aware of the page p at time t. User Awareness: It defines the user awareness of page p at time t, A(p,t), as the fraction of web users who are aware of the page p at time t.

53 LEMMA 1 From the definitons given before it can be derived that the popularity of a page p at time t, P(p,t), is equal to the fraction of web users who are aware of p at time t, A(p,t), times the quality of p. From the definitons given before it can be derived that the popularity of a page p at time t, P(p,t), is equal to the fraction of web users who are aware of p at time t, A(p,t), times the quality of p. P(p,t) = A(p,t)*Q(p) Why we can’t cumpute Q(p), by just utilizing this formula ? Why we can’t cumpute Q(p), by just utilizing this formula ?

54 Propositions Popularity-equivalance hypothesis: The number of visits to page p within a unit time interval at time t is proportional to how many people like the page. That is Popularity-equivalance hypothesis: The number of visits to page p within a unit time interval at time t is proportional to how many people like the page. That is V(p,t) = r*P(p,t) where r is a normalization constant. Random-visit hypothesis: All web users will visit a particular page with equal probability. Random-visit hypothesis: All web users will visit a particular page with equal probability.

55 LEMMA 2 The algorithm suggests that the user awareness of p at time t, A(p,t), can be computed from its past popularity through the following formula The algorithm suggests that the user awareness of p at time t, A(p,t), can be computed from its past popularity through the following formula where n is the total number of web users

56 The popularity of page p evolves over time through the following formula The popularity of page p evolves over time through the following formula

57 Quality Estimator The quality of a page is proportional to its popularity increase and inversely proportional to its current popularity. It is also inversely proportional to the fraction of the users who are unaware of the page, 1-A(p,t). The quality of a page is proportional to its popularity increase and inversely proportional to its current popularity. It is also inversely proportional to the fraction of the users who are unaware of the page, 1-A(p,t).

58 Relative Popularity Increase Function In the equation, we can easily compute the two main factors P(p,t) and dP(p,t)/dt by downloading the same page several times, but it is hard to compute the factor A(p,t). In the equation, we can easily compute the two main factors P(p,t) and dP(p,t)/dt by downloading the same page several times, but it is hard to compute the factor A(p,t). So, by omitting the factor A(p,t), another notion which is called relative popularity increase function appears and denoted by I(t) So, by omitting the factor A(p,t), another notion which is called relative popularity increase function appears and denoted by I(t)

59 Problematic Behaviour From the time evolution graph, it is possible to see that I(p,t) can serve as a good estimator for page quailty, but it exhibits a problem that suddenly downs and converges lower limit. From the time evolution graph, it is possible to see that I(p,t) can serve as a good estimator for page quailty, but it exhibits a problem that suddenly downs and converges lower limit.

60 Solution As also can be seen from the time evolution graph of P(t) and I(t), when I(t) goes downwards, P(t) tends to upwards. As also can be seen from the time evolution graph of P(t) and I(t), when I(t) goes downwards, P(t) tends to upwards. In this scheme, they act in a complemantary way. So taking relative effectiveness of P(t) and I(t) as the quality estimator makes sense. In this scheme, they act in a complemantary way. So taking relative effectiveness of P(t) and I(t) as the quality estimator makes sense. The quality of page p, Q(p), is always equal to the sum of its relative popularity increase I(p,t) and its popularity P(p,t) Q(p) = I(p,t)+P(p,t)

61 Changing Quality As it can be seen from the graph of I(p,t)+P(p,t), the quality of the web page retains in same level which is not the case for real real life. As it can be seen from the graph of I(p,t)+P(p,t), the quality of the web page retains in same level which is not the case for real real life. To be able to handle this situation the formula should be updated, by including the term P(p,t) in cumulative way, as follows: To be able to handle this situation the formula should be updated, by including the term P(p,t) in cumulative way, as follows: where it is assumed that the quality of page p, Q(p) changes from Q1 to Q2 at time T.

62 QUESTIONS ?


Download ppt "Web Search Engines Page Ranking. Web Search Engine Web Search Engine is a tool enabling document search, with respect to specified keywords, in the Web."

Similar presentations


Ads by Google