Web Search Engines Page Ranking. Web Search Engine Web Search Engine is a tool enabling document search, with respect to specified keywords, in the Web.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Chapter 5: Introduction to Information Retrieval
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Natural Language Processing WEB SEARCH ENGINES August, 2002.
Sandeep Pandey 1, Sourashis Roy 2, Christopher Olston 1, Junghoo Cho 2, Soumen Chakrabarti 3 1 Carnegie Mellon 2 UCLA 3 IIT Bombay Shuffling a Stacked.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
IR Models: Overview, Boolean, and Vector
Information Retrieval in Practice
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
CS246 Search Engine Bias. Junghoo "John" Cho (UCLA Computer Science)2 Motivation “If you are not indexed by Google, you do not exist on the Web” --- news.com.
Chapter 19: Information Retrieval
Information Retrieval
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
SEARCH ENGINE By Ms. Preeti Patel Lecturer School of Library and Information Science DAVV, Indore E mail:
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Adversarial Information Retrieval The Manipulation of Web Content.
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
1 Chapter 19: Information Retrieval Chapter 19: Information Retrieval Relevance Ranking Using Terms Relevance Using Hyperlinks Synonyms., Homonyms,
Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Web Search. Crawling Start from some root site e.g., Yahoo directories. Traverse the HREF links. Search(initialLink) fringe.Insert( initialLink ); loop.
Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Algorithmic Detection of Semantic Similarity WWW 2005.
Ranking CSCI 572: Information Retrieval and Search Engines Summer 2010.
Search Engines By: Faruq Hasan.
1 Page Quality: In Search of an Unbiased Web Ranking Presented by: Arjun Dasgupta Adapted from slides by Junghoo Cho and Robert E. Adams SIGMOD 2005.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper
1 CS 430: Information Discovery Lecture 5 Ranking.
Introduction n IR systems usually adopt index terms to process queries n Index term: u a keyword or group of selected words u any word (more general) n.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Presented by : Manoj Kumar & Harsha Vardhana Impact of Search Engines on Page Popularity by Junghoo Cho and Sourashis Roy (2004)
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Automated Information Retrieval
Information Retrieval in Practice
Search Engine Architecture
Information Retrieval
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
Information Retrieval
Representation of documents and queries
Chapter 5: Information Retrieval and Web Search
Chapter 31: Information Retrieval
Information Retrieval and Web Design
Chapter 19: Information Retrieval
Presentation transcript:

Web Search Engines Page Ranking

Web Search Engine Web Search Engine is a tool enabling document search, with respect to specified keywords, in the Web and returns a list of documents where the keywords were found. Web Search Engine is a tool enabling document search, with respect to specified keywords, in the Web and returns a list of documents where the keywords were found.

Market Share

Components of Web Search Engine 1. User Interface 2. Parser 3. Web Crawler 4. Database 5. Ranking Engine

User Interface It is the part of Web Search Engine interacting with the users and allowing them to query and view query results. It is the part of Web Search Engine interacting with the users and allowing them to query and view query results.

Parser It is the component providing term (keyword) extraction for both sides. It is the component providing term (keyword) extraction for both sides. The parser determines the keywords of the user query and all the terms of the Web documents which have been scanning by the crawler. The parser determines the keywords of the user query and all the terms of the Web documents which have been scanning by the crawler. Term extraction procedure includes the following subprocedures: Term extraction procedure includes the following subprocedures: 1. Tokenization 2. Normalization 3. Stemming 4. Stop word handling

Web Crawler A web crawler is a relatively simple automated program, or script, that methodically scans or "crawls" through Internet pages to create an index of the data it is looking for. A web crawler is a relatively simple automated program, or script, that methodically scans or "crawls" through Internet pages to create an index of the data it is looking for. Alternative names for a web crawler include web spider, web robot, bot, crawler, and automatic indexer. Alternative names for a web crawler include web spider, web robot, bot, crawler, and automatic indexer.

How does a Web Crawler Works? When a web crawler visits a web page, it reads the visible text, the hyperlinks, and the content of the various tags used in the site, such as keyword rich meta tags. When a web crawler visits a web page, it reads the visible text, the hyperlinks, and the content of the various tags used in the site, such as keyword rich meta tags. Using the information gathered from the crawler, a search engine will then determine what the site is about and index the information. Using the information gathered from the crawler, a search engine will then determine what the site is about and index the information. Lastly, the website is included in the search engine's database and its page ranking process. Lastly, the website is included in the search engine's database and its page ranking process.

Web Crawler Architecture

Database It is the component that all the text and metadata specifying the web documents scanned by the crawler. It is the component that all the text and metadata specifying the web documents scanned by the crawler.

Ranking Engine The component is mainly the ranking algorithm operating on the current data, which is indexed by the crawler, to be able to provide some order of relevance, for the web documents, with respect to the user query. The component is mainly the ranking algorithm operating on the current data, which is indexed by the crawler, to be able to provide some order of relevance, for the web documents, with respect to the user query.

INFORMATION RETRIEVAL

Information Retrieval Information retrieval is a process need to be handled while crawling the web documents in World Wide Web. After retrival, the most important part is how to organize the information so that it can be proccessed in an efficient manner. There are some information retrieval models available providing efficient organization of data for further matching process determined by the system.

Boolean Information Retrieval Model It is an exact-match model that the page retrieved or not according to match case. It is an exact-match model that the page retrieved or not according to match case. It uses the boolean operators AND, OR and NOT. It uses the boolean operators AND, OR and NOT. It is the first model used in information retrieval but has many disadvantages like: It is the first model used in information retrieval but has many disadvantages like: 1. Its main disadvantage is that it does not provide a ranking of retrieved documents. 2. The model either retrieves a document or not, which might lead to the system making rather frustrating decisions.

Vector Space Model It is a model that each concept in this space is organized as a vector. It is a model that each concept in this space is organized as a vector. The space includes: The space includes: Terms (Vocabulary): Terms (Vocabulary): Documents: Documents: Query: Query:

Page Weight-Vector Computation Any document d in vector space is also represented as a weight vector, in which each weight component is computed based on one of TF (Term Frequency) or TF with IDF (Inverse Document Frequency) weighting schemes. Any document d in vector space is also represented as a weight vector, in which each weight component is computed based on one of TF (Term Frequency) or TF with IDF (Inverse Document Frequency) weighting schemes. According to the chosen scheme, a weight vector, in which each element wi corresponds to weight of term ti, is computed for the document d. According to the chosen scheme, a weight vector, in which each element wi corresponds to weight of term ti, is computed for the document d.

TF Weighting Scheme In TF scheme, the weight of a term in document dj is the number of times that ti appears in document dj and denoted by fij. In TF scheme, the weight of a term in document dj is the number of times that ti appears in document dj and denoted by fij. The scheme computes intra-class weight, but what about inter-class?

TF-IDF Weighting Scheme If there are N web pages in the database and dfi represents the number of documents in which ti is appeared at least once. Then inverse document frequency is computed as: If there are N web pages in the database and dfi represents the number of documents in which ti is appeared at least once. Then inverse document frequency is computed as: Then, the term weight is computed by: Then, the term weight is computed by: wij= tfij x idfi Inter-class effect is reflected by this scheme that if a term appears in several pages, then it is probably not so important. Inter-class effect is reflected by this scheme that if a term appears in several pages, then it is probably not so important.

19 Similarity Measure Formulas Dot product: Cosine: Cosine: Dice: Dice: Jaccard: t1 t2 D Q

Introduction There are so many web servers in the Internet and numerous web pages on each of them. There are so many web servers in the Internet and numerous web pages on each of them. ıt is so important for any web search engine to rank the pages with the aim of providing more useful data, by listing the pages containing the data at higher places, to the searcher about the searched keyword or subject. ıt is so important for any web search engine to rank the pages with the aim of providing more useful data, by listing the pages containing the data at higher places, to the searcher about the searched keyword or subject. So to be able to provide desired ordering for the web pages: A page ranking algorithm is the technique utilizing some valuable metrics about the web pages and ordering the pages accordingly. So to be able to provide desired ordering for the web pages: A page ranking algorithm is the technique utilizing some valuable metrics about the web pages and ordering the pages accordingly.

Introduction (cont.) Together with the development of the Internet and the popularity of World Wide Web, Web page ranking systems have drawn significant attention. Together with the development of the Internet and the popularity of World Wide Web, Web page ranking systems have drawn significant attention. Many Web Search Engines have been introduced until now, but still have difficulty in providing completely relevant answers to the general subject of queries. Many Web Search Engines have been introduced until now, but still have difficulty in providing completely relevant answers to the general subject of queries. The main reason is not the lack of data but rather an The main reason is not the lack of data but rather an excess of data. excess of data.

Most Popular Ranking Algorithms 1. PageRank Algorithm (Google) 2. HITS Algorithm (IBM)

PageRank Algorithm (Google) The “PageRank” algorithm, proposed by founders of Google Sergey Brin and Lawrance Page, is one of the most common page ranking algorithms that is also currently used by the leading search engine Google. The “PageRank” algorithm, proposed by founders of Google Sergey Brin and Lawrance Page, is one of the most common page ranking algorithms that is also currently used by the leading search engine Google. In general the algorithm uses the linking (citation) info occurring among the pages as the core metric in ranking procedure. In general the algorithm uses the linking (citation) info occurring among the pages as the core metric in ranking procedure. Existence of a link from page p1 to p2 may indicate that the author of is interested in page. Existence of a link from page p1 to p2 may indicate that the author of is interested in page.

PageRank Algorithm The PageRank metric PR(p), defines the importance of page p to be the sum of the importance of the pages that point to p. The PageRank metric PR(p), defines the importance of page p to be the sum of the importance of the pages that point to p. More formally, consider pages p1,…,pn, which link to a page pi and let cj be the total number of links going out of page pj. Then, PageRank of page pi is given by: More formally, consider pages p1,…,pn, which link to a page pi and let cj be the total number of links going out of page pj. Then, PageRank of page pi is given by: PR(pi)=d+(1-d)[PR(p1)/c1+…+PR(pn)/cn] PR(pi)=d+(1-d)[PR(p1)/c1+…+PR(pn)/cn] where d is the damping factor.

PageRank Algorithm This damping factor d makes sense because users will only continue clicking on links for a finite amount of time before they get distracted and start exploring something completely unrelated. This damping factor d makes sense because users will only continue clicking on links for a finite amount of time before they get distracted and start exploring something completely unrelated. With the remaining probability (1-d), the user will click on one of the cj links on page pj at random. With the remaining probability (1-d), the user will click on one of the cj links on page pj at random. Damping factor is usually set to So it is easy to infer that every page distributes 85% of its original PageRank evenly among all pages to which it points. Damping factor is usually set to So it is easy to infer that every page distributes 85% of its original PageRank evenly among all pages to which it points.

HITS Algorithm (IBM) It is executed at query time, not at indexing time, with the associated hit on performance that accompanies query-time processing. Thus, the hub(going) and authority(coming) scores assigned to a page are query- specific. It is executed at query time, not at indexing time, with the associated hit on performance that accompanies query-time processing. Thus, the hub(going) and authority(coming) scores assigned to a page are query- specific. It is not commonly used by search engines. It is not commonly used by search engines. It computes two scores per document, hub and authority, as opposed to a single score of PageRank. It computes two scores per document, hub and authority, as opposed to a single score of PageRank. It is processed on a small subset of ‘relevant’ documents, not all documents as was the case with PageRank. It is processed on a small subset of ‘relevant’ documents, not all documents as was the case with PageRank.

Problems of PageRank Algorithm 1. It is a static algorithm that, because of its cumulative scheme, popular pages tend to stay popular generally. 2. Popularity of a site does not guarantee the desired information to the searcher so relevance factor also needs to be included. 3. In Internet, available data is huge and the algorithm is not fast enough. 4. It should support personalized search that personal specifications should be met by the search result.

REVIEW OF PAPERS

Papers to be Reviewed 1. “SimRank: A Page Rank Approach based on Similarity Measure”, Shaojie Qiao, Tianrui Li, Hong Li, Yan Zhu, Jing Peng, Jiangtao Qiu (Intelligent Systems and Knowledge Engineering (ISKE), 2010 International Conference) Intelligent Systems and Knowledge Engineering (ISKE), 2010 International ConferenceIntelligent Systems and Knowledge Engineering (ISKE), 2010 International Conference 2. “A Relation Based Page Rank Algorithm for Semantic Web Search Engines”, Fabrizio Lamberti, Andrea Sanna, and Claudio Demartini (IEEE Transactions On Knowledge and Data Engineering, Vol. 21, No. 1, January 2009) IEEE Transactions On Knowledge and Data Engineering, Vol. 21, No. 1, January 2009IEEE Transactions On Knowledge and Data Engineering, Vol. 21, No. 1, January “Page Quality: In Search of an Unbiased Web Ranking”, Junghoo Cho Sourashis Roy Robert E. Adams (SIGMOD 2005 June 14-16, 2005, Baltimore, Maryland, USA.) SIGMOD 2005 June 14-16, 2005, Baltimore, Maryland, USA.SIGMOD 2005 June 14-16, 2005, Baltimore, Maryland, USA.

SimRank: A Page Rank Approach based on Similarity Measure The algorithm is proposed with the main aim of coming up with one of the most substantial some drawback of the PageRank algorithm that the algorithm does not take content-relation into consideration(theme-draft). The algorithm is proposed with the main aim of coming up with one of the most substantial some drawback of the PageRank algorithm that the algorithm does not take content-relation into consideration(theme-draft). At first, the algorithm proposes a new similarity measure to compute the similarity of pages and apply it to partition a web database into several web social networks(WSNs) At first, the algorithm proposes a new similarity measure to compute the similarity of pages and apply it to partition a web database into several web social networks(WSNs)

Contributions of the Paper 1. It uses a new similarity measure derived from vector space model to compute the similarity between pages based on terms, and applied it to partition the Web into distinct WSNs. 2. It proposes a weighted page rank algorithm, which is called SimRank, that considers the relevance of a page to the given query which probably improve the accuracy of page scoring. 3. The algorithm also intrduce a new Web crawler, developed by them, that has the ability of filtering useless pages.

Similartiy Measure The algorithm mainly use a variation vector space model TF-IDF weighting scheme. The algorithm mainly use a variation vector space model TF-IDF weighting scheme. The slightly varied weight camputation is realized as follows: The slightly varied weight camputation is realized as follows: Lastly it uses, Jaccard similarity measure formula to compute the similarity between documents.

SimRank Algorithm It contains two phases which are: It contains two phases which are: 1. Apply the similarity measure to the k-means algorithm to partition the crawled web pages into distinct WSNs. 2. Use an improved PageRank algorithm in which two distinct weight values are assigned to the title and body of a page, respectively.

Web Crawler of SimRank It has the ability of eliminating useless documents such as advertising web pages that resulting in both disk and time saving. It has the ability of eliminating useless documents such as advertising web pages that resulting in both disk and time saving.

Experiment Results Accuracy of Page Ranking

A Relation Based Page Rank Algorithm for Semantic Web Search Engines In this paper, a relation-based page rank algorithm, to be used in conjuction with Semantic Web search engines that simply relies on information that could be extracted from user queries and on annotated resources, is proposed. In this paper, a relation-based page rank algorithm, to be used in conjuction with Semantic Web search engines that simply relies on information that could be extracted from user queries and on annotated resources, is proposed. Relevance is measured as the probability that a retrieved resource actually contains those relations whose existence was assumed by the user at the time of query definition. Relevance is measured as the probability that a retrieved resource actually contains those relations whose existence was assumed by the user at the time of query definition.

Semantic Web In Semantic Web, each page possesses semantic metadata that record additional details concerning the Web page itself. In Semantic Web, each page possesses semantic metadata that record additional details concerning the Web page itself.

What is Ontology? An ontology is an explicit and formal specification of a conceptualization. An ontology is an explicit and formal specification of a conceptualization. Ontologies provide a shared understanding of a domain which allows interoperability between semantics. Ontologies provide a shared understanding of a domain which allows interoperability between semantics. Components of an ontology: Components of an ontology: 1. Terms 2. Relations

Semantic Web Infrastructure

Ontology Graph

Annotation Graphs Activities, accommadations, and sightseeing places in Rome Hotel in the historical center of Rome, close to museums For Two Pages

Query Subgraph and Page Subgraph In a query subgraph, nodes are represented by concepts that have been specified within the query. In a query subgraph, nodes are represented by concepts that have been specified within the query. Concepts are linked by a weighted edge only if there exists at least one relation between those concepts in the ontology. Concepts are linked by a weighted edge only if there exists at least one relation between those concepts in the ontology. The weight is represented by the actual number of relations. The weight is represented by the actual number of relations. Similarly, a page subgraph is built based on the annotation associated to the page itself. Similarly, a page subgraph is built based on the annotation associated to the page itself.

Graph showing transitions to be able to reach page-query relevance graph

Main Steps of the Algorithm(1) 1. The algorithm starts from a page subgraph computed over an annotated page. Ontology GraphPage Subgraph

Main Steps of the Algorithm(2) 1. It generates all the possible combinations of the edges belonging to the subgraph itself, not including cycles, since there could exist pages in which there are concepts that do not show any relations with other concepts but that could still be of interest to the user. (Spanning Forest) Spanning Forest Page Subgraph built on the given query

Main Steps of the Algorithm(3) It continues by reducing the number of edges in the page subgraph and computes the probability that each of the resulting subgraphs obtained by a combination of the remaining edges is the one that matches the user’s intention. It continues by reducing the number of edges in the page subgraph and computes the probability that each of the resulting subgraphs obtained by a combination of the remaining edges is the one that matches the user’s intention.

Page Quality: In Search of an Unbiased Web Ranking The algorithm is proposed with the aim of coming up with the most important problem of Google’s page ranking algorithm that PageRank algorithm is biased against unpopular pages, especially that were created recently. The algorithm is proposed with the aim of coming up with the most important problem of Google’s page ranking algorithm that PageRank algorithm is biased against unpopular pages, especially that were created recently. It tries use as the ranking metric is not the current popularity of the page, but the probability that a web user will like the page when the user sees it for the first time. It tries use as the ranking metric is not the current popularity of the page, but the probability that a web user will like the page when the user sees it for the first time.

The algorithm firstly introduce some definitions about the critical notions of concern measurability. The algorithm firstly introduce some definitions about the critical notions of concern measurability. Then derive some useful formulas based on the definitions given before. Then derive some useful formulas based on the definitions given before. Lastly, propose a quality estimator formula taking the quality of a page into consideration in ranking procedure. Lastly, propose a quality estimator formula taking the quality of a page into consideration in ranking procedure.

Page Quality It defines the quality of a page p, Q(p), as the conditional probability that an average user will like the page for the first time. It defines the quality of a page p, Q(p), as the conditional probability that an average user will like the page for the first time. Q(p) = P(Lp | Ap) where Ap represents the event that the user becomes newly aware of the page p by visiting the page for the first time and Lp represents the event that the user likes the page.

Its main idea is based on the following derivations: Its main idea is based on the following derivations: 1. The creation of a link often indicates that a user likes the page. 2. A high quality page will be liked by most of its visitors, so its popularity may increase more rapidly than others.

Quality Estimation Principle The algortihm suggests that for a page p, current popularity P(p) and the popularity increase both should be considered. The algortihm suggests that for a page p, current popularity P(p) and the popularity increase both should be considered. By the definition, for quality estimation the following formula is given By the definition, for quality estimation the following formula is given

Definitions Popularity: It defines the popularity of a page p at time t, P(p,t) as the fraction of web users who like the page. Popularity: It defines the popularity of a page p at time t, P(p,t) as the fraction of web users who like the page. Visit Popularity: We define visit popularity of a page p at time t, V(p,t), as the number of visits or page views the page gets within a unit time interval at time t. Visit Popularity: We define visit popularity of a page p at time t, V(p,t), as the number of visits or page views the page gets within a unit time interval at time t. User Awareness: It defines the user awareness of page p at time t, A(p,t), as the fraction of web users who are aware of the page p at time t. User Awareness: It defines the user awareness of page p at time t, A(p,t), as the fraction of web users who are aware of the page p at time t.

LEMMA 1 From the definitons given before it can be derived that the popularity of a page p at time t, P(p,t), is equal to the fraction of web users who are aware of p at time t, A(p,t), times the quality of p. From the definitons given before it can be derived that the popularity of a page p at time t, P(p,t), is equal to the fraction of web users who are aware of p at time t, A(p,t), times the quality of p. P(p,t) = A(p,t)*Q(p) Why we can’t cumpute Q(p), by just utilizing this formula ? Why we can’t cumpute Q(p), by just utilizing this formula ?

Propositions Popularity-equivalance hypothesis: The number of visits to page p within a unit time interval at time t is proportional to how many people like the page. That is Popularity-equivalance hypothesis: The number of visits to page p within a unit time interval at time t is proportional to how many people like the page. That is V(p,t) = r*P(p,t) where r is a normalization constant. Random-visit hypothesis: All web users will visit a particular page with equal probability. Random-visit hypothesis: All web users will visit a particular page with equal probability.

LEMMA 2 The algorithm suggests that the user awareness of p at time t, A(p,t), can be computed from its past popularity through the following formula The algorithm suggests that the user awareness of p at time t, A(p,t), can be computed from its past popularity through the following formula where n is the total number of web users

The popularity of page p evolves over time through the following formula The popularity of page p evolves over time through the following formula

Quality Estimator The quality of a page is proportional to its popularity increase and inversely proportional to its current popularity. It is also inversely proportional to the fraction of the users who are unaware of the page, 1-A(p,t). The quality of a page is proportional to its popularity increase and inversely proportional to its current popularity. It is also inversely proportional to the fraction of the users who are unaware of the page, 1-A(p,t).

Relative Popularity Increase Function In the equation, we can easily compute the two main factors P(p,t) and dP(p,t)/dt by downloading the same page several times, but it is hard to compute the factor A(p,t). In the equation, we can easily compute the two main factors P(p,t) and dP(p,t)/dt by downloading the same page several times, but it is hard to compute the factor A(p,t). So, by omitting the factor A(p,t), another notion which is called relative popularity increase function appears and denoted by I(t) So, by omitting the factor A(p,t), another notion which is called relative popularity increase function appears and denoted by I(t)

Problematic Behaviour From the time evolution graph, it is possible to see that I(p,t) can serve as a good estimator for page quailty, but it exhibits a problem that suddenly downs and converges lower limit. From the time evolution graph, it is possible to see that I(p,t) can serve as a good estimator for page quailty, but it exhibits a problem that suddenly downs and converges lower limit.

Solution As also can be seen from the time evolution graph of P(t) and I(t), when I(t) goes downwards, P(t) tends to upwards. As also can be seen from the time evolution graph of P(t) and I(t), when I(t) goes downwards, P(t) tends to upwards. In this scheme, they act in a complemantary way. So taking relative effectiveness of P(t) and I(t) as the quality estimator makes sense. In this scheme, they act in a complemantary way. So taking relative effectiveness of P(t) and I(t) as the quality estimator makes sense. The quality of page p, Q(p), is always equal to the sum of its relative popularity increase I(p,t) and its popularity P(p,t) Q(p) = I(p,t)+P(p,t)

Changing Quality As it can be seen from the graph of I(p,t)+P(p,t), the quality of the web page retains in same level which is not the case for real real life. As it can be seen from the graph of I(p,t)+P(p,t), the quality of the web page retains in same level which is not the case for real real life. To be able to handle this situation the formula should be updated, by including the term P(p,t) in cumulative way, as follows: To be able to handle this situation the formula should be updated, by including the term P(p,t) in cumulative way, as follows: where it is assumed that the quality of page p, Q(p) changes from Q1 to Q2 at time T.

QUESTIONS ?