Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with.

Similar presentations


Presentation on theme: "Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with."— Presentation transcript:

1 Chapter 8 Web Structure Mining Part-1 1

2 Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with the topology of hyperlinks with or without the description of the links 2

3 Why?  The model can be used to classify web pages.  Helpful to create information such as the similarity and relationship between different websites.  Useful for discovering website type. 3

4 Website type Web structure mining is a suitable tool for discovering authority sites and overview sites for the subjects Authority sites contain information about the subject Overview sites point to many authority sites 4

5 Web Content Mining/ Web Structure Mining  Web Content Mining explores the structure within the document  Web Structure Mining studies citation relationship of documents within the web. 5

6 Algorithms for Web Structure Mining PageRank algorithm (Google Founders)  Looks at number of links to a website and importance of referring links  Computed before the user enters the query. HITS algorithm (Hyperlinked Induced Topic Search)  User receives two lists of pages for query (authority and link pages)  Computations are done after the user enters the query. 6

7 PageRank 7

8 PageRank Algorithm  The idea of the algorithm came from academic citation literature.  It was developed in 1998 as part of the Google search engine prototype  Studies citation relationship of documents within the web.  Google search engine ranks documents as a function of both the query terms and the hyperlink structure of the web. 8

9 Definition of PageRank  The PageRank produces ranking independent of a user’s query.  The importance of a web page is determined by the number of other important web pages that are pointing to that page and the number of out links from other web pages. 9

10 An art draw drawn by Felipe Micaroni Lalli (micaroni@gmail.com).micaroni@gmail.com 10

11 Example of Backlinks Page A is a backlink of page B and page C, while page B and page C are backlinks of page D. 11 Backlink = Outlink= OutDegree

12 Example-1 PR(A)=0.25+0.25+0.25 PR(A)=0.75 12 AB D C

13 Example-2 PR(A)= PR(B)/2+ PR(C)/1+ PR(D)/3 = 0.125+0.25+0.0833 =0.4583 13 A B CD

14 Page Ranking A page will have high page rank if:  There are many pages pointing to it.  There are some pages pointing to it which have high page ranks. In other words:  Pages well sited from around the web are worth looking at.  Pages that only have one citation from high rating web page is worth looking at. 14

15 Damping Factor  The PageRank theory holds that even an imaginary surfer who is randomly clicking on links will eventually stop clicking. The probability, at any step, that the person will continue is a damping factor d. 15

16 Damping Factor d The damping factor is subtracted from 1 and this term is then added to the product of the damping factor and the sum of the incoming PageRank scores. So any page's PageRank is derived in large part from the PageRanks of other pages. The damping factor adjusts the derived value downward. 16

17 Computing PageRank The PageRank of a page u is computed as follows: where, OutDegree(v) represents the number of links going out of the page v and parameter d be a damping factor, which can be a real number between 0 and 1. The value of d is generally taken as 0.85. 17

18 PageRank Algorithm 18

19 Applied Example 19

20 A Simple Network of Pages (Ian Roger, 2006) OutDegree(A) = 1 and OutDegree(B) = 1). Here, we do not know what their PageRanks should be to begin with, so we can take a guess at 1.0, assuming d=0.85, and perform following calculations PageRank(A)= (1 – d) + d (PageRank(B)/1) PageRank(B)= (1 – d) + d (PageRank(A)/1) PageRank(A)= 0.15 + 0.85 * 1=1 PageRank(B)= 0.15 + 0.85 * 1=1 We calculated that the PageRank of A and B is 1. 20

21 A Simple Network of Pages (Ian Roger, 2006) Now, we plug in 0 as the guess and perform calculations again: PageRank(A) = 0.15 + 0.85 * 0= 0.15 PageRank(B) = 0.15 + 0.85 * 0.15= 0.2775 We have now another guess for PageRank(A) so we use it to calculate PageRank(B) and continue: PageRank(A) = 0.15 + 0.85 * 0.2775 = 0.3859 PageRank(B) = 0.15 + 0.85 * 0.3859 = 0.4780 21

22 Example-cont. Repeating the calculations, we get: PageRank(A) = 0.15 + 0.85 * 0.4780 = 0.5563 PageRank(B) = 0.15 + 0.85 * 0.5563 = 0.6229 If we repeat the calculations, eventually the PageRanks for both the pages converge to 1. 22

23 Rank Sink  A, and B both have rank, but they will never circulate any rank. 23 A D A

24 Remarks on PageRank Remarks on PageRank Algorithm:  A page with no successors has no scope to send its importance. As well, a group of pages that have no links out of the group will eventually collect all the importance of the Web. 24

25 PageRank Toolbar 25

26 Sample Scores with Their Meaning 26

27 Toolbar PageRank and Corresponding Real PageRank 27

28 Activity  There is a link between page A to both B and C. Also there is a link from pages B and C to A.  Begin with intial value of PageRank as 0.  Complete 6 iterations 28 AB C


Download ppt "Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with."

Similar presentations


Ads by Google