Download presentation
Presentation is loading. Please wait.
1
Chapter 7 Web Structure Mining
L. Malak Bagais
2
Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with the topology of hyperlinks with or without the description of the links
3
Why? The model can be used to classify web pages.
Helpful to create information such as the similarity and relationship between different websites. Useful for discovering website type.
4
Website type Web structure mining is a suitable tool for discovering authority sites and overview sites for the subjects Authority sites contain information about the subject Overview sites point to many authority sites
5
Web Content Mining/ Web Structure Mining
Web Content Mining explores the structure within the document Web Structure Mining studies citation relationship of documents within the web.
6
Algorithms for Web Structure Mining
PageRank algorithm (Google Founders) Looks at number of links to a website and importance of referring links Computed before the user enters the query. HITS algorithm (Hyperlinked Induced Topic Search) User receives two lists of pages for query (authority and link pages) Computations are done after the user enters the query.
7
PageRank
8
PageRank Algorithm The idea of the algorithm came from academic citation literature. It was developed in 1998 as part of the Google search engine prototype Studies citation relationship of documents within the web. Google search engine ranks documents as a function of both the query terms and the hyperlink structure of the web.
9
Definition of PageRank
The PageRank produces ranking independent of a user’s query. The importance of a web page is determined by the number of other important web pages that are pointing to that page and the number of out links from other web pages.
10
An art draw drawn by Felipe Micaroni Lalli (micaroni@gmail.com).
11
Example of Backlinks Backlink = Outlink= OutDegree
Page A is a backlink of page B and page C, while page B and page C are backlinks of page D. Backlink = Outlink= OutDegree
12
Example-1 A B PR(A)= PR(A)=0.75 C D
13
Example-2 PR(A)= PR(B)/2+ PR(C)/1+ PR(D)/3 = 0.125+0.25+0.0833 =0.4583
14
Page Ranking A page will have high page rank if: In other words:
There are many pages pointing to it. There are some pages pointing to it which have high page ranks. In other words: Pages well sited from around the web are worth looking at. Pages that only have one citation from high rating web page is worth looking at.
15
Damping Factor The PageRank theory holds that even an imaginary surfer who is randomly clicking on links will eventually stop clicking. The probability, at any step, that the person will continue is a damping factor d.
16
Damping Factor d The damping factor is subtracted from 1 and this term is then added to the product of the damping factor and the sum of the incoming PageRank scores. So any page's PageRank is derived in large part from the PageRanks of other pages. The damping factor adjusts the derived value downward.
17
Computing PageRank The PageRank of a page u is computed as follows:
where, OutDegree(v) represents the number of links going out of the page v and parameter d be a damping factor, which can be a real number between 0 and 1. The value of d is generally taken as 0.85.
18
PageRank Algorithm
19
Applied Example
20
A Simple Network of Pages (Ian Roger, 2006)
OutDegree(A) = 1 and OutDegree(B) = 1). Here, we do not know what their PageRanks should be to begin with, so we can take a guess at 1.0 , assuming d=0.85, and perform following calculations PageRank(A)= (1 – d) + d (PageRank(B)/1) PageRank(B)= (1 – d) + d (PageRank(A)/1) PageRank(A)= * 1=1 PageRank(B)= * 1=1 We calculated that the PageRank of A and B is 1.
21
A Simple Network of Pages (Ian Roger, 2006)
Now, we plug in 0 as the guess and perform calculations again: PageRank(A) = * 0= 0.15 PageRank(B) = * 0.15= We have now another guess for PageRank(A) so we use it to calculate PageRank(B) and continue: PageRank(A) = * = PageRank(B) = * =
22
Example-cont. Repeating the calculations, we get:
PageRank(A) = * = PageRank(B) = * = If we repeat the calculations, eventually the PageRanks for both the pages converge to 1.
23
Rank Sink A, and B both have rank, but they will never circulate any rank. A A D
24
Remarks on PageRank Remarks on PageRank Algorithm:
A page with no successors has no scope to send its importance. As well, a group of pages that have no links out of the group will eventually collect all the importance of the Web.
25
PageRank Toolbar
26
Sample Scores with Their Meaning
27
Toolbar PageRank and Corresponding Real PageRank
28
Activity A B There is a link between page A to both B and C. Also there is a link from pages B and C to A. Begin with intial value of PageRank as 0. Complete 6 iterations C
29
HITS Algorithm
30
HITS Algorithm Hyperlink Induced Topic Search
Algorithm developed by Kleinberg in 1998 Attempts to computationally determine hubs and authorities on a particular topic through analysis of a relevant sub-graph of the web Based on mutually recursive facts: Hubs point to lots of authorities Authorities are pointed to by lots of hubs
31
Authorities and Hubs Authorities are pages that are recognized as providing significant, trustworthy, and useful information on a topic. In-degree (number of pointers to a page) is one simple measure of authority However, in-degree treats all links as equal Hubs are index pages that provide lots of useful links to relevant content pages (topic authorities).
32
Introduction HITS
33
HITS Algorithm
34
HITS Algorithm Step-1: retrieve the set of results to the search query. An authority value is computed as the sum of the scaled hub values that point to that page A hub value is the sum of the scaled authority values of the pages it points to The algorithm performs a series of iterations, each consisting of two basic steps: Authority update Hub update
35
Authority update Update each node's Authority score to be equal to the sum of the Hub Scores of each node that points to it. That is, a node is given a high authority score by being linked to by pages that are recognized as Hubs for information.
36
Hub update Update each node's Hub Score to be equal to the sum of the Authority Scores of each node that it points to. That is, a node is given a high hub score by linking to nodes that are considered to be authorities on the subject.
37
Authority/Hub Score Start with each node having a hub score and authority score of 1. Run the Authority Update Rule Run the Hub Update Rule Normalize the values by dividing each Hub score by the square root of the sum of the squares of all Hub scores, and dividing each Authority score by the square root of the sum of the squares of all Authority scores. Repeat from the second step as necessary.
38
Difference from PageRank
It is executed at query time, not at indexing time, thus, the hub and authority scores assigned to a page are query-specific. It computes two scores per document, hub and authority, as opposed to a single score. It is processed on a small subset of ‘relevant’ documents, not all documents as was the case with PageRank.
39
HITS Algorithm Focuses on broad topic queries that are likely to be answered with too many pages The more a page is pointed to by other pages, the more popular the page Popular pages are more likely to include relevant information than non-popular pages
40
Authority and Hubness 2 5 3 1 1 6 4 7 a(1) = h(2) + h(3) + h(4)
h(1) = a(5) + a(6) + a(7)
41
Numerical Example
42
Initial hub vector Authority vector Updated hub vector
43
Comparison of PageRank and HITS
The PageRank is computed for all web pages stored in the database and then prior to the query. HITS is performed on the set of retrieved web pages, and then for each query. PageRank computes authorities only. HITS computes authorities and hubs. PageRank is non-trivial to compute. HITS is easy to compute, but real-time execution is hard. Hard to spam Relatively easy to spam
44
Index Quality
45
Search Engines Henzinger (1999) and colleagues argue that the quality of pages in a search engine’s index is one of the important measures of search engine effectiveness.
46
Sampling Web Pages One approach to sample web pages approximately uniformly at random is based on the idea of a random walk. In a random walk a page is visited by the walk with probability roughly proportional to its PageRank value.
47
Experiment Two long random walks were performed starting at First walk (Walk1) 18 hours Crawler downloaded 2,867,466 pages Second walk (Walk2) 54 ours Crawler downloaded 6,219,704 pages
48
Measure Index Quality Choose a sample of pages (random walk), proportional to PageRank distribution Check if pages are in search engine index S. Estimate quality of S as the percentage of sampled pages that are in S.
49
Index Quality for Different Search Engines
(Copyright 1999 Hewlett-Packard Development Company, L. P. Reproduced with permission)
50
Index Quality Per Page for Different Search Engines
(Copyright 1999 Hewlett-Packard Development Company, L. P. Reproduced with permission)
51
Most Frequently Visited Pages
52
Most Frequently Visited Hosts
53
Social Network Analysis
54
Social Networks “A social network is a social structure made up of individuals (or organizations) called "nodes", which are tied (connected) by one or more specific types of interdependency, such as friendship, kinship, or common interest” Wikipedia
55
Social NW Analysis Social network analysis (SNA) views social relationships in terms of network theory consisting of nodes and ties (also called edges, links, or connections). Nodes are the individual actors within the networks, and ties are the relationships between the actors.
56
SNA Social network analysis [SNA] is the mapping and measuring of relationships and flows between people, groups, organizations, computers, URLs, and other connected information/knowledge entities.
57
SNA SNA provides both a visual and a mathematical analysis of human relationships. Management consultants use this methodology with their business clients and call it Organizational Network Analysis [ONA].
58
SNA To understand networks and their participants, we evaluate the location of actors in the network. Measuring the network location is finding the centrality of a node. These measures give us insight into the various roles and groupings in a network -- who are the connectors, experts, leaders, bridges, isolates, where are the clusters and who is in them, who is in the core of the network, and who is on the periphery?
59
Source: http://www.orgnet.com/sna.html
60
1. Degree centrality Social network researchers measure network activity for a node by using the concept of degrees. Degree: the number of direct connections a node has. Diane has the most direct connections in the network, making hers the most active node in the network. She is a 'connector' in this network.
61
2. Betweenness centrality
While Diane has many direct ties, Heather has few direct connections -- fewer than the average in the network. However, she has one of the best locations in the network -- she is between two important constituencies. She plays a 'broker' role in the network. A node with high betweenness has great influence over what flows -- and does not -- in the network. Heather may control the outcomes in a network.
62
3. Closeness centrality Fernando and Garth have fewer connections than Diane, yet the pattern of their direct and indirect ties allow them to access all the nodes in the network more quickly than anyone else. They have the shortest paths to all others -- they are close to everyone else. They are in an excellent position to monitor the information flow in the network -- they have the best visibility into what is happening in the network.
63
Web Structure Mining & SNA
Applying social NW analysis to study and model the link-structure of the web
64
Measuring link weight Transverse Links Intrinsic Links
Between pages with different domain names. Intrinsic Links Between pages with the same domain name Kleinberg suggested deletion of intrinsic links from the graph, keeping only transverse links, when computing PageRank
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.