Presentation is loading. Please wait.

Presentation is loading. Please wait.

Graph and Link Mining.

Similar presentations


Presentation on theme: "Graph and Link Mining."β€” Presentation transcript:

1 Graph and Link Mining

2 Graphs - Basics A graph is a powerful abstraction for modeling entities and their pairwise relationships. G = (V,E) Set of nodes 𝑉= 𝑣 1 ,…, 𝑣 5 Set of edges 𝐸={ 𝑣 1 , 𝑣 2 , … 𝑣 4 , 𝑣 5 } Examples: Social network Twitter Followers Web Collaboration graphs 𝑣 1 𝑣 2 𝑣 3 𝑣 4 𝑣 5

3 Undirected Graphs Undirected Graph
The edges are undirected pairs – they can be traversed in any direction. Degree of node: Number of edges incident on the node Path: A sequence of edges from one node to another Connected Component: A set of nodes such that there is a path between any two nodes in the set 𝑣 1 𝑣 2 𝑣 3 𝑣 4 𝑣 5

4 Directed Graphs Directed Graph: In-degree and Out-degree of a node.
Edges are ordered pairs – they can be traversed in the direction from first to second. In-degree and Out-degree of a node. Path: A sequence of directed edges from one node to another Strongly Connected Component: A set of nodes such that there is a directed path between any two nodes in the set 𝑣 1 𝑣 2 𝑣 3 𝑣 4 𝑣 5

5 Examples of Graphs we Might Mine
Airline Route Maps are useful Info can tell you about both history and politics Call Detail Records Tell us about relationships between people Who got in trouble about a decade ago for using this info? Web is based on (hyper)links between docs Social Networks form Graphs Link Analysis is the data mining technique that addresses relationships and connections

6 6 Degrees of Separation Claim: there are at most 6 degrees of separation between any two people This is important in social networks LinkedIn tell you how you connect to others and it expands with each link. Stanley Milgram wasn’t first to note small world effect But popularized it with famous experiment: How close are two random people? Picked people in Omaha Nebraska or Wichita Kansas, and someone in Boston Asked source person to send it to other person and if did not know the person send it to someone more likely to know them Average path length was 5.5 or 6 But only 64 of 296 arrived (this is often not highlighted)

7 Examples of Applications
Identifying authoritative sources of information on the WWW by analyzing page links Google and PageRank– we will come back to this Understanding physician referral patterns Analyzing telephone call patterns MCI Friends and Family You call Mary Smith, also on MCI, so ask her to join MCI But your wife does not know Mary Smith! Oops! Far-fetched? Facebook does it all of the time!!!! Identify fraud: in past one would purchaser several stolen calling cards and use them to call same person. That is a clue.

8 Mining the graph structure
A graph is a combinatorial object, with a certain structure. Mining the structure of the graph reveals information about the entities in the graph E.g., if in the Facebook graph I find that there are 100 people that are all linked to each other, then these people are likely to be a community The community discovery problem By measuring the number of friends in Facebook graph I can find the most important nodes The node importance problem

9 Importance problem What are the most important nodes in the graph?
What are the most authoritative pages on the web? Who are the important users in Facebook? What are the most influential Twitter accounts?

10 Link Analysis First generation search engines
view documents as flat text files could not cope with size, spamming, user needs Second generation search engines Ranking becomes critical shift from relevance to authoritativeness authoritativeness: the static importance of the page a success story for the network analysis + a huge commercial success it all started with two graduate students at Stanford. Everyone knows the company, right?

11 Link Analysis: Intuition
A link from page p to page q denotes endorsement page p considers page q an authority on a subject use the graph of recommendations assign an authority value to every page The same idea applies to other graphs as well Twitter graph, where user p follows user q

12 Constructing the graph
w w w w w Goal: output an authority weight for each node Also known as centrality or importance

13 Rank by Popularity Rank pages according to the number of incoming edges (in-degree, degree centrality) w=3 w=2 Red Page Yellow Page Blue Page Purple Page Green Page w=2 w=1 w=1

14 Popularity It is not important only how many link to you, but how important they are Good authorities are pointed by good authorities Recursive definition of importance

15 PageRank Good authorities are pointed to by good authorities
w Good authorities are pointed to by good authorities The value of a page is the value of the people that link to you How do we implement that? Each node distributes its authority value equally to its neighbors The authority value of each node is the sum of the authority fractions it collects from its neighbors. Solving the system of equations we get authority values for the nodes w = Β½ , w = ΒΌ , w = ΒΌ w w w + w + w = 1 w = w + w w = Β½ w w = Β½ w

16 A More Complex Example w1 = 1/3 w4 + 1/2 w5 w2 = 1/2 w1 + w3 + 1/3 w4
v2 v1 w1 = 1/3 w4 + 1/2 w5 v3 w2 = 1/2 w1 + w3 + 1/3 w4 w3 = 1/2 w1 + 1/3 w4 w4 = 1/2 w5 w5 = w2 v5 v4

17 Random Walks on Graphs What we described is equivalent to a random walk on the graph Random walk: Start from a node uniformly at random Pick one of the outgoing edges uniformly at random Repeat Some nodes will be visited more often than others. Those are more important. Based not only on number of incoming links, but how often the predecessor nodes are visited A value like Google’s Pagerank indicates how often a node would be visited

18 Random walks on graphs Question: what is the probability of being at a specific node? 𝑝 𝑖 : probability of being at node i at this step 𝑝 𝑖 β€²: probability of being at node i in the next step After many steps the probabilities converge to the stationary distribution of the random walk. v2 p’1 = 1/3 p4 + 1/2 p5 v1 p’2 = 1/2 p1 + p3 + 1/3 p4 v3 p’3 = 1/2 p1 + 1/3 p4 p’4 = 1/2 p5 v5 v4 p’5 = p2

19 How Does Pagerank Work? Arbitrarily initialize all pages to Pagerank of 1 Repeatedly perform calculations for each page Eventually the values will converge Pagerank is what caused Google to succeed Prior to that only content mattered, not link structure

20 Benefits of PageRank It is not trivial to fool Pagerank
You can create dummy pages to point to your page, but since no one is pointing to those pages, it will have low PageRank and not help much You can create dummy pages to also point to one another, but without being pointed to by an outside authority, the impact will be limited But it is clear that Google must have many tweaks to catch cases like this– link spam or link farms

21 Social Network Analysis
Social Network Analysis Overview 5 Minutes What is Social Network Analysis 4 minutes


Download ppt "Graph and Link Mining."

Similar presentations


Ads by Google