Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jerry Scripps N T O K M I N I N G E W R. Overview What is network mining? What is network mining? Motivation Motivation Preliminaries Preliminaries definitions.

Similar presentations


Presentation on theme: "Jerry Scripps N T O K M I N I N G E W R. Overview What is network mining? What is network mining? Motivation Motivation Preliminaries Preliminaries definitions."— Presentation transcript:

1 Jerry Scripps N T O K M I N I N G E W R

2 Overview What is network mining? What is network mining? Motivation Motivation Preliminaries Preliminaries definitions definitions metrics metrics network types network types Network mining techniques Network mining techniques

3 What is Network Mining? Statistics Graph Theory Social Network Analysis Machine Learning Network Mining Data Mining Computer Science Mathematics Pattern Recognition

4 What is Network Mining? Border Disciplines Statistics Statistics Computer Science Computer Science Physics Physics Math Math Psychology Psychology Law Enforcement Law Enforcement Sociology Sociology Military Military Biology Biology Medicine Medicine Chemistry Chemistry Business Business

5 What is Network Mining? Examples: Discovering communities within collaboration networks Discovering communities within collaboration networks Finding authoritative web pages on a given topic Finding authoritative web pages on a given topic Selecting the most influential people in a social network Selecting the most influential people in a social network

6 Network Mining – Motivation Emerging Data Sets World wide web World wide web Social networking Social networking Collaboration databases Collaboration databases Customer or Employee sets Customer or Employee sets Genomic data Genomic data Terrorist sets Terrorist sets Supply Chains Supply Chains Many more… Many more…

7 Network Mining – Motivation Direct Applications What is the community around msu.edu? What is the community around msu.edu? What are the authoritative pages? What are the authoritative pages? Who has the most influence? Who has the most influence? Who is the likely member of terrorist cell? Who is the likely member of terrorist cell? Is this a news story about crime, politics or business? Is this a news story about crime, politics or business?

8 Network Mining – Motivation Indirect Applications Convert ordinary data sets into networks Convert ordinary data sets into networks Integrate network mining techniques into other techniques Integrate network mining techniques into other techniques

9 Preliminaries Definitions Definitions Metrics Metrics Network Types Network Types Definitions Definitions Metrics Metrics Network Types Network Types

10 Definitions Node (vertex, point, object) Link (edge, arc) Community

11 Metrics Node Degree Degree Closeness Closeness Betweenness Betweenness Clustering coefficient Clustering coefficient Node Pair Graph distance Graph distance Min-cut Min-cut Common neighbors Common neighbors Jaccard’s coef Jaccard’s coef Adamic/adar Adamic/adar Pref. attachment Pref. attachment Katz Katz Hitting time Hitting time Rooted pageRank Rooted pageRank simRank simRank Bibliographic metrics Bibliographic metrics Network Characteristic path length Characteristic path length Clustering coefficient Clustering coefficient Min-cut Min-cut diameter diameter

12 Network Types – Random

13 Network Types – Small World Regular Small World Random Watts & Strogatz

14 Networks – Scale-free Barabasi & Bonabeau Degree follows a power law ~ 1/kn Can be found in a wide variety of real-world networks

15 Network recap Network Type Clustering coefficient Characteristic path length Power Law RandomLowLowNo RegularHighHighNo Small world HighLow? Scale-free??Yes

16 Techniques Link-Based Classification Link-Based Classification Link Prediction Link Prediction Ranking Ranking Influential Nodes Influential Nodes Community Finding Community Finding

17 Link-Based Classification ? Include features from linked objects: building a single model on all features Fusion of link and attribute models

18 Link-Based Classification Chakrabarti, et al. Copying data from neighboring web pages actually reduced accuracy Copying data from neighboring web pages actually reduced accuracy Using the label from neighboring page improved accuracy Using the label from neighboring page improved accuracy 010010 011110 111011 A A ? 101011B 111011 010010 101011 011110 A A B

19 Link-Based Classification Lu & Getoor Define vectors for attributes and links Define vectors for attributes and links Attribute data OA(X) Attribute data OA(X) Link data LD(X) constructed using Link data LD(X) constructed using mode (single feature – class of plurality) mode (single feature – class of plurality) count (feature for each class – count for neighbors) count (feature for each class – count for neighbors) binary (feature for each class – 0/1 if exists) binary (feature for each class – 0/1 if exists) 010010 011110 111011 A ? 101011B A 111011 … OA (attr) LD (link) A…A… 2 1 0 … 1 1 0 … Model Model 1 Model 2

20 Link-Based Classification Lu & Getoor Define probabilities for both Define probabilities for both Attribute Attribute Link Link Class estimation: Class estimation:

21 Collective Classification Uses both attributes and links Uses both attributes and links Iteratively update the unlabeled instances Iteratively update the unlabeled instances message passing, loopy belief nets, etc. message passing, loopy belief nets, etc.

22 Link-Based Classification Summary Using class of neighbors improves accuracy Using class of neighbors improves accuracy Using separate models for attribute and link data further improves accuracy Using separate models for attribute and link data further improves accuracy Other considerations: Other considerations: improvements are possible by using community information improvements are possible by using community information knowledge of network type could also benefit classifier knowledge of network type could also benefit classifier

23 Techniques Link-Based Classification Link-Based Classification Link Prediction Link Prediction Ranking Ranking Influential Nodes Influential Nodes Community Finding Community Finding

24 Link Prediction

25 Link Prediction Liben-Nowell and Kleinberg Tested node-pair metrics: Graph distance Graph distance Common neighbors Common neighbors Jaccards coefficient Jaccards coefficient Adamic/adar Adamic/adar Preferential attachment Preferential attachment Katz Katz Hitting time Hitting time Rooted PageRank Rooted PageRank SimRank SimRank Neighborhood Ensemble of paths

26 Link Prediction - results

27 Link Prediction – newer methods maximum likelihood maximum likelihood stochastic block model stochastic block model probabilistic probabilistic

28 Link Prediction – summary There is room for growth – best predictor has accuracy of only around 9% There is room for growth – best predictor has accuracy of only around 9% Predicting collaborations is difficult Predicting collaborations is difficult New problem could be to predict the direction of the link New problem could be to predict the direction of the link

29 Techniques Link-Based Classification Link-Based Classification Link Prediction Link Prediction Ranking Ranking Influential Nodes Influential Nodes Community Finding Community Finding Link Completion Link Completion

30 Ranking

31 Ranking – Markov Chain Based Random-surfer analogy Random-surfer analogy Problem with cycles Problem with cycles PageRank uses random vector PageRank uses random vector

32 Ranking – summary Other methods such as HITS and SALSA also based on Markov chain Other methods such as HITS and SALSA also based on Markov chain Ranking has been applied in other areas: Ranking has been applied in other areas: text summarization text summarization anomaly detection anomaly detection

33 Techniques Link-Based Classification Link-Based Classification Link Prediction Link Prediction Ranking Ranking Influential Nodes Influential Nodes Community Finding Community Finding

34 Influence

35 Influence Maximization Problem: find the best nodes to activate Problem: find the best nodes to activate Approaches: Approaches: degree – fast but not effective degree – fast but not effective greedy – effective but slow greedy – effective but slow improvements to greedy: degree heuristics and Shapely value improvements to greedy: degree heuristics and Shapely value use communities use communities cost-benefit – probabilistic approach cost-benefit – probabilistic approach

36 Maximizing influence model-based Problem – finding the k best nodes to activate to maximize the number of nodes activated Problem – finding the k best nodes to activate to maximize the number of nodes activated Models: Models: independent cascade – when activated a node has a one-time change to activate neighbors with prob. p ij independent cascade – when activated a node has a one-time change to activate neighbors with prob. p ij linear threshold – node becomes activated when the percent of its neighbors crosses a threshold linear threshold – node becomes activated when the percent of its neighbors crosses a threshold

37 Maximizing influence model-based Models: independent cascade & linear threshold Models: independent cascade & linear threshold A function f:S → S *, can be created using either model A function f:S → S *, can be created using either model Functions use monte-carlo, hill-climbing solution Functions use monte-carlo, hill-climbing solution Submodular functions, where S  T are proven in another work to be NP-C but by using a hill-climbing solution can get to within 1-1/e of optimum. Submodular functions, where S  T are proven in another work to be NP-C but by using a hill-climbing solution can get to within 1-1/e of optimum.

38 Maximizing influence – cost/benefit Assumptions: Assumptions: product x sells for $100 product x sells for $100 a discount of 10% can be offered to various prospective customers a discount of 10% can be offered to various prospective customers If customer purchases profit is: If customer purchases profit is: 90 if discount is offered 90 if discount is offered 100 if discount is not offered 100 if discount is not offered Expected lift in profit (ELP) from offering discount is: Expected lift in profit (ELP) from offering discount is: 90*P(buy|discount) - 100*P(buy|no discount) 90*P(buy|discount) - 100*P(buy|no discount)

39 Maximizing influence – cost/benefit Goal is to find M that maximizes global ELP Goal is to find M that maximizes global ELP Three approximations used: Three approximations used: single pass single pass greedy greedy hill-climbing hill-climbing X i is the decision of customer i to buy X i is the decision of customer i to buy Y is vector of product attributes Y is vector of product attributes M is vector of marketing decision M is vector of marketing decision f is a function to set the ith element of M f is a function to set the ith element of M r 0 and r 1 are revenue gained r 0 and r 1 are revenue gained c is the cost of marketing c is the cost of marketing

40 Comparison of approaches Cost/benefitModel-based Size of starting set variable - based on max. lift fixed uses attributes yesno probabilities extracted from data set assigned to links An extension would be to spread influence to the most number of communities An extension would be to spread influence to the most number of communities Improvements can be made in speed Improvements can be made in speed

41 Techniques Link-Based Classification Link-Based Classification Link Prediction Link Prediction Ranking Ranking Influential Nodes Influential Nodes Community Finding Community Finding

42 Communities

43 Gibson, Kleinberg and Raghavan Query Search Engine Root Set Use HITS to find top 10 hubs and authorities Base Set: add forward and back links

44 Flake, Lawrence and Giles Uses Min-cut Uses Min-cut Start with seed set Start with seed set Add linked nodes Add linked nodes Find nodes from outgoing links Find nodes from outgoing links Create virtual source node Create virtual source node Add virtual sink linking it to all nodes Add virtual sink linking it to all nodes Find the min-cut of the virtual source and sink Find the min-cut of the virtual source and sink

45 Community Finding Girvan and Newman – minimize betweenness Girvan and Newman – minimize betweenness Clauset, et al. – agglomerative, uses modularity Clauset, et al. – agglomerative, uses modularity Shi & Malik – spectral clustering Shi & Malik – spectral clustering

46 Communities - summary There are many options for building communities around a small group of nodes There are many options for building communities around a small group of nodes Possible future directions Possible future directions finding communities in networks having different link types finding communities in networks having different link types impact of network type on community finding techniques impact of network type on community finding techniques


Download ppt "Jerry Scripps N T O K M I N I N G E W R. Overview What is network mining? What is network mining? Motivation Motivation Preliminaries Preliminaries definitions."

Similar presentations


Ads by Google