CS 599: Social Media Analysis University of Southern California1 Social Ties and Link Prediction Kristina Lerman University of Southern California.

Slides:



Advertisements
Similar presentations
Mobile Communication Networks Vahid Mirjalili Department of Mechanical Engineering Department of Biochemistry & Molecular Biology.
Advertisements

Link Prediction in Social Networks
Imbalanced data David Kauchak CS 451 – Fall 2013.
Stelios Lelis UAegean, FME: Special Lecture Social Media & Social Networks (SM&SN)
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα Strong and Weak Ties Chapter 3, from D. Easley and J. Kleinberg book.
Analysis and Modeling of Social Networks Foudalis Ilias.
Modeling Malware Spreading Dynamics Michele Garetto (Politecnico di Torino – Italy) Weibo Gong (University of Massachusetts – Amherst – MA) Don Towsley.
Social Media Mining Chapter 5 1 Chapter 5, Community Detection and Mining in Social Media. Lei Tang and Huan Liu, Morgan & Claypool, September, 2010.
Structural Inference of Hierarchies in Networks BY Yu Shuzhi 27, Mar 2014.
CS 599: Social Media Analysis University of Southern California1 The Basics of Network Analysis Kristina Lerman University of Southern California.
Absorbing Random walks Coverage
LINK PREDICTION IN CO-AUTHORSHIP NETWORK Le Nhat Minh ( A N) Supervisor: Dongyuan Lu Aobo Tao Chen 1.
UNDERSTANDING VISIBLE AND LATENT INTERACTIONS IN ONLINE SOCIAL NETWORK Presented by: Nisha Ranga Under guidance of : Prof. Augustin Chaintreau.
Correlation and Autocorrelation
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
Mutual Information Mathematical Biology Seminar
Sampling from Large Graphs. Motivation Our purpose is to analyze and model social networks –An online social network graph is composed of millions of.
Computing Trust in Social Networks
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Network Measures Social Media Mining. 2 Measures and Metrics 2 Social Media Mining Network Measures Klout.
Models of Influence in Online Social Networks
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα Link Prediction.
Using Friendship Ties and Family Circles for Link Prediction Elena Zheleva, Lise Getoor, Jennifer Golbeck, Ugur Kuter (SNAKDD 2008)
Modeling Information Diffusion in Networks with Unobserved Links Quang Duong Michael P. Wellman Satinder Singh Computer Science and Engineering University.
Performance of Recommender Algorithms on Top-N Recommendation Tasks RecSys 2010 Intelligent Database Systems Lab. School of Computer Science & Engineering.
Online Social Networks and Media Absorbing Random Walks Link Prediction.
Modeling Relationship Strength in Online Social Networks Rongjing Xiang: Purdue University Jennifer Neville: Purdue University Monica Rogati: LinkedIn.
Popularity versus Similarity in Growing Networks Fragiskos Papadopoulos Cyprus University of Technology M. Kitsak, M. Á. Serrano, M. Boguñá, and Dmitri.
Using Transactional Information to Predict Link Strength in Online Social Networks Indika Kahanda and Jennifer Neville Purdue University.
PARAMETRIC STATISTICAL INFERENCE
DATA MINING LECTURE 13 Absorbing Random walks Coverage.
Network Characterization via Random Walks B. Ribeiro, D. Towsley UMass-Amherst.
Chapter 9 – Classification and Regression Trees
DATA MINING LECTURE 13 Pagerank, Absorbing Random Walks Coverage Problems.
The Link Prediction Problem for Social Networks David Libel-Nowell, MIT John Klienberg, Cornell Saswat Mishra sxm
A Graph-based Friend Recommendation System Using Genetic Algorithm
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign User Profiling in Ego-network: Co-profiling Attributes and Relationships.
Page 1 Inferring Relevant Social Networks from Interpersonal Communication Munmun De Choudhury, Winter Mason, Jake Hofman and Duncan Watts WWW ’10 Summarized.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα Link Prediction.
Susan O’Shea The Mitchell Centre for Social Network Analysis CCSR/Social Statistics, University of Manchester
Finding Top-k Shortest Path Distance Changes in an Evolutionary Network SSTD th August 2011 Manish Gupta UIUC Charu Aggarwal IBM Jiawei Han UIUC.
Exploit of Online Social Networks with Community-Based Graph Semi-Supervised Learning Mingzhen Mo and Irwin King Department of Computer Science and Engineering.
A Passive Approach to Sensor Network Localization Rahul Biswas and Sebastian Thrun International Conference on Intelligent Robots and Systems 2004 Presented.
LINK PREDICTION IN CO-AUTHORSHIP NETWORK Le Nhat Minh ( A N) Supervisor: Dongyuan Lu 1.
Link Prediction Topics in Data Mining Fall 2015 Bruno Ribeiro
Quantification in Social Networks Letizia Milli, Anna Monreale, Giulio Rossetti, Dino Pedreschi, Fosca Giannotti, Fabrizio Sebastiani Computer Science.
CS 590 Term Project Epidemic model on Facebook
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
By Lars Backstrom, Jon Kleinberg Presented by: Marina Simakov
Supervised Random Walks: Predicting and Recommending Links in Social Networks Lars Backstrom (Facebook) & Jure Leskovec (Stanford) Proc. of WSDM 2011 Present.
Sporadic model building for efficiency enhancement of the hierarchical BOA Genetic Programming and Evolvable Machines (2008) 9: Martin Pelikan, Kumara.
1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.
Online Social Networks and Media Absorbing random walks Label Propagation Opinion Formation.
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Alan Mislove Bimal Viswanath Krishna P. Gummadi Peter Druschel.
Cmpe 588- Modeling of Internet Emergence of Scale-Free Network with Chaotic Units Pulin Gong, Cees van Leeuwen by Oya Ünlü Instructor: Haluk Bingöl.
Sofus A. Macskassy Fetch Technologies
Romantic Partnerships and the Dispersion of Social Ties: A Network Analysis of Relationship Status on Facebook By: Lars Backstrom - Facebook Inc, Jon Kleinberg.
Social Networks Analysis
Empirical analysis of Chinese airport network as a complex weighted network Methodology Section Presented by Di Li.
Greedy Algorithm for Community Detection
Link Prediction Seminar Social Media Mining University UC3M
Community detection in graphs
Centrality in Social Networks
Hierarchical clustering approaches for high-throughput data
Model generalization Brief summary of methods
Parametric Methods Berlin Chen, 2005 References:
Presentation transcript:

CS 599: Social Media Analysis University of Southern California1 Social Ties and Link Prediction Kristina Lerman University of Southern California

Link Prediction Does network structure contain enough information to predict what new links will form in the future? Will nodes 33 and 28 become friends in the future? What about nodes 27 and 4?

Who to follow

Strength of social ties (review) Strong ties –surrounded by many mutual friends –characterized by lots of shared time together Weak ties –have few mutual friends –Serve as bridges to diverse parts of the network –Provide access to novel information

The Link-Prediction Problem for Social Networks (Liben-Nowell & Kleinberg) To what extent can the evolution of a social network be modeled using features intrinsic to the network itself? Formalize the link prediction problem –Given a snapshot of a network, infer which new interactions between nodes are likely to occur in the future Propose link prediction heuristics based on measures for analyzing the “proximity” of nodes in a network. Evaluate link prediction heuristics on large coauthorship networks. Future coauthorships can be extracted from network topology.

The intuition In many networks, people who are “close” belong to the same social circles and will inevitably encounter one another and become linked themselves. Link prediction heuristics measure how “close” people are x y x y Red nodes are close to each other Red nodes are more distant

Link prediction heuristics Local Common neighbors (CN) Jaccard (JC) Adamic-Adar (AA) Preferential attachment (PA) … Global Katz score Hitting time PageRank … x y

Local link prediction heuristics Link prediction heuristics –Common neighbors (CN) Neighborhood overlap –Jaccard (JC) –Adamic-Adar (AA) –Preferential attachment (PA) x y

Local link prediction heuristics Link prediction heuristics –Common neighbors (CN) –Jaccard (JC) Fraction of common neighbors –Adamic-Adar (AA) –Preferential attachment (PA) x y

Link prediction heuristics –Common neighbors (CN) –Jaccard (JC) –Adamic-Adar (AA) Nmbr common neighbors, with each neighbor z attenuated by log of its degree –Preferential attachment (PA) x y

Local link prediction heuristics Link prediction heuristics –Common neighbors (CN) –Jaccard (JC) –Adamic-Adar (AA) –Preferential attachment (PA) Better connected nodes are more likely to form more links x y

Global link prediction heuristics Link prediction heuristics –Katz score Measures number of paths between two nodes, attenuated by their length –Hitting time Expected time for a random walk from x to reach y –… x y

Data Collaboration networks of physicists –Core nodes: authors who published at least 3 papers during the training period and at least 3 papers during test period Training data: graph G(t 0, t 0 ’) of collaborations during time period [t 0, t 0 ‘] with V core nodes and E old edges Test data: graph G(t 1, t 1 ’) of collaborations during a later time period [t 1, t 1 ’] with V core nodes and E new edges

Evaluation metric Link prediction algorithm heuristic p Score node pairs using a heuristic p New links more likely among high scoring pairs Each link prediction heuristic p outputs a ranked list L of new collaborations: pairs in VxV-E old. Focus evaluation on new links E new * between core nodes Performance metric: How many of the top n pairs in ranked list L are the actual new nodes in E new *?

Results Heuristics vs random predictor

Results Heuristics vs graph distance predictor

Summary Graph-based link prediction heuristics outperform random guess by a factor of 40 However, they still predict only 16% of new collaborations at best, leaving much room for improvement.

Link prediction in complex networks: a survey Presenter: Yuan Shi USC ID: CSCI 599 Social Media Analysis L Lu and T Zhou, “Link prediction in complex networks: a survey”, Physica A 390(6): (2011)

Link Prediction Estimate the likelihood of the existence of a link between two nodes, based on observed links and the attributes of nodes Application –Biological networks: costly to identify links between nodes through field/laboratorial experiments –Online social networks: predicting friendship and recommending new friends (predicting future links in evolving networks)

Problem Description and Evaluation Metrics Undirected network G = (V, E) Universal set U containing |V|(|V|-1)/2 possible links Task: Find out missing links in U – E. Evaluation: randomly split E into two sets: training set E T, probe/validation set E P k-folder cross validation –Randomly partition into k subsets –Each time one subset is selected as probe set, the others as training set –Repeat k times, each with a different probe set

Evaluation Metrics A link prediction algorithm gives a ranking on each link AUC (area under the receiver operating characteristic curve) –Focus the whole list of ranks –The probability that a randomly chosen missing link is given a higher score than a randomly chosen nonexistent link Precision –Focus on the top ranks –Take top-L predicted links, among which L r links are right, the precision is L r /L

Similarity-Based Algorithms Assign a score s xy to each pair of nodes x and y The attributes of nodes are generally hidden -> focus on structural similarity: two nodes are linked if they have similar network structure Similarity indices –Local similarity Indices: only use local information –Global similarity indices: use global information, more accurate but costly –Quasi-local indices: a tradeoff between local and global

Local similarity Indices 10 indices are discussed. Common neighbors (CN) Resource Allocation Index (RA) Adamic-Adar Index (AA) set of neighbors degree of note z Intuition: Similarity(x, y) = the amount of resource y received from x x sends some resource to y, with their common neighbors as transmitters Each transmitter has a unit of resource and will equally distribute it to all its neighbors

Local similarity Indices - Evaluation Metric: AUC. Each number averaged by 10 implementations. Real-world networks PPI: protein-protein interaction NS: co-authorship Grid: electrical power-grid PB: US political blogs INT: router-level Internet USAir: US air transportation RA performs the best CN and AA have second best performance

Global similarity Indices 7 indices are discussed. Some examples are: Katz Index Average Commute Time Random Walk with Restart (direct application of PageRank algorithm) Global indices –Pros: more accurate than local indices –Cons: 1) time-consuming; 2) global topological information may not be available Laplacian matrix

Quasi-local Indices 3 indices are discussed. Local Path Index (LP) –Outperforms local indices like RA, AA and CN –Performs competitively to global indices with much less computational cost Local Random Walk (LRW): at time step t, Superposed Random Walk (SRW): at time step t, q is initial configuration function, e.g. Some experiments show LRW and SRW performs better than LP

Maximum Likelihood Methods Methodology –Assume some organizing principles of the network structure –Rules and parameters are obtained by maximizing the likelihood of the observed structure –Likelihood of any non-observed link can be calculated according to those rules and parameters Pros: provide valuable insights into the network organization Cons: Time consuming; Prediction accuracy is not very high

Example: Hierarchical Structure Model Assumption: 1. Each internal node r associated with a probability p r 2. Probability of linking a pair of leaves equals to p r’ where r’ is their lowest common ancestor Some statistics of the graph By maximizing the likelihood, Prediction: 1. Sample a large number of dendrograms with probability proportional to their likelihood 2. Compute the link probability by averaging the corresponding probability over all sampled dendrograms

Application Reconstruction of Networks –Not easy to reconstruct the “true” network since generally no one knows how many links are missing –Reliability of a network Classification of Partially Labeled Networks –Predict the labels of these unlabeled nodes based on the known labels and the network structure –Approach: add artificial links between every pair of labeled and unlabeled nodes Global optimization is difficult -> use greedy algorithms

Application Evaluation of Network Evolving Mechanisms –link prediction algorithm tells the factors resulting in the existence of links –Example: Similarity indices for the Chinese city airline network CN: topological effects DIS: geographical distance POPU: population GDP TI: third sector of GDP, named the tertiary industry

Outlook Link prediction in directed networks Multi-dimensional networks, where links could have different meanings (e.g. positive/negative) Hybrid algorithms to combine different similarity indices Leveraging external information (e.g. attributes) to improve accuracy Time-series link prediction approach considering the temporal evolutions of link occurrences

Romantic partnerships and the dispersion of social ties

Romantic Partnerships and the Dispersion of Social Ties (Backstrom & Kleinberg) Questions –Who are the most important individuals in a person’s social neighborhood? –What are the defining structural signatures of a person’s social neighborhood? Contributions –Dispersion: a new measure for estimating tie strength –Characterize romantic relationships in terms of network structure –Empirical study of this characteristic across Facebook population

Who are the most important people in one’s social neighborhood? Following Granovetter, researchers use number of mutual friends (embeddedness) to identify strong ties –Close friends, who share much time together –Emotionally intense interactions A B C D E F C D E F A B A-B tie is highly embedded in the network A-B tie is not embedded in the network

Romantic ties Embeddedness is not able to identify “significant others” (romantic relationships, e.g., spouse, partner, boy/girlfriend) Ego network – social neighborhood of an individual, showing all his/her friends and links between them Ego network of an individual Who is the “significant other ” ?

Social foci People have large clusters of friends corresponding to well-defined foci of interaction in their lives –These links have high embeddedness but are not very strong ties In contrast, romantic partners may have lower embeddedness, but they often involve mutual friends from different foci Ego network of an individual College friends Co-workers

Embeddedness vs dispersion Dispersion: mutual neighbors of u and v are not well- connected to one another, and hence u and v are the only intermediaries between these different parts of the network. Link u-h has high dispersion: u and h are the only intermediaries between c and f Embeddedness: u and v have many mutual neighbors. Links u-b, u-c, and u-f have embeddedness 5 Link u-h has embeddedness 4

Link dispersion s(u,h)=4 disp(u,b)=1

Evaluation Egonetworks of 1.3 million Facebook users, selected uniformly at random from among all users of age at least 20, with between 50 and 2000 friends, who list a spouse or relationship partner in their profile Rank all friends by importance. Attempt to identify romantic partners Measure: Precision of the first position,

Performance – How well does dispersion predict the “significant other”? – precision of the top-ranked person in the individual’s egonet –Beats others measures of interaction between users viewing of profiles, sending of messages, and co-presence at events (photos)

Performance as a function of neighborhood size Performance is best when the neighborhood size is around 100 nodes (56%), & drops moderately (to 33%) as the egonet size increases by an order of magnitude to 1000 Interaction features are better for larger neighborhoods, due to users with larger neighborhoods being more active

Performance as a function of user’s time on site

Best performance when combining features Predict relationship status of users –Ground truth: 60% of users are in a relationship Demographic features (age, gender, country, and time on site) work better than network-based features (dispersion) Best performance combining demographic and network features

How does performance vary based on age of the relationship?

Marriage Performance of dispersion measures increases as people approach time of their marriage

Persistence of relationships Transition probability from the status ‘in a relationship’ to the status ‘single’ over a 60-day period. The transition probabilities decrease monotonically, and by significant factors, for users with high normalized or recursive dispersion to their respective partners.

Summary Graph structure contains information predictive of individual relationships –New collaborations –Romantic partnerships In many cases, graph-based algorithms outperform feature- based machine learning algorithms These suggest complex interactions between personal relationships and global network structure