Social Network Analysis with Apache Spark and Neo4J Charles Copley Nathan Begbie Eli Copley
Introduction to social network concepts Workshop data & data handling OVERVIEW Introduction to social network concepts Workshop data & data handling Applied visualisation and network computations By the end of the workshop, participants will have the basic skills needed to learn to use Apache Spark with Neo4j for social network analysis.
01 Introduction to Social Networks Introduction to Concepts & Terminology Used in Social Network Analysis
Levels of Analysis → → → → Individuals affect other individuals SOCIAL NETWORK ANALYSIS Levels of Analysis → → → → Individuals affect other individuals Individual behaviours and decisions determine network structures and dynamics Network properties and an individual’s network location affect individual behaviour Network structures, dynamics, evolution mechanisms at time 1 affect network dynamics and structures at time 2
Isolates Component Edge Node (degree = 4) SOCIAL NETWORK CONCEPTS & TERMINOLOGY Isolates Node (degree = 4) Component Edge
Homophily Birds of a feather flock together SOCIAL NETWORK CONCEPTS & TERMINOLOGY Homophily Birds of a feather flock together Image from Moody, J. (2004)
Sourced by Ambika Samarthya-Howard, Praekelt.Org
Influence and Selection SOCIAL NETWORK CONCEPTS & TERMINOLOGY Influence and Selection 2 1 2 4 3 5 3 1 We influence and are influenced by the people we are connected to; but we also select those who are similar to us. 4 5
SOCIAL NETWORK CONCEPTS & TERMINOLOGY Triadic Closure Triad
How connected are your friends? SOCIAL NETWORK CONCEPTS & TERMINOLOGY How connected are your friends? Clustering Coefficient 1/3 Clustering Coefficient 2/3 Clustering Coefficient 3/3
Page Rank Your influence is determined by the influence of people you are connected to. Your influence is passed on to people that you link to Then you iterate…. MANY TIMES PR=1.35 PR =1.35 PR=0.15
02 Workshop Data Why and how we use specific tools to handle large network datasets
US National Longitudinal Study of Student Health DATASET US National Longitudinal Study of Student Health Longitudinal study of a nationally representative sample of adolescents in grades 7-12 in the United States during the 1994-95 school year Includes Race, Gender and Grade. See: http://www.cpc.unc.edu/projects/addhealth Reference: A Statnet Tutorial (Goodreau, Handcock, Hunter, Butts and Morris ), Journal of Statistical Software, February 2008, Volume 24. https://www.jstatsoft.org/article/view/v024i09
Distributed Computation Graph Database DATA HANDLING Raw Data Distributed Computation Graph Database Holds your primary data (could also be in a database) First import data into Spark for data handling, formatting and calculation function Then move the data into Neo4j, which allows you to query relationship patterns and conduct SNA.
03 Data Practical Visualising network data and computing basic metrics
DATA PRACTICAL A recommender system could consist of searching for people connected to your friends, e.g. via LinkedIn Person 1 knows Person 2 → Person 2 knows Person 3 MATCH (p1)-[r1:knows]-(p2), (p1)-[r2:knows]-(p3), (p3)-[r3:knows]-(p2) return p1,p2,p3,r1,r2 limit 10
Thank you! Any questions? charles@praekelt.org nathan@praekelt.org eli@praekelt.org
More Reading Social Network Analysis with Big Data Charles Copley, Head of Data Science at Praekelt: https://medium.com/mobileforgood/social-network-analysis-using-apache-spark-and-neo4j-1ccba3c8af9a Homophily and Influence Sinan Aral (2013) What would Ashton Do? Harvard Business Review On how homophily and social location impact our choices https://hbr.org/2013/05/what-would-ashton-do-and-does-it-matter Weak Ties, Social Capital Granovetter, M. S. (1977) The Strength of Weak Ties. American Journal of Sociology, 78(6), 1360-1380.