Social Sub-groups Overview Background: Continue discussion of social subgroups. Wayne Baker Social structure in a place where there should be none Scott.

Slides:



Advertisements
Similar presentations
Quality control tools
Advertisements

Conceptualization, Operationalization, and Measurement
Hypothesis testing 5th - 9th December 2011, Rome.
Clustering.
Brief introduction on Logistic Regression
PARTITIONAL CLUSTERING
Contingency Tables Chapters Seven, Sixteen, and Eighteen Chapter Seven –Definition of Contingency Tables –Basic Statistics –SPSS program (Crosstabulation)
Outline input analysis input analyzer of ARENA parameter estimation
Presented by: GROUP 7 Gayathri Gandhamuneni & Yumeng Wang.
Regression Analysis Using Excel. Econometrics Econometrics is simply the statistical analysis of economic phenomena Here, we just summarize some of the.
Generated Waypoint Efficiency: The efficiency considered here is defined as follows: As can be seen from the graph, for the obstruction radius values (200,
Cluster Analysis.
Visual Recognition Tutorial
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Contingency tables and Correspondence analysis Contingency table Pearson’s chi-squared test for association Correspondence analysis using SVD Plots References.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Decision Tree Models in Data Mining
Radial Basis Function Networks
This Week: Testing relationships between two metric variables: Correlation Testing relationships between two nominal variables: Chi-Squared.
LEARNING PROGRAMME Hypothesis testing Intermediate Training in Quantitative Analysis Bangkok November 2007.
Social Sub-groups II Outline “How?” - Review group-finding strategies - “Evade” – PCA (=SVD for the math-oriented!) - Theory Problem: What should group-structure.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Programming Collective Intelligence by Toby.
Introduction to SAS Essentials Mastering SAS for Data Analytics
Overview Granovetter: Strength of Weak Ties What are ‘weak ties’? why are they ‘strong’? Burt: Structural Holes What are they? What do they do? How do.
Social Sub-groups Overview Background: How do we characterize the social structure of a ‘group’? Exemplar: Ken Frank and Jeffrey Yasumoto A discussion.
Slides are based on Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems.
SW388R6 Data Analysis and Computers I Slide 1 Central Tendency and Variability Sample Homework Problem Solving the Problem with SPSS Logic for Central.
Victor Lee.  What are Social Networks?  Role and Position Analysis  Equivalence Models for Roles  Block Modelling.
Assignment 2: remarks FIRST PART Please don’t make a division of labor so blatantly obvious! 1.1 recode - don't just delete everything that looks suspicious!
Sampling Design and Analysis MTH 494 Ossam Chohan Assistant Professor CIIT Abbottabad.
Local Networks Overview Personal Relations: GSS Network Data To Dwell Among Friends Questions to answer with local network data Mixing Local Context Social.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
Hypothesis testing Intermediate Food Security Analysis Training Rome, July 2010.
TYPES OF STATISTICAL METHODS USED IN PSYCHOLOGY Statistics.
1 Psych 5500/6500 t Test for Dependent Groups (aka ‘Paired Samples’ Design) Fall, 2008.
Analysis of Variance 1 Dr. Mohammed Alahmed Ph.D. in BioStatistics (011)
Jennifer Lewis Priestley Presentation of “Assessment of Evaluation Methods for Prediction and Classification of Consumer Risk in the Credit Industry” co-authored.
Social Sub-groups Overview Substantive papers: Wayne Baker Social structure in a place where there should be none Scott Feld What causes clustering in.
Data Analysis Econ 176, Fall Populations When we run an experiment, we are always measuring an outcome, x. We say that an outcome belongs to some.
Centrality in Social Networks Background: At the individual level, one dimension of position in the network can be captured through centrality. Conceptually,
Social Sub-groups Overview Background: How do we characterize the social structure of a ‘group’? Theorists from Simmel to Homans have approached the question.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Understanding Network Concepts in Modules Dong J, Horvath S (2007) BMC Systems Biology 2007, 1:24.
CLUSTERING AND SEGMENTATION MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
Hierarchy Overview Background: Hierarchy surrounds us: what is it? Micro foundations of social stratification Ivan Chase: Structure from process Action.
Two-Way (Independent) ANOVA. PSYC 6130A, PROF. J. ELDER 2 Two-Way ANOVA “Two-Way” means groups are defined by 2 independent variables. These IVs are typically.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Multivariate Analysis and Data Reduction. Multivariate Analysis Multivariate analysis tries to find patterns and relationships among multiple dependent.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Structural Holes & Weak Ties
Applied Multivariate Statistics Cluster Analysis Fall 2015 Week 9.
Introduction to Matrices and Statistics in SNA Laura L. Hansen Department of Sociology UMB SNA Workshop July 31, 2008 (SOURCE: Introduction to Social Network.
CLUSTERING AND SEGMENTATION MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
Chapter 13 Backtracking Introduction The 3-coloring problem
BMTRY 789 Lecture9: Proc Tabulate Readings – Chapter 11 & Selected SUGI Reading Lab Problems , 11.2 Homework Due Next Week– HW6.
1 Cluster Analysis Prepared by : Prof Neha Yadav.
1 Lecture Plan Modelling Profit Distribution from Wind Production (Excel Case: Danish Wind Production and Spot Prices) Reasons for copula.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
DATA MINING: CLUSTER ANALYSIS (3) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Week 2 Normal Distributions, Scatter Plots, Regression and Random.
 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.
Unsupervised Learning
Local Networks Overview Personal Relations: Core Discussion Networks
Social Balance & Transitivity
Data Mining Practical Machine Learning Tools and Techniques
Social Sub-groups Overview Background:
Structural Holes & Weak Ties
Cluster Analysis.
Unsupervised Learning
Presentation transcript:

Social Sub-groups Overview Background: Continue discussion of social subgroups. Wayne Baker Social structure in a place where there should be none Scott Feld What causes clustering in a network? Opportunity and interests Methods: Search procedures for network subgroups Segregation statistics Iterative search procedures Cluster analysis

Social Sub-groups Wayne Baker: The Social Structure of a National Securities Market: 1) Behavioral assumptions of economic actors 2) Micro-structure of networks 3) Macro-structure of networks 4) Price Consequences Under standard economic assumptions, people should act rationally and act only on price. This would result in expansive and homogeneous (I.e. random) networks. It is, in fact, this structure that allows microeconomic theory to predict that prices will settle to an optimal equilibrium

Baker’s Model:

He makes two assumptions in contrast to standard economic assumptions: a) that people do not have access to perfect information and b) that some people act opportunistically He then shows how these assumptions change the underlying mechanisms in the market, focusing on price volatility as a marker for uncertainty. The key on the exchange floor is “market makers” people who will keep the process active, keep trading alive, and thus not ‘hoard’ (and lower profits system wide)

Baker’s Model: Micronetworks: Actors should trade extensively and widely. Why might they not? A) Physical factors (noise and distance) B) Avoid risk and build trust Macro-Networks: Should be undifferentiated. Why not? A) Large crowds should be more differentiated than small crowds. Why? Price consequences: Markets should clear. They often don’t. Why? Network differentiation reduces economic efficiency, leading to less information and more volatile prices

Baker: Use frequency of exchange to identify the network, resulting in: Baker finds that the structure of this network significantly (and differentially) affects the price volatility of the network

Baker: Because size is the primary determinant of clustering in this setting, he concludes that the standard economic assumption of large market = efficient is unwarranted.

Scott Feld: Focal Organization of Social Ties Feld wants to look at the effects of constraint & opportunity for mixing, to situate relational activity within a wider context. The contexts form “Foci”, “A social, psychological, legal or physical entity around which joint activities are organized” (p.1016) People with similar foci will be clustered together. He contrasts this with social balance theory. Claim: that much of the clustering attributed to interpersonal balance processes are really due to focal clustering. (note that this is not theoretically fair critique -- given that balance theory can easily accommodate non-personal balance factors (like smoking or group membership) but is a good empirical critique -- most researchers haven’t properly accounted for foci.)

Identifying Primary groups: 1) Measures of fit To identify a primary group, we need some measure of how clustered the network is. Usually, this is a function of the number of ties that fall within group to the number of ties that fall between group. 2) Algorithmic approaches to maximizing (1) Once we have such an index, we need a method for searching through the network to maximize the fit. We next go over various algorithms, that search different criteria for a fit. 3) Generalized cluster analysis In addition to maximizing a group function such as (1) we can use the relational distance directly, and look for clusters in the data. We next go over two different styles of cluster analysis

Measuring Cluster fit. Many options. For a review, see: Frank, K. A "Identifying Cohesive Subgroups." Social Networks Fershtman, M "Cohesive Group Detection in a Social Network by the Segregation Matrix Index." Social Networks Richards, William D NEGOPY. Vers Brunaby, B.C. Canada Simon Fraser University.

Segregation Index ( Freeman, L. C "Segregation in Social Networks." Sociological Methods and Research ) Freeman asked how we could identify segregation in a social network. Theoretically, he argues, if a given attribute (group label) does not matter for social relations, then relations should be distributed randomly with respect to the attribute. Thus, the difference between the number of cross-group ties expected by chance and the number observed measures segregation.

Segregation Index Consider the (hypothetical) network below. There are two attributes in this network: people with Blue eyes and Brown eyes and people who are square or not (they must be hip).

Segregation Index Mixing Matrix: Blue Brown Blue 6 17 Brown Hip Square Hip 20 3 Square 3 30

Segregation Index To calculate the number of expected, use the standard formula for a contingency table: Row marginal * column Marginal / Total Blue Brown Blue Brown Blue Brown Blue Brown observed Expected In matrix form: E(X) = R*C/T

Segregation Index Blue Brown Blue Brown Blue Brown Blue Brown observed Expected E(X) = ( ) X = (17+17) Seg = / 27.1 = -6.9 / 27.1 = -0.25

Hip Square Hip Square Observed Blue Brown Blue Brown Expected E(X) = ( ) X = (3+3) Seg = / 27.1 = 21.1 / 27.1 = 0.78 Segregation Index

In SAS, you need to create a mixing matrix to calculate the segregation index. Mixmat.mod will do this. It does so using an indicator matrix. Blue Square

Segregation Index M = I`AI You get the mixing matrix by pre multiplying the adjacency matrix by the transpose of the indicator matrix and post multiplying by the indicator matrix M = I` A I (k x k)(k x n)(n x n)(n x k)

Segregation Index In practice, how does the segregation index work? This is a plot of the extent of race segregation in a high school, by the racial heterogeneity of the high school

Segregation Index One problem with the segregation index is that it is not ‘margin free.’ That is, if you were to change the distribution of the category of interest (say race) by a constant but not the core association between race and friendship choice, you can get a different segregation level. One antidote to this problem is to use odds ratios. In this case, and odds ratio tells us the relative likelihood that two people in the same category will choose each other as friends.

Odds Ratios The odds ratio tells us how much more likely people in the same group are to nominate each other. You calculate the odds ratio based on the number of ties in a group and their relative size, based on the following table: Member of: Same Group Different Group Friends A B Not Friends C D OR = AD/ BC

Hip Square Hip Square Observed Odds Ratios There are 6 hip people and 9 square people in this network. This implies that there are the following number of possible ties in the network: Group Same Dif Yes 50 6 Friend No Hip Square Hip Square Diagonal = n i (n i -1) off diagonal = n i 2 OR = (50)102 / 52(6) = 16.35

Log(Same-Sex Odds Ratio) Friendship Segregation Index Segregation index compared to the odds ratio: r=.95

Algorithms that maximize this type of fit (density / tie ratio based) Factions in UCI-NET Multiple options for the exact factor maximized. I recommend either the density or the correlation function, and I would calculate the distance in each case. Frank’s KliqueFinder (the AJS paper we just read) I have it, but I’ve yet to be able to get it to work. The folks at UCI-NET are planning on incorporating it into the next version. Fershtman’s SMI Never seen it programmed, though I use some of the ideas in the CROWDS algorithm discussed below

Factions Once you read your data into UCI-NET you can use factions, which in many ways is the easiest, though only if your networks are not too big.

Input dataset: name of the network you want to cluster Fit criterion: Sum of the in-group ties Density of in-group ties Correlation of observed tie patterns to an ideal (block diagonal) “Other” - Steve Borgotti’s ‘special function’ - no idea what it means. Are diagonal’s valid? Depends on the data of interest Convert to geodesic: I recommend doing this if your network is fairly sparse Maximum # of iterations in a series: I usually go with the defaults. (Same with the next three options) Output: the name of the partition you want to save Factions

Cluster analysis In addition to tools like FACTIONS, we can use the distance information contained in a network to cluster observations that are ‘close’ to each other. In general, cluster analysis is a set of techniques that allows you to identify collections of objects that are simmilar to each other in some degree. A very good reference is the SAS/STAT manual section called, “Introduction to clustering procedures.” ( ) ( See also Wasserman and Faust, though the coverage is spotty). We are going to start with the general problem of hierarchical clustering applied to any set of analytic objects based on similarity, and then transfer that to clustering nodes in a network.

Cluster analysis Imagine a set of objects (say people) arrayed in a two dimensional space. You want to identify groups of people based on their position in that space. How do you do it? How Cool you are How Smart you are

Cluster analysis Start by choosing a pair of people who are very close to each other (such as 15 & 16) and now treat that pair as one point, with a value equal to the mean position of the two nodes. x

Cluster analysis Now repeat that process for as long as possible.

Cluster analysis This process is captured in the cluster tree (called a dendrogram)

Cluster analysis As with the network cluster algorithms, there are many options for clustering. The three that I use most are: Ward’s Minimum Variance -- the one I use almost 95% of the time Average Distance -- the one used in the example above Median Distance -- very similar Again, the SAS manual is the best single place I’ve found for information on each of these techniques. Some things to keep in mind: Units matter. The example above draws together pairs horizontally because the range there is smaller. Get around this by standardizing your data. This is an inductive technique. You can find clusters in a purely random distribution of points. Consider the following example.

data random; do i=1 to 20; x= rannor (0); y=rannor(0); output; end; run; The data in this scatter plot are produced using this code: Cluster analysis

Resulting dendrogram

Cluster analysis Resulting cluster solution

Cluster analysis Cluster analysis works by building a distance matrix between each pair of points. In the example above, it used the Euclidean distance which in two dimensions is simply the physical distance between the points in a plot. Can work on any number of dimensions. To use cluster analysis in a network, we base the distance on the path- distance between pairs of people in the network. Consider again the blue-eye hip example:

Cluster analysis Distance Matrix

The distance matrix implies a space that nodes are embedded within. Using something like MDS, we can represent the space implied by the distance matrix in two dimensions. This is the image of the network you would get if you did that. Cluster analysis

When you use variables, the cluster analysis program generates a distance matrix. We can, instead use the network distance matrix directly. If we do that with this example network, we get the following:

Cluster analysis

In SAS you use two commands to get a cluster analysis. The first does the hierarchical clustering. The second analyzes the cluster output to create the tree. Example 1. Using variables to define the space (like income and musical taste): proc cluster data=a method=ave out=clustd std; var x y; id node; run; proc tree data=clustd ncl=5 out=cluvars; run;

Cluster analysis Example 2. Using a pre- defined distance matrix to define the space (as in a social network). You first create the distance matrix (in IML), then use it in the cluster program. proc iml; %include 'c:\moody\sas\programs\modules\reach.mod'; /* blue eye example */ mat2=j(15,15,0); mat2[1,{ }]=1; /* lines cut here */ mat2[15,{ }]=1; dmat=reach(mat2); mattrib dmat format=1.0; print dmat; id=1:nrow(dmat); id=id`; ddat=id||dmat; create ddat from ddat; /* creates the dataset */ append from ddat; quit; data ddat (type=dist); /* tells SAS it is a distance */ set ddat; /* matrix */ run;

Cluster analysis Example 2. Using a pre-defined distance matrix to define the space (as in a social network). Once you have it, the cluster program is just the same. proc cluster data=ddat method=ward out=clustd; id col1; run; proc tree data=clustd ncl=3 out=netclust; copy col1; run; proc freq data=netclust; tables cluster; run; proc print data=netclust; var col1 cluster; run;

The CROWDS algorithm combines the density approach above with an initial cluster analysis and a routine for determining how many clusters are in the network. It does so by using the Segregation index and all of the information from the cluster hierarchy, combining two groups only if it improves the segregation fit for both groups.

The one other program you should know about is NEGOPY. Negopy is a program that combines elements of the density based approach and the graph theoretic approach to find groups and positions. Like CROWDS, NEGOPY assigns people both to groups and to ‘outsider’ or ‘between’ group positions. It also tells you how many groups are in the network. It’s a DOS based program, and a little clunky to use, but NEGWRITE.MOD will translate your data into NEGOPY format if you want to use it. There are many other approaches. If you’re interested in some specifically designed for very large networks (10,000+ nodes), I’ve developed something I call Recursive Neighborhood Means that seems to work fairly well.