Lecture 10 Measures and Metrics.

Lecture 10 Measures and Metrics

Network Growth Patterns
Network Segmentation Graph Densification Diameter Shrinkage

1. Network Segmentation Often, in evolving networks, segmentation takes place, where the large network is decomposed over time into three parts Giant Component: As network connections stabilize, a giant component of nodes is formed, with a large proportion of network nodes and edges falling into this component. Stars: These are isolated parts of the network that form star structures. A star is a tree with one internal node and n leaves. Singletons: These are orphan nodes disconnected from all nodes in the network.

2. Graph Densification

Densification in Real Networks
1.69 1.66 Source: Leskovec et al. KDD 2005 V(t) V(t) Physics Citations Patent Citations

3. Diameter Shrinking In networks diameter shrinks over time ArXiv citation graph Affiliation Network

Community Evolution Communities also expand, shrink, or dissolve in dynamic networks

Evaluating the Communities
Evaluation with ground truth Evaluation without ground truth

Evaluation with Ground Truth
When ground truth is available We have partial knowledge of what communities should look like We are given the correct community (clustering) assignments Measures Precision and Recall, or F-Measure Purity Normalized Mutual Information (NMI) TP – the intersection of the two oval shapes; TN – the rectangle minus the two oval shapes; FP – the circle minus the blue part; FN – the blue oval minus the circle. Accuracy = (TP+TN)/(TP+FP+FN+FN); Precision = TP/(TP+FP); Recall = TP/(TP+FN)

Precision and Recall True Positive (TP) : False Negative (FN) :
True Positive (TP) : When similar members are assigned to the same communities A correct decision. True Negative (TN) : When dissimilar members are assigned to different communities A correct decision False Negative (FN) : When similar members are assigned to different communities An incorrect decision False Positive (FP) : When dissimilar members are assigned to the same communities

Precision and Recall: Example
TP+FP = C(6,2) + C(8,2) = = 43; FP counts the wrong pairs within each cluster; FN counts the similar pairs but wrongly put into different clusters; TN counts dissimilar pairs in different clusters

F-Measure

We can assume the majority of a community represents the community
Purity We can assume the majority of a community represents the community We use the label of the majority against the label of each member to evaluate the communities Purity. The fraction of instances that have labels equal to the community’s majority label N – the total number of data points. Purity can be easily tampered by Points being singleton communities (of size 1); or by Very large communities

Mutual Information Mutual information (MI). The amount of information that two random variables share. By knowing one of the variables, it measures the amount of uncertainty reduced regarding the others

Normalizing Mutual Information (NMI)

Normalized Mutual Information

Normalized Mutual Information
NMI values close to one indicate high similarity between communities found and labels Values close to zero indicate high dissimilarity between them

Normalized Mutual Information: Example
Found communities (H) [1,1,1,1,1,1, 2,2,2,2,2,2,2,2] Actual Labels (L) [2,1,1,1,1,1, 2,2,2,2,2,2,1,1] nh h=1 6 h=2 8 nl 7 nh,l h=1 5 1 h=2 2 6

Evaluation without Ground Truth
Evaluation with Semantics A simple way of analyzing detected communities is to analyze other attributes (posts, profile information, content generated, etc.) of community members to see if there is a coherency among community members The coherency is often checked via human subjects. Or through labor markets: Amazon Mechanical Turk To help analyze these communities, one can use word frequencies. By generating a list of frequent keywords for each community, human subjects determine whether these keywords represent a coherent topic. Evaluation Using Clustering Quality Measures Use clustering quality measures (SSE) Use more than two community detection algorithms and compare the results and pick the algorithm with better quality measure

Cocitation and Bibliographic coupling
Cocitation of two vertices i and j is the number of vertices that have outgoing edges to both 𝐶 𝑖𝑗 = 𝑘=1 𝑛 𝐴 𝑖𝑘 𝐴 𝑗𝑘 = 𝑘=1 𝑛 𝐴 𝑖𝑘 𝐴 𝑘𝑗 𝑇 𝐶=𝐴 𝐴 𝑇 Bibliographic coupling is the number of vertices to which both point 𝐵= 𝑘=1 𝑛 𝐴 𝑘𝑖 𝐴 𝑘𝑗 = 𝑘=1 𝑛 𝐴 𝑖𝑘 𝑇 𝐴 𝑘𝑗 𝐵= 𝐴 𝑇 𝐴

Edge independent paths: if they share no common edge
Vertex independent paths: if they share no common vertex except start and end vertices Vertex-independent => Edge-independent Also called disjoint paths These set of paths are not necessarily unique Connectivity of vertices: the maximal number of independent paths between a pair of vertices Used to identify bottlenecks and resiliency to failures

Cut Sets and Maximum Flow
A minimum cut set is the smallest cut set that will disconnect a specified pair of vertices Need not to be unique Menger’s theorem: If there is no cut set of size less than n between a pair of vertices, then there are at least n independent paths between the same vertices. Implies that the size of min cut set is equal to maximum number of independent paths for both edge and vertex independence Maximum Flow between a pair of vertices is the number of edge independent paths times the edge capacity.

Transitivity  is said to be transitive if a  b and b  c together imply a  c Perfect transitivity in network → cliques Partial transitivity u knows v and v knows w → 𝐶= 𝑐𝑙𝑜𝑠𝑒𝑑 𝑝𝑎𝑡ℎ𝑠 𝑜𝑓 𝑙𝑒𝑛𝑔𝑡ℎ 𝑡𝑤𝑜 𝑝𝑎𝑡ℎ𝑠 𝑜𝑓 𝑙𝑒𝑛𝑔𝑡ℎ 𝑡𝑤𝑜 = 3 𝑡𝑟𝑖𝑎𝑛𝑔𝑙𝑒𝑠 𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑒𝑑 𝑡𝑟𝑖𝑝𝑙𝑒𝑡𝑠

Clustering Coefficient and Triples
Triple: an ordered set of three nodes, connected by two (open triple) edges or three edges (closed triple) A triangle can miss any of its three edges A triangle has 3 Triples 𝑣 𝑖 𝑣 𝑗 𝑣 𝑘 and 𝑣 𝑗 𝑣 𝑘 𝑣 𝑖 are different triples The same members First missing edge 𝑒(𝑣 𝑘 ,𝑣 𝑖 ) and second missing 𝑒(𝑣 𝑖 ,𝑣 𝑗 ) 𝑣 𝑖 𝑣 𝑗 𝑣 𝑘 and 𝑣 𝑘 𝑣 𝑗 𝑣 𝑖 are the same triple

[Global] Clustering Coefficient
Clustering coefficient measures transitivity in undirected graphs Count paths of length two and check whether the third edge exists When counting triangles, since every triangle has 6 closed paths of length 2 Or we can rewrite it as

[Global] Clustering Coefficient: Example

Local Clustering Coefficient
Local clustering coefficient measures transitivity at the node level Commonly employed for undirected graphs Computes how strongly neighbors of a node 𝑣 (nodes adjacent to 𝑣) are themselves connected In an undirected graph, the denominator can be rewritten as:

Local Clustering Coefficient: Example
Thin lines depict connections to neighbors Dashed lines are the missing connections among neighbors Solid lines indicate connected neighbors When all neighbors are connected 𝐶=1 When none of neighbors are connected 𝐶=0

Structural Metrics: Clustering coefficient

Local Clustering and Redundancy
𝐶 𝑖 = 𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑒𝑑 𝑝𝑎𝑖𝑟𝑠 𝑜𝑓 𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟𝑠 𝑜𝑓 𝑖 𝑝𝑎𝑖𝑟𝑠 𝑜𝑓 𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟𝑠 𝑜𝑓 𝑖 𝐶 𝑊𝑆 = 1 𝑛 𝑖=1 𝑛 𝐶 𝑖 Redundancy 𝐶 𝑖 = 𝑅 𝑖 𝑘 𝑖 −1 𝑅 𝑖 = 𝐶 𝑖 ( 𝑘 𝑖 −1)

Reciprocity How likely is it that the node you point to will point to you as well. 𝑟= 1 𝑚 𝑖𝑗 𝐴 𝑖𝑗 𝐴 𝑗𝑖 = 1 𝑚 Tr 𝐴 2

If you become my friend, I’ll be yours
Reciprocity If you become my friend, I’ll be yours Reciprocity is simplified version of transitivity It considers closed loops of length 2 If node 𝑣 is connected to node 𝑢, 𝑢 by connecting to 𝑣, exhibits reciprocity

Reciprocity: Example Reciprocal nodes: 𝑣1, 𝑣2

Signed Edges and Structural balance
Friends / Enemies Friend of friend → Enemy of my enemy → Structural balance: only loops of even number of “negative links” Structurally balanced → partitioned into groups where internal links are positive and between group links are negative

Triangle of nodes 𝑖, 𝑗, and 𝑘, is balanced, if and only if
Social Balance Theory Consistency in friend/foe relationships among individuals Informally, friend/foe relationships are consistent when In the network Positive edges demonstrate friendships (𝑤𝑖𝑗=1) Negative edges demonstrate being enemies (𝑤𝑖𝑗=−1) Triangle of nodes 𝑖, 𝑗, and 𝑘, is balanced, if and only if 𝑤𝑖𝑗 denotes the value of the edge between nodes 𝑖 and 𝑗

Social Balance Theory: Possible Combinations
For any cycle if the multiplication of edge values become positive, then the cycle is socially balanced

Structural Equivalence: share many of the same neighbors
Similarity Structural Equivalence: share many of the same neighbors Jaccard Similarity: 𝜎 𝑖𝑗 = 𝑛 𝑖𝑗 | 𝑛 𝑖 ∪ 𝑛 𝑗 | Cosine Similarity: 𝜎 𝑖𝑗 = 𝑛 𝑖𝑗 𝑘 𝑖 𝑘 𝑗 Pearson Coefficient: Given degree of two nodes, how many common neighbors they have ( 𝑟 𝑖𝑗 ) Euclidian Distance: 𝑑 𝑖𝑗 = 𝑘 ( 𝐴 𝑖𝑘 − 𝐴 𝑗𝑘 ) 2 Regular Equivalence: neighbors are the same Katz Similarity: 𝜎 𝑖𝑗 =𝛼 𝑘𝑙 𝐴 𝑖𝑘 𝐴 𝑗𝑙 𝜎 𝑘𝑙 𝝈=𝛼𝑨𝝈+𝑰

Homophily and Assortative Mixing
Assortativity: Tendency to be linked with nodes that are similar in some way Humans: age, race, nationality, language, income, education level, etc. Citations: similar fields than others Web-pages: Language Disassortativity: Tendency to be linked with nodes that are different in some way Network providers: End users vs other providers Assortative mixing can be based on Enumerative characteristic Scalar characteristic

Assortativity: An Example
The friendship network in a US high school in 1994 Colors represent races, White: whites Grey: blacks Light Grey: hispanics Black: others High assortativity between individuals of the same race

Assortativity Significance
The difference between measured assortativity and expected assortativity The higher this difference, the more significant the assortativity observed Example In a school, half the population is white and the other half is Hispanic. We expected 50% of the connections to be between members of different races. If all connections are between members of different races, then we have a significant finding

Modularity (enumerative)
Extend to which a node is connected to a like in network + if there are more edges between nodes of the same type than expected value - otherwise 𝑄= 1 2𝑚 𝑖𝑗 𝐴 𝑖𝑗 − 𝑘 𝑖 𝑘 𝑗 2𝑚 𝛿 𝑐 𝑖 , 𝑐 𝑗 𝛿 𝑐 𝑖 , 𝑐 𝑗 is 1 if ci and cj are of same type, and 0 otherwise 𝑄= 𝑟 𝑒 𝑟𝑟 − 𝑎 𝑟 2 err is fraction of edges that join same type of vertices ar is fraction of ends of edges attached to vertices type r

Assortative coefficient (enumerative)
Modularity is almost always less than 1, hence we can normalize it with the Qmax value 𝑟= 𝑄 𝑄 𝑚𝑎𝑥 = 𝑖𝑗 𝐴 𝑖𝑗 − 𝑘 𝑖 𝑘 𝑗 2𝑚 𝛿 𝑐 𝑖 , 𝑐 𝑗 2𝑚 − 𝑖𝑗 𝑘 𝑖 𝑘 𝑗 2𝑚 𝛿 𝑐 𝑖 , 𝑐 𝑗

Assortative coefficient (scalar)
𝑟= 𝑖𝑗 𝐴 𝑖𝑗 − 𝑘 𝑖 𝑘 𝑗 2𝑚 𝑥 𝑖 . 𝑥 𝑗 𝑖𝑗 𝑘 𝑖 𝛿 𝑖𝑗 − 𝑘 𝑖 𝑘 𝑗 2𝑚 𝑥 𝑖 . 𝑥 𝑗 r=1, perfectly assortative r=-1, perfectly disassortative r=0, non-assortative Usually node degree is used as scale 𝑟= 𝑖𝑗 𝐴 𝑖𝑗 − 𝑘 𝑖 𝑘 𝑗 2𝑚 𝑘 𝑖 . 𝑘 𝑗 𝑖𝑗 𝑘 𝑖 𝛿 𝑖𝑗 − 𝑘 𝑖 𝑘 𝑗 2𝑚 𝑘 𝑖 . 𝑘 𝑗

Modularity: Matrix Form
Let denote the indicator matrix and let 𝑘 denote the number of types The Kronecker delta function can be reformulated using the indicator matrix Therefore,

Normalized Modularity: Matrix Form
Let Modularity matrix be 𝑑∈ ℝ 𝑛 ×1 is the degree vector Modularity can be reformulated as

Modularity Example The number of edges between nodes of the same color is less than the expected number of edges between them

Assortativity Coefficient of Various Networks
M.E.J. Newman. Assortative mixing in networks

Measuring Assortativity for Ordinal Attributes
A common measure for analyzing the relationship between ordinal values is covariance It describes how two variables change together In our case, we have a network We are interested in how values assigned to nodes that are connected (via edges) are correlated

Covariance Variables The value assigned to node 𝑣𝑖 is 𝑥𝑖 We construct two variables 𝑋𝐿 and 𝑋𝑅 For any edge (𝑣𝑖,𝑣𝑗), we assume that 𝑥𝑖 is observed from variable 𝑋𝐿 and 𝑥𝑗 is observed from variable 𝑋𝑅 𝑋𝐿 represents the ordinal values associated with the left-node (the first node) of the edges and 𝑋𝑅 represents the values associated with the right-node (the second node) of the edges We need to compute the covariance between variables 𝑋𝐿 and 𝑋𝑅

Covariance Variables: Example
List of edges: (A, C) (C, A) (C, B) (B, C) 𝑋𝐿 : (18, 21, 21, 20) 𝑋𝑅 : (21, 18, 20, 21)

Covariance For two given column variables 𝑋𝐿 and 𝑋𝑅 the covariance is 𝐸(𝑋𝐿) is the mean of the variable and 𝐸(𝑋𝐿 𝑋𝑅) is the mean of the multiplication 𝑋𝐿 and 𝑋𝑅

Covariance

Normalizing Covariance
Pearson correlation 𝜌(𝑋,𝑌) is the normalized version of covariance In our case: \sigma = E(X-E(X))^2

Correlation Example

Lecture 10 Measures and Metrics.

Similar presentations

Presentation on theme: "Lecture 10 Measures and Metrics."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 10 Measures and Metrics.

Similar presentations

Presentation on theme: "Lecture 10 Measures and Metrics."— Presentation transcript:

Similar presentations

About project

Feedback