Mining Collaboration Patterns from a Large Developer Network(PP-DSN) MF1432031 李玉
Mining Collaboration Patterns Internet, communication devices Developers from diverse locations Globally distributed software development Common interests Mining collaboration patterns Learn, aid Dynamic, properties, weakness
Mining Collaboration Patterns extract patterns High level of details, network-level statistics Low level of details, topological patterns SourceForge.net(是全球最大开源软件开发平台和仓库,为开源软件提供一个存储、协作和发布的平台) A large graph of network node--->developer Edge--->collaboration
Mining Collaboration Patterns Q1: How connected are the developers? Are all developers connected to every other developers in the network? If no, how many clusters of connected developers are there in the network? Q2: What are some characteristics of developer collaboration clusters? Q3: What are some common topological collaboration patterns appearing in these developer collaboration clusters? Q4: Within a connected collaboration cluster, are all developers connected to every other developers in 6 hops following the small world phenomenon?
Definitions and Concepts Collaboration graph G(N, E, NL) non-directed N(developers) E([u, v]) NL(labels) Collaboration Pattern(N, E) Definition 1 (Sub-Graph Isomorphism): Consider two graphs G=(N,E,NL) and G’=(N’, E’, NL’). Sub-graph isomorphism is an injective function f : N → N’, s.t., (1), ∀n ∈ N,l(n) = l ‘(f(n)); (2),∀(u, v) ∈ E,(f(u), f(v)) ∈ E’. The function or mapping f is referred to as the embedding of G in G’.
Definitions and Concepts Definition 2 (Pattern Match): A patern P=(N, E) matches a graphs G=(N’, E’, NL’). If there exists an injective function f : N → N’, s.t., ∀(u, v) ∈ E, (f(u), f(v)) ∈ E’. Definition 3 (Frequent Pattern Mining): Given a graph dataset GSet and a threshold msup, find all patterns that appear in more than msup graphs in Gset.(P is, P’ also is. P’ is sub-graph of P) Definition 4(Closed Graphs): Given a set of graphs GSet, a graph pattern g is closed, if there does not exists another pattern g’ where g’ is a super-graph of g and both g and g’ match the same set of graphs in GSet. If there exists such a g’, we say that g is being subsumed by g’
Definitions and Concepts Definition 5 (Frequent Closed Pattern Mining): Given a graph dataset GSet and a threshold msup, find all patterns that are frequent and closed.
Approach Collaboration network Extract clusters, cc(connected components)-->database(CGD) Top-k Topological Pattern Mining(graph mining & graph matching) CGD, CGDl, CGDs Frequent closed patterns in CGDs, sup(P, CGDs) Each P match in CGDl, sup(P, CGDl) sup(p, CGD)= sup(P, CGDs)+sup(P, CGDl)
Experiment Initial How connected are the developers?(55,694) Inactive(no developers & >=100 downloads) 192,706--->28,087 Use contributor information, not SVN/CVS committer identifier How connected are the developers?(55,694) not a connected one, many disjoint components(6744) developers work alone very low, 1.5%(838) a very large cluster, 30,111, 54.07%, core community others much smaller size(second 117)
Experiment What are some characteristics of developer collaboration clusters? Power law(幂律,马太效应):节点具有的连线数和这样的节点数目乘积是一个定值(richer-get-richer)
Experiment Connectivity: edges/nodes Degree of Nodes: numbers of neighbors(largest cc) suggests that having indirect “neighbor” helps in fostering more collaborations among developers
Experiment
Experiment
Experiment What are some common topological collaboration patterns? Large(more 254nodes and 254 edges) Small(more 254nodes or 254 edges) Very small
Experiment Analysis: Lower rank derived from higher rank, expand hub Frequent patterns small sizes, most 6 nodes triadic closure principle (G2, G3, 96.5%)、(G6, G7, 94.2%)、(G7, G8, 99.6%) Least dense, chain. Most dense, complete graph (G2, G3, 96.5%)、(G4, G8, 92.9%)、(G10, G20, 90.0%) suggests that indirect links are likely to realize into direct links Lower rank derived from higher rank, expand hub (G2, G1)、(G6, G3)
Experiment Does six-degree-of-separation exist? 你和任何一个陌生人之间所间隔的人不会超过五个,也就是说,最多通过五个中间人你就能够认识任何一个陌生人 Largest CC 30111 nodes, average of the shortest path is 6.55 exists
Conclusions not all developers connected to every other Many clusters(disjoint) nodes, connectivity follow power law, but edges and node degrees not Project sizes, developer participation, cluster sizes(largest one) small-world phenomenon exists Indirect--->direct Expand hub Collaborative networks not as random, preferentially connected Linchpin nodes
谢谢!!!