Download presentation
Presentation is loading. Please wait.
1
Mining Collaboration Patterns from a Large Developer Network(PP-DSN) MF1432031 李玉
2
Mining Collaboration Patterns
Internet, communication devices Developers from diverse locations Globally distributed software development Common interests Mining collaboration patterns Learn, aid Dynamic, properties, weakness
3
Mining Collaboration Patterns
extract patterns High level of details, network-level statistics Low level of details, topological patterns SourceForge.net(是全球最大开源软件开发平台和仓库,为开源软件提供一个存储、协作和发布的平台) A large graph of network node--->developer Edge--->collaboration
4
Mining Collaboration Patterns
Q1: How connected are the developers? Are all developers connected to every other developers in the network? If no, how many clusters of connected developers are there in the network? Q2: What are some characteristics of developer collaboration clusters? Q3: What are some common topological collaboration patterns appearing in these developer collaboration clusters? Q4: Within a connected collaboration cluster, are all developers connected to every other developers in 6 hops following the small world phenomenon?
5
Definitions and Concepts
Collaboration graph G(N, E, NL) non-directed N(developers) E([u, v]) NL(labels) Collaboration Pattern(N, E) Definition 1 (Sub-Graph Isomorphism): Consider two graphs G=(N,E,NL) and G’=(N’, E’, NL’). Sub-graph isomorphism is an injective function f : N → N’, s.t., (1), ∀n ∈ N,l(n) = l ‘(f(n)); (2),∀(u, v) ∈ E,(f(u), f(v)) ∈ E’. The function or mapping f is referred to as the embedding of G in G’.
6
Definitions and Concepts
Definition 2 (Pattern Match): A patern P=(N, E) matches a graphs G=(N’, E’, NL’). If there exists an injective function f : N → N’, s.t., ∀(u, v) ∈ E, (f(u), f(v)) ∈ E’. Definition 3 (Frequent Pattern Mining): Given a graph dataset GSet and a threshold msup, find all patterns that appear in more than msup graphs in Gset.(P is, P’ also is. P’ is sub-graph of P) Definition 4(Closed Graphs): Given a set of graphs GSet, a graph pattern g is closed, if there does not exists another pattern g’ where g’ is a super-graph of g and both g and g’ match the same set of graphs in GSet. If there exists such a g’, we say that g is being subsumed by g’
7
Definitions and Concepts
Definition 5 (Frequent Closed Pattern Mining): Given a graph dataset GSet and a threshold msup, find all patterns that are frequent and closed.
8
Approach Collaboration network
Extract clusters, cc(connected components)-->database(CGD) Top-k Topological Pattern Mining(graph mining & graph matching) CGD, CGDl, CGDs Frequent closed patterns in CGDs, sup(P, CGDs) Each P match in CGDl, sup(P, CGDl) sup(p, CGD)= sup(P, CGDs)+sup(P, CGDl)
10
Experiment Initial How connected are the developers?(55,694)
Inactive(no developers & >=100 downloads) 192,706--->28,087 Use contributor information, not SVN/CVS committer identifier How connected are the developers?(55,694) not a connected one, many disjoint components(6744) developers work alone very low, 1.5%(838) a very large cluster, 30,111, 54.07%, core community others much smaller size(second 117)
11
Experiment What are some characteristics of developer collaboration clusters? Power law(幂律,马太效应):节点具有的连线数和这样的节点数目乘积是一个定值(richer-get-richer)
12
Experiment Connectivity: edges/nodes
Degree of Nodes: numbers of neighbors(largest cc) suggests that having indirect “neighbor” helps in fostering more collaborations among developers
13
Experiment
14
Experiment
15
Experiment What are some common topological collaboration patterns?
Large(more 254nodes and 254 edges) Small(more 254nodes or 254 edges) Very small
16
Experiment Analysis: Lower rank derived from higher rank, expand hub
Frequent patterns small sizes, most 6 nodes triadic closure principle (G2, G3, 96.5%)、(G6, G7, 94.2%)、(G7, G8, 99.6%) Least dense, chain. Most dense, complete graph (G2, G3, 96.5%)、(G4, G8, 92.9%)、(G10, G20, 90.0%) suggests that indirect links are likely to realize into direct links Lower rank derived from higher rank, expand hub (G2, G1)、(G6, G3)
17
Experiment Does six-degree-of-separation exist?
你和任何一个陌生人之间所间隔的人不会超过五个,也就是说,最多通过五个中间人你就能够认识任何一个陌生人 Largest CC nodes, average of the shortest path is 6.55 exists
18
Conclusions not all developers connected to every other
Many clusters(disjoint) nodes, connectivity follow power law, but edges and node degrees not Project sizes, developer participation, cluster sizes(largest one) small-world phenomenon exists Indirect--->direct Expand hub Collaborative networks not as random, preferentially connected Linchpin nodes
19
谢谢!!!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.