Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University http://www.mmds.org Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: http://www.mmds.orghttp://www.mmds.org

2 Nodes: Football Teams Edges: Games played Can we identify node groups? (communities, modules, clusters) J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

3 NCAA conferences Nodes: Football Teams Edges: Games played J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

4 Can we identify functional modules? Nodes: Proteins Edges: Physical interactions J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

5 Functional modules Nodes: Proteins Edges: Physical interactions J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

6 Can we identify social communities? Nodes: Facebook Users Edges: Friendships J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

7 High school Summer internship Stanford (Squash) Stanford (Basketball) Social communities Nodes: Facebook Users Edges: Friendships J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

 Non-overlapping vs. overlapping communities J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org8

9 Network Adjacency matrix Nodes J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

 What is the structure of community overlaps: Edge density in the overlaps is higher! 10 Communities as “tiles” J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

11 This is what we want! Communities in a network J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

 1) Given a model, we generate the network:  2) Given a network, find the “best” model J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org12 C A B D E H F G C A B D E H F G Generative model for networks

 Goal: Define a model that can generate networks  The model will have a set of “parameters” that we will later want to estimate (and detect communities)  Q: Given a set of nodes, how do communities “generate” edges of the network? J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org13 C A B D E H F G Generative model for networks

 Generative model B(V, C, M, { p c }) for graphs:  Nodes V, Communities C, Memberships M  Each community c has a single probability p c  Later we fit the model to networks to detect communities 14 Model Network Communities, C Nodes, V Model pApA pBpB Memberships, M J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

15 Model Network Communities, C Nodes, V Community Affiliations pApA pBpB Memberships, M Think of this as an “OR” function: If at least 1 community says “YES” we create an edge J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

16 Model Network J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

 AGM can express a variety of community structures: Non-overlapping, Overlapping, Nested 17J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

 Detecting communities with AGM: 19 C A B D E H F G 1) Affiliation graph M 2) Number of communities C 3) Parameters p c J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org21

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org22 00.10 0.04 0.1000.020.06 0.100.0200.06 0.040.06 0 0100 1011 0101 0110 Flip biased coins

 Given graph G(V,E) and Θ, we calculate likelihood that Θ generated G: P(G|Θ) 00.9 0 0 0 0 00 0 Θ= B(V, C, M, {p c }) 0110 1010 1101 0010 G P(G|Θ ) G 23 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org AB

24 P(|) AGM arg max

26 u v J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

27 j Communities Nodes

28 Node community membership strengths J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 01.200.2 0.5000.8 01.810

29J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

 BigCLAM takes 5 minutes for 300k node nets  Other methods take 10 days  Can process networks with 100M edges! 32J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

 Extension: Make community membership edges directed!  Outgoing membership: Nodes “sends” edges  Incoming membership: Node “receives” edges 35J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

 Overlapping Community Detection at Scale: A Nonnegative Matrix Factorization Approach by J. Yang, J. Leskovec. ACM International Conference on Web Search and Data Mining (WSDM), 2013. Overlapping Community Detection at Scale: A Nonnegative Matrix Factorization Approach  Detecting Cohesive and 2-mode Communities in Directed and Undirected Networks by J. Yang, J. McAuley, J. Leskovec. ACM International Conference on Web Search and Data Mining (WSDM), 2014. Detecting Cohesive and 2-mode Communities in Directed and Undirected Networks  Community Detection in Networks with Node Attributes by J. Yang, J. McAuley, J. Leskovec. IEEE International Conference On Data Mining (ICDM), 2013. Community Detection in Networks with Node Attributes J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org39

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.

Similar presentations

Presentation on theme: "Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.

Similar presentations

Presentation on theme: "Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these."— Presentation transcript:

Similar presentations

About project

Feedback