A Scalable Pattern Mining Approach to Web Graph Compression with Communities Greg Buehrer and Kumar Chellapilla Microsoft Live Labs.

Slides:

Advertisements

Similar presentations

Information Retrieval in Practice

Advertisements

Mining Association Rules

Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.

Mining Frequent Patterns Using FP-Growth Method Ivan Tanasić Department of Computer Engineering and Computer Science, School of Electrical.

CSE 634 Data Mining Techniques

Benchmarking traversal operations over graph databases Marek Ciglan 1, Alex Averbuch 2 and Ladialav Hluchý 1 1 Institute of Informatics, Slovak Academy.

Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.

Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.

Introduction to Computer Science 2 Lecture 7: Extended binary trees

gSpan: Graph-based substructure pattern mining

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

© 2005 IBM Corporation Discovering Large Dense Subgraphs in Massive Graphs David Gibson IBM Almaden Research Center Ravi Kumar Yahoo! Research* Andrew.

22C:19 Discrete Structures Trees Spring 2014 Sukumar Ghosh.

10 -1 Lecture 10 Association Rules Mining Topics –Basics –Mining Frequent Patterns –Mining Frequent Sequential Patterns –Applications.

Nadia Andreani Dwiyono DESIGN AND MAKE OF DATA MINING MARKET BASKET ANALYSIS APLICATION AT DE JOGLO RESTAURANT.

Web Graph representation and compression Thanks to Luciana Salete Buriol and Debora Donato.

1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing.

Data Mining Association Analysis: Basic Concepts and Algorithms

CS Lecture 9 Storeing and Querying Large Web Graphs.

Data Mining Association Analysis: Basic Concepts and Algorithms

CS728 Lecture 16 Web indexes II. Last Time Indexes for answering text queries –given term produce all URLs containing –Compact representations for postings.

1 On Compressing Web Graphs Michael Mitzenmacher, Harvard Micah Adler, Univ. of Massachusetts.

1 Lecture 18 Syntactic Web Clustering CS

Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.

Code and Decoder Design of LDPC Codes for Gbps Systems Jeremy Thorpe Presented to: Microsoft Research

Fast Algorithms for Association Rule Mining

Performance and Scalability: Apriori Implementation.

Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.

Mining Sequential Patterns: Generalizations and Performance Improvements R. Srikant R. Agrawal IBM Almaden Research Center Advisor: Dr. Hsu Presented by:

Mining Frequent Itemsets with Constraints Takeaki Uno Takeaki Uno National Institute of Informatics, JAPAN Nov/2005 FJWCP.

Mining Optimal Decision Trees from Itemset Lattices Dr, Siegfried Nijssen Dr. Elisa Fromont KDD 2007.

MapReduce and Graph Data Chapter 5 Based on slides from Jimmy Lin’s lecture slides ( (licensed.

Efficient Identification of Overlapping Communities Jeffrey Baumes Mark Goldberg Malik Magdon-Ismail Rensselaer Polytechnic Institute, Troy, NY.

Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.

Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

AR mining Implementation and comparison of three AR mining algorithms Xuehai Wang, Xiaobo Chen, Shen chen CSCI6405 class project.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )

Querying Business Processes Under Models of Uncertainty Daniel Deutch, Tova Milo Tel-Aviv University ERP HR System eComm CRM Logistics Customer Bank Supplier.

Mining High Utility Itemset in Big Data

An Efficient Algorithm for Enumerating Pseudo Cliques Dec/18/2007 ISAAC, Sendai Takeaki Uno National Institute of Informatics & The Graduate University.

1/52 Overlapping Community Search Graph Data Management Lab, School of Computer Science

Frequent Subgraph Discovery Michihiro Kuramochi and George Karypis ICDM 2001.

Parallel Mining Frequent Patterns: A Sampling-based Approach Shengnan Cong.

Frequent Item Mining. What is data mining? =Pattern Mining? What patterns? Why are they useful?

University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.

Data Mining Find information from data data ? information.

ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, Keyword Search on Relational Data Streams Alexander Markowetz Yin.

Security in Outsourced Association Rule Mining. Agenda  Introduction  Approximate randomized technique  Encryption  Summary and future work.

CMU SCS : Multimedia Databases and Data Mining Lecture #30: Data Mining - assoc. rules C. Faloutsos.

HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Frequent Word Combinations Mining and Indexing on HBase.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Chapter 8 Association Rules. Data Warehouse and Data Mining Chapter 10 2 Content Association rule mining Mining single-dimensional Boolean association.

1 Top Down FP-Growth for Association Rule Mining By Ke Wang.

Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),

Approach to Data Mining from Algorithm and Computation Takeaki Uno, ETH Switzerland, NII Japan Hiroki Arimura, Hokkaido University, Japan.

Gspan: Graph-based Substructure Pattern Mining

Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.

Cohesive Subgraph Computation over Large Graphs

Frequency Counts over Data Streams

Data Mining Association Analysis: Basic Concepts and Algorithms

Data Mining: Concepts and Techniques

Association rule mining

Byung Joon Park, Sung Hee Kim

HEXA: Compact Data Structures for Faster Packet Processing

Query-Friendly Compression of Graph Streams

Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform

A Framework for Testing Query Transformation Rules

Presentation transcript:

A Scalable Pattern Mining Approach to Web Graph Compression with Communities Greg Buehrer and Kumar Chellapilla Microsoft Live Labs

Motivation Who links to me? How many hops is it from me to Kevin Bacon? What is the growth/impact of social network X? Are these web pages part of a link farm? + =>

Web Graph Compression Goal: Reduce the memory footprint of the graph Existing Approaches [WWW04, DCC02, SHS] – Sort by URL to improve similarity between near nodes – Encode Id lists using a reference to a list in a near node, say within 5 nodes, called REFERENCE – Sort outlinks to minimize gap, code gap instead of Id, using Huffman coding (or a similar flat code) – called GAP – Zeta Codes – Flat codes to code the gap (no lookup table required) designed for power law distributions NodeOutlinks 12,10,12,14,18 25,10,12,14,18 NodeOutlinks 12,10,12,14,18 2(-1) +5,-2 NodeOutlinks 18,2,2,2,4 2(-1) +5,-2

Our Approach Mine for Dense Bipartite Graphs 20 Links [CN99, KDD00]

Virtual Node Miner Virtual Node 9 Links (20/9) = 2.2x compression

Finding Bipartite Graphs Cast adjacency list as a transactional data set Use pattern mining to find frequent itemsets Use an approximate mining strategy Cust 1: milk bread cereal Cust 2: milk bread eggs sugar Cust 3: milk bread butter Cust 4: eggs sugar Node 1 Outlinks: 12,13,14,17 Node 2 Outlinks: 12,13,14,19 Node 3 Outlinks: 12,13,14,33 Node 4 Outlinks: 3,4,12,13,14 =>

Webgraph Compression via Probabilistic Itemset Mining Perform mining in several steps 1.Cluster/group similar nodes together using min- wise hashing 2.Finds patterns in the correlated group 3.Create virtual nodes 4.Substitute VN into graph 5.Iterate Find Patterns Remove Patterns Add Virtual Node Cluster

Step 1 – Clustering A.Use K min hashes to reduce each outlink list from variable length to length K, obtaining an n*K matrix Id1234K 1hashAhashFhashWhashChashB 2hashFhashRhashA hashB 3hashAhashEhashAhashFhashC nhashGhashRhashEhashG

Clustering (cont) B. Sort the matrix Id1234K 1hashAhashFhashWhashChashB 2hashFhashRhashA hashB 3hashAhashEhashAhashFhashC nhashGhashRhashEhashG Id1234K 3hashAhashEhashAhashFhashC 1hashAhashFhashWhashChashB 2hashFhashRhashA hashB nhashGhashRhashEhashG

Clustering (cont) 3.Traverse the columns lexicographically, grouping nodes with the same hash value If we reach K or have a small set, mine it Id1234K 3hashAhashEhashAhashFhashC 1hashAhashEhashAhashChashB 2hashAhashEhashWhashAhashB nhashAhashRhashEhashG

Step 2 - Mining 1.Scan all node outlinks and record a histogram of outlink ID frequencies Node IdOutlinks 236,10,5,12,15, 1,2,3 1021,2,3,20 552,3, 10,12,1,5,6, ,7,8,9,3 131,2,3,8 641,2,3,5,6,10,12,15 431,2,5,10,22,31,8,23,36,6 4311,2,5,10,21,31,67,8,23,36,6 Id Count

Node IdOutlinks 236,10,5,12,15, 1,2,3 1021,2,3,20 552,3,1,5 2041,7,8,9,3 131,2,3,8 641,2,3,5,6,10,12,15 431,2,5,10,22,31,8,23,36,6 4311,2,5,10,21,31,67,8,23,36,6 Mining (cont) 2.Reorder each node’s outlink list based on the histogram (delete those with count=1) Id Count Node IdOutlinks 231,2,3,5,6,10, 12, ,2,3 551,2,3,5,6,10,12, ,3,8 131,2,3,8 641,2,3,5,6,10,12,15 431,2,5,6,10,8,23,31, ,2,5,6,10,8,23,31,36

Mining (cont) Node IdOutlinks 231,2,3,5,6,10, 12, ,2,3 551,2,3,5,6,10,12, ,3,8 131,2,3,8 641,2,3,5,6,10,12,15 431,2,5,6,10,8,23,31, ,2,5,6,10,8,23,31,36 3: {204} 8: {13} 23: {43,431} 31: {43,431} 36: {43,431} 10: {43,431} 8: {43,431} 8: {204} 1: {23} 2: {23} 3: {23} 5: {23} 6: {23} 10: {23} 12: {23} 15: {23} 1: {23,102} 2: {23,102} 3: {23,102} 5: {23} 15: {23,55,64} 12: {23,55,64} 3: {13,23,55,64,102} 5: {23,55,64} 1: {13,23,43,55,64,102,204,431} 2: {13,23,43,55,64,102,431} 3. Build a trie of the node outlink lists 6: {23,55,64} 10: {23,55,64} 5: {43,431} 6: {43,431}

Mining (cont) 4.Walk the trie and add candidate nodes to a list $ = (L-1)*(F-1) |P|Node List$ 943,4318 |P|Node List$ 943, ,23,43,55,64,102,4318 |P|Node List$ 943, ,23,43,55,64,102, ,55,6414 3: {204} 8: {13} 23: {43,431} 31: {43,431} 36: {43,431} 10: {43,431} 8: {43,431} 8: {204} 1: {23} 2: {23} 3: {23} 5: {23} 6: {23} 10: {23} 12: {23} 15: {23} 1: {23,102} 2: {23,102} 3: {23,102} 5: {23} 15: {23,55,64} 12: {23,55,64} 3: {13,23,55,64,102} 5: {23,55,64} 1: {13,23,43,55,64,102,204,431} 2: {13,23,43,55,64,102,431} 6: {23,55,64} 10: {23,55,64} 5: {43,431} 6: {43,431} |P|Node List$ 943, ,23,43,55,64,102, ,55, ,23,55,64,1028

Mining Stage (cont) 5.Sort the list based on their $ – Including a Virtual Node for a pattern may rule out another pattern |P|Node List$ 943, ,55, ,23,55,64, ,23,43,55,64,102,4316 |P|Node List$ 823,55, , ,23,55,64, ,23,43,55,64,102,4316

Node IdOutlinks 236,10,5,12,15, 1,2,3 1021,2,3,20 552,3, 10,12,1,5,6, ,7,8,9,3 131,2,3,8 641,2,3,5,6,10,12,15 431,2,5,10,22,31,8,23,36,6 4311,2,5,10,21,31,67,8,23,36,6 Mining (cont) 6.Remove the top item in the list and make a virtual node of it (replacing outlink IDs along the way) |P|Node List$ 823,55, , ,23,55,64, ,23,43,55,64,102,4316 Node IdOutlinks 23V1 102V3,20 55V1 2041,7,8,9,3 13V3,8 64V1 43V2,22 431V2,21,67 V11,2,5,6,10,8,23,31,36 V21,2 V31,2,3,5,6,10,12,15

Empirical Evaluation Goal: Evaluate along 3 axes – Compression, Scalability, Patterns Discovered – Implementation in C++ – Windows Server 2003, 16GB RAM, 2.8GHz core Datasets from WebGraph data repository

Compression Afforded by VNodes Webbase2001 is old and only has 8 edges/node

Total Compression

Compression Comparison Bits per edge for Virtual Node Miner and WebGraph

Scalability

Virtual Node Properties

Communities are far apart Reference schemes typically have a small window size

Vs Traditional Mining σ=5000 σ=1000 σ=500 σ=100 σ=75 σ=65 σ=50 VNM VNM8core Closed Sets Gen. Closed Sets Comp. VNM VNM1Iteration Closed Sets VNM5Iterations EU-2005

Take Home Message Web Graph Compression Contribution – Supports any URL ordering, any labeling – Supports any encoding scheme – Seeds for community discovery – High compression ratio – Scales well – Can be extended Data Mining – Log-linear itemset miner – Interesting data sets for pattern mining

Ongoing Work Computations on the compressed graph Ease of importing/updating data Compression for the full graph

Thanks! [JCSS98] A. Broder, M. Charikar, A. Frieze, M. Mitzenmache. Min-wise Independent Permutations. In Journal of Computer and System Sciences, [CN99] R. Kumar, P. Raghavan, S. Rajagopalan and A. Tomkins. Trawling the Web for emerging cyber-communities. In CN [KDD00] G. Flake, S. Lawrence and C. Giles. Efficient identification of web communities. In KDD [SIG00] J. Han, J. Pei and Y. Yin. Mining frequent patterns without candidate generation. In SIGMOD [DCC02] K. Randall, R. Stata, R. Wickremesinghe and J. Wiener. The Link database: Fast access to graphs of the web. In DCC [WWW04] P. Boldi and S. Vigna. The webgraph framework i: Compression Techniques. In WWW [VLDB05] D. Gibson, R. Kumar and A. Tomkins. Discovering large dense subgraphs in massive graphs. In VLDB External References

End of Talk

Extra slides for question support

Length of Virtual Nodes

Compression as a Function of Pattern Length

Empirical Evaluation Scalability and Execution Time

Semantics Community 11: A link farm for inlinks pattern Community 31: ringtones.mobilefun.co.uk Community 16: Community 40:

Optimality What if we were given every itemset and its frequency for free? Optimality is intractable An approximate solution may prove useful 1,2,4,5,9,10,12,13,14,18,23,34

Existing Itemset Mining Algorithms Existing solutions have worst case exponential runtimes [FIMI03] – Our use case is worst case (support=2) – Even streaming algorithms have worst case exponential runtime complexities Other patterns besides itemsets, such as closed sets, maximal sets, and top-K sets also have exponential runtimes

Compression Components Huffman coding degrades as VN compression increases