Shuai Ma Graph Search & Social Networks. 2 Graphs are everywhere, and quite a few are huge graphs!

Slides:



Advertisements
Similar presentations
22C:19 Discrete Math Graphs Fall 2014 Sukumar Ghosh.
Advertisements

Yinghui Wu, LFCS DB talk Database Group Meeting Talk Yinghui Wu 10/11/ Simulation Revised for Graph Pattern Matching.
Shuai Ma, Yang Cao, Wenfei Fan, Jinpeng Huai, Tianyu Wo Capturing Topology in Graph Pattern Matching University of Edinburgh.
New Models for Graph Pattern Matching Shuai Ma ( 马 帅 )
The IEEE International Conference on Big Data 2013 Arash Fard M. Usman Nisar Lakshmish Ramaswamy John A. Miller Matthew Saltz Computer Science Department.
Spectrum Based RLA Detection Spectral property : the eigenvector entries for the attacking nodes,, has the normal distribution with mean and variance bounded.
Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.
Feb 20, Definition of subgroups Definition of sub-groups: “Cohesive subgroups are subsets of actors among whom there are relatively strong, direct,
C++ Programming: Program Design Including Data Structures, Third Edition Chapter 21: Graphs.
Search Engines and Information Retrieval
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
The Theory of NP-Completeness
Efficient Data Mining for Path Traversal Patterns CS401 Paper Presentation Chaoqiang chen Guang Xu.
The community-search problem and how to plan a successful cocktail party Mauro SozioAris Gionis Max Planck Institute, Germany Yahoo! Research, Barcelona.
Abstract Shortest distance query is a fundamental operation in large-scale networks. Many existing methods in the literature take a landmark embedding.
Overview of Web Data Mining and Applications Part I
1 Trends in Mathematics: How could they Change Education? László Lovász Eötvös Loránd University Budapest.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
Yinghui Wu LFCS Lab Lunch Homomorphism and Simulation Revised for Graph Matching.
Social Media Mining Graph Essentials.
FAST FREQUENT FREE TREE MINING IN GRAPH DATABASES Marko Lazić 3335/2011 Department of Computer Engineering and Computer Science,
Making Pattern Queries Bounded in Big Graphs 11 Yang Cao 1,2 Wenfei Fan 1,2 Jinpeng Huai 2 Ruizhe Huang 1 1 University of Edinburgh 2 Beihang University.
CS105 Introduction to Social Network Lecture: Yang Mu UMass Boston.
Querying Big Graphs within Bounded Resources 1 Yinghui Wu UC Santa Barbara Wenfei Fan University of Edinburgh Southwest Jiaotong University Xin Wang.
Yinghui Wu, SIGMOD 2012 Query Preserving Graph Compression Wenfei Fan 1,2 Jianzhong Li 2 Xin Wang 1 Yinghui Wu 1,3 1 University of Edinburgh 2 Harbin Institute.
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
Performance Guarantees for Distributed Reachability Queries Wenfei Fan 1,2 Xin Wang 1 Yinghui Wu 1,3 1 University of Edinburgh 2 Harbin Institute of Technology.
Search Engines and Information Retrieval Chapter 1.
Tennessee Technological University1 The Scientific Importance of Big Data Xia Li Tennessee Technological University.
Mehdi Kargar Aijun An York University, Toronto, Canada Discovering Top-k Teams of Experts with/without a Leader in Social Networks.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Research Directions for Big Data Graph Analytics John A. Miller, Lakshmish Ramaswamy, Krys J. Kochut and Arash Fard Department of Computer Science University.
MapReduce and Graph Data Chapter 5 Based on slides from Jimmy Lin’s lecture slides ( (licensed.
Topology aggregation and Multi-constraint QoS routing Presented by Almas Ansari.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Querying Structured Text in an XML Database By Xuemei Luo.
Diversified Top-k Graph Pattern Matching 1 Yinghui Wu UC Santa Barbara Wenfei Fan University of Edinburgh Southwest Jiaotong University Xin Wang.
RESOURCES, TRADE-OFFS, AND LIMITATIONS Group 5 8/27/2014.
On Graph Query Optimization in Large Networks Alice Leung ICS 624 4/14/2011.
A Graph-based Friend Recommendation System Using Genetic Algorithm
1/52 Overlapping Community Search Graph Data Management Lab, School of Computer Science
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Autumn Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Chapter 3. Community Detection and Evaluation May 2013 Youn-Hee Han
Yinghui Wu, ICDE Adding Regular Expressions to Graph Reachability and Pattern Queries Wenfei Fan Shuai Ma Nan Tang Yinghui Wu University of Edinburgh.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to database visualization and exploration.
Most of contents are provided by the website Graph Essentials TJTSD66: Advanced Topics in Social Media.
Distributed Graph Simulation: Impossibility and Possibility 1 Yinghui Wu Washington State University Wenfei Fan University of Edinburgh Southwest Jiaotong.
The Structure of the Web. Getting to knowing the Web How big is the web and how do you measure it? How many people use the web? How many use search engines?
Scalable Hybrid Keyword Search on Distributed Database Jungkee Kim Florida State University Community Grids Laboratory, Indiana University Workshop on.
Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.
Data Structures and Algorithms in Parallel Computing Lecture 3.
Research Directions for Big Data Graph Analytics John A. Miller, Lakshmish Ramaswamy, Krys J. Kochut and Arash Fard.
Graph Data Management Lab, School of Computer Science Branch Code: A Labeling Scheme for Efficient Query Answering on Tree
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Introduction to Graph Theory By: Arun Kumar (Asst. Professor) (Asst. Professor)
Chapter 20: Graphs. Objectives In this chapter, you will: – Learn about graphs – Become familiar with the basic terminology of graph theory – Discover.
Outline  Introduction  Subgraph Pattern Matching  Types of Subgraph Pattern Matching  Models of Computation  Distributed Algorithms  Performance.
Yinghui Wu, SIGMOD Incremental Graph Pattern Matching Wenfei Fan Xin Wang Yinghui Wu University of Edinburgh Jianzhong Li Jizhou Luo Harbin Institute.
Chapter 8: Web Analytics, Web Mining, and Social Analytics
Presented by: Siddhant Kulkarni Spring Authors: Publication:  ICDE 2015 Type:  Research Paper 2.
::Network Optimization:: Minimum Spanning Trees and Clustering Taufik Djatna, Dr.Eng. 1.
Cohesive Subgraph Computation over Large Graphs
RE-Tree: An Efficient Index Structure for Regular Expressions
CHAPTER 3 Architectures for Distributed Systems
Probabilistic Data Management
Simulation based approach Shang Zechao
Graph Indexing for Shortest-Path Finding over Dynamic Sub-Graphs
Complexity Theory: Foundations
Presentation transcript:

Shuai Ma Graph Search & Social Networks

2 Graphs are everywhere, and quite a few are huge graphs!

Application Scenarios 3 Traditional plagiarism detection tools may not be applicable for serious software plagiarism problems. A new tool based on graph pattern matching –Represent the source codes as program dependence graphs [2]. –Use graph pattern matching to detect plagiarism. Software plagiarism detection [1]

Application Scenarios 4 Recommendations have found its usage in many emerging specific applications, such as social matching systems. Graph search is a useful tool for recommendations. Recommender systems [3] –A headhunter wants to find a biologist (Bio) to help a group of software engineers (SEs) analyze genetic data. –To do this, (s)he uses an expertise recommendation network G, as depicted in G, where a node denotes a person labeled with expertise, and an edge indicates recommendation, e.g., HR 1 recommends Bio 1, and AI 1 recommends DM 1

Application Scenarios 5 Graph search is a common practice in transportation networks, due to the wide application of Location-Based Services. Example: Mark, a driver in the U.S. who wants to go from Irvine to Riverside in California. –If Mark wants to reach Riverside by his car in the shortest time, the problem can be expressed as the shortest path problem. Then by using existing methods, we can get the shortest path from Irvine, CA to Riverside, CA traveling along State Route 261. Transport routing [4] –If Mark drives a truck delivering hazardous materials may not be allowed to cross over some bridges or railroad crossings. This time we can use a pattern graph containing specific route constraints (such as regular expressions) to find the optimal transport routes.

Application Scenarios 6 A large amount of biological data can be represented by graphs, and it is significant to analyze biological data with graph search techniques. –“Protein-interaction network (PIN) analysis provides valuable insight into an organism’s functional organization and evolutionary behavior.” Biological data analysis [5] –For example, one can get the topological properties of a PIN formed by high- confidence human protein interactions obtained from various public interaction databases by PIN analysis.

Outline 7 What is graph search? Graph search, why bother? Three Types of Graph Search Challenges & Related techniques Summary

8 What is Graph Search?

9 A unified definition [6] (in the name of graph matching) : Remarks: Given a pattern graph G p and a data graph G: –check whether G p ‘‘matches’’ G; and –identify all ‘‘matched’’ subgraphs. –Two classes of queries: –Boolean queries (Yes or No) –Functional queries, which may use Boolean queries as a subroutine –Graphs contain a set of nodes and a set of edges, typically with labels –Pattern graphs are typically small (e.g., 10), but data graphs are usually huge (e.g., 10 8 )

What is Graph Search? 10 Different semantics of “match” implies different “types” of graph search, including, but not limited to, the following: Shortest paths/distances [4] Subgraph isomorphism [12] Graph homomorphism and its extensions [10] Graph simulation and its extensions [8,9] Graph keyword search [7] Neighborhood queries [11] … Graph search is a very general concept!

Graph Search, Why Bother?

Social Networks are the New Media 12 Social networks are graphs The nodes are the people and groups The links/edges show relationships or flows between the nodes.

Social Networks are the New Media 13 Social networks are becoming an important way to get information in everyday life !

The need for a Social Search Engine 14 File systems ’s: very simple search functionalities Databases - mid 1960’s:SQL language World Wide Web ’s:keyword search engines Social networks - late 1990’s: File systems Databases World Wide Web Graph search is a new paradigm for social computing! Social Networks Facebook launched “graph search” on 16 th January, 2013 Assault on Google, Yelp, and LinkedIn with new graph search; Yelp was down more than 7%

Graph Search vs. RDBMS [13] 15 Query: Find the name of all of Alberto Pepe's friends. Step 1: The person.name index -> the identifier of Alberto Pepe. [O(log 2 n)] Step 2: The friend.person index -> k friend identifiers. [O(log 2 x) : x<<m] Step 3: The k friend identifiers -> k friend names. [O(k log2n)]

Graph Search vs. RDBMS [13] 16 Step 1: The vertex.name index -> the vertex with the name Alberto Pepe. [O(log2n)] Step 2: The vertex returned -> the k friend names. [O(k + x)] Query: Find the name of all of Alberto Pepe's friends.

Social Search vs. Web Search Phrases、short sentences vs. key words only (Simple Web) pages vs. Entities Lifeless vs. Full of life History vs. Future International Conference on Application of Natural Language to Information Systems (NLDB) started from 1995 it’s interesting, and over the last 10 years, people have been trained on how to use search engines more effectively. Keywords & Search In 2013: Interview With A. Goodman & M. Wagner

Interesting Coincidence! 18 DB people started working on graphs at around the same time ! Social computing & Web 2.0

Three Types of Graph Search 19 Cohesive subgraphs Keyword search on graphs Graph pattern matching

Cohesive Subgraphs 20 Cohesive subgroups are subsets of actors among whom there are relatively strong, direct, intense, frequent or positive ties [14]. –Different cohesive subgroups are formed according to different cohesive relations, which are further specified by application needs. Social networks can be represented as graphs, such that we formalize cohesive subgroups as cohesive subgraphs. –Correspondingly, the problem of finding cohesive subgraphs on graphs are referred to as Cohesive subgraph search.

Cohesive Subgraphs 21 Various cohesive subgraphs (clique, n-clan, k-plex, k-core) Maximal clique: a maximal clique is a maximal complete sub graph. “Padgett's Florentine Families” Main issues: –Cliques can overlap –Too many or too few cliques emerge –The problem is NP-complete

Cohesive Subgraphs 22 N-clique : an n-clique is a maximal subgraph in which the largest distance between any two nodes is no greater than n. N-clan : an n-clan is an n-clique in which the diameter is no greater than n. K-core : a k-core is a maximal subgraph in which the nodal degree of each node is no smaller than k. “Padgett's Florentine Families” The cohesive relations are gradually looser Various cohesive subgraphs (clique, n-clan, k-plex, k-core)

Keyword Search on Graphs Different “structure constraints” implies different types of keyword search. 2. Keyword search is a very simple but user-friendly information retrieval mechanism. Remarks: Given a set of keywords and a data graph, the problem is to determine a group of densely linked nodes in the graph such that the nodes together –contain all the keywords, and –satisfy some structural constrains [7]

Keyword Search on Graphs 24 Minimum spanning tree [7] Given keywords: {A, B}

Keyword Search on Graphs 25 Lack of input structure constrains, the results requires ranking r-clique [15] Lack justification of the usage of the structure constrains

Graph Pattern Matching [17] 26 Given two directed graphs G1 (pattern graph) and G2 (data graph), –decide whether G1 “matches” G2 (Boolean queries); –identify “subgraphs” of G2 that match G1 Matching Semantics –Traditional: Subgraph Isomorphism –Emerging applications: Graph Simulation and its extensions, etc..

Subgraph Isomorphism [12] Given Pattern graph Q, subgraph G s of data graph G –Q matches G s if there exists a bijective function f: V Q → V Gs such that for each node u in Q, u and f(u) have the same label An edge (u, u‘) in Q if and only if (f(u), f(u')) is an edge in G s Goodness: Badness: 27 These hinder the usability in emerging applications, e.g., social networks Keep exact structure topology between Q and G s May return exponential many matched subgraphs Decision problem is NP-complete In certain scenarios, too restrictive to find matches

Graph Simulation 28 Given pattern graph Q(Vq, Eq) and data graph G(V, E), a binary relation R ⊆ Vq × V is said to be a match if – (1) for each (u, v) ∈ R, u and v have the same label; and – (2) for each edge (u, u′) ∈ Eq, there exists an edge (v, v′) in E such that (u′, v′) ∈ R. Graph G matches pattern Q via graph simulation, if there exists a total match relation M – for each u ∈ Vq, there exists v ∈ V such that (u, v) ∈ M. – Intuitively, simulation preserves the labels and the child relationship of a graph pattern in its match. – Simulation was initially proposed for the analyses of programs; and simulation and its extensions were recently introduced for social networks. Subgraph isomorphism (NP-complete) vs. graph simulation (O(n 2 ))!

Subgraph Isomorphism 29 Set up a team to develop a new software product Graph simulation returns F 3, F 4 and F 5; Subgraph isomorphism returns empty! Graph simulation returns F 3, F 4 and F 5; Subgraph isomorphism returns empty! Subgraph isomorphism is too strict for emerging applications

Terrorist Collaboration Network “Those who were trained to fly didn’t know the others. One group of people did not know the other group.” (Osama Bin Laden, 2001) 30

Strong Simulation [16,17] Subgraph isomorphism –Goodness Keep (strong) structure topology –Badness May return exponential number of matched subgraphs Decision problem: NP-complete In certain scenarios, too restrictive to find sensible matches Graph simulation –Goodness Solvable in quadratic time –Badness Lose structure topology (how much? open question) Only return a single matched subgraph 31 Balance between complexity and the capability to capturing topology!

Strong Simulation Graph simulation loses graph structures Disconnected Tree Long cycle 32

33 Strong Simulation Duality (dual simulation) –Both child and parent relationships –Simulation considers only child relationships Locality –Restricting matches within a ball –When social distance increases, the closeness of relationships decreases and the relationships may become irrelevant The semantics of strong simulation is well defined –The results are unique Strong simulation: bring duality and locality into graph simulation

34 Strong Simulation Subgraph Isomorphism Strong Simulation Dual Simulation Graph Simulation Topology preservation and bounded matches

Strong Simulation A new matching model referred to as strong simulation A cubic time algorithm Three main optimization techniques –Query minimization An O(n 2 ) algorithm –Dual simulation filtering First compute the match graph of dual simulation, then project on each ball of the data graph –Connectivity pruning Based on the connectivity theorem A distributed algorithm –Data locality property –Boundary nodes and radius 35 Towards revising conventional notions of graph matching

Analyses 36 Graph search User- friendliness Result-accuracy Cohesive Subgraphs  Keyword SearchKeywordsResult ranking Graph Pattern MatchingPattern graphs More accurate (well structure constrained) A novel approach to combining the advantages and overcoming the shortcomings of existing graph search.

37 Challenges & Related techniques

Big Data Big Data refers to datasets that grow so large that it is difficult to capture, store, manage, share, analyze and visualize with those traditional (database) software tools “Big data” has become a Buzz word, and the common focus of both industrial and academic communities!

More Data Beats Better Algorithms

Kepler's third law of planetary motion The square of the orbital period of a planet is directly proportional to the cube of the semi-major axis of its orbit

Social networks are “big data” Volume : 10 x 10 8 users, 2400 x 10 8 photos, 10 4 x 10 8 page visits Velocity: 7.9 new users per second, over 60 thousands per day Variety: text (weibo, blogs), figures, videos, relationships (topology) Value: 1.5 x 10 8 dollars in 2007, 3 x 10 8 dollars in 2008, 6 ~ 7 x 10 8 dollars in 2009, 10 x 10 8 dollars in Further, data are often dirty due to data missing and data uncertainty [18, 19] Facebook:

Challenges 42 –The amount of data has reached hundred millions orders of magnitude. –The data are updated all the time, and the updated amount of data daily reaches hundred thousands orders of magnitude. –Same with traditional relational data, there exists data quality problems such as data uncertainty and data missing in the new applications. Graph search with high efficiency, striking a balance between its performance and accuracy. Consider the dynamic changes and timing characteristics of data. Solve the data quality problems.

43 Related Techniques

44 Real-life graphs are typically way too large: –Yahoo! web graph: 14 billion nodes –Facebook: over 0.8 billion users Real-life graphs are naturally distributed: –Google, Yahoo! and Facebook have large-scale data centers It is nature to study “distributed graph search”! It is NOT practical to handle large graphs on single machines Distributed graph processing is inevitable Distributed Processing

45 Distributed Processing A cluster of identical machines (with one acted as coordinator); Each machine can directly send arbitrary number of messages to another one; All machines co-work with each other by local computations and message-passing. 45 Model of Computation [3] : Complexity measures: 1. Visit times: the maximum visiting times of a machine (interactions) 2. Makespan: the evaluation completion time (efficiency) 3. Data shipment: the size of the total messages shipped among distinct machines (network band consumption)

Incremental Techniques 46 Converting the indexing system to an incremental system, Reduce the average document processing latency by a factor of 100 Process the same number of documents per day, while reducing the average age of documents in Google search results by 50%. It is a great waste to compute everything from scratch! Google Percolator [20] :

Data Preprocessing 47 Data sampling –Instead of dealing with the entire data graphs, it reduces the size of data graphs by sampling and allows a certain loss of precision. –In the sampling process, ensure that the sampling data obtained can reflect the characteristics and information of the original data graphs as much as possible. Data compression –It generates small graphs from original data graphs that preserve the information only relevant to queries. –A specific compression method is applied to a specific query application, such that data graph compression is not universal for all query applications. –Reachability query, Neighbor query

Data Preprocessing 48 Indexing There are mainly three standards for measuring the goodness of an indexing method. –The space of a graph index –Establishing time for a graph index –Query time with a graph index Data partitioning –Partition a data graph to relatively “small” graphs –Hash function is a simple approach for random partitioning. –There are well established tools, e.g. Metis.

49 Summary

50 We have introduced graph search: a new paradigm for social computing We have discussed the history and applications of graph search We have introduced and analyzed three types of graph search: –Cohesive subgraphs –Keyword search on graphs –Graph pattern matching We have introduced and analyzed three types of graph search: –Cohesive subgraphs –Keyword search on graphs –Graph pattern matching We have also discussed the challenges and related techniques We have presented some useful techniques to solve the problems

References [1] Chao Liu, Chen Chen, Jiawei Han and Philip S. Yu, GPLAG: detection of software plagiarism by program dependence graph analysis. KDD [2] J. Ferrante, K. J. Ottenstein, and J. D. Warren. The program dependence graph and its use in optimization. ACM Trans. Program. Lang. Syst., 9(3):319–349, [3] Shuai Ma, Yang Cao, Jinpeng Huai, and Tianyu Wo, Distributed Graph Pattern Matching, WWW [4] Rice, M. and Tsotras, V.J., Graph indexing of road networks for shortest path queries with label restrictions,VLDB [5] David A. Bader and Kamesh Madduri, A graph-theoretic analysis of the human protein- interaction network using multicore parallel algorithms. Parallel Computing [6] Shuai Ma, Yang Cao, Tianyu Wo, and Jinpeng Huai, Social Networks and Graph Matching. Communications of CCF, [7] C. C. Aggarwal and H. Wang. Managing and Mining Graph Data. Springer, [8] Wenfei Fan, Jianzhong Li, Shuai Ma, Nan Tang, and Yinghui Wu, Adding Regular Expressions to Graph Reachability and Pattern Queries. ICDE [9] Wenfei Fan, Jianzhong Li, Shuai Ma, Nan Tang, and Yinghui Wu, Graph Pattern Matching: From Intractable to Polynomial Time. VLDB [10] Wenfei Fan, Jianzhong Li, Shuai Ma, Nan Tang, and Yinghui Wu, Graph Homomorphism Revisited for Graph Matching. VLDB

References [11] Hossein Maserrat and Jian Pei, Neighbor query friendly compression of social networks. KDD [12] Brian Gallaghe, Matching structure and semantics: A survey on graph-based pattern matching. AAAI FS [13] Marko A. Rodriguez, Peter Neubauer: The Graph Traversal Pattern. Graph Data Management 2011: [14] S.Wasserman and K. Faust. Social Network Analysis: Methods and Applications. Cambridge University Press, [15] Mehdi Kargar, Aijun An: Keyword Search in Graphs: Finding r-cliques. In VLDB Conference, [16] Shuai Ma, Yang Cao, Wenfei Fan, Jinpeng Huai, and Tianyu Wo, Capturing Topology in Graph Pattern Matching. VLDB [17] Wenfei Fan, Graph Pattern Matching Revised for Social Network Analysis. ICDT [18] Eytan Adar and Christopher Re, Managing Uncertainty in Social Networks, IEEE Data Eng. Bull., pp.15-22, 30(2), [19] Gueorgi Kossinets, Effects of missing data in social networks. Social Networks 28: , [20] Daniel Peng, Frank Dabek: Large-scale Incremental Processing Using Distributed Transactions and Notifications. OSDI

Homepage: Address: Room G1122, New Main Building, Beihang University 53 Acknowledgement: Yang Cao, Wenfei Fan, Kaiyu Feng, Jinpeng Huai, Jia Li, Jianzhong Li, Xudong Liu, Nan Tang, Tianyu Wo, Yinghui Wu, …

Book Recommendation 54

Databases and Logic 55

Computational Complexity 56

Algorithms 57

Formal Languages 58

Statistics and Social Networks 59

Graph Theory 60

Thanks!