Key-Key-Value Stores for Efficiently Processing Graph Data in the Cloud Alexander G. Connor Panos K. Chrysanthis Alexandros Labrinidis Advanced Data Management Technologies Laboratory Department of Computer Science University of Pittsburgh
Data in social networks A social network manages user profiles, updates and connections How to manage this data in a scalable way? Key-value stores offer performance under high load Some observations about social networks A profile view usually includes data from a user’s friends Spatial locality A friend’s profile is often visited next Temporal locality Requests might ask for updates from several users Web pages might include pieces of several user profiles A single request requires connecting to many machines
Connections in a Social Network Alice
Leveraging Locality Can we take advantage of the connections? What if we stored connected user’s profiles and data in the same place? Locality can be leveraged The number of connections is reduced User data can be pre-fetched We can think of this as a graph partitioning problem… Partitions = machines Vertices = user profiles, including update Edges = connections Objective: minimize the number of edges that cross partitions
Example – graph partitioning Many edges cross partitions Accessing a vertex’s neighbors requires accessing many partitions In a social network, requesting updates from followed users requires connecting to many machines Far fewer edges cross partitions Accessing a vertex’s neighbors requires accessing few partitions In a social network, fewer connections are made and related user data can be pre-fetched
Key-Key-Value Stores Our proposed approach: extend the key-value model Data can be stored key-values User profiles Data can also be stored as key-key-values User connections “Alice follows Bob” Use key-key-values to compute locality On-line graph partitioning algorithm Assign keys to grid locations based on connections Each grid cell represents a data host Keys that are related are kept together
Outline Introduction Data in Social Networks Leveraging Locality Key-Key-Value Stores System Model Client API Adding a Key-Key-Value Load management On-line partitioning algorithm Simulation Parameters Results Conclusion
Physical hosts Virtual hosts Address table Application Sessions Physical Layer: Physical machines can be added or removed dynamically as demands change Logical Layer: Virtual machines Organized as a square grid Run the KKV store software Manage replication Can be moved between physical machines as needed Address Table: Mapping Store a transactional, distributed hash table maps keys to virtual machines Application Layer: Client API maintain client sessions cached data
Client API and Sessions Clients use a simple API that includes the get, put and sync commands Data is pulled from the logical layer in blocks Groups of related keys The client API keeps data in an in-memory cache Data is pushed out asynchronously to virtual nodes in blocks Push/pull can be done synchronously if requested by the client Offers stronger consistency at the cost of performance
bob put(alice, bob, follows) alice 1,1 8,8 Virtual hosts Address table kv(bob,...)... kkv(alice, bob, follows) kv(alice,...)... kkv(alice, bob, follows) 8,8 1,1 8,8 Adding a key-key-value Two users: Alice and Bob Use the Address Table to determine the virtual machine (node) that hosts Alice’s data Write the data to that node Use the address table to determine the node that hosts Bob’s data Write the same data to that node The on-line partitioning algorithm moves Alice’s data to Bob’s node because they are connected
Splitting a Node Virtual hosts If one node becomes overloaded, it can initiate a splitTo maintain the grid structure, nodes in the same row and column must also split Once the split is complete, new physical machines can be turned on Virtual nodes can be transferred to these new machines
Outline Introduction Data in Social Networks Leveraging Locality Key-Key-Value Stores System Model Client API Adding a Key-Key-Value Load management On-line Partitioning Algorithm Simulation Parameters Results Conclusion
On-line Partitioning Algorithm Runs periodically in parallel on each virtual node Also after a split or merge For each key stored on a node Determine the number of connections (key-key-values) with keys on other nodes Can also be sum of edge weights Find the node that has the most connections If that node is different than the current node If the number of connections to that node is greater than the number of connections to the current node If this margin is greater than some threshold Move the key to the other node Update the address table Designed to work in a distributed, dynamic setting NOT a replacement for off-line algorithms in static settings
Node Sum(Edges) 1,1 0 2,1 2 1,2 1 1,1 2,1 1,2 Partitioning Example
1,1 2,1 1,2
ParameterValues No. Vertices (V) Branching Factor (b)10%-100% of V Distribution of bZipf alpha1.5 Partitioning AlgorithmsOn-line, Kernighan-Lin On-line WorkloadRandom, pre-generated history of edge inserts On-line algorithm run frequency Every V/10 inserts On-line thresholdImprovement > 0 Trials3 per graph size Experimental Parameters
On-line partitions as well as Kernighan-Lin Partitioning Quality Results % Edges in partition Vertices in graph
On-line partitions 2x faster than Kernighan-Lin! Vertices moved Vertices in graph Partitioning Performance Results
Conclusions Contributions: A novel model for scalable graph data stores that extends the key- value model Key-key-value store A high-level system design A novel on-line partitioning algorithm Preliminary experimental results Our proposed algorithm shows promise in the distributed, dynamic setting
What’s Ahead? Prototype system implementation Java, PostgreSQL Performance Analysis against MongoDB, Cassandra Sensitivity Analysis Cloud Deployment
Thank You! Acknowledgments Daniel Cole, Nick Farnan, Thao Pham, Sean Snyder ADMT Lab, CS Department, Pitt GPSA, Pitt A&S GSO, Pitt A&S PBC