Key-Key-Value Stores for Efficiently Processing Graph Data in the Cloud Alexander G. Connor Panos K. Chrysanthis Alexandros Labrinidis Advanced Data Management.

Slides:



Advertisements
Similar presentations
Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.
Advertisements

Scalable Content-Addressable Network Lintao Liu
Load Rebalancing for Distributed File Systems in Clouds Hung-Chang Hsiao, Member, IEEE Computer Society, Hsueh-Yi Chung, Haiying Shen, Member, IEEE, and.
A Scalable Virtual Registry Service for jGMA Matthew Grove CCGRID WIP May 2005.
Distributed Breadth-First Search with 2-D Partitioning Edmond Chow, Keith Henderson, Andy Yoo Lawrence Livermore National Laboratory LLNL Technical report.
1 Cheriton School of Computer Science 2 Department of Computer Science RemusDB: Transparent High Availability for Database Systems Umar Farooq Minhas 1,
Online Educational Game of Snakes and Ladders -Shalini Pradhan -Manali Joshi -Uttara Paingankar -Seema Joshi.
Scalable Content-aware Request Distribution in Cluster-based Network Servers Jianbin Wei 10/4/2001.
Making Services Fault Tolerant
Leveraging IP for Sensor Network Deployment Simon Duquennoy, Niklas Wirstrom, Nicolas Tsiftes, Adam Dunkels Swedish Institute of Computer Science Presenter.
CMPT 300: Final Review Chapters 8 – Memory Management: Ch. 8, 9 Address spaces Logical (virtual): generated by the CPU Physical: seen by the memory.
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Internet Indirection Infrastructure Ion Stoica UC Berkeley.
Chord and CFS Philip Skov Knudsen Niels Teglsbo Jensen Mads Lundemann
CMPT 300: Final Review Chapters 8 – Memory Management: Ch. 8, 9 Address spaces Logical (virtual): generated by the CPU Physical: seen by the memory.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Vladimir Kulyukin Computer Science Department Utah State University
Wide-area cooperative storage with CFS
1 Making Services Fault Tolerant Pat Chan, Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong Miroslaw Malek.
ICDE A Peer-to-peer Framework for Caching Range Queries Ozgur D. Sahin Abhishek Gupta Divyakant Agrawal Amr El Abbadi Department of Computer Science.
BiGraph BiGraph: Bipartite-oriented Distributed Graph Partitioning for Big Learning Jiaxin Shi Rong Chen, Jiaxin Shi, Binyu Zang, Haibing Guan Institute.
Roger ZimmermannCOMPSAC 2004, September 30 Spatial Data Query Support in Peer-to-Peer Systems Roger Zimmermann, Wei-Shinn Ku, and Haojun Wang Computer.
A User Experience-based Cloud Service Redeployment Mechanism KANG Yu.
Titan Graph Database Meet Bhatt(13MCEC02).
Word Wide Cache Distributed Caching for the Distributed Enterprise.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
What is Architecture  Architecture is a subjective thing, a shared understanding of a system’s design by the expert developers on a project  In the.
SaaS 傅汝緯 李碩元 林子驥 1. What is SaaS?  Definition :Software as a service  a software delivery model in which software and associated data are centrally.
Presenter: Dipesh Gautam.  Introduction  Why Data Grid?  High Level View  Design Considerations  Data Grid Services  Topology  Grids and Cloud.
Network Aware Resource Allocation in Distributed Clouds.
An affinity-driven clustering approach for service discovery and composition for pervasive computing J. Gaber and M.Bakhouya Laboratoire SeT Université.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
Database Application Security Models Database Application Security Models 1.
GUI & Optimizer for the Virtual Pipeline Simulation Testbed Walamitien Oyenan November 10, 2003 MSE Presentation (Phase 2)
Introduction to Hadoop and HDFS
1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.
 Protocols used by network systems are not effective to distributed system  Special requirements are needed here.  They are in cases of: Transparency.
GraphLab: how I understood it with sample code Aapo Kyrola, Carnegie Mellon Univ. Oct 1, 2009.
Massively Distributed Database Systems - Distributed DBS Spring 2014 Ki-Joune Li Pusan National University.
IB ITGS Case Study. Introduction: Serving thousands of clients, it is method of environment-friendly green ticketing. User friendly system which minimizes.
Multi-Criteria Routing in Pervasive Environment with Sensors Santhanakrishnan, G., Li, Q., Beaver, J., Chrysanthis, P.K., Amer, A. and Labrinidis, A Department.
The Vesta Parallel File System Peter F. Corbett Dror G. Feithlson.
Chapter 6 Distributed File Systems Summary Bernard Chen 2007 CSc 8230.
Zibin Zheng DR 2 : Dynamic Request Routing for Tolerating Latency Variability in Cloud Applications CLOUD 2013 Jieming Zhu, Zibin.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
An Efficient Linear Time Triple Patterning Solver Haitong Tian Hongbo Zhang Zigang Xiao Martin D.F. Wong ASP-DAC’15.
Efficient Computing k-Coverage Paths in Multihop Wireless Sensor Networks XuFei Mao, ShaoJie Tang, and Xiang-Yang Li Dept. of Computer Science, Illinois.
QoS Supported Clustered Query Processing in Large Collaboration of Heterogeneous Sensor Networks Debraj De and Lifeng Sang Ohio State University Workshop.
Stefanos Antaris A Socio-Aware Decentralized Topology Construction Protocol Stefanos Antaris *, Despina Stasi *, Mikael Högqvist † George Pallis *, Marios.
Client Assignment in Content Dissemination Networks for Dynamic Data Shetal Shah Krithi Ramamritham Indian Institute of Technology Bombay Chinya Ravishankar.
Data Structures and Algorithms in Parallel Computing
CS 6401 Overlay Networks Outline Overlay networks overview Routing overlays Resilient Overlay Networks Content Distribution Networks.
A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.
DATABASE REPLICATION DISTRIBUTED DATABASE. O VERVIEW Replication : process of copying and maintaining database object, in multiple database that make.
Stefanos Antaris Distributed Publish/Subscribe Notification System for Online Social Networks Stefanos Antaris *, Sarunas Girdzijauskas † George Pallis.
1 Information Retrieval and Use De-normalisation and Distributed database systems Geoff Leese September 2008, revised October 2009.
Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:
Services DFS, DHCP, and WINS are cluster-aware.
Distributed Cache Technology in Cloud Computing and its Application in the GIS Software Wang Qi Zhu Yitong Peng Cheng
Self Healing and Dynamic Construction Framework:
Programming with ANTS ANTS facilitates protocols construction and deployment Demonstrate some examples using Mobility services Multicasting.
Parallel Programming By J. H. Wang May 2, 2017.
NOSQL.
Chapter 7: Consistency & Replication IV - REPLICATION MANAGEMENT -Sumanth Kandagatla Instructor: Prof. Yanqing Zhang Advanced Operating Systems (CSC 8320)
Outline Midterm results summary Distributed file systems – continued
Chord and CFS Philip Skov Knudsen
Presentation transcript:

Key-Key-Value Stores for Efficiently Processing Graph Data in the Cloud Alexander G. Connor Panos K. Chrysanthis Alexandros Labrinidis Advanced Data Management Technologies Laboratory Department of Computer Science University of Pittsburgh

Data in social networks A social network manages user profiles, updates and connections How to manage this data in a scalable way? Key-value stores offer performance under high load Some observations about social networks A profile view usually includes data from a user’s friends Spatial locality A friend’s profile is often visited next Temporal locality Requests might ask for updates from several users Web pages might include pieces of several user profiles A single request requires connecting to many machines

Connections in a Social Network Alice

Leveraging Locality Can we take advantage of the connections? What if we stored connected user’s profiles and data in the same place? Locality can be leveraged The number of connections is reduced User data can be pre-fetched We can think of this as a graph partitioning problem… Partitions = machines Vertices = user profiles, including update Edges = connections Objective: minimize the number of edges that cross partitions

Example – graph partitioning Many edges cross partitions Accessing a vertex’s neighbors requires accessing many partitions In a social network, requesting updates from followed users requires connecting to many machines Far fewer edges cross partitions Accessing a vertex’s neighbors requires accessing few partitions In a social network, fewer connections are made and related user data can be pre-fetched

Key-Key-Value Stores Our proposed approach: extend the key-value model Data can be stored key-values User profiles Data can also be stored as key-key-values User connections “Alice follows Bob” Use key-key-values to compute locality On-line graph partitioning algorithm Assign keys to grid locations based on connections Each grid cell represents a data host Keys that are related are kept together

Outline Introduction Data in Social Networks Leveraging Locality Key-Key-Value Stores System Model Client API Adding a Key-Key-Value Load management On-line partitioning algorithm Simulation Parameters Results Conclusion

Physical hosts Virtual hosts Address table Application Sessions Physical Layer: Physical machines can be added or removed dynamically as demands change Logical Layer: Virtual machines Organized as a square grid Run the KKV store software Manage replication Can be moved between physical machines as needed Address Table: Mapping Store a transactional, distributed hash table maps keys to virtual machines Application Layer: Client API maintain client sessions cached data

Client API and Sessions Clients use a simple API that includes the get, put and sync commands Data is pulled from the logical layer in blocks Groups of related keys The client API keeps data in an in-memory cache Data is pushed out asynchronously to virtual nodes in blocks Push/pull can be done synchronously if requested by the client Offers stronger consistency at the cost of performance

bob put(alice, bob, follows) alice 1,1 8,8 Virtual hosts Address table kv(bob,...)... kkv(alice, bob, follows) kv(alice,...)... kkv(alice, bob, follows) 8,8 1,1 8,8 Adding a key-key-value Two users: Alice and Bob Use the Address Table to determine the virtual machine (node) that hosts Alice’s data Write the data to that node Use the address table to determine the node that hosts Bob’s data Write the same data to that node The on-line partitioning algorithm moves Alice’s data to Bob’s node because they are connected

Splitting a Node Virtual hosts If one node becomes overloaded, it can initiate a splitTo maintain the grid structure, nodes in the same row and column must also split Once the split is complete, new physical machines can be turned on Virtual nodes can be transferred to these new machines

Outline Introduction Data in Social Networks Leveraging Locality Key-Key-Value Stores System Model Client API Adding a Key-Key-Value Load management On-line Partitioning Algorithm Simulation Parameters Results Conclusion

On-line Partitioning Algorithm Runs periodically in parallel on each virtual node Also after a split or merge For each key stored on a node Determine the number of connections (key-key-values) with keys on other nodes Can also be sum of edge weights Find the node that has the most connections If that node is different than the current node If the number of connections to that node is greater than the number of connections to the current node If this margin is greater than some threshold Move the key to the other node Update the address table Designed to work in a distributed, dynamic setting NOT a replacement for off-line algorithms in static settings

Node Sum(Edges) 1,1 0 2,1 2 1,2 1 1,1 2,1 1,2 Partitioning Example

1,1 2,1 1,2

ParameterValues No. Vertices (V) Branching Factor (b)10%-100% of V Distribution of bZipf alpha1.5 Partitioning AlgorithmsOn-line, Kernighan-Lin On-line WorkloadRandom, pre-generated history of edge inserts On-line algorithm run frequency Every V/10 inserts On-line thresholdImprovement > 0 Trials3 per graph size Experimental Parameters

On-line partitions as well as Kernighan-Lin Partitioning Quality Results % Edges in partition Vertices in graph

On-line partitions 2x faster than Kernighan-Lin! Vertices moved Vertices in graph Partitioning Performance Results

Conclusions Contributions: A novel model for scalable graph data stores that extends the key- value model Key-key-value store A high-level system design A novel on-line partitioning algorithm Preliminary experimental results Our proposed algorithm shows promise in the distributed, dynamic setting

What’s Ahead? Prototype system implementation Java, PostgreSQL Performance Analysis against MongoDB, Cassandra Sensitivity Analysis Cloud Deployment

Thank You! Acknowledgments Daniel Cole, Nick Farnan, Thao Pham, Sean Snyder ADMT Lab, CS Department, Pitt GPSA, Pitt A&S GSO, Pitt A&S PBC