Download presentation
Presentation is loading. Please wait.
Published byKalle Tuominen Modified over 5 years ago
1
A Case Study of the Use of NoSQL Databases By Some Companies April Song and Sarah Graupman
2
Apollo Facebook is trying to address problems with latencies by switching to a NoSQL database called Apollo. Facebook created Apollo internally, and it is written in C++. Raft is a consensus protocol that makes sure that all of the systems consent and agree to the state transitions. Facebook mostly used RocksDB for their storage. The read() and write() methods are atomic, which means that the entire process of reading and writing either runs or none of it occurs if part of it fails. The fault tolerant state machines ensure that the program executes even if one of the nodes dies.
3
Apache Cassandra Apache Cassandra is a NoSQL database created by Facebook for searching in inboxes. Their goals when designing Cassandra was to give it high availability, eventual consistency, and incremental scalability. When writing, it will write to a random cluster. It is currently used by companies including but not limited to: Comcast, eBay, GitHub, Hulu, Instagram, Netflix, Reddit, The Weather Channel, and Apple.
4
Cassandra (Continued)
The efficiency of reads and writes increases linearly as the number of machines increases. Based on experiments at University of Toronto, Cassandra has the best scalability compared to other NoSQL databases. The read latency for Cassandra is about constant, regardless of how many nodes there are.
5
Others Facebook uses a distributed system called Scribe to transport all of its data. It then uses processing systems called Puma, Swift, and Stylus which allow for computation and analysis of the data in Java, Python, and C++, respectively. Facebook also uses data stores such as Laser, Scuba, and Hive which work on top of Facebook’s RocksDB database. The many different tools that Facebook use allow them to adapt to all of the different needs of their large company. There is complication in this strategy though because there is significant overhead in maintaining all of these systems and ensuring they are compatible with each other.
6
DynamoDB Amazon is focused on reliability of their data because a slight outage can have large financial and customer relationship consequences. To do this, they manage their data through multiple instances of Dynamo in multiple data centers around the world. Dynamo is designed so that the data store is always writeable ensuring that customers will always be allowed to add and remove items from their shopping carts even during network and/or server failures.
7
Dynamo (Continued) Document and key-value models are supported by Amazon DynamoDB. It is a cloud database, making it good for web, gaming, and IoT. It reduces latency by having Amazon Dynamo Accelerator (DAX), which is a cache. Caches reduce the time it takes to retrieve data if the requested data is in the cache.
8
Highly available, low – latency distributed data store
Open sourced in 2009 Based off DynamoDB Built as an alternative to their Oracle system for things like “Who viewed my profile” Currently house ~ 10 clusters, over 100 nodes, and several hundred stores Key value storage hosted in a distributed hash table. Each store maps to a single cluster with the store partitioned across all nodes. Each store has a replication factor, required # of nodes to participate in read and write operations, and a schema. Querying is done by primary key. Can handle lists as values so denormizilation is key Keys for the same store are stored in a hash ring Conflicts? Everything can write. Conflicts are resolved during reads. conflicts resolved by vector clocks Of their 10 cluster, 9 are read-write traffic across multiple datacenters while one server is a customer readily storage engine Largest cluster is about 60% reads and 40% writes handles ~10K queries per second at peak and a latency of 3 ms. Stores range from 8KB to 2.9 TB
9
Use NoSQL data store: Voldemort
Developed in 2008 Key-Value Stores Highly available, low – latency distributed data store Open sourced in 2009 Based off DynamoDB Built as an alternative to their Oracle system for things like “Who viewed my profile” Currently house ~ 10 clusters, over 100 nodes, and several hundred stores Key value storage hosted in a distributed hash table. Each store maps to a single cluster with the store partitioned across all nodes. Each store has a replication factor, required # of nodes to participate in read and write operations, and a schema. Querying is done by primary key. Can handle lists as values so denormizilation is key Keys for the same store are stored in a hash ring Conflicts? Everything can write. Conflicts are resolved during reads. conflicts resolved by vector clocks Of their 10 cluster, 9 are read-write traffic across multiple datacenters while one server is a customer readily storage engine Largest cluster is about 60% reads and 40% writes handles ~10K queries per second at peak and a latency of 3 ms. Stores range from 8KB to 2.9 TB
10
Document store: Espresso Developed in 2011 Document Stores
Bridge the gap between Voldemort and other rdms Tables have schemas represented in json Like mongodb each table is a container of documents Master/slave relationship
11
DATA Databus Strong timeline consistency User-space processing
Support for long look-back queries Low latency Relay, Bootstrap Server, and Client Library DATA Databus is a system for change data capture that’s being used to enable complex online computation with strict latency bounds A pipeline for transporting events from databases to applications. Has adapters for Oracle and MySQL also has an API allowing for getting data for multiple data stores User-space processing is the ability to perform computations triggered by changing data outside the database server Relay: captures changes in the data source, serializes them and buffes them and serves them to the client library and the bootstrap server Bootstap server: handles long look-back queries and isolates the source database from having to handle these queries
12
https://blog.twitter.com/2017/the-infrastructure-behind-twitter-scale
Hadoop: multiple clusters with over 500 pb of data Manhatten: (stores tweets, dms, accounts) : key-value store Graph: Gizzard cluster has Flock social graphs FlockDB – Stores graphs as a sets of edges between nodes as integers. Nodes here are userid and tweetids in a MySQL database As of 2010 they store over 13 billion edges, 20K writes/second, 100k reads/second Data retrieved through app server flapp and partitioned through a library called gizzard
13
Overall trends Focus on Low Latency
Maintain both MySQL an NoSQL Databases Large enterprises are developing their own database systems and releasing them to the public
14
references Auradkar, A., Botev, C., & Das, S. (2012). Data Infrastructure at LinkedIn. IEEE 28th International Conference on Data Engineering (ICDE). Hashemi , M. (n.d.). The Infrastructure Behind Twitter: Scale | Twitter Blogs. Retrieved May 02, 2017, from Introducing FlockDB | Twitter Blogs. (2010, May 03). Retrieved May 02, 2017, from Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and Werner Vogels. "Dynamo: Amazon's Highly Available Key-Value Store". ACM 2007: Print. Tilman Rabl, Mohammad Sadoghi, Hans-Arno Jacobsen, Sergio Gomez-Villamor, Victor Muntes-Mulero, and Serge Mankovskii. "Solving Big Data Challenges For Enterprise Application Performance Management". VLDB Endowment 2012: Print.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.