MySQL to NoSQL Data Modeling Challenges in Supporting Scalability ΧΑΡΟΚΟΠΕΙΟ ΠΑΝΕΠΙΣΤΗΜΙΟ - ΤΜΗΜΑ ΠΛΗΡΟΦΟΡΙΚΗΣ ΚΑΙ ΤΗΛΕΜΑΤΙΚΗΣ ΠΜΣ "Πληροφορική και Τηλεματική“ ΣΑΜΑΡΤΖΟΠΟΥΛΟΣ ΝΙΚΟΣ (itp12406)
Large data sets are driving adoption of NoSQL technologies. Transitioning from relational persistence to NoSQL persistence is non- trivial. The challenges making this transition are encountered in many areas. Software architecture, data modeling, deployment, developer skill sets, system operations, etc. They report on making this transition within the context of a large research enterprise: Project EPIC The focus of the paper is on data modeling issues, but touches on others issues Introduction and Background
Project EPIC Investigates use of social media to collaborate and coordinate during times of disaster. Project team consists of researchers with skills in human-centered computing, software engineering, natural language processing, and network privacy and policy. Focus on the collection of Twitter data, because twitter is a place where people turn to, to ask for help or to report on things they can do to help during these events. 1. Introduction and Background 3
Software Engineering Challenges Crisis informatics places significant demands on Software Engineering Quality: colleagues require high-quality data sets for their research. Collection for an event must be 24/7 Robustness: Given the quality constraint, their data collection infrastructure must be robust in the face of network disconnects, system failures, rate limiting, etc. Scalability: A single event can generate millions of data points (Tweets); In its two years of deployment it has collected over 2B disaster-related status messages covering numerous mass emergency events that occurred in while maintaining 99% uptime. 1. Introduction and Background 4
Goal Design, develop, and deploy a system capable of ▫ collecting, ▫ packaging, and ▫ analyzing research-quality data sets ▫ in real-time and on-demand, while natural disasters and all these types of events can occur at any moment. 1. Introduction and Background 5
Software Architecture, version 1 Persistence Architecture based on Relational Technologies Highly-decoupled four-tier architecture ▫ applications, services, persistence, database Production-class software ▫ Hibernate, Spring, Spring MVC Infrastructure components ▫ Tomcat, MySQL and Lucene Adopted MySQL because of its “one size fits all” nature ▫ Familiar data model plus great tool support ▫ Easy integration with Hibernate and Lucene 1. Introduction and Background 6
System Architecture (V.1) 7
The Problem Relational databases are great when starting out ▫ Lots of tool support; well understood technology ▫ Can be made to scale (mostly through $) However, a relational-only approach to storage was not meeting their needs ▫ RDBMSs aren’t flexible, schema updates are painful ▫ Availability is less then ideal: single point of failure ▫ Data replication isn’t automatic; table scans are painful (this does not mean that we can't make the scale; we can do vertical scaling on these things: sharding of data, memory cache, good data center $$$) 1. Introduction and Background 8
NoSQL Not Only SQL NoSQL (Not Only SQL) technologies ▫ Models based on Google’s BigTable and Amazon’s Dynamo ▫ Storage of “big data” sets across clusters of machines ▫ Enable analysis on large data sets via, e.g. MapReduce framework (Hadoop) Enable flexibility, availability, and scalability ▫ Flexibility via no enforced schema ▫ Availability via replication of data across the cluster ▫ Scalability via ability to add machines to the cluster 2. NoSQL 9
Version 2. Moving to NoSQL Added Apache Cassandra to their Persistence ▫ Addressed the storage problems they encountered in version 1. Analytics can now occur in a variety of ways ▫ Hadoop; Lucene; SQL Challenges ▫ Moving from relational schema to schema-less non-trivial ▫ Little to no tool support ▫ Largely undocumented frameworks and APIs ▫ Distributed system expertise required; sysadmin skills a plus! 2. NoSQL 10
2. NoSQL System Architecture (V.2) Hybrid Persistence Architecture: Relational + NoSQL 11
Original Data Model (Simplified) With ORM technology (Hibernate) this model is easily supported ORM frameworks for handling all interactions with the database, enabling the client to interact only with objects and their relationships ▫ But it runs into problems when the number of tweets that can be returned across the associations number in the hundreds of millions (the system runs out of memory) The benefit, however, is flexible queries 3. Data Model: Before the Transition 12
NoSQL technologies require “queries up front” ▫ Data retrieval is fast because query results are co-located ▫ This is by design; you have to write your data the way you want it to be retrieved ▫ Need to know what questions we are answering up front Queries Find all tweets associated with an event Find all tweets associated with a user Find all tweets associated with an event in a given time range 4. Making the Transition 13
Cassandra Data Model To ensure that their new architecture could answer these queries, they needed to store data that maps to Cassandra’s data model A column family consists of rows that point to many columns. Each column has a column name and a column value. Design of row key (i.e. strings, dates and numbers) is critical. It allows the client to index into the column family and retrieve columns. Each row can have a different set of columns (no schema) 4. Making the Transition 14
First Query: Find all tweets associated with an event Event name acts as row key; the unique id will be used as the column name for the event columns ▫ Each column stores full JSON representation of the tweet no information about the tweet is lost Rows contain potentially millions of tweets ▫ Problematic when Cassandra attempts to replicate keys and their associated data (columns) around a cluster of machines all of the key’s data is replicated as a unit long delays or timeouts when adding additional nodes to the cluster. 4. Making the Transition 15
Second Query: Find all tweets associated with a user Use of the secondary indexing feature provided by Cassandra Secondary indexes allow the client to execute very simple queries against column values that can be indexed by Cassandra ▫ Ex. screen_name = ‘jsmith’ 4. Making the Transition 16
Third Query: Find all tweets associated with an event in a given date range Make use of composite (string) row keys (event name : day) to store tweets in chunks of time By decreasing the number of columns stored with each key the amount of data that must be moved with each key, when it is replicated across the cluster, is also decreased improves speed of replication Data reads may also be more efficient; the client may now specify the exact data they are interested in receiving, instead of requesting all the data available. 4. Making the Transition 17 This approach replaces first query
5. After the Transition 18 Hector CassandraTwitterStatusService
Lessons Learned Cassandra meets their needs but non-trivial to implement Flexibility ▫ Immunity to changes in Tweet meta-data by Twitter (don't have to make any change to software every time twitter changes the metadata) Availability ▫ Always writeable Scalability ▫ Need more storage? Add another node 5. After the Transition 19
Performance Twitter Streaming API can deliver tweets/s 24/7 (5M / day) Version 1 architecture struggled to keep up with collection Version 2 architecture (cassandra), with no need to store tweets in a queue waiting for the persistence mechanism to update its records ▫ Now it can handle 100+ tweets per second; ~8.6M a day (April 2011 Japan Earthquake) ▫ easily handled collection on 2012 Summer Olympics (712 users and keywords; 40M tweets (98.2GB) after two weeks) 5. After the Transition Deployment of the Project EPIC software infrastructure in a wide variety of configurations Single researcher storing Twitter data in JSON files, a research group running the infrastructure on a single powerful server, to an even larger research group running a hybrid persistence architecture on a large cluster of machines (as Project EPIC does today). 20
Challenges ▫ NoSQL is added alongside relational technologies; it does not replace them ▫ - they’re not saying that MySQL can’t scale ▫ Data modeling is hard; difficult to change queries ▫ Skills gap: SQL is familiar; NoSQL is unfamiliar ▫ New skills needed: system administration; cluster management Despite challenges, it is possible to incorporate NoSQL into existing systems ▫ Requires good software architecture and software engineering practices Possible but not trivial ▫ Determine if your application needs require this combination of flexibility, availability and scalability, offered by these technologies. If not look at other Storage Technologies and figure out which ones exactly meet your needs. 6. Conclusions 21
THANK YOU!!! 22