CSE 482 Lecture 5: NoSQL
Outline of Today’s Lecture Previous lecture talks about relational database and SQL Today’s lecture focuses on NoSQL
NoSQL Not only SQL (does not mean No SQL) Supports distributed data storage and processing across multiple servers Motivation Lots of new applications that require large data storage Traditional database systems provide many functionalities (e.g., powerful query languages, concurrency control) that are overly complex and not needed by the applications The structured data model used by traditional database systems is also too restrictive for the new applications E.g., Schema is often fixed and not flexible enough
Traditional versus New Applications Bank/grocery/credit card transactions, etc Lots of read, write, and update operations Fixed set of columns and data format Consistency is important New Facebook, Gmail or Yahoo mail, Flickr, etc Mostly read or write (few update operations) Variable set of columns and data format Availability is important
Required Characteristics of NoSQL Scalability Store data in a cluster of machines Can easily expand storage by adding more nodes in a cluster Availability Data is replicated over multiple nodes to improve availability However, write performance is cumbersome because any update must be applied to every copy of the replicated data items NoSQL assumes eventual consistency, i.e., all replicas will eventually be consistent (instead of guaranteeing consistency at all times)
CAP Theorem For distributed database systems, we want Consistency: all replicate copies are consistent Availability: each read/write request must have a response Partition tolerance: system must continue to operate even when there is a fault that partitions the nodes in a network CAP theorem: it is not possible to guarantee all three NoSQL systems satisfy weaker consistency levels
Types of NoSQL Systems Document-based systems: store data in the form of documents using well-known formats such as JSON Example: MongoDB Key-value systems: Use key-value pairs for fast access to data items; value can be a record, an object, a document, or a complex data structure Example: Amazon’s DynamoDB, Facebook’s Cassandra Column-based systems: Partition a table by column into column families, where each column family is stored in its own files Example: Google’s BigTable Graph-based systems: Data is represented as graphs, and related nodes are found by traversing the edges using path expressions Example: GraphBase
MongoDB An open-source, document database Stores data as collections of documents in binary JSON (BSON) format Each document in a given collection has a unique id (key) MongoDB database Collection 1 Collection 2 Set of JSON Documents Set of JSON Documents
CRUD Operations Create: create a document to be inserted into collection db.<collection_name>.insert(<document(s)>) Read: find a document in the collection db.<collection_name>.find(<condition>) Update: update a document in the collection db.<collection_name>.update(<condition>) Delete: remove a document from the collection db.<collection_name>.remove(<condition>) http://api.mongodb.com/python/current/tutorial.html
Obtaining and Installing MongoDB You can download MongoDB from https://www.mongodb.org/downloads#production After installation: Create a data directory to store the data files prompt> md <data_directory> Launch the server by executing mongod.exe prompt> mongod.exe --dbpath <data_directory> Launch the client instance by executing mongo.exe prompt> mongo.exe
Launching the Server
MongoDB is ready to accept new commands Launching the Client MongoDB is ready to accept new commands
Some Useful Commands use <database_name>: If database_name exist, it will switch to the named database Otherwise, it will create a new database with the given name db: To check the name of the current database show dbs: To display all the databases available show collections: To display all the collections created under the current database
Collections To create a collection of documents: db.createCollection(collection_name, collection_options) Example: db.createCollection(“posts”, {capped:true, size:1310720, max:500}) Specifies the collection has upper limits on its storage space (size in bytes) and number of documents (max)
Collections Capped versus uncapped collection Capped collection: Documents are stored in a fixed-size circular queue Documents are stored according to insertion order If number of documents exceeds max number of documents, oldest document will be removed Fast especially if there is a large number of inserts needed Does not require an index for insertion order You cannot delete documents from a capped collection. Max = 3 Max = 3 doc1 doc2 doc3 doc2 doc3 doc4 doc4
Collections To create a collection: db.createCollection(“collection_name”, {capped:true, size:1310720, max:500, autoIndexID: true}) Capped: true/false. If true, you must specify the size parameter. Size: If it is less than or equal to 4096, then the collection will have a cap of 4096 bytes. Otherwise, the size is raised to an integer multiple of 256. Max: maximum number of documents allowed in the collection autoIndexID: true/false If true, automatically create index on _id field. Default value is false.
Collections To check whether a collection is capped: db.collection_name.isCapped() To drop an existing collection: db.collection_name.drop()
Insert Syntax: db.collection_name.insert(document) Can be used to insert one or more documents If collection_name does not exist, it will be created automatically
Insert
Example Suppose we want to create a collection of social media profiles { Name: ‘bob’, City: ‘Detroit’, Interests: [ ‘sports’, ‘outdoor’ ] } { Name: ‘mary’, City: ‘Chicago’, Interests: [ ‘science’, ‘art’ ] } { Name: ‘john’, City: ‘Lansing’, Interests: [ ‘politics’, ‘music’ ] }
Example
Example Find the users who lived in Lansing Find the users whose age is above 30 years old
Example Find Lansing users who are older than 23 years old Find the users who like outdoors or travel Find the users who don’t belong to any groups
Other Query Operators
Update Syntax: Example: db.collection_name.update(selection_condition, update) Example: Change Bob’s interests in music and outdoor to music and art
Remove Syntax: Example: db.collection_name.remove(selection_condition) Remove all users who are older than 30 years old
Bulk Import of JSON file You can import the file directly using mongoimport on the command prompt: C:> mongoimport –d <database> -c <collectionName> --file <filename> Suppose you have a JSON file named users.json: C:> mongoimport –d test -c profiles --file users.json This will create a collection named profiles in the test database to store information about the 4 users
Aggregate Function Syntax: db.collection_name.aggregate(aggregate_operation) For more examples: https://www.mkyong.com/mongodb/mongodb-aggregate-and-group-example/
Accessing MongoDB using Python Should install pymongo library package conda install pymongo After installing, launch the server (see slide 13)
Using MongoDB to store tweets Launch MongoDB server Python script to download tweets and store in MongoDB Use tweepy to download tweets from CDCgov Use pymongo to Open a connection to MongoDB server Store the json tweets Query MongoDB to retrieve the tweets
Using MongoDB to store tweets Step 1: Use tweepy to retrieve tweets
Using MongoDB to store tweets Step 2: Connect to MongoDB and store tweets Selected database The tweets are stored in a collection named twitter
Using MongoDB to store tweets Step 3: Query MongoDB to retrieve tweets
Using MongoDB to store tweets Step 3b: Using regular expression to find tweets about Zika
Summary Goals of this lecture: Next lecture To introduce NoSQL and explain how it differs from SQL To introduce MongoDB To give examples on how to interact with MongoDB using Python Next lecture Data preprocessing