MongoDB Connection in Husky CSCI5570 Large Scale Data Processing Systems Lab 3
Deep in Connection 1. Understand MongoDB distributed architecture 2. Consider the correctness of reading all data 3. Get familiar with the APIs 4. Design the InputFormat for Husky
MongoDB - Introduction Document database BSON(binary JSON format) Field -> Record(Document) -> Collection -> Database
MongoDB - Sharding
MongoDB - Sharding shard: Each shard contains a subset of the sharded data. config servers: Config servers store metadata and configuration settings for the cluster. mongos: The mongos acts as a query router, providing an interface between client applications and the sharded cluster.
MongoDB - Sharding Chunk: range based (default size 64MB) Two operations: A chunk is split into two, depends on chunk size Chunks migration among shards, by balancer
MongoDB – Mongo Shell Use tools provided by MongoDB: $ export PATH=$PATH:/data/opt/mongo-tools/bin Connect to mongo shell: $ mongo proj5:20001
MongoDB – Mongo Shell Show databases: Show collections: mongos> show databases Show collections: mongos> use hdb mongos> show collections
MongoDB – Mongo Shell Use `hdb` database: Show shards distribution: mongos> use hdb Show shards distribution: mongos> db.printShardStatus()
MongoDB – Mongo Shell See collection `enwiki` in `hdb` database: mongos> db.enwiki.count()
MongoDB – Mongo Shell See collection `enwiki` in `hdb` database: mongos> db.enwiki.find() # show 20 for each time
MongoDB - Split Data distribution has these two characteristics: Each chunk is a split in Husky Finish reading all chunks = finish reading a collection Ensure the data integrity and no repetition
MongoDB - Split Basic information of one split (chunk in terms of MongoDB): class MongoDBSplit { std::string input_uri; // location to shard std::string max; std::string min; // [min, max) std::string ns; // database.collection };
MongoDB – Assigner Program on the Husky Master side: Get all the shards information Get all the chunks information Obtain the chunk list Each worker will ask a chunk to read Until all the chunks have been read
MongoDB – Assigner Use `config` database: Check shards status: mongos> use config Check shards status: mongos> db.shards.find()
MongoDB – Assigner Check chunks status: mongos> db.chunks.find()
MongoDB – Assigner The assigner keeps the chunks list: class MongoSplitAssigner { std::vector<MongoDBSplit> splits; }; Splits assignment: An idle worker will ask an unread split to read. Then this split will be erased in the vector. Until the vector is empty.
MongoDB - InputFormat After get the chunk information: (shard location, database.collection, max, min) Access the shard and read the chunk directly Obtain the all records in the specific chunk Yield each record for each parse function
MongoDB - InputFormat Get one chunk location and range by assigner: shard: shard0004 -> 192.168.50.10:20000(proj10) ns: hdb.enwiki min: {md5: “cd1c88c44f2e99dcd6fa3378bbb18137”} max: {md5: “cfdf8f1b10cff07317c5de6247149a2e”}
MongoDB - InputFormat Access shard0004 directly: Use hdb: $ mongo proj10:20000 Use hdb: mongos> use hdb Find all records in [min, max): mongos> db.enwiki.find({"md5":{$gte:"cd1c88c44f2e99dcd6fa3378bbb18137",$lt:"cfdf8f1b10cff07317c5de6247149a2e"}})
MongoDB - InputFormat
InputFormat Example Husky uses C++ MongoDB Driver to do the aforementioned procedure In C++ Husky, add the following to your application: husky::io::MongoDBInputFormat infmt; infmt.set_server(“proj5:20001”); infmt.set_ns(“hdb”, “enwiki”); husky:load(infmt, parse_lambda);
InputFormat Example Build WordCountMongo: $ cd build $ make WordCountMongo Set 20 threads for each worker in configuration file, then run: $ ./Master --conf default.cfg $ ./exec.sh WordCountMongo --conf default.cfg
More Any improvement for MongoDBInputFormat? MongoDBInputFormat is just reading, how to write? Is it the same as HDFSLineInputFormat? Access time optimization? Other data storage systems?
Thank you