Download presentation
Presentation is loading. Please wait.
1
MongoDB Connection in Husky
CSCI5570 Large Scale Data Processing Systems Lab 3
2
Deep in Connection 1. Understand MongoDB distributed architecture
2. Consider the correctness of reading all data 3. Get familiar with the APIs 4. Design the InputFormat for Husky
3
MongoDB - Introduction
Document database BSON(binary JSON format) Field -> Record(Document) -> Collection -> Database
4
MongoDB - Sharding
5
MongoDB - Sharding shard: Each shard contains a subset of the sharded data. config servers: Config servers store metadata and configuration settings for the cluster. mongos: The mongos acts as a query router, providing an interface between client applications and the sharded cluster.
6
MongoDB - Sharding Chunk: range based (default size 64MB)
Two operations: A chunk is split into two, depends on chunk size Chunks migration among shards, by balancer
7
MongoDB – Mongo Shell Use tools provided by MongoDB:
$ export PATH=$PATH:/data/opt/mongo-tools/bin Connect to mongo shell: $ mongo proj5:20001
8
MongoDB – Mongo Shell Show databases: Show collections:
mongos> show databases Show collections: mongos> use hdb mongos> show collections
9
MongoDB – Mongo Shell Use `hdb` database: Show shards distribution:
mongos> use hdb Show shards distribution: mongos> db.printShardStatus()
10
MongoDB – Mongo Shell See collection `enwiki` in `hdb` database:
mongos> db.enwiki.count()
11
MongoDB – Mongo Shell See collection `enwiki` in `hdb` database:
mongos> db.enwiki.find() # show 20 for each time
12
MongoDB - Split Data distribution has these two characteristics:
Each chunk is a split in Husky Finish reading all chunks = finish reading a collection Ensure the data integrity and no repetition
13
MongoDB - Split Basic information of one split (chunk in terms of MongoDB): class MongoDBSplit { std::string input_uri; // location to shard std::string max; std::string min; // [min, max) std::string ns; // database.collection };
14
MongoDB – Assigner Program on the Husky Master side:
Get all the shards information Get all the chunks information Obtain the chunk list Each worker will ask a chunk to read Until all the chunks have been read
15
MongoDB – Assigner Use `config` database: Check shards status:
mongos> use config Check shards status: mongos> db.shards.find()
16
MongoDB – Assigner Check chunks status: mongos> db.chunks.find()
17
MongoDB – Assigner The assigner keeps the chunks list:
class MongoSplitAssigner { std::vector<MongoDBSplit> splits; }; Splits assignment: An idle worker will ask an unread split to read. Then this split will be erased in the vector. Until the vector is empty.
18
MongoDB - InputFormat After get the chunk information:
(shard location, database.collection, max, min) Access the shard and read the chunk directly Obtain the all records in the specific chunk Yield each record for each parse function
19
MongoDB - InputFormat Get one chunk location and range by assigner:
shard: shard0004 -> :20000(proj10) ns: hdb.enwiki min: {md5: “cd1c88c44f2e99dcd6fa3378bbb18137”} max: {md5: “cfdf8f1b10cff07317c5de a2e”}
20
MongoDB - InputFormat Access shard0004 directly: Use hdb:
$ mongo proj10:20000 Use hdb: mongos> use hdb Find all records in [min, max): mongos> db.enwiki.find({"md5":{$gte:"cd1c88c44f2e99dcd6fa3378bbb18137",$lt:"cfdf8f1b10cff07317c5de a2e"}})
21
MongoDB - InputFormat
22
InputFormat Example Husky uses C++ MongoDB Driver to do the aforementioned procedure In C++ Husky, add the following to your application: husky::io::MongoDBInputFormat infmt; infmt.set_server(“proj5:20001”); infmt.set_ns(“hdb”, “enwiki”); husky:load(infmt, parse_lambda);
23
InputFormat Example Build WordCountMongo:
$ cd build $ make WordCountMongo Set 20 threads for each worker in configuration file, then run: $ ./Master --conf default.cfg $ ./exec.sh WordCountMongo --conf default.cfg
24
More Any improvement for MongoDBInputFormat?
MongoDBInputFormat is just reading, how to write? Is it the same as HDFSLineInputFormat? Access time optimization? Other data storage systems?
25
Thank you
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.