MongoDB Connection in Husky

Slides:



Advertisements
Similar presentations
Technical Architectures
Advertisements

The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.
Installing and Setting up mongoDB replica set PREPARED BY SUDHEER KONDLA SOLUTIONS ARCHITECT.
Building applications with MongoDB – An introduction Roger
MongoDB Sharding and its Threats
Jeff Lemmerman Matt Chimento Medtronic Confidential 1 9th Annual CodeFreeze Symposium Medtronic Energy and Component Center.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.
Nutch Search Engine Tool. Nutch overview A full-fledged web search engine Functionalities of Nutch  Internet and Intranet crawling  Parsing different.
Software Engineer, #MongoDBDays.
10/26/00Splitting Access Databases...1 Preparing for Access 2000 Windows 2000/Office 2000 Roll-out.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
ALMA Integrated Computing Team Coordination & Planning Meeting #1 Santiago, April 2013 Evaluation of mongoDB for Persistent Storage of Monitoring.
Goodbye rows and tables, hello documents and collections.
MongoDB Replica,Shard Cluster 中央大學電算中心 楊素秋
Guide to Linux Installation and Administration, 2e1 Chapter 2 Planning Your System.
MediaGrid Processing Framework 2009 February 19 Jason Danielson.
Introduction to the Adapter Server Rob Mace June, 2008.
APEL & MySQL Alison Packer Richard Sinclair. APEL Accounting Processor for Event Logs extracts job information by parsing batch system (PBS, LSF, SGE.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation MongoDB Architecture.
By N.Gopinath AP/CSE Cognos Impromptu. What is Impromptu? Impromptu is an interactive database reporting tool. It allows Power Users to query data without.
Database Concepts Track 3: Managing Information using Database.
Module 10: Preparing to Monitor Server Performance.
Introduction to HDFS Prasanth Kothuri, CERN 2 What’s HDFS HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to expand.
3/18: Microsoft Access Refresher: What is a relational database? Why use a database? Sample database in MS access. –Fields, records, attributes. –Tables,
Department of Computing, School of Electrical Engineering and Computer Sciences, NUST - Islamabad KTH Applied Information Security Lab Secure Sharding.
Copyright 2007, Information Builders. Slide 1 Machine Sizing and Scalability Mark Nesson, Vashti Ragoonath June 2008.
MongoDB First Light. Mongo DB Basics Mongo is a document based NoSQL. –A document is just a JSON object. –A collection is just a (large) set of documents.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
A Technical Overview Bill Branan DuraCloud Technical Lead.
Senior Solutions Architect, MongoDB Inc. Massimo Brignoli #MongoDB Introduction to Sharding.
Master Cluster Manager User Interface (API Level) User Interface (API Level) Query Translator Avro NTA Query Engine NTA Query Engine Job Scheduler Avro.
M ODULE 2: P REPARING TO M ONITOR S ERVER P ERFORMANCE.
Technology Drill Down: Windows Azure Platform Eric Nelson | ISV Application Architect | Microsoft UK |
Database Overview What is a database? What types of databases are there? How are databases more powerful than spreadsheets?
Bigtable: A Distributed Storage System for Structured Data Google Inc. OSDI 2006.
CS422 Principles of Database Systems Introduction to NoSQL Chengyu Sun California State University, Los Angeles.
Accounts Receivable. Topics Covered New Statement Options New Receivables report source Customer Statement panel Scheduling Statements DB Config settings.
Introduction to Mongo DB(NO SQL data Base)
The Holmes Platform and Applications
Plan for Final Lecture What you may expect to be asked in the Exam?
Architecture Review 10/11/2004
CSE-291 (Distributed Systems) Winter 2017 Gregory Kesden
INTRODUCTION TO DATABASES (MICROSOFT ACCESS)
Chapter 1: Introduction
WinCC OA NextGen Archiver: OSS Database selection process Dipl. -Ing
Life of a Sharded Write by Randolph Tan.
MongoDB Distributed Write and Read
Learning MongoDB ZhangGang
Dineesha Suraweera.
Senior Solutions Architect, MongoDB Inc.
The Improvement of PaaS Platform ZENG Shu-Qing, Xu Jie-Bin 2010 First International Conference on Networking and Distributed Computing SQUARE.
Twitter & NoSQL Integration with MVC4 Web API
NOSQL databases and Big Data Storage Systems
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
CS6604 Digital Libraries IDEAL Webpages Presented by
MongoDB for the SQL DBA.
Face2Gene- DPDL integration
Chapter 2: Scaling VLANs
CS 345A Data Mining MapReduce This presentation has been altered.
Introduction of Week 11 Return assignment 9-1 Collect assignment 10-1
Building applications with MongoDB – An introduction
Azure Cosmos DB with SQL API .Net SDK
Introducing Scenario Network Data Editing and Enterprise GIS
Amazon AWS Certified Solutions Architect Professional solutions-architect-professional-practice-test.html.
5/7/2019 Map Reduce Map reduce.
NoSQL databases An introduction and comparison between Mongodb and Mysql document store.
Server & Tools Business
Presentation transcript:

MongoDB Connection in Husky CSCI5570 Large Scale Data Processing Systems Lab 3

Deep in Connection 1. Understand MongoDB distributed architecture 2. Consider the correctness of reading all data 3. Get familiar with the APIs 4. Design the InputFormat for Husky

MongoDB - Introduction Document database BSON(binary JSON format) Field -> Record(Document) -> Collection -> Database

MongoDB - Sharding

MongoDB - Sharding shard: Each shard contains a subset of the sharded data. config servers: Config servers store metadata and configuration settings for the cluster. mongos: The mongos acts as a query router, providing an interface between client applications and the sharded cluster.

MongoDB - Sharding Chunk: range based (default size 64MB) Two operations: A chunk is split into two, depends on chunk size Chunks migration among shards, by balancer

MongoDB – Mongo Shell Use tools provided by MongoDB: $ export PATH=$PATH:/data/opt/mongo-tools/bin Connect to mongo shell: $ mongo proj5:20001

MongoDB – Mongo Shell Show databases: Show collections: mongos> show databases Show collections: mongos> use hdb mongos> show collections

MongoDB – Mongo Shell Use `hdb` database: Show shards distribution: mongos> use hdb Show shards distribution: mongos> db.printShardStatus()

MongoDB – Mongo Shell See collection `enwiki` in `hdb` database: mongos> db.enwiki.count()

MongoDB – Mongo Shell See collection `enwiki` in `hdb` database: mongos> db.enwiki.find() # show 20 for each time

MongoDB - Split Data distribution has these two characteristics: Each chunk is a split in Husky Finish reading all chunks = finish reading a collection Ensure the data integrity and no repetition

MongoDB - Split Basic information of one split (chunk in terms of MongoDB): class MongoDBSplit { std::string input_uri; // location to shard std::string max; std::string min; // [min, max) std::string ns; // database.collection };

MongoDB – Assigner Program on the Husky Master side: Get all the shards information Get all the chunks information Obtain the chunk list Each worker will ask a chunk to read Until all the chunks have been read

MongoDB – Assigner Use `config` database: Check shards status: mongos> use config Check shards status: mongos> db.shards.find()

MongoDB – Assigner Check chunks status: mongos> db.chunks.find()

MongoDB – Assigner The assigner keeps the chunks list: class MongoSplitAssigner { std::vector<MongoDBSplit> splits; }; Splits assignment: An idle worker will ask an unread split to read. Then this split will be erased in the vector. Until the vector is empty.

MongoDB - InputFormat After get the chunk information: (shard location, database.collection, max, min) Access the shard and read the chunk directly Obtain the all records in the specific chunk Yield each record for each parse function

MongoDB - InputFormat Get one chunk location and range by assigner: shard: shard0004 -> 192.168.50.10:20000(proj10) ns: hdb.enwiki min: {md5: “cd1c88c44f2e99dcd6fa3378bbb18137”} max: {md5: “cfdf8f1b10cff07317c5de6247149a2e”}

MongoDB - InputFormat Access shard0004 directly: Use hdb: $ mongo proj10:20000 Use hdb: mongos> use hdb Find all records in [min, max): mongos> db.enwiki.find({"md5":{$gte:"cd1c88c44f2e99dcd6fa3378bbb18137",$lt:"cfdf8f1b10cff07317c5de6247149a2e"}})

MongoDB - InputFormat

InputFormat Example Husky uses C++ MongoDB Driver to do the aforementioned procedure In C++ Husky, add the following to your application: husky::io::MongoDBInputFormat infmt; infmt.set_server(“proj5:20001”); infmt.set_ns(“hdb”, “enwiki”); husky:load(infmt, parse_lambda);

InputFormat Example Build WordCountMongo: $ cd build $ make WordCountMongo Set 20 threads for each worker in configuration file, then run: $ ./Master --conf default.cfg $ ./exec.sh WordCountMongo --conf default.cfg

More Any improvement for MongoDBInputFormat? MongoDBInputFormat is just reading, how to write? Is it the same as HDFSLineInputFormat? Access time optimization? Other data storage systems?

Thank you