Zois Vasileios Α. Μ :4183 University of Patras Department of Computer Engineering & Informatics Diploma Thesis.

Slides:



Advertisements
Similar presentations
Meet Hadoop Doug Cutting & Eric Baldeschwieler Yahoo!
Advertisements

 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Store RDF Triples In A Scalable Way Liu Long & Liu Chunqiu.
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Distributed storage for structured data
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Gowtham Rajappan. HDFS – Hadoop Distributed File System modeled on Google GFS. Hadoop MapReduce – Similar to Google MapReduce Hbase – Similar to Google.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
HADOOP ADMIN: Session -2
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E.
Frankie Pike. 2010: 1.2 zettabytes 1.2 trillion gigabytes DVDs past the moon 2-way = 6 newspapers everyday ~58% growth per year Why care?
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
ZhangGang, Fabio, Deng Ziyan /31 NoSQL Introduction to Cassandra Data Model Design Implementation.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
Whirlwind Tour of Hadoop Edward Capriolo Rev 2. Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High.
BigTable and Accumulo CMSC 461 Michael Wilson. BigTable  This was Google’s original distributed data concept  Key value store  Meant to be scaled up.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Big Table - Slides by Jatin. Goals wide applicability Scalability high performance and high availability.
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes 13: BigTable, HBASE, Cassandra Hector Garcia-Molina.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.
Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.
By Vaibhav Nachankar Arvind Dwarakanath.  HBase is an open-source, distributed, column- oriented and sorted-map data storage.  It is a Hadoop Database;
MapReduce and NoSQL CMSC 461 Michael Wilson. Big data  The term big data has become fairly popular as of late  There is a need to store vast quantities.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
Nov 2006 Google released the paper on BigTable.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Bigtable : A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows,
MapReduce Basics Chapter 2 Lin and Dyer & /tutorial/
1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.
Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
Bigtable A Distributed Storage System for Structured Data.
Google Cloud computing techniques (Lecture 03) 18th Jan 20161Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore
1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.
Column-Based.
CPS216: Data-intensive Computing Systems
HADOOP ADMIN: Session -2
Gowtham Rajappan.
Introduction to MapReduce and Hadoop
Ministry of Higher Education
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
CS 345A Data Mining MapReduce This presentation has been altered.
5/7/2019 Map Reduce Map reduce.
Presentation transcript:

Zois Vasileios Α. Μ :4183 University of Patras Department of Computer Engineering & Informatics Diploma Thesis

 Distributed Systems  Hadoop Distributed File System (HDFS )  Distributed Database(HBase)  MapReduce Programming Model  Study of Β, Β + Trees  Building Trees on Η Base  Range Queries on B+ & B Trees  Experiments in the Construction of Trees  Analyzing Results  Conclusions

 Open Source Implementation of GFS  Distributed File System Used by Google  Google File System  Distributed File System  Management of Large Amount of Data  Failure Detection & Automatic Recovery  Scalability  Designed Using Java  Independent from Operating System  Computers with Different Hardware

 HBase  Open Source Implementation of BigTable  NoSQL Systems  Organizing Data in Tables  Tables Divided in Column Families  Category: Column Family Stores  Architecture Similar to HDFS  Work Using HDFS

 Distributed Programming Model  Data Intensive Applications  Distributed Computing in a Cluster of Machines  Functional Programming  Map Function  Reduce Function  Operations  Data Structured in (key,value)  Process Data Parallel at Input (Mapper)  Process Intermediate Results(Reducer)  Map(k1,v1) → List(k2,v2)  Reduce(k2,list(v2)) → List(v3)

 Mapper  Input Data Processing  Pairing in the Form (key,value)  Custom Partitioner  Data Clustering  Specific Range of Values on Each Reducer  Reducer  Tree Building(BulkInsert,BulkLoading)  Some Data saved in memory during process  Cleanup  Write Tree at Hbase Table

 More Efficient  Lesser Requirements in Physical Memory.  Completion in Less Steps Ο (n/B).  Relative Easy Implementation  Execution Steps  Sorted keys from Map Face  Divide into Leafs  Save Information for the Next Level  Write Created Nodes when Buffer Full  Repeat Procedure Until you Reach the Root

 Tree Node = Row in Table  Define Node Column Family  Row Key  Internal Nodes – Last Key of Respective Node  Leafs – Adding a Special Tag in Front of Last Node key (Sorting in Lexicographic order)

 Check Tree Range  Find Leaf  Leaf Including left range  Leaf Including right range  Hbase Table  Scan to Find Keys  Use Rowkey from each Leaf to Scan  Complexity  Τ Trees, Ε keys in Tree, Β Tree Order  Ο (2*( Τ + log B (E) )

 Respectively with B+ Trees  Find Trees with Required Range  Pinpoint Individual Trees from Start to End  Execution of Depth First Search on Each Tree  Depth First Search  Retrieval of Keys in Internal Nodes  Complexity  Depth First Search Complexity  Ο (|V| + |E|)* Τ

 Hadoop & HBase  Hadoop version  HBase version  Operating System  Debian Base  Machines(4) – Okeanos  4 CPUs(Virtual) per machine  RAM 2048MB per machine  HDD 40 GB per machine  Data  tpc-H  Orders Table (cust_id,order_id)

 Experiment Observation  Tree Order  Execution Time  Necessary Storage Space  Physical Memory  Number of Reducers

 Comparison of Trees with Order 5 & 101  Augmented Execution Time  Rebalance Operation  Physical Memory & HDD Space  Necessary Information for Tree Structure  Conclusion  Problem in Scalability  Large Physical Memory Requirements  Augmented Execution Time

Tree Order 5 Β+TreeB-Tree Data Input Size230ΜΒ230MB Output Tree Size2,2 GΒ1,4 GB Execution Time (sec) Median Execution Time Map(sec) 56,2955 Median Execution Time Shuffle (sec) 2828,75 Median Execution Time Reduce (sec) 125,588,25 Number of Reducers 88 Physical Memory Allocated19525 MB15222 MB Tree Order 101 Β+TreeB-Tree Input Data Size230ΜΒ230MB Output Tree Size598,2ΜΒ256MB Execution Time (sec) Median Execution Time Map (sec) 5249,86 Median Execution Time Shuffle (sec) 28,6329,75 Median Execution Time Reduce (sec) 68,2566,25 Number of Reducers 88 Physical Memory Allocated9501 MB9286 MB

 BulkLoading vs BulkInsert Comparison  Smaller Execution Time  Less Requirements in Physical Memory  Smaller Required Space on HDD  Testing Buffer Fluctuation  Buffer 128,512  Smaller Execution Time  Adjustable Requirements for Physical Memory

Tree Order 101Β+TreeB-Tree Input Data Size230ΜΒ230MB Output Tree Size267,1ΜΒ256MB Execution Time (sec) Median Execution Time Map(sec) 51,1453,57 Median Execution Time Reduce (sec) 43,537,75 Number of Reducers 88 Buffer Size(Put Objects) 128 Physical Memory Allocated6517 ΜΒ6165 ΜΒ Tree Order 101Β+TreeB-Tree Input Data Size230ΜΒ230MB Output Tree Size267,1ΜΒ256MB Execution Time (sec) Median Execution Time Map(sec) 5255,14 Median Execution Time Reduce (sec) 3330,63 Number of Reducers 88 Buffer Size(Put Objects) 512 Physical Memory Allocated6613 ΜΒ6678 ΜΒ

 In Comparing Building Techniques  BulkInsert  Precise Choice of Tree Order  Augmented Execution Time with Small Order Trees Due to constant Rebalancing  High Physical Memory Requirements  Not So Scalable  BulkLoading  Created Tree is Full ( Next Insert could cause an Tree Rebalancing)  Smaller Execution Time  Adjustable Requirements in Physical Memory  More Complicated Implementation  Why Use B & B+ Trees  In Collaboration with Pre-Warm Techniques  Less Burden on Master.  Communication Between Slaves