Practical Hadoop: do’s and don’ts by example Kacper Surdy, Zbigniew Baranowski.

Slides:



Advertisements
Similar presentations
MapReduce Simplified Data Processing on Large Clusters
Advertisements

 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Spark: Cluster Computing with Working Sets
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
7/2/2015EECS 584, Fall Bigtable: A Distributed Storage System for Structured Data Jing Zhang Reference: Handling Large Datasets at Google: Current.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Data storing and data access. Plan Basic Java API for HBase – demo Bulk data loading Hands-on – Distributed storage for user files SQL on noSQL Summary.
Key/Value Stores CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
Processing of the WLCG monitoring data using NoSQL J. Andreeva, A. Beche, S. Belov, I. Dzhunov, I. Kadochnikov, E. Karavakis, P. Saiz, J. Schovancova,
Data storing and data access. Adding a row with Java API import org.apache.hadoop.hbase.* 1.Configuration creation Configuration config = HBaseConfiguration.create();
Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.
Cloudera Kudu Introduction
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
Apache Accumulo CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
BIG DATA/ Hadoop Interview Questions.
Presenter: Yue Zhu, Linghan Zhang A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: a Case Study by PowerPoint.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
Hadoop file format studies in IT-DB Analytics WG meeting 20 th of May, 2015 Daniel Lanza, IT-DB.
Getting the Most out of Scientific Computing Resources
Getting the Most out of Scientific Computing Resources
Integration of Oracle and Hadoop: hybrid databases affordable at scale
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
Data Management with Google File System Pramod Bhatotia wp. mpi-sws
HBase Mohamed Eltabakh
Machine Learning Library for Apache Ignite
Introduction to Distributed Platforms
Software Systems Development
INTRODUCTION TO BIGDATA & HADOOP
What is Apache Hadoop? Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created.
How did it start? • At Google • • • • Lots of semi structured data
Hadoop Tutorials Spark
Running virtualized Hadoop, does it make sense?
CLOUDERA TRAINING For Apache HBase
Scaling SQL with different approaches
TABLE OF CONTENTS. TABLE OF CONTENTS Not Possible in single computer and DB Serialised solution not possible Large data backup difficult so data.
Gowtham Rajappan.
Hadoop: what is it?.
Introduction to HDFS: Hadoop Distributed File System
Hadoop Clusters Tess Fulkerson.
Powering real-time analytics on Xfinity using Kudu
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
The Basics of Apache Hadoop
CS6604 Digital Libraries IDEAL Webpages Presented by
湖南大学-信息科学与工程学院-计算机与科学系
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Andy Wang Operating Systems COP 4610 / CGS 5765
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
Hadoop Technopoints.
Hadoop for SQL Server Pros
Introduction to Apache
Overview of big data tools
Setup Sqoop.
Lecture 16 (Intro to MapReduce and Hadoop)
CS 345A Data Mining MapReduce This presentation has been altered.
Charles Tappert Seidenberg School of CSIS, Pace University
Apache Hadoop and Spark
Andy Wang Operating Systems COP 4610 / CGS 5765
Sarah Diesburg Operating Systems COP 4610
Presentation transcript:

Practical Hadoop: do’s and don’ts by example Kacper Surdy, Zbigniew Baranowski

Goal forewarn of the common mistakes/pitfalls show good practices point out limitations of the Hadoop Ecosystem Based on real-life cases

Outline Number of files in HDFS Accessing HDFS files directly HBase table scanning

HDFS metadata in namenode memory Namenode keeps the entire HDFS metadata in memory Stored objects directories files blocks Memory footprint might be an issue in particular when storing small files

Memory size estimation Estimates for Hadoop and 0.15 Typical estimate: 150 bytes / object On our clusters: at least 350 bytes / object Object size estimate (bytes) typical size (bytes) size estimate (bytes) 0.15 typical size (bytes) 0.15 File * fileName.length fileName.length125 Directory * fileName.length fileName.length155 Block * replication * replication184

File size vs. cluster capacity (1) Assumptions: 20 GB of namenode heap used for HDFS objects 256 MB block size 350 bytes used to store each object Using simplified formula: (#files + #directories + #blocks) * object_size Flat directory structure (#files >> #directories)

File size vs. cluster capacity (2) avg file size (MB)capacity (TB)

Misunderstandings of Hadoop API (1) Distributed files system ≠ parallelized computation Single client HDFS operations are usually not parallelized even though they operate on a distributed file system To perform parallel computations you need a computing framework MapReduce Spark Impala etc.

Misunderstandings of Hadoop API (2) Operations will be single threaded when using command line tools like hdfs dfs –put... hdfs dfs –get... HDFS Java API e.g. hadoop.fs.FileSystem.append(...) Parallel execution will be performed by some specialized tools hadoop distcp... sqoop-import... because they submit a MapReduce job your jobs submitted using MapReduce, Spark, etc.

HBase table scanning My HBase Table 1 million rows size 3GB has a generic rowkey – meaningless Lets scan the table with a filter (where my_id=‘zbaranow’) Process execution time = 15s  What is so slow? Can we do better?

What is slow? Instrument your code startTime = System.currentTimeMillis(); HTable table = new HTable(config, args[0]); endTime = System.currentTimeMillis(); System.out.println("Opening table :"+ (endTime-startTime) + " ms");

What is slow? Instrument you code! TOTAL time :15923 ms Loading configuration :262 ms Opening table :1229 ms Setting up scan time :2 ms Setting up filter time :1 ms Scanner creation time :558 ms Data scanning time :13871 ms => 220MB/s

Can we do better? 220 MB/s - big data? Oracle on NAS can do faster What is going on with HBase and my table? Check HBase master page: My table has only one region. Is it good?

My table has only one region Reading done only by single region server PID USER PR NI VIRT RES SHR S %CPU %MEM hbase m 3.3g 25m S Region server page: single thread is reading the data One reader (id=3) use out of 10 available

Do we use cache?

Lets split the table into more regions hbase> split ‘mytable2’ Nothing has changed with 2 regions - Scanning time :13106 ms hbase> move ‘e2f265a372ab aae a3’ Even worse: scanning time :14623 ms (data are not fully local)

Lets split the table into more regions 16 regions – each on a separated server scanning time :13602 ms …cache hit ratio <80% 16 regions – after reshuffling for optimal cache utilisation (manually) Scanning time :12783 ms cache hit ratio >80% Conclusion: Splitting table into more regions does not give real scanning performance improvement can only improve cache hit ratio - it requires some manual work

Why it is slow? Scanning of a single region (225MB) takes ~1s By default HBase scans regions in serial (one after another) this is to return results sorted by a rowkey

How to scan HBase in parallel? Multi-threaded subrange scanning – simple and fast 1.6 seconds to scan the data MapReduce – slow for small tables Coprocessors – fast but not simple …Impala with external table interface– simple but suboptimal 3-4s to scan the table

Conclusions Full HBase table scanning should be avoided – HBase was not design for this Does not scale by default …however parallel scanning can make the process scalable Cannot rely on HBase cache when scanning …still other technologies can do it better There isn’t a lot of profiling instrumentation for a HBase user Instrumentation of the client code is important HBase monitoring pages are useful to understand the topology of the data and potential bottlenecks Unit testing on isolated environment is a key to understand your data flow