Practical Hadoop: do’s and don’ts by example Kacper Surdy, Zbigniew Baranowski.

Slides:

Advertisements

Similar presentations

MapReduce Simplified Data Processing on Large Clusters

Advertisements

 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.

Spark: Cluster Computing with Working Sets

Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.

7/2/2015EECS 584, Fall Bigtable: A Distributed Storage System for Structured Data Jing Zhang Reference: Handling Large Datasets at Google: Current.

Google Distributed System and Hadoop Lakshmi Thyagarajan.

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.

Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.

SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.

HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.

Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.

Introduction to Hadoop and HDFS

f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read

Data storing and data access. Plan Basic Java API for HBase – demo Bulk data loading Hands-on – Distributed storage for user files SQL on noSQL Summary.

Key/Value Stores CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Processing of the WLCG monitoring data using NoSQL J. Andreeva, A. Beche, S. Belov, I. Dzhunov, I. Kadochnikov, E. Karavakis, P. Saiz, J. Schovancova,

Data storing and data access. Adding a row with Java API import org.apache.hadoop.hbase.* 1.Configuration creation Configuration config = HBaseConfiguration.create();

Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.

Cloudera Kudu Introduction

Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.

Apache Accumulo CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.

BIG DATA/ Hadoop Interview Questions.

Presenter: Yue Zhu, Linghan Zhang A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: a Case Study by PowerPoint.

Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.

1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.

Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.

Hadoop file format studies in IT-DB Analytics WG meeting 20 th of May, 2015 Daniel Lanza, IT-DB.

Getting the Most out of Scientific Computing Resources

Getting the Most out of Scientific Computing Resources

Integration of Oracle and Hadoop: hybrid databases affordable at scale

About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.

Data Management with Google File System Pramod Bhatotia wp. mpi-sws

HBase Mohamed Eltabakh

Machine Learning Library for Apache Ignite

Introduction to Distributed Platforms

Software Systems Development

INTRODUCTION TO BIGDATA & HADOOP

What is Apache Hadoop? Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created.

How did it start? • At Google • • • • Lots of semi structured data

Hadoop Tutorials Spark

Running virtualized Hadoop, does it make sense?

CLOUDERA TRAINING For Apache HBase

Scaling SQL with different approaches

TABLE OF CONTENTS. TABLE OF CONTENTS Not Possible in single computer and DB Serialised solution not possible Large data backup difficult so data.

Gowtham Rajappan.

Hadoop: what is it?.

Introduction to HDFS: Hadoop Distributed File System

Hadoop Clusters Tess Fulkerson.

Powering real-time analytics on Xfinity using Kudu

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

Ministry of Higher Education

The Basics of Apache Hadoop

CS6604 Digital Libraries IDEAL Webpages Presented by

湖南大学-信息科学与工程学院-计算机与科学系

Introduction to PIG, HIVE, HBASE & ZOOKEEPER

Andy Wang Operating Systems COP 4610 / CGS 5765

February 26th – Map/Reduce

Cse 344 May 4th – Map/Reduce.

Hadoop Technopoints.

Hadoop for SQL Server Pros

Introduction to Apache

Overview of big data tools

Lecture 16 (Intro to MapReduce and Hadoop)

CS 345A Data Mining MapReduce This presentation has been altered.

Charles Tappert Seidenberg School of CSIS, Pace University

Apache Hadoop and Spark

Andy Wang Operating Systems COP 4610 / CGS 5765

Sarah Diesburg Operating Systems COP 4610

Presentation transcript:

Practical Hadoop: do’s and don’ts by example Kacper Surdy, Zbigniew Baranowski

Goal forewarn of the common mistakes/pitfalls show good practices point out limitations of the Hadoop Ecosystem Based on real-life cases

Outline Number of files in HDFS Accessing HDFS files directly HBase table scanning

HDFS metadata in namenode memory Namenode keeps the entire HDFS metadata in memory Stored objects directories files blocks Memory footprint might be an issue in particular when storing small files

Memory size estimation Estimates for Hadoop and 0.15 Typical estimate: 150 bytes / object On our clusters: at least 350 bytes / object Object size estimate (bytes) typical size (bytes) size estimate (bytes) 0.15 typical size (bytes) 0.15 File * fileName.length fileName.length125 Directory * fileName.length fileName.length155 Block * replication * replication184

File size vs. cluster capacity (1) Assumptions: 20 GB of namenode heap used for HDFS objects 256 MB block size 350 bytes used to store each object Using simplified formula: (#files + #directories + #blocks) * object_size Flat directory structure (#files >> #directories)

File size vs. cluster capacity (2) avg file size (MB)capacity (TB)

Misunderstandings of Hadoop API (1) Distributed files system ≠ parallelized computation Single client HDFS operations are usually not parallelized even though they operate on a distributed file system To perform parallel computations you need a computing framework MapReduce Spark Impala etc.

Misunderstandings of Hadoop API (2) Operations will be single threaded when using command line tools like hdfs dfs –put... hdfs dfs –get... HDFS Java API e.g. hadoop.fs.FileSystem.append(...) Parallel execution will be performed by some specialized tools hadoop distcp... sqoop-import... because they submit a MapReduce job your jobs submitted using MapReduce, Spark, etc.

HBase table scanning My HBase Table 1 million rows size 3GB has a generic rowkey – meaningless Lets scan the table with a filter (where my_id=‘zbaranow’) Process execution time = 15s  What is so slow? Can we do better?

What is slow? Instrument your code startTime = System.currentTimeMillis(); HTable table = new HTable(config, args[0]); endTime = System.currentTimeMillis(); System.out.println("Opening table :"+ (endTime-startTime) + " ms");

What is slow? Instrument you code! TOTAL time :15923 ms Loading configuration :262 ms Opening table :1229 ms Setting up scan time :2 ms Setting up filter time :1 ms Scanner creation time :558 ms Data scanning time :13871 ms => 220MB/s

Can we do better? 220 MB/s - big data? Oracle on NAS can do faster What is going on with HBase and my table? Check HBase master page: My table has only one region. Is it good?

My table has only one region Reading done only by single region server PID USER PR NI VIRT RES SHR S %CPU %MEM hbase m 3.3g 25m S Region server page: single thread is reading the data One reader (id=3) use out of 10 available

Do we use cache?

Lets split the table into more regions hbase> split ‘mytable2’ Nothing has changed with 2 regions - Scanning time :13106 ms hbase> move ‘e2f265a372ab aae a3’ Even worse: scanning time :14623 ms (data are not fully local)

Lets split the table into more regions 16 regions – each on a separated server scanning time :13602 ms …cache hit ratio <80% 16 regions – after reshuffling for optimal cache utilisation (manually) Scanning time :12783 ms cache hit ratio >80% Conclusion: Splitting table into more regions does not give real scanning performance improvement can only improve cache hit ratio - it requires some manual work

Why it is slow? Scanning of a single region (225MB) takes ~1s By default HBase scans regions in serial (one after another) this is to return results sorted by a rowkey

How to scan HBase in parallel? Multi-threaded subrange scanning – simple and fast 1.6 seconds to scan the data MapReduce – slow for small tables Coprocessors – fast but not simple …Impala with external table interface– simple but suboptimal 3-4s to scan the table

Conclusions Full HBase table scanning should be avoided – HBase was not design for this Does not scale by default …however parallel scanning can make the process scalable Cannot rely on HBase cache when scanning …still other technologies can do it better There isn’t a lot of profiling instrumentation for a HBase user Instrumentation of the client code is important HBase monitoring pages are useful to understand the topology of the data and potential bottlenecks Unit testing on isolated environment is a key to understand your data flow