Practical Hadoop: do’s and don’ts by example Kacper Surdy, Zbigniew Baranowski.

Practical Hadoop: do’s and don’ts by example Kacper Surdy, Zbigniew Baranowski

Goal forewarn of the common mistakes/pitfalls show good practices point out limitations of the Hadoop Ecosystem Based on real-life cases

Outline Number of files in HDFS Accessing HDFS files directly HBase table scanning

HDFS metadata in namenode memory Namenode keeps the entire HDFS metadata in memory Stored objects directories files blocks Memory footprint might be an issue in particular when storing small files

Memory size estimation Estimates for Hadoop 0.13.1 and 0.15 Typical estimate: 150 bytes / object On our clusters: at least 350 bytes / object Object size estimate (bytes) 0.13.1 typical size (bytes) 0.13.1 size estimate (bytes) 0.15 typical size (bytes) 0.15 File224 + 2 * fileName.length250112 + fileName.length125 Directory264 + 2 * fileName.length290144 + fileName.length155 Block152 + 72 * replication368112 + 24 * replication184

File size vs. cluster capacity (1) Assumptions: 20 GB of namenode heap used for HDFS objects 256 MB block size 350 bytes used to store each object Using simplified formula: (#files + #directories + #blocks) * object_size Flat directory structure (#files >> #directories)

File size vs. cluster capacity (2) avg file size (MB)capacity (TB) 0.12.93 129.3 10293 1002926 100011703 1000014272 10000014927

Misunderstandings of Hadoop API (1) Distributed files system ≠ parallelized computation Single client HDFS operations are usually not parallelized even though they operate on a distributed file system To perform parallel computations you need a computing framework MapReduce Spark Impala etc.

Misunderstandings of Hadoop API (2) Operations will be single threaded when using command line tools like hdfs dfs –put... hdfs dfs –get... HDFS Java API e.g. hadoop.fs.FileSystem.append(...) Parallel execution will be performed by some specialized tools hadoop distcp... sqoop-import... because they submit a MapReduce job your jobs submitted using MapReduce, Spark, etc.

HBase table scanning My HBase Table 1 million rows size 3GB has a generic rowkey – meaningless Lets scan the table with a filter (where my_id=‘zbaranow’) Process execution time = 15s  What is so slow? Can we do better?

What is slow? Instrument your code startTime = System.currentTimeMillis(); HTable table = new HTable(config, args[0]); endTime = System.currentTimeMillis(); System.out.println("Opening table :"+ (endTime-startTime) + " ms");

What is slow? Instrument you code! TOTAL time :15923 ms Loading configuration :262 ms Opening table :1229 ms Setting up scan time :2 ms Setting up filter time :1 ms Scanner creation time :558 ms Data scanning time :13871 ms => 220MB/s

Can we do better? 220 MB/s - big data? Oracle on NAS can do faster What is going on with HBase and my table? Check HBase master page: https://hbase-master:60010https://hbase-master:60010 My table has only one region. Is it good?

My table has only one region Reading done only by single region server PID USER PR NI VIRT RES SHR S %CPU %MEM 21524 hbase 20 0 3869m 3.3g 25m S 203.7 5.2 Region server page: http://rserverName:60030http://rserverName:60030 single thread is reading the data One reader (id=3) use out of 10 available

Do we use cache?

Lets split the table into more regions hbase> split ‘mytable2’ Nothing has changed with 2 regions - Scanning time :13106 ms hbase> move ‘e2f265a372ab21635544aae8595256a3’ Even worse: scanning time :14623 ms (data are not fully local)

Lets split the table into more regions 16 regions – each on a separated server scanning time :13602 ms …cache hit ratio <80% 16 regions – after reshuffling for optimal cache utilisation (manually) Scanning time :12783 ms cache hit ratio >80% Conclusion: Splitting table into more regions does not give real scanning performance improvement can only improve cache hit ratio - it requires some manual work

Why it is slow? Scanning of a single region (225MB) takes ~1s By default HBase scans regions in serial (one after another) this is to return results sorted by a rowkey

How to scan HBase in parallel? Multi-threaded subrange scanning – simple and fast 1.6 seconds to scan the data MapReduce – slow for small tables Coprocessors – fast but not simple …Impala with external table interface– simple but suboptimal 3-4s to scan the table

Conclusions Full HBase table scanning should be avoided – HBase was not design for this Does not scale by default …however parallel scanning can make the process scalable Cannot rely on HBase cache when scanning …still other technologies can do it better There isn’t a lot of profiling instrumentation for a HBase user Instrumentation of the client code is important HBase monitoring pages are useful to understand the topology of the data and potential bottlenecks Unit testing on isolated environment is a key to understand your data flow

Practical Hadoop: do’s and don’ts by example Kacper Surdy, Zbigniew Baranowski.

Similar presentations

Presentation on theme: "Practical Hadoop: do’s and don’ts by example Kacper Surdy, Zbigniew Baranowski."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Practical Hadoop: do’s and don’ts by example Kacper Surdy, Zbigniew Baranowski.

Similar presentations

Presentation on theme: "Practical Hadoop: do’s and don’ts by example Kacper Surdy, Zbigniew Baranowski."— Presentation transcript:

Similar presentations

About project

Feedback