Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.

Hadoop tutorials

Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2

Cloudera Image for hands-on Installation instruction https://cern.ch/test-zbaranow/CVM.txt 3

Hadoop Introduction

What is Hadoop? (1) A framework for large scale data processing Volume Variety Velocity 5

What Hadoop is? (2) Solution for big data processing Sequential data access – a brute force approach Simplified data structures (no relational model) Ideal for ad-hoc data analytics Instead of some clever data lookups with indexing etc. Data analytic cases has to be known before hand Complex data design 6

What is Hadoop? (3) Data locality (shared nothing) – scales out Interconnect network MEMORY CPU Disks MEMORY CPU Disks MEMORY CPU Disks MEMORY CPU Disks MEMORY CPU Disks MEMORY CPU Disks Node 1Node 2Node 3Node 4 Node 5 Node X 7

What is Hadoop? (4) Optimized storage access (for HDD) Big data blocks >=128MB Seqential IO instead of Random IO HDD drive 7200rpm speed: -Sequential IO: ~120MB/s -Random IO: 0.5 - 50MB/s 8

Hadoop eco system HDFS Hadoop Distributed File System Hbase NoSql columnar store YARN Cluster resource manager MapReduce Hive SQL Pig Scripting Flume Log data collector Sqoop Data exchange with RDBMS Oozie Workflow manager Mahout Machine learning Zookeeper Coordination Impala SQL Spark Large scale data proceesing 9

Hadoop cluster architecture Master and slaves approach Interconnect network Node 1Node 2Node 3Node 4 Node 5 Node X HDFS DataNode Various component agents and masters YARN Node Manager HDFS NameNode HDFS DataNode Various component agents and masters YARN Node Manager YARN ResourceManager HDFS DataNode Various component agents and demons YARN Node Manager Hive metastore HDFS DataNode Various component agents and demons YARN Node Manager HDFS DataNode Various component agents and demons YARN Node Manager HDFS DataNode Various component agents and demons YARN Node Manager 10

What to not use the Hadoop for? Online Transaction Processing system No transactions No locks No data updates (only appends and overwrites) Response time in seconds rather milliseconds Not good for systems with relational data Interactive applications Accounting systems Etc. 11

What to use the Hadoop for? For Big Data! Storing Analysis Write once – read many Scalable out system (CPU, IO, RAM) transparent to the users (data placement, data analysis) Good for data exploration: in a batch fashion statistics, aggregations, correlation Data warehouses Logs 12

Hadoop @CERN 4 main clusters (provided by IT) 16-20 machines each 24GB – 256GB of RAM Main users ATLAS (EventIndex, PandaMon, Rucio) CASTOR logs WLCG Dasboards IT Monitoring Computer Security … Available services HDFS, YARN (MR), Hbase, Hive, Pig, Spark, Impala (upcoming) Contact SNOW: https://cern.service-now.com/service-portal/report- ticket.do?name=request&se=Hadoop-Service 13

Summary Hadoop is a solution for massive data processing Designed to scale out On a commodity hardware Optimized for sequential reads Hadoop architecture HDFS is a core Many components with multiple functionalities distributed across cluster nodes 14

Questions? 15

Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.

Similar presentations

Presentation on theme: "Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.

Similar presentations

Presentation on theme: "Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2."— Presentation transcript:

Similar presentations

About project

Feedback