Etcetera! CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.

Slides:

Advertisements

Similar presentations

 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.

Advertisements

 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)

Etcetera! CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 7 Configuring File Services in Windows Server 2008.

Microsoft Load Balancing and Clustering. Outline Introduction Load balancing Clustering.

Google Distributed System and Hadoop Lakshmi Thyagarajan.

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

Configuring File Services Lesson 6. Skills Matrix Technology SkillObjective DomainObjective # Configuring a File ServerConfigure a file server4.1 Using.

The Hadoop Distributed File System: Architecture and Design by Dhruba Borthakur Presented by Bryant Yao.

Windows Server MIS 424 Professor Sandvig. Overview Role of servers Performance Requirements Server Hardware Software Windows Server IIS.

Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.

A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.

CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.

The Hadoop Distributed File System

SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.

Module 12: Designing High Availability in Windows Server ® 2008.

Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)

INSTALLING MICROSOFT EXCHANGE SERVER 2003 CLUSTERS AND FRONT-END AND BACK ‑ END SERVERS Chapter 4.

HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.

Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

Introduction to Hadoop and HDFS

f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read

Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data.

Introduction to HDFS Prasanth Kothuri, CERN 2 What’s HDFS HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to expand.

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

Key Management. Given a computer network with n hosts, for each host to be able to communicate with any other host would seem to require as many as n*(n-1)

Kerberos By Robert Smithers. History of Kerberos Kerberos was created at MIT, and was named after the 3 headed guard dog of Hades in Greek mythology Cerberus.

Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.

HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.

{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.

GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.

Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.

Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.

1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.

BIG DATA/ Hadoop Interview Questions.

Configuring File Services

Introduction to Distributed Platforms

Software Systems Development

High Availability Linux (HA Linux)

What is Apache Hadoop? Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created.

HDFS Yarn Architecture

Chapter 10 Data Analytics for IoT

Large-scale file systems and Map-Reduce

Introduction to MapReduce and Hadoop

Introduction to HDFS: Hadoop Distributed File System

Introduction to Networks

Introduction to Networks

Hadoop and NoSQL at Thomson Reuters

Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

Hadoop Clusters Tess Fulkerson.

Gregory Kesden, CSE-291 (Cloud Computing) Fall 2016

Software Engineering Introduction to Apache Hadoop Map Reduce

Central Florida Business Intelligence User Group

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

The Basics of Apache Hadoop

湖南大学-信息科学与工程学院-计算机与科学系

GARRETT SINGLETARY.

Hadoop Distributed Filesystem

Oracle Architecture Overview

آزمايشگاه سيستمهای هوشمند علی کمالی زمستان 95

Hadoop Technopoints.

Introduction to Apache

Lecture 16 (Intro to MapReduce and Hadoop)

Cloud Computing Architecture

Specialized Cloud Architectures

CS 295: Modern Systems Organizing Storage Devices

Presentation transcript:

Etcetera! CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook

Agenda Advanced HDFS Features Apache Cassandra Cluster Planning

ADVANCED HDFS FEATURES

Highly Available NameNode Highly Available NameNode feature eliminates SPOF Requires two NameNodes and some extra configuration – Active/Passive or Active/Active – Clients only contact the active NameNode – DataNodes report in and heartbeat with both NameNodes – Active NameNode writes metadata to a quorum of JournalNodes – Standby NameNode reads the JournalNodes to stay in sync There is no CheckPointNode (SecondaryNameNode) – The passive NameNode performs checkpoint operations

HA NameNode Failover There are two failover scenarios – Graceful – Performed by an administrator for maintenance – Automated – Active NameNode fails Failed NameNode must be fenced – Eliminates the 'split brain syndrome' Two fencing methods are available – sshfence – Kill NameNodes daemon – shell script – disables access to the NameNode, shuts down the network switch port, sends power off to the failed NameNode There is no 'default' fencing method

NN Active NN Active ZooKeeper NN Active Data Node Shared NN State NFS or QJM ZKFC Release lock Lock Released Create Lock Lock Created Fence NN Become Active I'm the Boss NN Standby ZKFC

HDFS Federation Useful for: – Isolation/multi-tenancy – Horizontal scalability of HDFS namespace – Performance Allows for multiple independent NameNodes using the same collection of DataNodes DataNodes store blocks from all NameNode pools

Federated NameNodes File-system namespace scalable beyond heap size NameNode performance no longer a bottleneck NameNode failure/degradation is isolated – Only data managed by the failed NameNode is unavailable Each NameNode can be made Highly Available

Hadoop Security Hadoop's original design – web crawler and indexing – Not designed for processing of confidential data – Small number of trusted users Access to cluster controlled by providing user accounts – Little / no control on what a user could do once logged in HDFS permissions were added in the Hadoop 0.16 release – Similar to basic UNIX file permissions – HDFS permissions can be disabled via dfs.permissions – Basically for protection against user-induced accidents – Did not protect from attacks Authentication is accomplished on the client side – Easily subverted via a simple configuration parameter

Kerberos Kerberos support introduced in the Hadoop release – Developed at MIT / freely available – Not a Hadoop-specific feature – Not included in Hadoop releases Works on the basis of 'tickets' – Allow communicating nodes to securely identify each other across unsecure networks Primarily a client/server model implementing mutual authentication – The user and the server verify each other's identity

How Kerberos Works Client forwards the username to KDC A.KDC sends Client/TGS Session Key, encrypted with user's password B.KDC issues a TGT, encrypted with TGS's key C.Sends B and service ID to TGS D.Authenticator encrypted w/A E.TGS issues CTS ticket, encrypted with SS key F.TGS issues CSS, encrypted w/A G.New authenticator encrypted with F H.Timestamp found in G+1 KDC - Key Distribution Center TGS – Ticket Granting Service TGT – Ticket Granting Ticket CTS – Client-to-Server Ticket CSS – Client Server Session Key

Kerberos Services Authentication Server – Authenticates client – Gives client enough information to authenticate with Service Server Service Server – Authenticates client – Authenticates itself to client – Provides services to client

Kerberos Limitations Single point of failure – Must use multiple servers – Implement failback authentication mechanisms Strict time requirements – 'tickets' are time stamped – Clocks on all host must be carefully synchronized All authentication is controlled by the KDC – Compromise of this infrastructure will allow attackers to impersonate any user Each network service requiring a different host name must have its own set of Kerberos keys – Complicates virtual hosting of clusters

APACHE CASSANDRA

In a couple dozen words... Apache Cassandra is an open source, distributed, decentralized, elastically scalable, highly available, fault-tolerant, tunably consistent, column-oriented database with a lot of adjectives

Overview Originally created by Facebook and opened sourced in 2008 Based on Google Big Table & Amazon Dynamo Massively Scalable Easy to use No relation to Hadoop – Specifically, data is not stored on HDFS

Distributed and Decentralized Distributed – Can run on multiple machines Decentralized – No single point of failure No master or slave issues by using a peer-to-peer architecture (gossip protocol, specifically) Can run across geographic datacenters

Elastic Scalability Scales horizontally Adding nodes linearly increases performance Decreasing and increasing nodecounts happen seamlessly

Highly Available and Fault Tolerant Multiple networked computers in a cluster Facility for recognizing node failures Forward failing over requests to another part of the system

Tunable Consistency Choice between strong and eventual consistency Adjustable for reads and write operations separately Conflicts are solved during reads

Column-Oriented Stored in spare multi- dimensional hash tables Row can have multiple columns, and not necessarily the same amount of columns for each row Each row has a unique key used for partitioning

Query with CQL Familiar SQL-like syntax that maps to Cassandra's storage engine and simplifies data modeling CREATE TABLE songs ( id uuid PRIMARY KEY, title text, album text, artist text, data blob, tags set ); INSERT INTO songs (id, title, artist, album, tags) VALUES ( 'a3e648f...', 'La Grange', 'ZZ Top', 'Tres Hombres', {'cool', 'hot'}); SELECT * FROM songs WHERE id = 'a3e648f...';

When should I use this? Key features to compliment a Hadoop system: Geographical distribution Large deployments of structured data

CLUSTER PLANNING

Workload Considerations Balanced workloads – Jobs are distributed across various job types CPU bound Disk I/O bound Network I/O bound Compute intensive workloads - Data Analytics – CPU bound workloads require: Large numbers of CPU's Large amounts of memory to store in-process data I/O intensive workloads - Sorting – I/O bound workloads require: Larger number of spindles ( disks ) per node Not sure…go with balance workloads configuration

Hardware Topology Hadoop uses a master / slave topology Master Nodes include: – NameNode - maintains system metadata – Backup NN- performs checkpoint operations and host standby – ResourceManager- manages task assignment Slave Nodes include: – DataNode - stores hdfs files / manages read and write requests Preferably co-located with TaskTracker – NodeManager - performs map / reduce tasks

Sizing The Cluster Remember... Scaling is a relatively simple task – Start with a moderate sized cluster – Grow the cluster as requirements dictate – Develop a scaling strategy As simple as scaling is…adding new nodes takes time and resources Don't want to be adding new nodes each week Amount of data typically defines initial cluster size – rate at which the volume of data increases Drivers for determining when to grow your cluster – Storage requirements – Processing requirements – Memory requirements

Storage Reqs Drive Cluster Growth Data volume increases at a rate of 1TB / week 3TB of storage are required to store the data alone – Remember block replication Consider additional overhead - typically 30% – Remember files that are stored on a nodes local disk If DataNodes incorporate 4 - 1TB drives – 1 new node per week is required – 2 years of data - roughly 100TB will require 100 new nodes

Things Break Things are going to break – This assumption is a core premise of Hadoop – If a disk fails, the infrastructure must accommodate – If a DataNode fails, the NameNode must manage this – If a task fails, the ApplicationMaster must manage this failure Master nodes are typically a SPOF unless using a Highly Available configuration – NameNode goes down, HDFS is inaccessible Use NameNode HA – ResourceManager goes down, can't run any jobs Use RM HA (in development)

Cluster Nodes Cluster nodes should be commodity hardware – Buy more nodes... Not more expensive nodes Workload patterns and cluster size drive CPU choice – Small cluster - 50 nodes or less Quad core / medium clock speed is usually sufficient – Large cluster Dual 8-core CPUs with a medium clock speed is sufficient – Compute intensive workloads might require higher clock speeds – General guideline is to buy more hardware instead of faster hardware Lots of memory - 48GB / 64GB / 128GB / 256GB – Each map / reduce task consumes 1GB to 3GB of memory – OS / Daemons consume memory as well

Cluster Storage 4 to 12 drives of 1TB / 2TB capacity - up to 24TB / node – 3TB drives work Network performance penalty if a node fails – 7200 rpm SATA drives are sufficient Slightly above average MTBF is advantageous – JBOD configuration RAID is slow RAID is not required due to block replication More smaller disks is preferred over fewer larger disks – Increased parallelism for DataNodes Slaves should never use virtual memory

Master Nodes Still commodity hardware, but... better Redundant everything – Power supplies – Dual Ethernet cards 16 to 24 CPU cores on NameNodes – NameNodes and their clients are very chatty and need more cores to handle messaging traffic – Medium clock speeds should be sufficient

Master Nodes HDFS namespace is limited to the amount of memory on the NameNode RAID and NFS storage on NameNode – Typically RAID5 with hot spare – Second remote directory such as NFS Quorum Journal Manager for HA

Network Considerations Hadoop is bandwidth intensive – This can be a significant bottleneck – Use dedicated switches 10Gb Ethernet is pretty good for large clusters

Which Operating System? Choose an OS that you are comfortable and familiar with – Consider you admin resources / experience RedHat Enterprise Linux – Includes support contract CentOS – No support but the price is right Many other possibilities – SuSE Enterprise Linux – Ubuntu – Fedora

Which Java Virtual Machine? Oracle Java is the only “supported” JVM – Runs on OpenJDK, but use at your own risk Hadoop 1.0 requires Java JDK 1.6 or higher Hadoop 2.x requires Java JDK 1.7

References – – Give it a test drive! nosql-in-the-enterprise nosql-in-the-enterprise introduction-features introduction-features us/um/people/srikanth/netdb11/netdb11papers/netd b11-final12.pdf us/um/people/srikanth/netdb11/netdb11papers/netd b11-final12.pdf basic-training-verisign basic-training-verisign