- Inter-departmental Lab

Slides:

Advertisements

Similar presentations

 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.

Advertisements

LIBRA: Lightweight Data Skew Mitigation in MapReduce

Spark: Cluster Computing with Working Sets

Resource Management with YARN: YARN Past, Present and Future

APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.

 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)

Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.

Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.

Ch 4. The Evolution of Analytic Scalability

A Dynamic MapReduce Scheduler for Heterogeneous Workloads Chao Tian, Haojie Zhou, Yongqiang He,Li Zha 簡報人：碩資工一甲董耀文.

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read

Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.

Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data.

The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.

Next Generation of Apache Hadoop MapReduce Owen

INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.

G. Russo, D. Del Prete, S. Pardi Frascati, 2011 april 4th-7th The Naples' testbed for the SuperB computing model: first tests G. Russo, D. Del Prete, S.

Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.

Big Data is a Big Deal!.

Sushant Ahuja, Cassio Cristovao, Sameep Mohta

About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.

Introduction to Distributed Platforms

By Chris immanuel, Heym Kumar, Sai janani, Susmitha

Distributed Programming in “Big Data” Systems Pramod Bhatotia wp

HDFS Yarn Architecture

Chapter 10 Data Analytics for IoT

Distributed Network Traffic Feature Extraction for a Real-time IDS

Large-scale file systems and Map-Reduce

Hadoop MapReduce Framework

Spark Presentation.

Data Platform and Analytics Foundational Training

Introduction to MapReduce and Hadoop

Introduction to HDFS: Hadoop Distributed File System

Hadoop Clusters Tess Fulkerson.

Software Engineering Introduction to Apache Hadoop Map Reduce

University of Technology

Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

Ministry of Higher Education

Introduction to Spark.

Database Applications (15-415) Hadoop Lecture 26, April 19, 2016

Distributed Systems CS

MapReduce Simplied Data Processing on Large Clusters

The Basics of Apache Hadoop

CS6604 Digital Libraries IDEAL Webpages Presented by

湖南大学-信息科学与工程学院-计算机与科学系

GARRETT SINGLETARY.

February 26th – Map/Reduce

Cse 344 May 4th – Map/Reduce.

CS110: Discussion about Spark

Ch 4. The Evolution of Analytic Scalability

CLUSTER COMPUTING.

Overview of big data tools

Distributed computing deals with hardware

CS 345A Data Mining MapReduce This presentation has been altered.

Distributed Systems CS

Apache Hadoop and Spark

COS 518: Distributed Systems Lecture 11 Mike Freedman

MapReduce: Simplified Data Processing on Large Clusters

Presentation transcript:

BigData@polito - Inter-departmental Lab Idilio Drago / Marco Mellia

Outline Introduction to the BigData@Polito lab The Big Data cluster Hardware Software Basic benchmark How to access and use the cluster? Examples of current usage of the cluster

BigData@Polito Lab – Why? When the data is such that processing it becomes part of the challenge Volume, velocity, variety etc Extract some useful knowledge Data mining, machine learning, clustering … Big data cluster Open, flexible, scalable Based on open-source For experimental activities Research Teaching

Big data vs HPC HPC Focus on fast computing Message passing etc. Focus on storage Simple operations on large data Embarrassingly parallel tasks Divide and conquer principle Move code where data is located PB HPC Focus on fast computing - cores, ram, GHz, … Message passing etc. Move superfast little data to superfast CPUs TFLOPS

BigData@Polito Lab Involved departments Physical cluster location DET, DAUIN, DISMA, DIGEP Physical cluster location Auta T – Ing. del Cinema Scientific committee members Mellia Marco - Telecommunication Networks Group DET Baralis Elena - Database and Data Mining Group DAUIN Paolucci Emilio, Neirotti Paolo - DIGEP Mauro Gasparini, Vaccarino Francesco - DISMA Michiardi Pietro - Distributed Systems Group EURECOM (France)

History

Key ideas of big data frameworks Data locality principle Move algorithms to the data, not data to the algorithms Failures are the norm, not the exception The framework takes care of splitting data, synchronizing tasks, recovering in case of failures of a task or a server etc. Data intensive workloads MapReduce → a batch processing framework designed to perform full reads of the input, thus avoiding random access Horizontal scalability based on commodity servers e.g., doubling the number of servers, halving processing time

Map Reduce – Toy example How often a word appears in a collection of documents?

Outline Introduction to the BigData@Polito lab The Big Data cluster Hardware Software Basic benchmark How to access and use the cluster? Examples of current usage of the cluster

BigData@Polito – The Hardware

BigData@Polito – The Hardware 4 Switches N3048 18 Workers DELL R720XD 2 x Intel E5-2630v2 6 cores Memory 96 GB 12 HDs 3TB – JBOD 4+1 GbE Network 12 Workers SuperMicro 1 x Intel Xeon 6 cores Memory 64 / 32 GB 5 HDs 2TB – JBOD 2+1 GbE Network Workers: 576 logical cores (with HT) +2TB RAM 276 HDs 768 TB of storage ~ 45 GB/s “nominal” disk read speed (dd) 3 Masters DELL R620 2 x Intel E5-2630v2 6 cores Memory 128 GB 3 HDs 600GB in RAID 4+1 GbE Network

BigData@Polito – Logic Setup Link Aggregation w/Bonding (balance-alb) all machines are connected to both switches in their racks P2P communication is limited to 1 Gbps

Outline Introduction to the BigData@Polito lab The Big Data cluster Hardware Software Basic benchmark How to access and use the cluster? Examples of current usage of the cluster

The Software Based on the Cloudera platform

Architecture HDFS – Hadoop Distributed File System YARN – Yet Another Resource Negotiator Applications : MapReduce, Spark etc

HDFS: What is the usable disk capacity? Replication set to 3 – the client writes blocks to its own node first, then the other rack is used for a second and a third copy Therefore out cluster actual capacity is 256 TB Replicas guarantee resilience to disk failures (and we had some already) They give flexibility to allocation of executors

YARN: How are the resources shared? Scheduling Policy Preemption

YARN: How are the resources shared? Dominant Resource Fairness: Equalizes “dominant share” of users Host: <9 CPU, 18 GB> Task User 1: <1 CPU, 4 GB> dom res: memory Task User 2: <3 CPU, 1 GB> dom res: CPU Preemption occurs after 2 min: It is normal to wait some time to see the job starting running It is normal to see containers being killed

Spark applications

MLlib algorithms

Example – Spark execution overview The application creates a driver process The application gets its executor processes It sends the code and tasks to the executors Our current setup allows applications to have more than 500 executors (500+ threads reading and processing the data in parallel)

Outline Introduction to the BigData@Polito lab The Big Data cluster Hardware Software Basic benchmark How to access and use the cluster? Examples of current usage of the cluster

Raw HDFS read speed Thanks to overhead, the cluster can read up to 13 GB/s (without any processing)

Roughly, this cluster can sort 1 TB in ~10 min (mapred) Terasort Roughly, this cluster can sort 1 TB in ~10 min (mapred)

Outline Introduction to the BigData@Polito lab The Big Data cluster Hardware Software Basic benchmark How to access and use the cluster? Samples of current usage of the cluster

How do I request an user account? First: Is this cluster/framework the best solution? The cluster has an independent LDAP/Kerberos system controlling access and HDFS permissions Contact the responsible in your department DET: Marco Mellia, Maurizio Munafò, Idilio Drago, … DAUIN: Elena Baralis, Paolo Garza, … … Fill in the form available at http://bigdata.polito.it/contact

How do I use the cluster? Go to http://bigdata.polito.it/content/access-instructions

Outline Introduction to the BigData@Polito lab The Big Data cluster Hardware Software Basic benchmark How to access and use the cluster? Examples of current usage of the cluster

Research Scope: New Algorithms and data science APPLICATION LAYER TRANSPORT LAYER Analysis of network traffic in real-time APPLICATION LAYER Analysis of OSN contents Scope: New Algorithms and data science Traffic classification, engineering Network security (e.g., malware detection) User and community profiling Recommendation systems

Teaching Computer Engineering MS current offering Data Mining Artificial Intelligence Big Data Management New track on Data Science Data Modeling + Data Engineering + Software engineering + Data Mining & Analytics

Questions?