Data science laboratory (DSLAB)

Slides:



Advertisements
Similar presentations
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Advertisements

Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
Hadoop Ecosystem Overview
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
Analytics Map Reduce Query Insight Hive Pig Hadoop SQL Map Reduce Business Intelligence Predictive Operational Interactive Visualization Exploratory.
Introduction to Hadoop 趨勢科技研發實驗室. Copyright Trend Micro Inc. Outline Introduction to Hadoop project HDFS (Hadoop Distributed File System) overview.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.
Introduction to Hadoop and HDFS
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Data and SQL on Hadoop. Cloudera Image for hands-on Installation instruction – 2.
Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Data Science Hadoop YARN Rodney Nielsen. Rodney Nielsen, Human Intelligence & Language Technologies Lab Outline Classical Hadoop What’s it all about Hadoop.
Redmond Protocols Plugfest 2016 Casey Karst PolyBase in SQL Server 2016.
Microsoft Partner since 2011
CCD-410 Cloudera Certified Developer for Apache Hadoop (CCDH) Cloudera.
Page 1 © Hortonworks Inc – All Rights Reserved Apache Hadoop - Virtualization Winter 2015 Version 1.4 Hortonworks. We do Hadoop.
BI 202 Data in the Cloud Creating SharePoint 2013 BI Solutions using Azure 6/20/2014 SharePoint Fest NYC.
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
Hadoop Big Data Usability Tools and Methods. On the subject of massive data analytics, usability is simply as crucial as performance. Right here are three.
The Big Data Network (phase 2) Cloud Hadoop system
Petr Škoda, Jakub Koza Astronomical Institute Academy of Sciences
OMOP CDM on Hadoop Reference Architecture
Best IT Training Institute in Hyderabad
Connected Infrastructure
PROTECT | OPTIMIZE | TRANSFORM
How Alluxio (formerly Tachyon) brings a 300x performance improvement to Qunar’s streaming processing Xueyan Li (Qunar) & Chunming Li (Garena)
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
Smart Building Solution
Hadoop.
Introduction to Distributed Platforms
Hadoop and Analytics at CERN IT
By Chris immanuel, Heym Kumar, Sai janani, Susmitha
Software Systems Development
INTRODUCTION TO BIGDATA & HADOOP
An Open Source Project Commonly Used for Processing Big Data Sets
Microsoft Machine Learning & Data Science Summit
Chapter 10 Data Analytics for IoT
Large-scale file systems and Map-Reduce
Running virtualized Hadoop, does it make sense?
Spark Presentation.
Smart Building Solution
Connected Infrastructure
Data Platform and Analytics Foundational Training
Hadoop Clusters Tess Fulkerson.
Software Engineering Introduction to Apache Hadoop Map Reduce
Central Florida Business Intelligence User Group
Lab #2 - Create a movies dataset
Ministry of Higher Education
MIT 802 Introduction to Data Platforms and Sources Lecture 2
Designed for Big Data Visual Analytics, Zoomdata Allows Business Users to Quickly Connect, Stream, and Visualize Data in the Microsoft Azure Platform MICROSOFT.
CS6604 Digital Libraries IDEAL Webpages Presented by
20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.
Hadoop Basics.
CS110: Discussion about Spark
Ch 4. The Evolution of Analytic Scalability
Introduction to Apache
Overview of big data tools
Lecture 16 (Intro to MapReduce and Hadoop)
CS 345A Data Mining MapReduce This presentation has been altered.
Charles Tappert Seidenberg School of CSIS, Pace University
Data science laboratory (DSLAB)
Data science laboratory (DSLAB)
Data science laboratory (DSLAB)
Big-Data Analytics with Azure HDInsight
Oracle 1z0-928 Oracle Cloud Platform Big Data Management 2018 Associate.
Presentation transcript:

Data science laboratory (DSLAB) Eric Bouillet Swiss Data Science Center EPFL & ETH Zurich Data Science Lab – Spring 2018

Start your engines! Head over to the course webpage https://dslab2018.github.io/ Configure & Run Oracle VirtualBox Download, configure & start the DSLab VM Ubuntu linux Anaconda (Python 3), git, Docker/ pySpark, pyCharm {username: student, password: student} If you want to try the big data platform at home Download, configure & start the HDP (not HDF) Sandbox https://hortonworks.com/downloads/#sandbox (registration required) You can choose between the VM for VirtualBox or Docker There are office hours on Fridays for the graded projects 974789e900b5ad11be069b30f7f770b4bb0e1c35  HDP_2.6.4_virtualbox_01_02_2018_1428.ova

Today’s lab week #5 Hadoop Weeks 1 & 2 Week 4 Week 3

This Module Review Big Data concepts Experiment with Big Data technologies Distributed storage Distributed computing Getting familiar with SQL on big data

Big Data Landscape Evolution – A Moving Target 2014 2016 2017 Google Trends Big Data Machine Learning Data Science (Sources: Matt Turck - http://mattturck.com ; trends.google.ch)

Today’s Objectives Get acquainted with the Big Data ecosystems Find your way in the Big Data jungle, explore it more efficiently

Addressing the Big Data Challenges cost capability scale up scale out Bigger machines: scale up (a.k.a vertical scaling) … the High Performance Computing way (HPC) RAM Break the CPU/RAM/Disk bounds More machines: scale out (a.k.a horizontal scaling) … the commodity hardware (cloud) way (…) DISK Scale UP (vertically), Scale Out (horizontally) Scale out => many machines => distributed computation and distributed storage => framework to manage the distributed systems. CPU

Addressing the Big Data Challenge Horizontal scaling (scaling out) entails distributed computing across a large number of compute servers Challenges of distributed computing are: The same code should transparently work on 1, 10, or 10,000 servers Assume problem can be broken down into chunks and each chunk calculated locally The data must be accessible from anywhere High availability: the systems must survive one or more server failures with no impact on operations Resource utilization must be optimized Minimize hot-spots with a good load-balancing strategy Bring compute to data The systems should support elastic scaling Add/remove machines without requiring maintenance down-times

Addressing the Big Data Challenge - Solutions Distributed Computing Moving data around Spark, MapReduce(2), Storm, TEZ, Flink, … Kafka, Flume, Scoop, Nifi Distributed Storage Workflow orchestration HDFS, Alluxio(Tachyon), Ceph, S3, GPFS, … Oozie, … Distributed data warehouses Resource Management Hive, SparkSQL, Hbase, Accumulo, Druid, Impala, Cassandra, MongoDB, CouchDB, Redis, Phoenix, JanusGraph, Neo4j, … YARN, mesos, … Provisioning, monitoring, management, security End user interfaces Ambari, Zookeeper, Ranger, Knox, … Zeppelin, Jupyter Notebook, … Data governance Falcon, Atlas

Hadoop Distributed File Systems (HDFS) Essentials What you need to know NameNode: (master node) manage namespace, must have at least one, preferably two for high availability DataNode: (worker node) serve the data, one per server Write-once, you cannot modify a file in place (it must be replaced) Data blocks are in units of 64MB (default) Not for small files, use HDFS for large files only Redundancy, all blocks replicated x3 (default) Redundancy against failures Statistically easier to move computation next to the data and load-balance the CPU usage HDFS command line, with POSIX like interface (Hadoop2): hdfs dfs [–help] [1] http://comphadoop.weebly.com/

Popular Big Data Formats Big data components (Hive, Spark, etc) give you several formatting options for exchanging data, the most popular are Plain text (csv, json, …) Parquet: column-oriented integrated compression ORC: column-oriented in collections of rows, splittable by row collections indexed Avro: row-oriented, splittable and support block compression index ( -

MapReduce in a Nutshell Intermediate files on local file systems or in-memory (key-value pairs) 1 1 1 2 4 1 1 4 2 1 3 3 1 2 2 Output 1 1 2 Input data (on HDFS) Split 2 1 Shuffle (key,value) pairs Reduce Map Notes: (1) In HDFS the data is already ”split” into blocks and ”Mapping” can be done on all worker nodes where data blocks are located. However, beware of compressed data! not all compression algorithm are “splittable” in blocks, and MapReduce will not chunk it (splittable: bzip2, LZA, not splittable: gzip, snappy). (2) shuffling is the network bottleneck of MapReduce operations, because placement of “reducers” cannot be optimized based on data locality.

The Clash: should I Stream, or should I Batch? You want an updated answer as more information becomes available? streams AKA: Data in motion, or Fast data Continuous computation that never stop, process infinite amount of data on the fly Designed to keep size of in-memory state bounded, regardless of how much data is processed Update the answer as more data becomes available Operate on small time windows Can wait until all information is available for a more accurate answer? batch AKA: Data at rest Operates on finite size data sets, and terminate when all data has been processed

Big Data Illustrated – A Popular Real World Pattern Data Stream (Kafka) Prediction (Storm) Real-time Bus GPS Data Bus ETA Streams (continuous processing) Model Historical Data DBMS (Hive) Model Training (Spark) Batch (e.g. learn a new model once a month) Archives Import (Scoop)

Hadoop MapReduce Distributions On-Premise (self-hosted) Big Data Distributions Hortonworks Hadoop Distribution (used in this lab) Cloudera Hadoop Distribution MapR Hadoop Distribution IBM Open Platform with Apache Hadoop DC/OS, Mesosphere … Cloud-based (all-purposes) Big Data Platforms as a Service EPFL IC Cluster (used in this lab) Amazon EMR Microsoft Azure HDInsight Alibaba Cloud E-MapReduce ... They are like linux distributions, apache big data open sources with vendor-specific distinctive features added to it

This week’s exercises Before you start confirm that you are listed in, and contact us if you are in the red groups: https://git-dslab.epfl.ch/dslab2018/homework1- checkpoint/blob/master/homework1_report_21_03_2018.pdf Get the DSLab Week 5 exercise instructions: https://github.com/dslab2018/dslab2018.github.io/tree/master/labs/we ek5 Communications: https://mattermost-dslab.epfl.ch Remember we have office hours on Fridays for the graded projects