Image taken from: slideshare

Slides:

Advertisements

Similar presentations

Spark: Cluster Computing with Working Sets

Advertisements

Hadoop Ecosystem Overview

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.

Map Reduce and Hadoop S. Sudarshan, IIT Bombay

Hive : A Petabyte Scale Data Warehouse Using Hadoop

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.

Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High Throughput Partition-able problems Fault Tolerance.

Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.

MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

Introduction to Hadoop and HDFS

f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read

Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.

Whirlwind Tour of Hadoop Edward Capriolo Rev 2. Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High.

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.

Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.

Nov 2006 Google released the paper on BigTable.

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.

INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.

COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University

MapReduce and Hadoop Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata November 10, 2014.

Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.

Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.

”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.

Big Data is a Big Deal!.

Sushant Ahuja, Cassio Cristovao, Sameep Mohta

About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.

Hadoop Aakash Kag What Why How 1.

Machine Learning Library for Apache Ignite

Introduction to Distributed Platforms

Big Data Technologies Based on MapReduce and Hadoop

Distributed Programming in “Big Data” Systems Pramod Bhatotia wp

INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER

Tutorial: Big Data Algorithms and Applications Under Hadoop

Chapter 10 Data Analytics for IoT

Spark Presentation.

15-826: Multimedia Databases and Data Mining

Hadoop Clusters Tess Fulkerson.

Central Florida Business Intelligence User Group

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

Ministry of Higher Education

Introduction to Spark.

MIT 802 Introduction to Data Platforms and Sources Lecture 2

The Basics of Apache Hadoop

CS6604 Digital Libraries IDEAL Webpages Presented by

湖南大学-信息科学与工程学院-计算机与科学系

Introduction to PIG, HIVE, HBASE & ZOOKEEPER

Cse 344 May 2nd – Map/reduce.

February 26th – Map/Reduce

Cse 344 May 4th – Map/Reduce.

CS110: Discussion about Spark

Introduction to Apache

Overview of big data tools

Data processing with Hadoop

Lecture 16 (Intro to MapReduce and Hadoop)

Charles Tappert Seidenberg School of CSIS, Pace University

Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper

Apache Hadoop and Spark

Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &

MapReduce: Simplified Data Processing on Large Clusters

Analysis of Structured or Semi-structured Data on a Hadoop Cluster

Map Reduce, Types, Formats and Features

Presentation transcript:

Image taken from: http://www. slideshare

Selected Apache Projects Xinsheng Parth Jung Xilun Shengyu Hans http://hortonworks.com/wp-content/uploads/2016/03/hive_logo.png http://d1ip2sukdj00ce.cloudfront.net/wp-content/uploads/2013/10/apache-hadoop.jpg http://www.contexti.com/wp-content/uploads/2014/09/spark_2.png http://blog.trifork.com/wp-content/uploads/2014/01/mahout-logo-400.png http://www.crackinghadoop.com/wp-content/uploads/2015/12/cassandra.gif http://hbase.apache.org/images/hbase_logo_with_orca_large.png http://docs.couchdb.org/en/1.6.1/_static/logo.png http://hortonworks.com/wp-content/uploads/2016/03/apachepig.png http://hortonworks.com/wp-content/uploads/2016/03/kafka-logo-wide.png Yash Maolin Sicong

Selected Apache Projects

Selected Apache Projects

Hadoop Architecture Hadoop = HDFS + MapReduce HDFS (Hadoop Distributed File System): a reliable distributed file system that provide high-throughput access to data MapReduce: framework for performing high performance distributed data processing using the divide and aggregate programming paradigm.

Components of HDFS NameNode: master of system, which maintains the name system. DataNode: slaves, which provide the actual storage Secondary NameNode: periodic checkpoints Master Name Node Secondary NameNode Periodic check points DataNode DataNode ………. DataNode Slaves

Components of MapReduce JobTracker: Master, manages the jobs and resources in the cluster(TaskTrackers); TaskTracker: Slaves, responsible for running map and reduce tasks. Master JobTracker Slaves TaskTracker TaskTracker ………. TaskTracker

Hadoop example: Word count Goal: Given a set of documents, count how often each word occurs Input: Key-value pairs (document:lineNumber, text) Output: Key-value pairs (word, #occurrences) What should be the intermediate key-value pairs? map(String key, String value) { // key: document name, line no // value: contents of line } reduce(String key, Iterator values) { }

Simple example: Word count[1] Key range the node is responsible for (apple, 3) Mapper (1-2) (apple, 1) (apple, 1) (apple, {1, 1, 1}) (apple, 1) Reducer (A-G) (an, 2) (an, 1) (an, {1, 1}) (an, 1) (because, 1) (1, the apple) (because, {1}) (because, 1) (green, 1) (2, is an apple) (green, {1}) (green, 1) Mapper (3-4) (is, 1) (is, 1) Reducer (H-N) (is, {1, 1}) (is, 2) (3, not an orange) (not, 1) (not, {1, 1}) (not, 1) (not, 2) (4, because the) (5, orange) (orange, 1) (orange, 1) (orange, {1, 1, 1}) (orange, 1) (orange, 3) Mapper (5-6) Reducer (O-U) (6, unlike the apple) (the, 1) (the, 1) (the, {1, 1, 1}) (the, 1) (the, 3) (unlike, {1}) (unlike, 1) (unlike, 1) (7, is orange) (8, not green) Mapper (7-8) Reducer (V-Z) 1 Each mapper receives some of the KV-pairs as input 2 The mappers process the KV-pairs one by one 3 Each KV-pair output by the mapper is sent to the reducer that is responsible for it 4 The reducers sort their input by key and group it 5 The reducers process their input one group at a time 9 [1] www.cis.upenn.edu/~nets212/slides/08-MapReduceIntro.pptx

Hadoop: Pros and Cons Pros: Processing speed of parallel batch jobs can be reduced Fault-tolerance Large clusters can be deployed Yahoo has a 42000 node Hadoop cluster [1] Cons: Query workload should consist of batch jobs Execution overhead (e.g. initialization of MapReduce) makes it unwieldy for smaller data sets Not suitable for low-latency queries (e.g. real-time analytics, websites, etc.) [1] http://www.datamation.com/data-center/hadoop-vs.-spark-the-new-age-of-big-data.html

Selected Apache Projects

Overview Drawback of Hadoop: Disk-based processing http://spark.apache.org/docs/2.0.1/cluster-overview.html

Hadoop vs Spark: Which one is better? Memory Requirement Cache Usage Flexibility Support for libraries ANY CLUSTER MANAGER: ANY DATA SOURCE: DIFFERENT LANGUAGE SUPPORT: JAVA , SCALA, PYTHON It all boils down to what kind of query processing you want to do

Selected Apache Projects

Apache Hive: Definition Data warehouse software with managing large datasets in Hadoop MapReduce plans Data model Table Basic type: int, float, boolean, string Complex type: List / Map SQL based query Create/Alter/Drop Table, Select with GROUP BY (Aggregation In memory hash-table), Join, Insert (not Update or Delete) No Indexes, data is always scanned in parallel Ref: https://hive.apache.org/

Apache Hive: Example http://www.slideshare.net/InderajRajBains/using-apache-hive-with-high-performance

Apache Hive: MySQL vs Hive http://www.slideshare.net/SagarJauhari/improving-mysql-performance-with-hadoop

Apache Hive: Ingestion using external Table ** ORC: Optimized Row Columnar http://www.slideshare.net/InderajRajBains/using-apache-hive-with-high-performance

ORC (Optimized Row Columnar) File structure

Apache Hive: Pros and Cons A easy way to process large scale data Support SQL-based queries Interoperability with other database Programmability Efficient execution plans for performance Cons No easy way to append data Files in HDFS are immutable sunset.usc.edu/classes/cs572_2010/LTang.ppt

Selected Apache Projects

Apache Mahout Build an environment for quickly creating scalable performant machine learning applications Major features Simple and extensible programming environment and framework Wide variety of premade algorithms R-like syntax vector math experimentation environment, which works at scale Ref: https://mahout.apache.org/

Apache Mahout Machine Learning Algorithms Collaborative Filtering User-/Item- Based Collaborative Filtering, (Weighted) Matrix Factorization With ALS, etc. Classification Logistic Regression, Naive Bayes, HMM, etc. Clustering K-means, Spectral Clustering, etc. Dimensionality Reduction SVD, PCA, etc. Topic Models LDA Ref: https://mahout.apache.org/

Apache Mahout Workflow Tweets Classification Raw Training File (e.g. raw tweets) Convert into SequenceFile Upload to HDFS Mahout: Sequence to Sparse TFIDF Vectors Mahout: Train Naïve Bayes Classifier Workflow Tweets Classification SequenceFile is a flat file consisting of binary key/value pairs. It is extensively used in MapReduce as input/output formats.