Image taken from: slideshare

Slides:



Advertisements
Similar presentations
Spark: Cluster Computing with Working Sets
Advertisements

Hadoop Ecosystem Overview
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
Hive : A Petabyte Scale Data Warehouse Using Hadoop
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High Throughput Partition-able problems Fault Tolerance.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Whirlwind Tour of Hadoop Edward Capriolo Rev 2. Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High.
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.
Nov 2006 Google released the paper on BigTable.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
MapReduce and Hadoop Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata November 10, 2014.
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.
Big Data is a Big Deal!.
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
Hadoop Aakash Kag What Why How 1.
Machine Learning Library for Apache Ignite
Hadoop.
Introduction to Distributed Platforms
Big Data Technologies Based on MapReduce and Hadoop
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Tutorial: Big Data Algorithms and Applications Under Hadoop
Chapter 10 Data Analytics for IoT
Spark Presentation.
15-826: Multimedia Databases and Data Mining
Hadoop Clusters Tess Fulkerson.
Central Florida Business Intelligence User Group
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
Introduction to Spark.
MIT 802 Introduction to Data Platforms and Sources Lecture 2
The Basics of Apache Hadoop
CS6604 Digital Libraries IDEAL Webpages Presented by
湖南大学-信息科学与工程学院-计算机与科学系
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Cse 344 May 2nd – Map/reduce.
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
CS110: Discussion about Spark
Introduction to Apache
Overview of big data tools
Data processing with Hadoop
Lecture 16 (Intro to MapReduce and Hadoop)
Charles Tappert Seidenberg School of CSIS, Pace University
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Apache Hadoop and Spark
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
MapReduce: Simplified Data Processing on Large Clusters
Analysis of Structured or Semi-structured Data on a Hadoop Cluster
Map Reduce, Types, Formats and Features
Presentation transcript:

Image taken from: http://www. slideshare

Selected Apache Projects Xinsheng Parth Jung Xilun Shengyu Hans http://hortonworks.com/wp-content/uploads/2016/03/hive_logo.png http://d1ip2sukdj00ce.cloudfront.net/wp-content/uploads/2013/10/apache-hadoop.jpg http://www.contexti.com/wp-content/uploads/2014/09/spark_2.png http://blog.trifork.com/wp-content/uploads/2014/01/mahout-logo-400.png http://www.crackinghadoop.com/wp-content/uploads/2015/12/cassandra.gif http://hbase.apache.org/images/hbase_logo_with_orca_large.png http://docs.couchdb.org/en/1.6.1/_static/logo.png http://hortonworks.com/wp-content/uploads/2016/03/apachepig.png http://hortonworks.com/wp-content/uploads/2016/03/kafka-logo-wide.png Yash Maolin Sicong

Selected Apache Projects

Selected Apache Projects

Hadoop Architecture Hadoop = HDFS + MapReduce HDFS (Hadoop Distributed File System): a reliable distributed file system that provide high-throughput access to data MapReduce: framework for performing high performance distributed data processing using the divide and aggregate programming paradigm.

Components of HDFS NameNode: master of system, which maintains the name system. DataNode: slaves, which provide the actual storage Secondary NameNode: periodic checkpoints Master Name Node Secondary NameNode Periodic check points DataNode DataNode ………. DataNode Slaves

Components of MapReduce JobTracker: Master, manages the jobs and resources in the cluster(TaskTrackers); TaskTracker: Slaves, responsible for running map and reduce tasks. Master JobTracker Slaves TaskTracker TaskTracker ………. TaskTracker

Hadoop example: Word count Goal: Given a set of documents, count how often each word occurs Input: Key-value pairs (document:lineNumber, text) Output: Key-value pairs (word, #occurrences) What should be the intermediate key-value pairs? map(String key, String value) { // key: document name, line no // value: contents of line } reduce(String key, Iterator values) { }

Simple example: Word count[1] Key range the node is responsible for (apple, 3) Mapper (1-2) (apple, 1) (apple, 1) (apple, {1, 1, 1}) (apple, 1) Reducer (A-G) (an, 2) (an, 1) (an, {1, 1}) (an, 1) (because, 1) (1, the apple) (because, {1}) (because, 1) (green, 1) (2, is an apple) (green, {1}) (green, 1) Mapper (3-4) (is, 1) (is, 1) Reducer (H-N) (is, {1, 1}) (is, 2) (3, not an orange) (not, 1) (not, {1, 1}) (not, 1) (not, 2) (4, because the) (5, orange) (orange, 1) (orange, 1) (orange, {1, 1, 1}) (orange, 1) (orange, 3) Mapper (5-6) Reducer (O-U) (6, unlike the apple) (the, 1) (the, 1) (the, {1, 1, 1}) (the, 1) (the, 3) (unlike, {1}) (unlike, 1) (unlike, 1) (7, is orange) (8, not green) Mapper (7-8) Reducer (V-Z) 1 Each mapper receives some of the KV-pairs as input 2 The mappers process the KV-pairs one by one 3 Each KV-pair output by the mapper is sent to the reducer that is responsible for it 4 The reducers sort their input by key and group it 5 The reducers process their input one group at a time 9 [1] www.cis.upenn.edu/~nets212/slides/08-MapReduceIntro.pptx

Hadoop: Pros and Cons Pros: Processing speed of parallel batch jobs can be reduced Fault-tolerance Large clusters can be deployed Yahoo has a 42000 node Hadoop cluster [1] Cons: Query workload should consist of batch jobs Execution overhead (e.g. initialization of MapReduce) makes it unwieldy for smaller data sets Not suitable for low-latency queries (e.g. real-time analytics, websites, etc.) [1] http://www.datamation.com/data-center/hadoop-vs.-spark-the-new-age-of-big-data.html

Selected Apache Projects

Overview Drawback of Hadoop: Disk-based processing http://spark.apache.org/docs/2.0.1/cluster-overview.html

Hadoop vs Spark: Which one is better? Memory Requirement Cache Usage Flexibility Support for libraries ANY CLUSTER MANAGER: ANY DATA SOURCE: DIFFERENT LANGUAGE SUPPORT: JAVA , SCALA, PYTHON It all boils down to what kind of query processing you want to do

Selected Apache Projects

Apache Hive: Definition Data warehouse software with managing large datasets in Hadoop MapReduce plans Data model Table Basic type: int, float, boolean, string Complex type: List / Map SQL based query Create/Alter/Drop Table, Select with GROUP BY (Aggregation In memory hash-table), Join, Insert (not Update or Delete) No Indexes, data is always scanned in parallel Ref: https://hive.apache.org/

Apache Hive: Example http://www.slideshare.net/InderajRajBains/using-apache-hive-with-high-performance

Apache Hive: MySQL vs Hive http://www.slideshare.net/SagarJauhari/improving-mysql-performance-with-hadoop

Apache Hive: Ingestion using external Table ** ORC: Optimized Row Columnar http://www.slideshare.net/InderajRajBains/using-apache-hive-with-high-performance

ORC (Optimized Row Columnar) File structure

Apache Hive: Pros and Cons A easy way to process large scale data Support SQL-based queries Interoperability with other database Programmability Efficient execution plans for performance Cons No easy way to append data Files in HDFS are immutable sunset.usc.edu/classes/cs572_2010/LTang.ppt

Selected Apache Projects

Apache Mahout Build an environment for quickly creating scalable performant machine learning applications Major features Simple and extensible programming environment and framework Wide variety of premade algorithms R-like syntax vector math experimentation environment, which works at scale Ref: https://mahout.apache.org/

Apache Mahout Machine Learning Algorithms Collaborative Filtering User-/Item- Based Collaborative Filtering, (Weighted) Matrix Factorization With ALS, etc. Classification Logistic Regression, Naive Bayes, HMM, etc. Clustering K-means, Spectral Clustering, etc. Dimensionality Reduction SVD, PCA, etc. Topic Models LDA Ref: https://mahout.apache.org/

Apache Mahout Workflow Tweets Classification Raw Training File (e.g. raw tweets) Convert into SequenceFile Upload to HDFS Mahout: Sequence to Sparse TFIDF Vectors Mahout: Train Naïve Bayes Classifier Workflow Tweets Classification SequenceFile is a flat file consisting of binary key/value pairs. It is extensively used in MapReduce as input/output formats.