Image taken from: slideshare

Image taken from: http://www. slideshare

Selected Apache Projects
Xinsheng Parth Jung Xilun Shengyu Hans Yash Maolin Sicong

Hadoop Architecture Hadoop = HDFS + MapReduce
HDFS (Hadoop Distributed File System): a reliable distributed file system that provide high-throughput access to data MapReduce: framework for performing high performance distributed data processing using the divide and aggregate programming paradigm.

Components of HDFS NameNode: master of system, which maintains the name system. DataNode: slaves, which provide the actual storage Secondary NameNode: periodic checkpoints Master Name Node Secondary NameNode Periodic check points DataNode DataNode ………. DataNode Slaves

Components of MapReduce
JobTracker: Master, manages the jobs and resources in the cluster(TaskTrackers); TaskTracker: Slaves, responsible for running map and reduce tasks. Master JobTracker Slaves TaskTracker TaskTracker ………. TaskTracker

Hadoop example: Word count
Goal: Given a set of documents, count how often each word occurs Input: Key-value pairs (document:lineNumber, text) Output: Key-value pairs (word, #occurrences) What should be the intermediate key-value pairs? map(String key, String value) { // key: document name, line no // value: contents of line } reduce(String key, Iterator values) { }

Simple example: Word count[1]
Key range the node is responsible for (apple, 3) Mapper (1-2) (apple, 1) (apple, 1) (apple, {1, 1, 1}) (apple, 1) Reducer (A-G) (an, 2) (an, 1) (an, {1, 1}) (an, 1) (because, 1) (1, the apple) (because, {1}) (because, 1) (green, 1) (2, is an apple) (green, {1}) (green, 1) Mapper (3-4) (is, 1) (is, 1) Reducer (H-N) (is, {1, 1}) (is, 2) (3, not an orange) (not, 1) (not, {1, 1}) (not, 1) (not, 2) (4, because the) (5, orange) (orange, 1) (orange, 1) (orange, {1, 1, 1}) (orange, 1) (orange, 3) Mapper (5-6) Reducer (O-U) (6, unlike the apple) (the, 1) (the, 1) (the, {1, 1, 1}) (the, 1) (the, 3) (unlike, {1}) (unlike, 1) (unlike, 1) (7, is orange) (8, not green) Mapper (7-8) Reducer (V-Z) 1 Each mapper receives some of the KV-pairs as input 2 The mappers process the KV-pairs one by one 3 Each KV-pair output by the mapper is sent to the reducer that is responsible for it 4 The reducers sort their input by key and group it 5 The reducers process their input one group at a time 9 [1]

Hadoop: Pros and Cons Pros:
Processing speed of parallel batch jobs can be reduced Fault-tolerance Large clusters can be deployed Yahoo has a node Hadoop cluster [1] Cons: Query workload should consist of batch jobs Execution overhead (e.g. initialization of MapReduce) makes it unwieldy for smaller data sets Not suitable for low-latency queries (e.g. real-time analytics, websites, etc.) [1]

Overview Drawback of Hadoop: Disk-based processing

Hadoop vs Spark: Which one is better?
Memory Requirement Cache Usage Flexibility Support for libraries ANY CLUSTER MANAGER: ANY DATA SOURCE: DIFFERENT LANGUAGE SUPPORT: JAVA , SCALA, PYTHON It all boils down to what kind of query processing you want to do

Apache Hive: Definition
Data warehouse software with managing large datasets in Hadoop MapReduce plans Data model Table Basic type: int, float, boolean, string Complex type: List / Map SQL based query Create/Alter/Drop Table, Select with GROUP BY (Aggregation In memory hash-table), Join, Insert (not Update or Delete) No Indexes, data is always scanned in parallel Ref:

Apache Hive: Example

Apache Hive: MySQL vs Hive

Apache Hive: Ingestion using external Table
** ORC: Optimized Row Columnar

ORC (Optimized Row Columnar)
File structure

Apache Hive: Pros and Cons
A easy way to process large scale data Support SQL-based queries Interoperability with other database Programmability Efficient execution plans for performance Cons No easy way to append data Files in HDFS are immutable sunset.usc.edu/classes/cs572_2010/LTang.ppt

Apache Mahout Build an environment for quickly creating scalable performant machine learning applications Major features Simple and extensible programming environment and framework Wide variety of premade algorithms R-like syntax vector math experimentation environment, which works at scale Ref:

Apache Mahout Machine Learning Algorithms Collaborative Filtering
User-/Item- Based Collaborative Filtering, (Weighted) Matrix Factorization With ALS, etc. Classification Logistic Regression, Naive Bayes, HMM, etc. Clustering K-means, Spectral Clustering, etc. Dimensionality Reduction SVD, PCA, etc. Topic Models LDA Ref:

Apache Mahout Workflow Tweets Classification
Raw Training File (e.g. raw tweets) Convert into SequenceFile Upload to HDFS Mahout: Sequence to Sparse TFIDF Vectors Mahout: Train Naïve Bayes Classifier Workflow Tweets Classification SequenceFile is a flat file consisting of binary key/value pairs. It is extensively used in MapReduce as input/output formats.

Image taken from: slideshare

Similar presentations

Presentation on theme: "Image taken from: slideshare"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Image taken from: slideshare

Similar presentations

Presentation on theme: "Image taken from: slideshare"— Presentation transcript:

Similar presentations

About project

Feedback