Presentation is loading. Please wait.

Presentation is loading. Please wait.

Image taken from: slideshare

Similar presentations


Presentation on theme: "Image taken from: slideshare"— Presentation transcript:

1 Image taken from: http://www. slideshare

2 Selected Apache Projects
Xinsheng Parth Jung Xilun Shengyu Hans Yash Maolin Sicong

3 Selected Apache Projects

4 Selected Apache Projects

5 Hadoop Architecture Hadoop = HDFS + MapReduce
HDFS (Hadoop Distributed File System): a reliable distributed file system that provide high-throughput access to data MapReduce: framework for performing high performance distributed data processing using the divide and aggregate programming paradigm.

6 Components of HDFS NameNode: master of system, which maintains the name system. DataNode: slaves, which provide the actual storage Secondary NameNode: periodic checkpoints Master Name Node Secondary NameNode Periodic check points DataNode DataNode ………. DataNode Slaves

7 Components of MapReduce
JobTracker: Master, manages the jobs and resources in the cluster(TaskTrackers); TaskTracker: Slaves, responsible for running map and reduce tasks. Master JobTracker Slaves TaskTracker TaskTracker ………. TaskTracker

8 Hadoop example: Word count
Goal: Given a set of documents, count how often each word occurs Input: Key-value pairs (document:lineNumber, text) Output: Key-value pairs (word, #occurrences) What should be the intermediate key-value pairs? map(String key, String value) { // key: document name, line no // value: contents of line } reduce(String key, Iterator values) { }

9 Simple example: Word count[1]
Key range the node is responsible for (apple, 3) Mapper (1-2) (apple, 1) (apple, 1) (apple, {1, 1, 1}) (apple, 1) Reducer (A-G) (an, 2) (an, 1) (an, {1, 1}) (an, 1) (because, 1) (1, the apple) (because, {1}) (because, 1) (green, 1) (2, is an apple) (green, {1}) (green, 1) Mapper (3-4) (is, 1) (is, 1) Reducer (H-N) (is, {1, 1}) (is, 2) (3, not an orange) (not, 1) (not, {1, 1}) (not, 1) (not, 2) (4, because the) (5, orange) (orange, 1) (orange, 1) (orange, {1, 1, 1}) (orange, 1) (orange, 3) Mapper (5-6) Reducer (O-U) (6, unlike the apple) (the, 1) (the, 1) (the, {1, 1, 1}) (the, 1) (the, 3) (unlike, {1}) (unlike, 1) (unlike, 1) (7, is orange) (8, not green) Mapper (7-8) Reducer (V-Z) 1 Each mapper receives some of the KV-pairs as input 2 The mappers process the KV-pairs one by one 3 Each KV-pair output by the mapper is sent to the reducer that is responsible for it 4 The reducers sort their input by key and group it 5 The reducers process their input one group at a time 9 [1]

10 Hadoop: Pros and Cons Pros:
Processing speed of parallel batch jobs can be reduced Fault-tolerance Large clusters can be deployed Yahoo has a node Hadoop cluster [1] Cons: Query workload should consist of batch jobs Execution overhead (e.g. initialization of MapReduce) makes it unwieldy for smaller data sets Not suitable for low-latency queries (e.g. real-time analytics, websites, etc.) [1]

11 Selected Apache Projects

12 Overview Drawback of Hadoop: Disk-based processing

13 Hadoop vs Spark: Which one is better?
Memory Requirement Cache Usage Flexibility Support for libraries ANY CLUSTER MANAGER: ANY DATA SOURCE: DIFFERENT LANGUAGE SUPPORT: JAVA , SCALA, PYTHON It all boils down to what kind of query processing you want to do

14 Selected Apache Projects

15 Apache Hive: Definition
Data warehouse software with managing large datasets in Hadoop MapReduce plans Data model Table Basic type: int, float, boolean, string Complex type: List / Map SQL based query Create/Alter/Drop Table, Select with GROUP BY (Aggregation In memory hash-table), Join, Insert (not Update or Delete) No Indexes, data is always scanned in parallel Ref:

16 Apache Hive: Example

17 Apache Hive: MySQL vs Hive

18 Apache Hive: Ingestion using external Table
** ORC: Optimized Row Columnar

19 ORC (Optimized Row Columnar)
File structure

20 Apache Hive: Pros and Cons
A easy way to process large scale data Support SQL-based queries Interoperability with other database Programmability Efficient execution plans for performance Cons No easy way to append data Files in HDFS are immutable sunset.usc.edu/classes/cs572_2010/LTang.ppt

21 Selected Apache Projects

22 Apache Mahout Build an environment for quickly creating scalable performant machine learning applications Major features Simple and extensible programming environment and framework Wide variety of premade algorithms R-like syntax vector math experimentation environment, which works at scale Ref:

23 Apache Mahout Machine Learning Algorithms Collaborative Filtering
User-/Item- Based Collaborative Filtering, (Weighted) Matrix Factorization With ALS, etc. Classification Logistic Regression, Naive Bayes, HMM, etc. Clustering K-means, Spectral Clustering, etc. Dimensionality Reduction SVD, PCA, etc. Topic Models LDA Ref:

24 Apache Mahout Workflow Tweets Classification
Raw Training File (e.g. raw tweets) Convert into SequenceFile Upload to HDFS Mahout: Sequence to Sparse TFIDF Vectors Mahout: Train Naïve Bayes Classifier Workflow Tweets Classification SequenceFile is a flat file consisting of binary key/value pairs. It is extensively used in MapReduce as input/output formats.


Download ppt "Image taken from: slideshare"

Similar presentations


Ads by Google