Presentation is loading. Please wait.

Presentation is loading. Please wait.

巨量資料基礎: MapReduce典範、Hadoop與Spark生態系統

Similar presentations


Presentation on theme: "巨量資料基礎: MapReduce典範、Hadoop與Spark生態系統"— Presentation transcript:

1 巨量資料基礎: MapReduce典範、Hadoop與Spark生態系統
Tamkang University Tamkang University Big Data Mining 巨量資料探勘 巨量資料基礎: MapReduce典範、Hadoop與Spark生態系統 (Fundamental Big Data: MapReduce Paradigm, Hadoop and Spark Ecosystem) 1042DM02 MI4 (M2244) (3094) Tue, 3, 4 (10:10-12:00) (B216) Min-Yuh Day 戴敏育 Assistant Professor 專任助理教授 Dept. of Information Management, Tamkang University 淡江大學 資訊管理學系 tku.edu.tw/myday/

2 課程大綱 (Syllabus) 週次 (Week) 日期 (Date) 內容 (Subject/Topics) /02/16 巨量資料探勘課程介紹 (Course Orientation for Big Data Mining) /02/23 巨量資料基礎:MapReduce典範、Hadoop與Spark生態系統 (Fundamental Big Data: MapReduce Paradigm, Hadoop and Spark Ecosystem) /03/01 關連分析 (Association Analysis) /03/08 分類與預測 (Classification and Prediction) /03/15 分群分析 (Cluster Analysis) /03/22 個案分析與實作一 (SAS EM 分群分析): Case Study 1 (Cluster Analysis – K-Means using SAS EM) /03/29 個案分析與實作二 (SAS EM 關連分析): Case Study 2 (Association Analysis using SAS EM)

3 課程大綱 (Syllabus) 週次 (Week) 日期 (Date) 內容 (Subject/Topics) /04/05 教學行政觀摩日 (Off-campus study) /04/12 期中報告 (Midterm Project Presentation) /04/19 期中考試週 (Midterm Exam) /04/26 個案分析與實作三 (SAS EM 決策樹、模型評估): Case Study 3 (Decision Tree, Model Evaluation using SAS EM) /05/03 個案分析與實作四 (SAS EM 迴歸分析、類神經網路): Case Study 4 (Regression Analysis, Artificial Neural Network using SAS EM) /05/10 Google TensorFlow 深度學習 (Deep Learning with Google TensorFlow) /05/17 期末報告 (Final Project Presentation) /05/24 畢業班考試 (Final Exam)

4 2016/02/ 巨量資料基礎: MapReduce典範、 Hadoop與Spark生態系統 (Fundamental Big Data: MapReduce Paradigm, Hadoop and Spark Ecosystem)

5 Architecture of Big Data Analytics
Big Data Analytics Applications Big Data Sources Big Data Transformation Big Data Platforms & Tools * Internal * External * Multiple formats * Multiple locations * Multiple applications Middleware Hadoop MapReduce Pig Hive Jaql Zookeeper Hbase Cassandra Oozie Avro Mahout Others Queries Transformed Data Big Data Analytics Raw Data Extract Transform Load Reports Data Warehouse OLAP Traditional Format CSV, Tables Data Mining Source: Stephan Kudyba (2014), Big Data, Mining, and Analytics: Components of Strategic Decision Making, Auerbach Publications

6 Architecture of Big Data Analytics
Big Data Analytics Applications Big Data Sources Big Data Transformation Big Data Platforms & Tools * Internal * External * Multiple formats * Multiple locations * Multiple applications Data Mining Big Data Analytics Applications Middleware Hadoop MapReduce Pig Hive Jaql Zookeeper Hbase Cassandra Oozie Avro Mahout Others Queries Transformed Data Big Data Analytics Raw Data Extract Transform Load Reports Data Warehouse OLAP Traditional Format CSV, Tables Data Mining Source: Stephan Kudyba (2014), Big Data, Mining, and Analytics: Components of Strategic Decision Making, Auerbach Publications

7 Architecture for Social Big Data Mining (Hiroshi Ishikawa, 2015)
Enabling Technologies Analysts Integrated analysis Model Construction Explanation by Model Integrated analysis model Conceptual Layer Construction and confirmation of individual hypothesis Description and execution of application-specific task Natural Language Processing Information Extraction Anomaly Detection Discovery of relationships among heterogeneous data Large-scale visualization Data Mining Application specific task Multivariate analysis Logical Layer Parallel distrusted processing Social Data Software Hardware Physical Layer Source: Hiroshi Ishikawa (2015), Social Big Data Mining, CRC Press

8 Business Intelligence (BI) Infrastructure
Source: Kenneth C. Laudon & Jane P. Laudon (2014), Management Information Systems: Managing the Digital Firm, Thirteenth Edition, Pearson.

9 Data Warehouse Data Mining and Business Intelligence
Increasing potential to support business decisions End User Decision Making Data Presentation Business Analyst Visualization Techniques Data Mining Data Analyst Information Discovery Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses DBA Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems Source: Jiawei Han and Micheline Kamber (2006), Data Mining: Concepts and Techniques, Second Edition, Elsevier

10 The Evolution of BI Capabilities
Source: Turban et al. (2011), Decision Support and Business Intelligence Systems

11 Data Science and Business Intelligence
Source: EMC Education Services, Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data, Wiley, 2015

12 Data Mining at the Intersection of Many Disciplines
Source: Turban et al. (2011), Decision Support and Business Intelligence Systems

13 Knowledge Discovery (KDD) Process
Data mining: core of knowledge discovery process Pattern Evaluation Data Mining Task-relevant Data Data Warehouse Selection Data Cleaning Data Integration Databases Source: Han & Kamber (2006)

14 A Taxonomy for Data Mining Tasks
Source: Turban et al. (2011), Decision Support and Business Intelligence Systems

15 Deep Learning Intelligence from Big Data
Source:

16 Big Data Source: EMC Education Services, Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data, Wiley, 2015

17 Big Data Growth is increasingly unstructured
Source: EMC Education Services, Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data, Wiley, 2015

18 Typical Analytic Architecture
Source: EMC Education Services, Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data, Wiley, 2015

19 Data Evolution and the Rise of Big Data Sources
Source: EMC Education Services, Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data, Wiley, 2015

20 Emerging Big Data Ecosystem
Source: EMC Education Services, Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data, Wiley, 2015

21 Key Roles for the New Big Data Ecosystem
Source: EMC Education Services, Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data, Wiley, 2015

22 Profile of a Data Scientist
Quantitative mathematics or statistics Technical software engineering, machine learning, and programming skills Skeptical mind-set and critical thinking Curious and creative Communicative and collaborative Source: EMC Education Services, Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data, Wiley, 2015

23 Source: https://www. thalesgroup

24 MapReduce Paradigm Map Reduce Big Data Map0 Map1 Map2 Map3 Reduce0
MapReduce Data Output Data

25 The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing. Source:

26 Processing MapReduce HDFS Storage Source:

27 Big Data with Hadoop Architecture
Source:

28 Big Data with Hadoop Architecture Logical Architecture Processing: MapReduce
Source:

29 Big Data with Hadoop Architecture Logical Architecture Storage: HDFS
Source:

30 Big Data with Hadoop Architecture Process Flow
Source:

31 Big Data with Hadoop Architecture Hadoop Cluster
Source:

32 Hadoop Ecosystem Source: Shiva Achari (2015), Hadoop Essentials - Tackling the Challenges of Big Data with Hadoop, Packt Publishing

33 Traditional ETL Architecture
Source:

34 Offload ETL with Hadoop (Big Data Architecture)
Source:

35 Source: http://www.newera-technologies.com/big-data-solution.html
Big Data Solution Source:

36 HDP A Complete Enterprise Hadoop Data Platform
Source:

37 Lightning-fast cluster computing
Apache Spark is a fast and general engine for large-scale data processing. Source:

38 Logistic regression in Hadoop and Spark
Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. Source:

39 Ease of Use Write applications quickly in Java, Scala, Python, R.
Source:

40 Word count in Spark's Python API
text_file = spark.textFile("hdfs://...") text_file.flatMap(lambda line: line.split()) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a+b) Source:

41 Spark and Hadoop Source:

42 Spark Ecosystem Source:

43 Mllib (machine learning)
Spark Ecosystem Spark Spark Streaming Mllib (machine learning) Spark SQL GraphX (graph) Kafka Flume H2O Hive Titan HBase Cassandra HDFS Source: Mike Frampton (2015), Mastering Apache Spark, Packt Publishing

44 Python for Big Data Analytics
(The column on the left is the 2015 ranking; the column on the right is the 2014 ranking for comparison 2015 2014 Source:

45 Source: http://www. kdnuggets

46 SAS & Hadoop

47 Modern Reality Infrastructure Data Analytics Commoditization
Architectures Scale Infrastructure New Complex Streams Perishable Considerations Cost Data New Category of Business Problems Analytical Algorithms Operationalization Analytics Deepak Ramanathan (2014), SAS Modernization architectures - Big Data Analytics, Source: Deepak Ramanathan (2014), SAS Modernization architectures - Big Data Analytics

48 Copyright © 2011, SAS Institute Inc. All rights reserved.
ADVANCED ANALYTICS TEXT ANALYTICS Finding treasures in unstructured data like social media or survey tools that could uncover insights about consumer sentiment FORECASTING Leveraging historical data to drive better insight into decision-making for the future INFORMATION MANAGEMENT OPTIMIZATION DATA MINING Analyze massive amounts of data in order to accurately identify areas likely to produce the most profitable results Currently, most organizations use one or more of these techniques to solve a departmental specific business problem. As such, even though valuable, from an enterprise standpoint, the value is somewhat marginalized or not optimized. What we see is that departments, such as Marketing, Finance or Risk are using one technique to solve a problem but not multiple techniques. So, for instance, in a bank, the risk department will use econometric forecasting to manage the treasury portfolio and overall risk…so they think. But they are not proactively looking at text ( s and chat), to see where the bank could be exposed. Or a retailer uses forecasting to predict what will sell, but they are not looking at on line sentiment and customer segmentation to optimize what product they should offer what customer and how to optimize distribution all as part of the same process. These are examples of why these techniques should be pulled together for the greater good. Source: Deepak Ramanathan (2014), SAS Modernization architectures - Big Data Analytics, Mine transaction databases for data of spending patterns that indicate a stolen card.. STATISTICS Copyright © 2011, SAS Institute Inc. All rights reserved. Source: Deepak Ramanathan (2014) Copyright © 2011, SAS Institute Inc. All rights reserved.

49 Trends in Analytics Complex Business Problems Are Driving Analytics Innovation Speed Will Be Of Essence Leverage Analytics To Unlock The Information Contained In Unstructured Data Operationalizing Analytics Source: Deepak Ramanathan (2014), SAS Modernization architectures - Big Data Analytics

50 Architectures of Big Data Analytics

51 Traditional Analytics
BI and Analytics Data Mart Operational Data Sources Data Mart EDW Analytic Mart Analytic Mart Unstructured, Semi-structured and Streaming data (i.e. sensor data) handled often outside the Warehouse flow Source: Deepak Ramanathan (2014), SAS Modernization architectures - Big Data Analytics

52 Hadoop as a “new data” Store
BI and Analytics Data Mart Operational Data Sources Data Mart EDW Analytic Mart Analytic Mart Source: Deepak Ramanathan (2014), SAS Modernization architectures - Big Data Analytics

53 Hadoop as an additional input to the EDW
BI and Analytics Data Mart Operational Data Sources Data Mart EDW Analytic Mart Analytic Mart Analytic Mart Data Mart Source: Deepak Ramanathan (2014), SAS Modernization architectures - Big Data Analytics

54 Operational Data Sources
Hadoop Data Platform As a “staging Layer” as part of a “data Lake” – Downstream stores could be Hadoop, data appliances or an RDBMS BI and Analytics Operational Data Sources EDW Data Mart Data Mart Analytic Mart Analytic Mart Source: Deepak Ramanathan (2014), SAS Modernization architectures - Big Data Analytics

55 SAS Big data Strategy – SAS areas
Source: Deepak Ramanathan (2014), SAS Modernization architectures - Big Data Analytics

56 SAS Big data Strategy – SAS areas
Source: Deepak Ramanathan (2014), SAS Modernization architectures - Big Data Analytics

57 SAS® Within the HADOOP ECOSYSTEM
EG EM VA SAS® Enterprise Guide® SAS® Data Integration SAS® Enterprise Miner™ SAS® Visual Analytics SAS® In-Memory Statistics for Haodop User Interface SAS Metadata Metadata Next-Gen SAS® User SAS® User Base SAS & SAS/ACCESS® to Hadoop™ In-Memory Data Access Data Access Impala Pig Hive SAS Embedded Process Accelerators SAS® LASR™ Analytic Server SAS® High- Performance Analytic Procedures Data Processing Map Reduce MPI Based HDFS File System Source: Deepak Ramanathan (2014), SAS Modernization architectures - Big Data Analytics

58 SAS enables the entire lifecycle around HADOOP
Done using either the Data Preparation, Data Exploration or Build Model Tools SAS Visual Analytics Decision Manager IDENTIFY / FORMULATE PROBLEM DATA PREPARATION EXPLORATION TRANSFORM & SELECT BUILD MODEL VALIDATE DEPLOY EVALUATE / MONITOR RESULTS SAS Visual Analytics SAS Visual Statistics SAS In-Memory Statistics for Hadoop SAS Scoring Accelerator for Hadoop SAS Code Accelerator for Hadoop Done using either the Data Preparation, Data Exploration or Build Model Tools Decision Manager SAS High Performance Analytics Offerings supported by relevant clients like SAS Enterprise Miner, SAS/STAT etc. Source: Deepak Ramanathan (2014), SAS Modernization architectures - Big Data Analytics

59 SAS® VISUAL ANALYTICS A Single solution for Data Discovery, Visualization, analytics and reporting
Source: Deepak Ramanathan (2014), SAS Modernization architectures - Big Data Analytics

60 Visualization Powered by SAS Analytics
SAS® VISUAL ANALYTICS Example: text analysis gives you insight to customer experience and opinion Example: Applying text analysis to twitter streams, or to customer comments in call logs, can give quick insight into the “hot topics” discussed. It’s more than a simple world=cloud that shows which topics are being discussed, but the analytics applied behind the scenes determine which words are used most frequently –so you can determine which topics are the most “important” and warrant further understanding/exploration. Visualization Powered by SAS Analytics Analytics applied to text provides real MEANING Source: Deepak Ramanathan (2014), SAS Modernization architectures - Big Data Analytics

61 Visualization Source: Deepak Ramanathan (2014), SAS Modernization architectures - Big Data Analytics

62 References EMC Education Services (2015), Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data, Wiley Shiva Achari (2015), Hadoop Essentials - Tackling the Challenges of Big Data with Hadoop, Packt Publishing Mike Frampton (2015), Mastering Apache Spark, Packt Publishing Deepak Ramanathan (2014), SAS Modernization architectures - Big Data Analytics,


Download ppt "巨量資料基礎: MapReduce典範、Hadoop與Spark生態系統"

Similar presentations


Ads by Google