Download presentation
Presentation is loading. Please wait.
Published byMaud Norton Modified over 9 years ago
1
Hadoop & MapReduce Zhangxi Lin CAABI, Texas Tech University FIFE, Southwestern University of Finance & Economics Cellphone:18610660375, QQ/WeChat: 155970 http://zlin.ba.ttu.eduhttp://zlin.ba.ttu.edu Zhangxi.lin@ttu.eduZhangxi.lin@ttu.edu 2015-06-16 1 1 6/16/2015Zhangxi Lin
2
CAABI, Texas Tech University ◦ Center for Advanced Analytics and Business Intelligence initially started in 2004 by Dr. Peter Westfall, ISQS, Rawls College of Business. Center for Advanced Analytics and Business Intelligence Sichuan Key Lab of Financial Intelligence and Financial Engineering (FIFE), SWUFE ◦ One of two key labs in finance founded in 2008 and sponsored by Sichuan Provincial Government ◦ Underpinned by two areas in SWUFE: Information and Finance 2 ISQS 6339, Data Mgmt & BI 2 6/16/2015Zhangxi Lin
3
Know Big Data One More Step When we talk about big data we must know what Hadoop is When we planning about data warehousing we must know what HDFS and NoSQL are. When we say data mining we must know what Mahout and H2O are. Do you know Hadoop data warehousing does not need dimensional modeling? Do you know how Hadoop stores heterogeneous data? Do you know what are Hadoop’s “Archeries heal”? Do you know you can install a Hadoop system in your Laptop? Do you know Alibaba has retired its last mini-computer in 2014? So, let’s talk about Hadoop 36/16/2015Zhangxi Lin
4
After this lecture you will Understand what challenges are in big data management Understand how Hadoop and MapReduce works Get familiar to the Hadoop ecology Be able to install a Hadoop in your laptop Be able to install a handy big data tool in your laptop to visualize and mine data 6/16/2015Zhangxi Lin4
5
Outlines Apache Hadoop Hadoop Data Warehousing Hadoop ETL Hadoop Data Mining Data Visualization with Hadoop MapReduce Algorithm Setting up Your Hadoop Appendixes ◦ The Hadoop Ecological System ◦ Matrix calculation with MapReduce Zhangxi Lin5 6/16/2015
6
A Traditional Business Intelligence System 6 SSMS SSIS SSAS SSRS SAS EM SAS EM SAS EG SAS EG MS SQL Server BIDS 6/16/2015Zhangxi Lin
7
Hadoop ecosystem 6/16/20157Zhangxi Lin
8
- 6/16/20158Zhangxi Lin
9
What is Hadoop? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity hardware. Hadoop is not a replacement for traditional RDMS but is a supplement to handle and process large datasets. It achieves two tasks: 1. Massive data storage. 2. Faster processing. Using Hadoop is cheaper, faster and better. 6/16/20159Zhangxi Lin
10
Hadoop 2: Big data's big leap forward The new Hadoop is the Apache Foundation's attempt to create a whole new general framework for the way big data can be stored, mined, and processed. The biggest constraint on scale has been Hadoop’s job handling. All jobs in Hadoop are run as batch processes through a single daemon called JobTracker, which creates a scalability and processing-speed bottleneck. Hadoop 2 uses an entirely new job-processing framework built using two daemons: ResourceManager, which governs all jobs in the system, and NodeManager, which runs on each Hadoop node and keeps the ResourceManager informed about what's happening on that node. Zhangxi Lin106/16/2015
11
Hadoop 1.0 VS Hadoop 2.0 Hadoop 1.0 Hadoop 2.0 Horizontal scalability of Namenode. Namenode is no longer a single point of failure. Ability to process Terabytes and Petabytes of data available in HDFS using Non-MapReduce applications such as MPI, GIRAPH. The two major functionalities of overburdened JobTracker (resource management and job scheduling/monitoring) into two separate daemons. Features of Hadoop 2.0 over Hadoop 1.0: 6/16/201511Zhangxi Lin
12
Apache Spark Apache Spark is an open source cluster computing framework originally developed in the AMPlab at UC Berkley. Spark in-memory provides performance up to 100 times faster for certain applications. Spark is well suited for machine learning algorithms. Spark requires a cluster manager and a distributed storage system. Spark supports Hadoop YARN. 6/16/201512Zhangxi Lin
13
MapReduce 13 MapReduce is a framework for processing parallelizable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster or a grid. 6/16/2015Zhangxi Lin
14
MapReduce 2.0 – YARN (Yet Another Resource Negotiator) Zhangxi Lin146/16/2015
15
How Hadoop Operates Zhangxi Lin156/16/2015
16
Hadoop Ecosystem 6/16/201516Zhangxi Lin
17
Hadoop Topics 17 No:TopicComponents 1Data warehousingHDFS, HBase, HIVE, KylinNoSQL/NewSQL, Solr 2Publicly available big data servicesHortonworks, CloudEra, HaaS, EC2 3MapReduce & Data miningMahout, H2O, R, Python 4Big data ETLKettle, Flume, Sqoop, Impala,Chakwa. Dremel, Pig 5Big data platform managementOozie, ZooKeeper, Ambari, Loom, Ganglia 6Application development platformTomcat, Neo4J, Pig, Hue 7Tools & VisualizationsPentaho, Tableau Saiku, Mondrian, Gephi, 8Streaming data processing Spark, Storm, Kafka, Avro 6/16/2015Zhangxi Lin
18
HADOOP DATA WAREHOUSING 186/16/2015Zhangxi Lin
19
Comparing the RDBMS and Hadoop data warehousing stack Layer Conventional RDBMS Hadoop Advantages of Hadoop over conventional RDBMS StorageDatabase tablesHDFS file system HDFS is purpose- built for extreme IO speeds MetadataSystem tablesHCatalog All clients can use HCatalog to read files. QuerySQL query engine Multiple engines (SQL and non- SQL) Multiple query engines like Hive or Impala are available. 6/16/201519Zhangxi Lin
20
HDFS ( Hadoop Distributed File System) Hadoop ecosystem consists of many components and libraries for varied tasks. The storage part of Hadoop is HDFS and the processing part is MapReduce. HDFS is the a java based distributed file-system that stores data on commodity machines without prior organization, providing very high aggregate bandwidth across the clusters. 6/16/2015Zhangxi Lin20
21
HDFS Architecture & Design HDFS has a master/slave architecture. HDFS consists of a single NameNode and several number of DataNodes in a cluster. In HDFS files are split in one or more ’blocks’ and are stored in a set of DataNodes. HDFS exposes a file system namespace and allows user data to be stored in files. DataNodes serves read, write requests, performs block creation, deletion, and replication upon instruction from Namenode. 6/16/201521Zhangxi Lin
22
6/16/201522Zhangxi Lin
23
What is NoSQL? Stands for Not Only SQL NoSQL is a non-relational database management system. NoSQL is different from traditional relational database management systems in some significant ways. NoSQL is designed for distributed data stores where very large scale of data storing is needed (for example Google or Facebook which collects terabits of data every day for their users). These types of data storing may not require fixed schema, avoid join operations and typically scale horizontally. 6/16/201523Zhangxi Lin
24
NoSQL 6/16/201524Zhangxi Lin
25
- Praveen Ashokan 6/16/201525Zhangxi Lin
26
What is NewSQL? A modern RDBMS that seek to provide the same scalable performance of NoSQL systems for OLTP read-write workloads while still maintaining the ACID guarantees of a traditional database system. SQL as the primary interface Non- Locking Concurrency control High per-node performance H-Store parallel database system is the first known NewSQL system 6/16/201526Zhangxi Lin
27
Classification of NoSQL and NewSQL 6/16/201527Zhangxi Lin
28
Taxonomy of Big Data Stores 6/16/201528Zhangxi Lin
29
Features of OldSQL vs NoSQL vs NewSQL 6/16/201529Zhangxi Lin
30
6/16/201530Zhangxi Lin
31
HBase HBase is a non-relational,distributed database It is a column-oriented DBMS It is an implementation of Google’s Big Table HBase is built on top of Hadoop File Distributed System(HDFS) 6/16/201531Zhangxi Lin
32
Differences between HBase and Relational Database HBase is a column-oriented database while a Relational database is a row-oriented database HBase is highly scalable while RDBMS is hard to scale. Hbase has flexible schema while RDBMS has fixed schema HBase holds denormalized data while data in a Relational database is normalized The performance of HBase is good for large volumes of unstructured data while the performance is poor for a Relational database HBase does not use any query language while a Relational Database uses SQL to retrieve data 6/16/201532Zhangxi Lin
33
HBase Data Model 6/16/201533Zhangxi Lin
34
HBase: Keys and Column Families Each record is divided into Column Families 6/16/201534Zhangxi Lin
35
What is Apache Hive? The Apache Hive is data warehouse software facilitates querying and managing large datasets residing in distributed storage It built on top of Apache Hadoop it provides tools to easy data extract/transform/load It supports analysis of large datasets stored in Hadoop’s HDFS It supports SQL-like language called HQL as well as big data analytics with the help of Map-Reduce 6/16/201535Zhangxi Lin
36
What is HQL? HQL : Hive Query Language Doesn’t conform any ANSI standard Very close to MySQL dialect, but with some differences SQL to HQL cheat sheet http://hortonworks.com/wp- content/uploads/downloads/2013/08/Hortonworks.CheatSheet.SQLt oHive.pdf http://hortonworks.com/wp- content/uploads/downloads/2013/08/Hortonworks.CheatSheet.SQLt oHive.pdf HQL doesn’t support transactions, so don’t compare with RDBMS 6/16/201536Zhangxi Lin
37
HADOOP ETL 6/16/201537Zhangxi Lin
38
List of Tools Sqoop Flume Impala Chukwa Kettle 6/16/201538Zhangxi Lin
39
E T L 6/16/201539Zhangxi Lin
40
Sqoop Is a short form of SQL to Hadoop Used to move back data back and forth between RDBMS and HDFS for performing analysis using BI Tools. Is a simple command line tool(Sqoop 2 is bringing web interface as well) 6/16/201540Zhangxi Lin
41
How Sqoop Works Dataset Slice 1 Slice 2 Slice 1 Mapper 1 6/16/201541Zhangxi Lin
42
Sqoop 1 & Sqoop 2 FeatureSqoop 1Sqoop 2 Connectors for all major RDBMSSupported.Not supported. Workaround: Use the generic JDBC Connector which has been tested on the following databases: Microsoft SQL Server, PostgreSQL, MySQL and Oracle. This connector should work on any other JDBC compliant database. However, performance might not be comparable to that of specialized connectors in Sqoop. Encryption of Stored PasswordsNot supported. No workaround.Supported using Derby's on-disk encryption.Disclaimer: Although expected to work in the current version of Sqoop 2, this configuration has not been verified.on-disk encryption Data transfer from RDBMS to Hive or HBase Supported.Not supported. 1.Workaround: Follow this two-step approach.Import data from RDBMS into HDFS 2.Load data into Hive or HBase manually using appropriate tools and commands such as the LOAD DATA statement in Hive Data transfer from Hive or HBase to RDBMS 1.Not supported.Workaround: Follow this two-step approach.Extract data from Hive or HBase into HDFS (either as a text or Avro file) 2.Use Sqoop to export output of previous step to RDBMS Not supported. Follow the same workaround as for Sqoop 1. 6/16/201542Zhangxi Lin
43
Sqoop 1 & Sqoop 2 Architecture For more on Differences https://www.youtube.com/watch?v=xzU3HL4ZYI0 6/16/201543Zhangxi Lin
44
What is Flume ? Flume – It is a distributed, reliable service used for gathering, aggregating and transporting large amounts of streaming event data for analysis. Event data – streaming log data (website/application logs – to analyse user’s activity) or streaming data (e.g. social media – analyse an event, stock prices- to analyse a stock’s performance) 6/16/201544Zhangxi Lin
45
Architecture and Working 6/16/201545Zhangxi Lin
46
Impala –An open source SQL query engine Developed by Cloudera and fully open source, hosted on github. Released as beta in 10/2012 1.0 version available in 05/2013 6/16/201546Zhangxi Lin
47
About Impala 6/16/201547Zhangxi Lin
48
What is Chukwa Chukwa is an open source data collection system for monitoring large distributed systems. Used for log collection and analysis. Built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework Not a streaming database Not a real time system 6/16/201548Zhangxi Lin
49
Why do we need Chukwa? Data monitoring and analysis. ◦ To collect system matrices and log files. ◦ To store data in Hadoop clusters Uses MapReduce to analyze data. ◦ Robust ◦ Scalable ◦ Rapid Data Processing 6/16/201549Zhangxi Lin
50
How it Works? 6/16/201550Zhangxi Lin
51
Data Analysis 6/16/201551Zhangxi Lin
52
ETL ToolsFeaturesAdvantageDisadvantage Sqoop Bulk import Direct input Data interaction Data export Parallel data transfer Efficient data Analysis Not east to manage installations and configurations Flume Fan out Fan in Processors Auto-batching of events Multiplexing channels for data mining Reliable, Scalable, Manageable, Customizable, High Performance Feature Rich and Fully Extensible Contextual Routing Have to weaken some delivery guarantees Kettle Migrating data between applications or databases Exporting data from databases to flat files Loading data massively into databases Data cleansing Integrating applications Higher level than code Well tested full suite of components Data analysis tools Free Not running fast Take some time to install 6/16/201552Zhangxi Lin
53
Building a Datawarehouse in Hadoop using ETL Tools Copy data into HDFS with ETL tool (e.g. Informatica), Sqoop or Flume into standard HDFS files (write once). This registers the metadata with HCatalog. Declare the query schema in Hive or Impala, which doesn’t require data copying or re-loading, due to the schema-on-read advantage of Hadoop compared with schema-on-write constraint in RDBMS. Explore with SQL queries and launching BI tools e.g. Tableau, BusinessObjects for exploratory analytics. 6/16/201553Zhangxi Lin
54
HADOOP DATA MINING 546/16/2015Zhangxi Lin
55
What is Mahout? Meaning: A person who keep and drives an elephant – an Indian term Mahout is a scalable open source machine learning library hosted by Apache. Mahout core algorithms are implemented on top of Apache Hadoop using the Map/Reduce paradigm. 6/16/201555Zhangxi Lin
56
Mahout’s position 6/16/201556Zhangxi Lin
57
6/16/201557Zhangxi Lin
58
Mapreduce flow in mahout 6/16/201558Zhangxi Lin
59
What is H2O? H2O scales statistics, machine learning and math over BigData. H2O is extensible and users can build blocks using simple math legos in the core. H2O keeps familiar interfaces like R, Excel & JSON so that BigData enthusiasts & experts can explore, merge, model and score datasets using a range of simple to advanced algorithms. H2O makes it fast and easy to derive insights from your data through faster and better predictive modeling. H2O has a vision of online scoring and modeling in a single platform 6/16/201559Zhangxi Lin
60
H2O How is H2O Different form Mahout ? H2OMahout Can use any of R, REST/JSON, GUI (browser), Java or Scala. Can use Java H2O is GUI product with less algorithmsMore number of Algorithms that need knowledge od Java Algorithms are typically 100x faster than current Map/Reduce-based Mahout Algorithms are typically slower compared to H2O. Knowledge of Java is NOT required to develop prediction model Knowledge of Java required to develop prediction model Real TimeNot Real Time 6/16/201560Zhangxi Lin
61
H2O Predictive Modeling Factories – Better Marketing with H2O Advertising Technology – Better Conversions with H2O Risk & Fraud Analysis – Better detection with H2O Customer Intelligence – Better Sales with H2O Users of H2O 6/16/201561Zhangxi Lin
62
MAP/REDUCE ALGORITHM 6/16/2015 Zhangxi Lin62
63
How to write a MapReduce program Parallelization is the key Algorithm is different from a single server application ◦ Map function ◦ Reduce function Considerations ◦ Load balance ◦ Efficiency ◦ Memory management 6/16/201563Zhangxi Lin
64
MapReduce Executes 6/16/201564Zhangxi Lin
65
Schematic of a map-reduce computation 6/16/201565Zhangxi Lin
66
Example: counting the number of occurrences for each word in a collection of documents The input file is a repository of documents, and each document is an element. The Map function for this example uses keys that are of type String (the words) and values that are integers. The Map task reads a document and breaks it into its sequence of words w1,w2,...,wn. It then emits a sequence of key-value pairs where the value is always 1. That is, the output of the Map task for this document is the sequence of key- value pairs: (w1, 1), (w2, 1),..., (wn, 1) 6/16/201566Zhangxi Lin
67
Map Task A single Map task will typically process many documents. Thus, its output will be more than the sequence for the one document suggested above. If a word w appears m times among all the documents assigned to that process, then there will be m key- value pairs (w, 1) among its output. After all the Map tasks have completed successfully, the master controller merges the files from each Map task that are destined for a particular Reduce task and feeds the merged file to that process as a sequence of key-list-of-value pairs. That is, for each key k, the input to the Reduce task that handles key k is a pair of the form (k, [v1, v2,..., vn]), where (k, v1), (k, v2),..., (k, vn) are all the key-value pairs with key k coming from all the Map tasks. 6/16/201567Zhangxi Lin
68
Reduce Task The output of the Reduce function is a sequence of zero or more key-value pairs. The Reduce function simply adds up all the values. The output of a reducer consists of the word and the sum. Thus, the output of all the Reduce tasks is a sequence of (w,m) pairs, where w is a word that appears at least once among all the input documents and m is the total number of occurrences of w among all those documents. The application of the Reduce function to a single key and its associated list of values is referred to as a reducer. 6/16/201568Zhangxi Lin
69
Big Data Visualization and Tools 6/16/201569Zhangxi Lin
70
Big Data Visualization and Tools Tools : Tableau Pentaho Modrian Saiku Spotfire Gephi 6/16/201570Zhangxi Lin
71
Tableau Tableau is a visual analysis solution that allows people to explore and analyze data with simple drag and drop operations. What is Tableau? 6/16/201571Zhangxi Lin
72
Tableau Tableau Alliance Partners 6/16/201572Zhangxi Lin
73
Tableau 6/16/201573Zhangxi Lin
74
What is Pentaho? Pentaho is a commercial open source software for Business Intelligence (BI). Pentaho has been developed since 2004 in Orlando, Florida. Pentaho provides comprehensive reporting, OLAP analysis, dashboards, data integration, data mining and a BI platform. It is built under Java platform. Runs well under various platforms (Windows, Linux, Macintosh, Solaris, Unix, etc.) Has a complete package from reporting, ETL for warehousing data management, OLAP server data mining also dashboard. BI Platform supports Pentaho end to end business intelligence capabilities and provide central access to your business information, with back end security, integration, scheduling, auditing and more. Designed to meet the needs of any size organization. 6/16/201574Zhangxi Lin
75
A few facts 6/16/201575Zhangxi Lin
76
6/16/201576Zhangxi Lin
77
Analyzer 6/16/201577Zhangxi Lin
78
Reports 6/16/201578Zhangxi Lin
79
Overall Features 6/16/201579Zhangxi Lin
80
HADOOP IN YOUR LAPTOP 6/16/2015Zhangxi Lin80
81
Hortonworks Background Hortonworks is a Business computer software company based in Palo Alto,California Hortonworks supports & develops Apache Hadoop framework, that allows distributed processing of large data sets across clusters of computers They are the sponsors of Apache Software Foundation Founded in June 2011 by Yahoo and Benchmark capital as an independent company. It went public on December 2014 Below are the list of company collaborated with Hortonworks Microsoft on October 2011 to develop Azure & Window server Infomatica on November 2011 to develop HParser Teradata on February 2012 to develop Aster data system SAP AG on September 2012 announced it would resell Hortonworks distribution 6/16/201581Zhangxi Lin
82
They do Hadoop using HDP 6/16/201582Zhangxi Lin
83
Hortonworks Data Platform Hortonworks' product named Hortonworks Data Platform (HDP) includes Apache Hadoop and is used for storing, processing, and analyzing large volumes of data. It includes Apache Projects like HDFS, MapReduce, Pig, Hive, Hbase, Zookeeepr and other components Why was it developed? It was develop with one aim to make Apache Hadoop ready for enterprise. What does it do? It takes Big Data component of Apache Hadoop and make them ready for prime time use in Enterprise Environment. 6/16/201583Zhangxi Lin
84
HDP Functional Areas 6/16/201584Zhangxi Lin
85
Certified Technology Program One of the most important aspects of the Technology Partner Program is the certification of partner technologies with HDP Hortonworks Certified Technology Program simplifies big data planning by providing pre-built and validated integrations between leading enterprise technologies and Hortonworks Data Platform (HDP) YARN Ready Certification,Operations Ready Security Ready, Governance Ready More Details: http://hortonworks.com/partners/certified/http://hortonworks.com/partners/certified/ 6/16/201585Zhangxi Lin
86
How to get HDP? HDP is architected, developed, and built completely in the open. Anyone can download it from website http://hortonworks.com/hdp/downloads/ for freehttp://hortonworks.com/hdp/downloads/ It comes with different version which can used as per need. HDP 2.2 on Sandbox – runs on VirtualBox or VMWare Automated (Amabri) – RHEL/Ubuntu/CentOS/SLES Manual – RHEL/Ubuntu/CentOS/SLES Windows – Windows Server 2008 & 2012 6/16/201586Zhangxi Lin
87
Installing HDP IP address to login on the browser 6/16/201587Zhangxi Lin
88
DEMO-HDP Below are the step we will be preforming in HDP Starting HDP Upload a source file Load in file in HCatalog Pig Basic Tutorial 6/16/201588Zhangxi Lin
89
6/16/201589Zhangxi Lin
90
About Cloudera Cloudera is “The commercial Hadoop company” Founded by leading experts on Hadoop from Facebook, Google, Oracle and Yahoo Provides consulting and training services for Hadoop users Staff includes several committers to Hadoop projects 6/16/201590Zhangxi Lin
91
Who uses Cloudera? 6/16/201591Zhangxi Lin
92
Cloudera Software (All Open-Source) Cloudera’s Distribution including Apache Hadoop (CDH) – A single, easy-to-install package from the Apache Hadoop core repository – Includes a stable version of Hadoop, plus critical bug fixes and solid new features from the development version Components – Apache Hadoop – Apache Hive – Apache Pig – Apache HBase – Apache Zookeeper – Flume, Hue, Oozie, and Sqoop 6/16/201592Zhangxi Lin
93
CDH and Enterprise Ecosystem 6/16/201593Zhangxi Lin
94
Beyond Hadoop Hadoop is incapable of handling OLTP tasks because of its latency. Alibaba has deelop its own distributed system instead of using Hadoop. Currently, it takes Alipay’s system 20 ms to process a payment transaction, but 200 ms for fraud detection ◦ “2014 年双十一交易额是多少?当大家还正在酣睡之时,双 十一的疯狂正在开始。 11 日凌晨,逆天的天猫双十一购物狂 欢节开场,今年每分钟支付成功的峰值为 79 万笔 / 分,对比 去年 20 万笔 / 分,较去年增长 4 倍, ” 12306.cn has replaced its old system with VMware vFabric TM GemFire ® in-memory database system. This makes its services stable and robustic. 946/16/2015Zhangxi Lin
95
HaaS(Hadoop as a Service)
96
HaaS example HaaS example Amazon Web Services(AWS) -Amazon Elastic MapReduce (EMR) providing Hadoop based platform for data analysis with S3 as the storage system and EC2 as the compute system Microsoft HDInsight, Cloudera CDH3, IBM Infoshpere BigInsights, EMC GreenPlum HD and Windows Azure HDInsight Service are the primary HaaS services by global IT giants
97
APPENDIX 1: HADOOP ECOLOGICAL SYSTEM 976/16/2015Zhangxi Lin
98
Choosing a right Hadoop architecture Application dependent Too many solution providers Too many choices 986/16/2015Zhangxi Lin
99
Teradata Big Data Platform 6/16/2015Zhangxi Lin99
100
Dell’s Hadoop ecosystem 1006/16/2015Zhangxi Lin
101
Nokia’s Big Data Architechture 6/16/2015Zhangxi Lin101
102
Cloudera’s Hadoop System 1026/16/2015Zhangxi Lin
103
1036/16/2015Zhangxi Lin
104
1046/16/2015Zhangxi Lin
105
Intel 1056/16/2015Zhangxi Lin
106
Comparison of Two Generations of Hadoop 1066/16/2015Zhangxi Lin
107
1076/16/2015Zhangxi Lin
108
1086/16/2015Zhangxi Lin
109
Different Components of Hadoop 1096/16/2015Zhangxi Lin
110
1106/16/2015Zhangxi Lin
111
APPENDIX 2: MATRIX CALCULATION 6/16/2015 Zhangxi Lin111
112
Map/Reduce Matrix Multiplication 6/16/2015 Zhangxi Lin112
113
Map/Reduce – Scheme 1, Step 1 6/16/2015 113Zhangxi Lin
114
Map/Reduce – Scheme 1, Step 2 6/16/2015 Zhangxi Lin114
115
Map/Reduce – Scheme 2, Oneshot 6/16/2015 Zhangxi Lin115
116
Communication Cost 6/16/2015 Zhangxi Lin116 The sum of the communication cost of all the tasks implementing that algorithm. In addition to the amount of time to execute a task it also includes the time for moving data into the memory. ◦ The algorithm executed by each task tends to be very simple, often linear in the size of its input ◦ The typical interconnect speed for a computing cluster is one gigabit per second. ◦ The time taken to move the data from a chunk into the main memory may exceed the time needed to operate on the data.
117
Reducer size 6/16/2015 Zhangxi Lin117 The upper bound on the number of values that are allowed to appear in the list associated with a single key. Reducer size can be selected with at least two goals. ◦ By making the reducer size small, we can force there to be many reducers, according to which the problem input is divided by the Map tasks. ◦ We can choose a reducer size sufficiently small that we are certain the computation associated with a single reducer can be executed entirely in the main memory of the compute node where its Reduce task is located. The running time will be greatly reduced if we can avoid having to move data repeatedly between main memory and disk.
118
Replication rate 6/16/2015 Zhangxi Lin118 The number of key-value pairs produced by all the Map tasks on all the inputs, divided by the number of inputs. That is, the average communication from Map tasks to Reduce tasks (measured by counting key-value pairs) per input.
119
Segmenting Matrix to Reduce the Cost 6/16/2015 Zhangxi Lin119
120
Map/Reduce – Scheme 3 6/16/2015 Zhangxi Lin120
121
Map/Reduce – Scheme 4, Step 1 6/16/2015 Zhangxi Lin121
122
Map/Reduce – Scheme 4, Step 2 6/16/2015 Zhangxi Lin122
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.