Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hadoop & MapReduce Zhangxi Lin CAABI, Texas Tech University FIFE, Southwestern University of Finance & Economics Cellphone:18610660375, QQ/WeChat: 155970.

Similar presentations


Presentation on theme: "Hadoop & MapReduce Zhangxi Lin CAABI, Texas Tech University FIFE, Southwestern University of Finance & Economics Cellphone:18610660375, QQ/WeChat: 155970."— Presentation transcript:

1 Hadoop & MapReduce Zhangxi Lin CAABI, Texas Tech University FIFE, Southwestern University of Finance & Economics Cellphone:18610660375, QQ/WeChat: 155970 http://zlin.ba.ttu.eduhttp://zlin.ba.ttu.edu Zhangxi.lin@ttu.eduZhangxi.lin@ttu.edu 2015-06-16 1 1 6/16/2015Zhangxi Lin

2 CAABI, Texas Tech University ◦ Center for Advanced Analytics and Business Intelligence initially started in 2004 by Dr. Peter Westfall, ISQS, Rawls College of Business. Center for Advanced Analytics and Business Intelligence Sichuan Key Lab of Financial Intelligence and Financial Engineering (FIFE), SWUFE ◦ One of two key labs in finance founded in 2008 and sponsored by Sichuan Provincial Government ◦ Underpinned by two areas in SWUFE: Information and Finance 2 ISQS 6339, Data Mgmt & BI 2 6/16/2015Zhangxi Lin

3 Know Big Data One More Step When we talk about big data we must know what Hadoop is When we planning about data warehousing we must know what HDFS and NoSQL are. When we say data mining we must know what Mahout and H2O are. Do you know Hadoop data warehousing does not need dimensional modeling? Do you know how Hadoop stores heterogeneous data? Do you know what are Hadoop’s “Archeries heal”? Do you know you can install a Hadoop system in your Laptop? Do you know Alibaba has retired its last mini-computer in 2014? So, let’s talk about Hadoop 36/16/2015Zhangxi Lin

4 After this lecture you will Understand what challenges are in big data management Understand how Hadoop and MapReduce works Get familiar to the Hadoop ecology Be able to install a Hadoop in your laptop Be able to install a handy big data tool in your laptop to visualize and mine data 6/16/2015Zhangxi Lin4

5 Outlines Apache Hadoop Hadoop Data Warehousing Hadoop ETL Hadoop Data Mining Data Visualization with Hadoop MapReduce Algorithm Setting up Your Hadoop Appendixes ◦ The Hadoop Ecological System ◦ Matrix calculation with MapReduce Zhangxi Lin5 6/16/2015

6 A Traditional Business Intelligence System 6 SSMS SSIS SSAS SSRS SAS EM SAS EM SAS EG SAS EG MS SQL Server BIDS 6/16/2015Zhangxi Lin

7 Hadoop ecosystem 6/16/20157Zhangxi Lin

8 - 6/16/20158Zhangxi Lin

9 What is Hadoop? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity hardware. Hadoop is not a replacement for traditional RDMS but is a supplement to handle and process large datasets. It achieves two tasks: 1. Massive data storage. 2. Faster processing. Using Hadoop is cheaper, faster and better. 6/16/20159Zhangxi Lin

10 Hadoop 2: Big data's big leap forward The new Hadoop is the Apache Foundation's attempt to create a whole new general framework for the way big data can be stored, mined, and processed. The biggest constraint on scale has been Hadoop’s job handling. All jobs in Hadoop are run as batch processes through a single daemon called JobTracker, which creates a scalability and processing-speed bottleneck. Hadoop 2 uses an entirely new job-processing framework built using two daemons: ResourceManager, which governs all jobs in the system, and NodeManager, which runs on each Hadoop node and keeps the ResourceManager informed about what's happening on that node. Zhangxi Lin106/16/2015

11 Hadoop 1.0 VS Hadoop 2.0 Hadoop 1.0 Hadoop 2.0 Horizontal scalability of Namenode. Namenode is no longer a single point of failure. Ability to process Terabytes and Petabytes of data available in HDFS using Non-MapReduce applications such as MPI, GIRAPH. The two major functionalities of overburdened JobTracker (resource management and job scheduling/monitoring) into two separate daemons. Features of Hadoop 2.0 over Hadoop 1.0: 6/16/201511Zhangxi Lin

12 Apache Spark Apache Spark is an open source cluster computing framework originally developed in the AMPlab at UC Berkley. Spark in-memory provides performance up to 100 times faster for certain applications. Spark is well suited for machine learning algorithms. Spark requires a cluster manager and a distributed storage system. Spark supports Hadoop YARN. 6/16/201512Zhangxi Lin

13 MapReduce 13 MapReduce is a framework for processing parallelizable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster or a grid. 6/16/2015Zhangxi Lin

14 MapReduce 2.0 – YARN (Yet Another Resource Negotiator) Zhangxi Lin146/16/2015

15 How Hadoop Operates Zhangxi Lin156/16/2015

16 Hadoop Ecosystem 6/16/201516Zhangxi Lin

17 Hadoop Topics 17 No:TopicComponents 1Data warehousingHDFS, HBase, HIVE, KylinNoSQL/NewSQL, Solr 2Publicly available big data servicesHortonworks, CloudEra, HaaS, EC2 3MapReduce & Data miningMahout, H2O, R, Python 4Big data ETLKettle, Flume, Sqoop, Impala,Chakwa. Dremel, Pig 5Big data platform managementOozie, ZooKeeper, Ambari, Loom, Ganglia 6Application development platformTomcat, Neo4J, Pig, Hue 7Tools & VisualizationsPentaho, Tableau Saiku, Mondrian, Gephi, 8Streaming data processing Spark, Storm, Kafka, Avro 6/16/2015Zhangxi Lin

18 HADOOP DATA WAREHOUSING 186/16/2015Zhangxi Lin

19 Comparing the RDBMS and Hadoop data warehousing stack Layer Conventional RDBMS Hadoop Advantages of Hadoop over conventional RDBMS StorageDatabase tablesHDFS file system HDFS is purpose- built for extreme IO speeds MetadataSystem tablesHCatalog All clients can use HCatalog to read files. QuerySQL query engine Multiple engines (SQL and non- SQL) Multiple query engines like Hive or Impala are available. 6/16/201519Zhangxi Lin

20 HDFS ( Hadoop Distributed File System) Hadoop ecosystem consists of many components and libraries for varied tasks. The storage part of Hadoop is HDFS and the processing part is MapReduce. HDFS is the a java based distributed file-system that stores data on commodity machines without prior organization, providing very high aggregate bandwidth across the clusters. 6/16/2015Zhangxi Lin20

21 HDFS Architecture & Design HDFS has a master/slave architecture. HDFS consists of a single NameNode and several number of DataNodes in a cluster. In HDFS files are split in one or more ’blocks’ and are stored in a set of DataNodes. HDFS exposes a file system namespace and allows user data to be stored in files. DataNodes serves read, write requests, performs block creation, deletion, and replication upon instruction from Namenode. 6/16/201521Zhangxi Lin

22 6/16/201522Zhangxi Lin

23 What is NoSQL?  Stands for Not Only SQL  NoSQL is a non-relational database management system.  NoSQL is different from traditional relational database management systems in some significant ways.  NoSQL is designed for distributed data stores where very large scale of data storing is needed (for example Google or Facebook which collects terabits of data every day for their users).  These types of data storing may not require fixed schema, avoid join operations and typically scale horizontally. 6/16/201523Zhangxi Lin

24 NoSQL 6/16/201524Zhangxi Lin

25 - Praveen Ashokan 6/16/201525Zhangxi Lin

26 What is NewSQL? A modern RDBMS that seek to provide the same scalable performance of NoSQL systems for OLTP read-write workloads while still maintaining the ACID guarantees of a traditional database system. SQL as the primary interface Non- Locking Concurrency control High per-node performance H-Store parallel database system is the first known NewSQL system 6/16/201526Zhangxi Lin

27 Classification of NoSQL and NewSQL 6/16/201527Zhangxi Lin

28 Taxonomy of Big Data Stores 6/16/201528Zhangxi Lin

29 Features of OldSQL vs NoSQL vs NewSQL 6/16/201529Zhangxi Lin

30 6/16/201530Zhangxi Lin

31 HBase HBase is a non-relational,distributed database It is a column-oriented DBMS It is an implementation of Google’s Big Table HBase is built on top of Hadoop File Distributed System(HDFS) 6/16/201531Zhangxi Lin

32 Differences between HBase and Relational Database HBase is a column-oriented database while a Relational database is a row-oriented database HBase is highly scalable while RDBMS is hard to scale. Hbase has flexible schema while RDBMS has fixed schema HBase holds denormalized data while data in a Relational database is normalized The performance of HBase is good for large volumes of unstructured data while the performance is poor for a Relational database HBase does not use any query language while a Relational Database uses SQL to retrieve data 6/16/201532Zhangxi Lin

33 HBase Data Model 6/16/201533Zhangxi Lin

34 HBase: Keys and Column Families Each record is divided into Column Families 6/16/201534Zhangxi Lin

35 What is Apache Hive? The Apache Hive is data warehouse software facilitates querying and managing large datasets residing in distributed storage It built on top of Apache Hadoop it provides tools to easy data extract/transform/load It supports analysis of large datasets stored in Hadoop’s HDFS It supports SQL-like language called HQL as well as big data analytics with the help of Map-Reduce 6/16/201535Zhangxi Lin

36 What is HQL? HQL : Hive Query Language Doesn’t conform any ANSI standard Very close to MySQL dialect, but with some differences SQL to HQL cheat sheet http://hortonworks.com/wp- content/uploads/downloads/2013/08/Hortonworks.CheatSheet.SQLt oHive.pdf http://hortonworks.com/wp- content/uploads/downloads/2013/08/Hortonworks.CheatSheet.SQLt oHive.pdf HQL doesn’t support transactions, so don’t compare with RDBMS 6/16/201536Zhangxi Lin

37 HADOOP ETL 6/16/201537Zhangxi Lin

38 List of Tools Sqoop Flume Impala Chukwa Kettle 6/16/201538Zhangxi Lin

39 E T L 6/16/201539Zhangxi Lin

40 Sqoop Is a short form of SQL to Hadoop Used to move back data back and forth between RDBMS and HDFS for performing analysis using BI Tools. Is a simple command line tool(Sqoop 2 is bringing web interface as well) 6/16/201540Zhangxi Lin

41 How Sqoop Works Dataset Slice 1 Slice 2 Slice 1 Mapper 1 6/16/201541Zhangxi Lin

42 Sqoop 1 & Sqoop 2 FeatureSqoop 1Sqoop 2 Connectors for all major RDBMSSupported.Not supported. Workaround: Use the generic JDBC Connector which has been tested on the following databases: Microsoft SQL Server, PostgreSQL, MySQL and Oracle. This connector should work on any other JDBC compliant database. However, performance might not be comparable to that of specialized connectors in Sqoop. Encryption of Stored PasswordsNot supported. No workaround.Supported using Derby's on-disk encryption.Disclaimer: Although expected to work in the current version of Sqoop 2, this configuration has not been verified.on-disk encryption Data transfer from RDBMS to Hive or HBase Supported.Not supported. 1.Workaround: Follow this two-step approach.Import data from RDBMS into HDFS 2.Load data into Hive or HBase manually using appropriate tools and commands such as the LOAD DATA statement in Hive Data transfer from Hive or HBase to RDBMS 1.Not supported.Workaround: Follow this two-step approach.Extract data from Hive or HBase into HDFS (either as a text or Avro file) 2.Use Sqoop to export output of previous step to RDBMS Not supported. Follow the same workaround as for Sqoop 1. 6/16/201542Zhangxi Lin

43 Sqoop 1 & Sqoop 2 Architecture For more on Differences https://www.youtube.com/watch?v=xzU3HL4ZYI0 6/16/201543Zhangxi Lin

44 What is Flume ? Flume – It is a distributed, reliable service used for gathering, aggregating and transporting large amounts of streaming event data for analysis. Event data – streaming log data (website/application logs – to analyse user’s activity) or streaming data (e.g. social media – analyse an event, stock prices- to analyse a stock’s performance) 6/16/201544Zhangxi Lin

45 Architecture and Working 6/16/201545Zhangxi Lin

46 Impala –An open source SQL query engine Developed by Cloudera and fully open source, hosted on github. Released as beta in 10/2012 1.0 version available in 05/2013 6/16/201546Zhangxi Lin

47 About Impala 6/16/201547Zhangxi Lin

48 What is Chukwa Chukwa is an open source data collection system for monitoring large distributed systems. Used for log collection and analysis. Built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework Not a streaming database Not a real time system 6/16/201548Zhangxi Lin

49 Why do we need Chukwa? Data monitoring and analysis. ◦ To collect system matrices and log files. ◦ To store data in Hadoop clusters Uses MapReduce to analyze data. ◦ Robust ◦ Scalable ◦ Rapid Data Processing 6/16/201549Zhangxi Lin

50 How it Works? 6/16/201550Zhangxi Lin

51 Data Analysis 6/16/201551Zhangxi Lin

52 ETL ToolsFeaturesAdvantageDisadvantage Sqoop Bulk import Direct input Data interaction Data export Parallel data transfer Efficient data Analysis Not east to manage installations and configurations Flume Fan out Fan in Processors Auto-batching of events Multiplexing channels for data mining Reliable, Scalable, Manageable, Customizable, High Performance Feature Rich and Fully Extensible Contextual Routing Have to weaken some delivery guarantees Kettle Migrating data between applications or databases Exporting data from databases to flat files Loading data massively into databases Data cleansing Integrating applications Higher level than code Well tested full suite of components Data analysis tools Free Not running fast Take some time to install 6/16/201552Zhangxi Lin

53 Building a Datawarehouse in Hadoop using ETL Tools Copy data into HDFS with ETL tool (e.g. Informatica), Sqoop or Flume into standard HDFS files (write once). This registers the metadata with HCatalog. Declare the query schema in Hive or Impala, which doesn’t require data copying or re-loading, due to the schema-on-read advantage of Hadoop compared with schema-on-write constraint in RDBMS. Explore with SQL queries and launching BI tools e.g. Tableau, BusinessObjects for exploratory analytics. 6/16/201553Zhangxi Lin

54 HADOOP DATA MINING 546/16/2015Zhangxi Lin

55 What is Mahout? Meaning: A person who keep and drives an elephant – an Indian term Mahout is a scalable open source machine learning library hosted by Apache. Mahout core algorithms are implemented on top of Apache Hadoop using the Map/Reduce paradigm. 6/16/201555Zhangxi Lin

56 Mahout’s position 6/16/201556Zhangxi Lin

57 6/16/201557Zhangxi Lin

58 Mapreduce flow in mahout 6/16/201558Zhangxi Lin

59 What is H2O? H2O scales statistics, machine learning and math over BigData. H2O is extensible and users can build blocks using simple math legos in the core. H2O keeps familiar interfaces like R, Excel & JSON so that BigData enthusiasts & experts can explore, merge, model and score datasets using a range of simple to advanced algorithms. H2O makes it fast and easy to derive insights from your data through faster and better predictive modeling. H2O has a vision of online scoring and modeling in a single platform 6/16/201559Zhangxi Lin

60 H2O How is H2O Different form Mahout ? H2OMahout Can use any of R, REST/JSON, GUI (browser), Java or Scala. Can use Java H2O is GUI product with less algorithmsMore number of Algorithms that need knowledge od Java Algorithms are typically 100x faster than current Map/Reduce-based Mahout Algorithms are typically slower compared to H2O. Knowledge of Java is NOT required to develop prediction model Knowledge of Java required to develop prediction model Real TimeNot Real Time 6/16/201560Zhangxi Lin

61 H2O Predictive Modeling Factories – Better Marketing with H2O Advertising Technology – Better Conversions with H2O Risk & Fraud Analysis – Better detection with H2O Customer Intelligence – Better Sales with H2O Users of H2O 6/16/201561Zhangxi Lin

62 MAP/REDUCE ALGORITHM 6/16/2015 Zhangxi Lin62

63 How to write a MapReduce program Parallelization is the key Algorithm is different from a single server application ◦ Map function ◦ Reduce function Considerations ◦ Load balance ◦ Efficiency ◦ Memory management 6/16/201563Zhangxi Lin

64 MapReduce Executes 6/16/201564Zhangxi Lin

65 Schematic of a map-reduce computation 6/16/201565Zhangxi Lin

66 Example: counting the number of occurrences for each word in a collection of documents The input file is a repository of documents, and each document is an element. The Map function for this example uses keys that are of type String (the words) and values that are integers. The Map task reads a document and breaks it into its sequence of words w1,w2,...,wn. It then emits a sequence of key-value pairs where the value is always 1. That is, the output of the Map task for this document is the sequence of key- value pairs: (w1, 1), (w2, 1),..., (wn, 1) 6/16/201566Zhangxi Lin

67 Map Task A single Map task will typically process many documents. Thus, its output will be more than the sequence for the one document suggested above. If a word w appears m times among all the documents assigned to that process, then there will be m key- value pairs (w, 1) among its output. After all the Map tasks have completed successfully, the master controller merges the files from each Map task that are destined for a particular Reduce task and feeds the merged file to that process as a sequence of key-list-of-value pairs. That is, for each key k, the input to the Reduce task that handles key k is a pair of the form (k, [v1, v2,..., vn]), where (k, v1), (k, v2),..., (k, vn) are all the key-value pairs with key k coming from all the Map tasks. 6/16/201567Zhangxi Lin

68 Reduce Task The output of the Reduce function is a sequence of zero or more key-value pairs. The Reduce function simply adds up all the values. The output of a reducer consists of the word and the sum. Thus, the output of all the Reduce tasks is a sequence of (w,m) pairs, where w is a word that appears at least once among all the input documents and m is the total number of occurrences of w among all those documents. The application of the Reduce function to a single key and its associated list of values is referred to as a reducer. 6/16/201568Zhangxi Lin

69 Big Data Visualization and Tools 6/16/201569Zhangxi Lin

70 Big Data Visualization and Tools Tools : Tableau Pentaho Modrian Saiku Spotfire Gephi 6/16/201570Zhangxi Lin

71 Tableau Tableau is a visual analysis solution that allows people to explore and analyze data with simple drag and drop operations. What is Tableau? 6/16/201571Zhangxi Lin

72 Tableau Tableau Alliance Partners 6/16/201572Zhangxi Lin

73 Tableau 6/16/201573Zhangxi Lin

74 What is Pentaho? Pentaho is a commercial open source software for Business Intelligence (BI). Pentaho has been developed since 2004 in Orlando, Florida. Pentaho provides comprehensive reporting, OLAP analysis, dashboards, data integration, data mining and a BI platform. It is built under Java platform. Runs well under various platforms (Windows, Linux, Macintosh, Solaris, Unix, etc.) Has a complete package from reporting, ETL for warehousing data management, OLAP server data mining also dashboard. BI Platform supports Pentaho end to end business intelligence capabilities and provide central access to your business information, with back end security, integration, scheduling, auditing and more. Designed to meet the needs of any size organization. 6/16/201574Zhangxi Lin

75 A few facts 6/16/201575Zhangxi Lin

76 6/16/201576Zhangxi Lin

77 Analyzer 6/16/201577Zhangxi Lin

78 Reports 6/16/201578Zhangxi Lin

79 Overall Features 6/16/201579Zhangxi Lin

80 HADOOP IN YOUR LAPTOP 6/16/2015Zhangxi Lin80

81 Hortonworks Background  Hortonworks is a Business computer software company based in Palo Alto,California  Hortonworks supports & develops Apache Hadoop framework, that allows distributed processing of large data sets across clusters of computers  They are the sponsors of Apache Software Foundation  Founded in June 2011 by Yahoo and Benchmark capital as an independent company. It went public on December 2014 Below are the list of company collaborated with Hortonworks  Microsoft on October 2011 to develop Azure & Window server  Infomatica on November 2011 to develop HParser  Teradata on February 2012 to develop Aster data system  SAP AG on September 2012 announced it would resell Hortonworks distribution 6/16/201581Zhangxi Lin

82 They do Hadoop using HDP 6/16/201582Zhangxi Lin

83 Hortonworks Data Platform  Hortonworks' product named Hortonworks Data Platform (HDP) includes Apache Hadoop and is used for storing, processing, and analyzing large volumes of data.  It includes Apache Projects like HDFS, MapReduce, Pig, Hive, Hbase, Zookeeepr and other components  Why was it developed? It was develop with one aim to make Apache Hadoop ready for enterprise.  What does it do? It takes Big Data component of Apache Hadoop and make them ready for prime time use in Enterprise Environment. 6/16/201583Zhangxi Lin

84 HDP Functional Areas 6/16/201584Zhangxi Lin

85 Certified Technology Program One of the most important aspects of the Technology Partner Program is the certification of partner technologies with HDP Hortonworks Certified Technology Program simplifies big data planning by providing pre-built and validated integrations between leading enterprise technologies and Hortonworks Data Platform (HDP) YARN Ready Certification,Operations Ready Security Ready, Governance Ready More Details: http://hortonworks.com/partners/certified/http://hortonworks.com/partners/certified/ 6/16/201585Zhangxi Lin

86 How to get HDP? HDP is architected, developed, and built completely in the open. Anyone can download it from website http://hortonworks.com/hdp/downloads/ for freehttp://hortonworks.com/hdp/downloads/ It comes with different version which can used as per need.  HDP 2.2 on Sandbox – runs on VirtualBox or VMWare  Automated (Amabri) – RHEL/Ubuntu/CentOS/SLES  Manual – RHEL/Ubuntu/CentOS/SLES  Windows – Windows Server 2008 & 2012 6/16/201586Zhangxi Lin

87 Installing HDP IP address to login on the browser 6/16/201587Zhangxi Lin

88 DEMO-HDP Below are the step we will be preforming in HDP  Starting HDP  Upload a source file  Load in file in HCatalog  Pig Basic Tutorial 6/16/201588Zhangxi Lin

89 6/16/201589Zhangxi Lin

90 About Cloudera  Cloudera is “The commercial Hadoop company”  Founded by leading experts on Hadoop from Facebook, Google, Oracle and Yahoo  Provides consulting and training services for Hadoop users  Staff includes several committers to Hadoop projects 6/16/201590Zhangxi Lin

91 Who uses Cloudera? 6/16/201591Zhangxi Lin

92 Cloudera Software (All Open-Source)  Cloudera’s Distribution including Apache Hadoop (CDH) – A single, easy-to-install package from the Apache Hadoop core repository – Includes a stable version of Hadoop, plus critical bug fixes and solid new features from the development version  Components – Apache Hadoop – Apache Hive – Apache Pig – Apache HBase – Apache Zookeeper – Flume, Hue, Oozie, and Sqoop 6/16/201592Zhangxi Lin

93 CDH and Enterprise Ecosystem 6/16/201593Zhangxi Lin

94 Beyond Hadoop Hadoop is incapable of handling OLTP tasks because of its latency. Alibaba has deelop its own distributed system instead of using Hadoop. Currently, it takes Alipay’s system 20 ms to process a payment transaction, but 200 ms for fraud detection ◦ “2014 年双十一交易额是多少?当大家还正在酣睡之时,双 十一的疯狂正在开始。 11 日凌晨,逆天的天猫双十一购物狂 欢节开场,今年每分钟支付成功的峰值为 79 万笔 / 分,对比 去年 20 万笔 / 分,较去年增长 4 倍, ” 12306.cn has replaced its old system with VMware vFabric TM GemFire ® in-memory database system. This makes its services stable and robustic. 946/16/2015Zhangxi Lin

95 HaaS(Hadoop as a Service)

96 HaaS example HaaS example  Amazon Web Services(AWS) -Amazon Elastic MapReduce (EMR) providing Hadoop based platform for data analysis with S3 as the storage system and EC2 as the compute system  Microsoft HDInsight, Cloudera CDH3, IBM Infoshpere BigInsights, EMC GreenPlum HD and Windows Azure HDInsight Service are the primary HaaS services by global IT giants

97 APPENDIX 1: HADOOP ECOLOGICAL SYSTEM 976/16/2015Zhangxi Lin

98 Choosing a right Hadoop architecture Application dependent Too many solution providers Too many choices 986/16/2015Zhangxi Lin

99 Teradata Big Data Platform 6/16/2015Zhangxi Lin99

100 Dell’s Hadoop ecosystem 1006/16/2015Zhangxi Lin

101 Nokia’s Big Data Architechture 6/16/2015Zhangxi Lin101

102 Cloudera’s Hadoop System 1026/16/2015Zhangxi Lin

103 1036/16/2015Zhangxi Lin

104 1046/16/2015Zhangxi Lin

105 Intel 1056/16/2015Zhangxi Lin

106 Comparison of Two Generations of Hadoop 1066/16/2015Zhangxi Lin

107 1076/16/2015Zhangxi Lin

108 1086/16/2015Zhangxi Lin

109 Different Components of Hadoop 1096/16/2015Zhangxi Lin

110 1106/16/2015Zhangxi Lin

111 APPENDIX 2: MATRIX CALCULATION 6/16/2015 Zhangxi Lin111

112 Map/Reduce Matrix Multiplication 6/16/2015 Zhangxi Lin112

113 Map/Reduce – Scheme 1, Step 1 6/16/2015 113Zhangxi Lin

114 Map/Reduce – Scheme 1, Step 2 6/16/2015 Zhangxi Lin114

115 Map/Reduce – Scheme 2, Oneshot 6/16/2015 Zhangxi Lin115

116 Communication Cost 6/16/2015 Zhangxi Lin116 The sum of the communication cost of all the tasks implementing that algorithm. In addition to the amount of time to execute a task it also includes the time for moving data into the memory. ◦ The algorithm executed by each task tends to be very simple, often linear in the size of its input ◦ The typical interconnect speed for a computing cluster is one gigabit per second. ◦ The time taken to move the data from a chunk into the main memory may exceed the time needed to operate on the data.

117 Reducer size 6/16/2015 Zhangxi Lin117 The upper bound on the number of values that are allowed to appear in the list associated with a single key. Reducer size can be selected with at least two goals. ◦ By making the reducer size small, we can force there to be many reducers, according to which the problem input is divided by the Map tasks. ◦ We can choose a reducer size sufficiently small that we are certain the computation associated with a single reducer can be executed entirely in the main memory of the compute node where its Reduce task is located. The running time will be greatly reduced if we can avoid having to move data repeatedly between main memory and disk.

118 Replication rate 6/16/2015 Zhangxi Lin118 The number of key-value pairs produced by all the Map tasks on all the inputs, divided by the number of inputs. That is, the average communication from Map tasks to Reduce tasks (measured by counting key-value pairs) per input.

119 Segmenting Matrix to Reduce the Cost 6/16/2015 Zhangxi Lin119

120 Map/Reduce – Scheme 3 6/16/2015 Zhangxi Lin120

121 Map/Reduce – Scheme 4, Step 1 6/16/2015 Zhangxi Lin121

122 Map/Reduce – Scheme 4, Step 2 6/16/2015 Zhangxi Lin122


Download ppt "Hadoop & MapReduce Zhangxi Lin CAABI, Texas Tech University FIFE, Southwestern University of Finance & Economics Cellphone:18610660375, QQ/WeChat: 155970."

Similar presentations


Ads by Google