Hadoop & MapReduce Zhangxi Lin CAABI, Texas Tech University FIFE, Southwestern University of Finance & Economics Cellphone:18610660375, QQ/WeChat: 155970.

Hadoop & MapReduce Zhangxi Lin CAABI, Texas Tech University FIFE, Southwestern University of Finance & Economics Cellphone:18610660375, QQ/WeChat: 155970 http://zlin.ba.ttu.eduhttp://zlin.ba.ttu.edu Zhangxi.lin@ttu.eduZhangxi.lin@ttu.edu 2015-06-16 1 1 6/16/2015Zhangxi Lin

CAABI, Texas Tech University ◦ Center for Advanced Analytics and Business Intelligence initially started in 2004 by Dr. Peter Westfall, ISQS, Rawls College of Business. Center for Advanced Analytics and Business Intelligence Sichuan Key Lab of Financial Intelligence and Financial Engineering (FIFE), SWUFE ◦ One of two key labs in finance founded in 2008 and sponsored by Sichuan Provincial Government ◦ Underpinned by two areas in SWUFE: Information and Finance 2 ISQS 6339, Data Mgmt & BI 2 6/16/2015Zhangxi Lin

Know Big Data One More Step When we talk about big data we must know what Hadoop is When we planning about data warehousing we must know what HDFS and NoSQL are. When we say data mining we must know what Mahout and H2O are. Do you know Hadoop data warehousing does not need dimensional modeling? Do you know how Hadoop stores heterogeneous data? Do you know what are Hadoop’s “Archeries heal”? Do you know you can install a Hadoop system in your Laptop? Do you know Alibaba has retired its last mini-computer in 2014? So, let’s talk about Hadoop 36/16/2015Zhangxi Lin

After this lecture you will Understand what challenges are in big data management Understand how Hadoop and MapReduce works Get familiar to the Hadoop ecology Be able to install a Hadoop in your laptop Be able to install a handy big data tool in your laptop to visualize and mine data 6/16/2015Zhangxi Lin4

Outlines Apache Hadoop Hadoop Data Warehousing Hadoop ETL Hadoop Data Mining Data Visualization with Hadoop MapReduce Algorithm Setting up Your Hadoop Appendixes ◦ The Hadoop Ecological System ◦ Matrix calculation with MapReduce Zhangxi Lin5 6/16/2015

A Traditional Business Intelligence System 6 SSMS SSIS SSAS SSRS SAS EM SAS EM SAS EG SAS EG MS SQL Server BIDS 6/16/2015Zhangxi Lin

Hadoop ecosystem 6/16/20157Zhangxi Lin

- 6/16/20158Zhangxi Lin

What is Hadoop? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity hardware. Hadoop is not a replacement for traditional RDMS but is a supplement to handle and process large datasets. It achieves two tasks: 1. Massive data storage. 2. Faster processing. Using Hadoop is cheaper, faster and better. 6/16/20159Zhangxi Lin

Hadoop 2: Big data's big leap forward The new Hadoop is the Apache Foundation's attempt to create a whole new general framework for the way big data can be stored, mined, and processed. The biggest constraint on scale has been Hadoop’s job handling. All jobs in Hadoop are run as batch processes through a single daemon called JobTracker, which creates a scalability and processing-speed bottleneck. Hadoop 2 uses an entirely new job-processing framework built using two daemons: ResourceManager, which governs all jobs in the system, and NodeManager, which runs on each Hadoop node and keeps the ResourceManager informed about what's happening on that node. Zhangxi Lin106/16/2015

Hadoop 1.0 VS Hadoop 2.0 Hadoop 1.0 Hadoop 2.0 Horizontal scalability of Namenode. Namenode is no longer a single point of failure. Ability to process Terabytes and Petabytes of data available in HDFS using Non-MapReduce applications such as MPI, GIRAPH. The two major functionalities of overburdened JobTracker (resource management and job scheduling/monitoring) into two separate daemons. Features of Hadoop 2.0 over Hadoop 1.0: 6/16/201511Zhangxi Lin

Apache Spark Apache Spark is an open source cluster computing framework originally developed in the AMPlab at UC Berkley. Spark in-memory provides performance up to 100 times faster for certain applications. Spark is well suited for machine learning algorithms. Spark requires a cluster manager and a distributed storage system. Spark supports Hadoop YARN. 6/16/201512Zhangxi Lin

MapReduce 13 MapReduce is a framework for processing parallelizable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster or a grid. 6/16/2015Zhangxi Lin

MapReduce 2.0 – YARN (Yet Another Resource Negotiator) Zhangxi Lin146/16/2015

How Hadoop Operates Zhangxi Lin156/16/2015

Hadoop Ecosystem 6/16/201516Zhangxi Lin

Hadoop Topics 17 No:TopicComponents 1Data warehousingHDFS, HBase, HIVE, KylinNoSQL/NewSQL, Solr 2Publicly available big data servicesHortonworks, CloudEra, HaaS, EC2 3MapReduce & Data miningMahout, H2O, R, Python 4Big data ETLKettle, Flume, Sqoop, Impala,Chakwa. Dremel, Pig 5Big data platform managementOozie, ZooKeeper, Ambari, Loom, Ganglia 6Application development platformTomcat, Neo4J, Pig, Hue 7Tools & VisualizationsPentaho, Tableau Saiku, Mondrian, Gephi, 8Streaming data processing Spark, Storm, Kafka, Avro 6/16/2015Zhangxi Lin

HADOOP DATA WAREHOUSING 186/16/2015Zhangxi Lin

Comparing the RDBMS and Hadoop data warehousing stack Layer Conventional RDBMS Hadoop Advantages of Hadoop over conventional RDBMS StorageDatabase tablesHDFS file system HDFS is purpose- built for extreme IO speeds MetadataSystem tablesHCatalog All clients can use HCatalog to read files. QuerySQL query engine Multiple engines (SQL and non- SQL) Multiple query engines like Hive or Impala are available. 6/16/201519Zhangxi Lin

HDFS ( Hadoop Distributed File System) Hadoop ecosystem consists of many components and libraries for varied tasks. The storage part of Hadoop is HDFS and the processing part is MapReduce. HDFS is the a java based distributed file-system that stores data on commodity machines without prior organization, providing very high aggregate bandwidth across the clusters. 6/16/2015Zhangxi Lin20

HDFS Architecture & Design HDFS has a master/slave architecture. HDFS consists of a single NameNode and several number of DataNodes in a cluster. In HDFS files are split in one or more ’blocks’ and are stored in a set of DataNodes. HDFS exposes a file system namespace and allows user data to be stored in files. DataNodes serves read, write requests, performs block creation, deletion, and replication upon instruction from Namenode. 6/16/201521Zhangxi Lin

6/16/201522Zhangxi Lin

What is NoSQL?  Stands for Not Only SQL  NoSQL is a non-relational database management system.  NoSQL is different from traditional relational database management systems in some significant ways.  NoSQL is designed for distributed data stores where very large scale of data storing is needed (for example Google or Facebook which collects terabits of data every day for their users).  These types of data storing may not require fixed schema, avoid join operations and typically scale horizontally. 6/16/201523Zhangxi Lin

NoSQL 6/16/201524Zhangxi Lin

- Praveen Ashokan 6/16/201525Zhangxi Lin

What is NewSQL? A modern RDBMS that seek to provide the same scalable performance of NoSQL systems for OLTP read-write workloads while still maintaining the ACID guarantees of a traditional database system. SQL as the primary interface Non- Locking Concurrency control High per-node performance H-Store parallel database system is the first known NewSQL system 6/16/201526Zhangxi Lin

Classification of NoSQL and NewSQL 6/16/201527Zhangxi Lin

Taxonomy of Big Data Stores 6/16/201528Zhangxi Lin

Features of OldSQL vs NoSQL vs NewSQL 6/16/201529Zhangxi Lin

HBase HBase is a non-relational,distributed database It is a column-oriented DBMS It is an implementation of Google’s Big Table HBase is built on top of Hadoop File Distributed System(HDFS) 6/16/201531Zhangxi Lin

Differences between HBase and Relational Database HBase is a column-oriented database while a Relational database is a row-oriented database HBase is highly scalable while RDBMS is hard to scale. Hbase has flexible schema while RDBMS has fixed schema HBase holds denormalized data while data in a Relational database is normalized The performance of HBase is good for large volumes of unstructured data while the performance is poor for a Relational database HBase does not use any query language while a Relational Database uses SQL to retrieve data 6/16/201532Zhangxi Lin

HBase Data Model 6/16/201533Zhangxi Lin

HBase: Keys and Column Families Each record is divided into Column Families 6/16/201534Zhangxi Lin

What is Apache Hive? The Apache Hive is data warehouse software facilitates querying and managing large datasets residing in distributed storage It built on top of Apache Hadoop it provides tools to easy data extract/transform/load It supports analysis of large datasets stored in Hadoop’s HDFS It supports SQL-like language called HQL as well as big data analytics with the help of Map-Reduce 6/16/201535Zhangxi Lin

What is HQL? HQL : Hive Query Language Doesn’t conform any ANSI standard Very close to MySQL dialect, but with some differences SQL to HQL cheat sheet http://hortonworks.com/wp- content/uploads/downloads/2013/08/Hortonworks.CheatSheet.SQLt oHive.pdf http://hortonworks.com/wp- content/uploads/downloads/2013/08/Hortonworks.CheatSheet.SQLt oHive.pdf HQL doesn’t support transactions, so don’t compare with RDBMS 6/16/201536Zhangxi Lin

HADOOP ETL 6/16/201537Zhangxi Lin

List of Tools Sqoop Flume Impala Chukwa Kettle 6/16/201538Zhangxi Lin

E T L 6/16/201539Zhangxi Lin

Sqoop Is a short form of SQL to Hadoop Used to move back data back and forth between RDBMS and HDFS for performing analysis using BI Tools. Is a simple command line tool(Sqoop 2 is bringing web interface as well) 6/16/201540Zhangxi Lin

How Sqoop Works Dataset Slice 1 Slice 2 Slice 1 Mapper 1 6/16/201541Zhangxi Lin

Sqoop 1 & Sqoop 2 FeatureSqoop 1Sqoop 2 Connectors for all major RDBMSSupported.Not supported. Workaround: Use the generic JDBC Connector which has been tested on the following databases: Microsoft SQL Server, PostgreSQL, MySQL and Oracle. This connector should work on any other JDBC compliant database. However, performance might not be comparable to that of specialized connectors in Sqoop. Encryption of Stored PasswordsNot supported. No workaround.Supported using Derby's on-disk encryption.Disclaimer: Although expected to work in the current version of Sqoop 2, this configuration has not been verified.on-disk encryption Data transfer from RDBMS to Hive or HBase Supported.Not supported. 1.Workaround: Follow this two-step approach.Import data from RDBMS into HDFS 2.Load data into Hive or HBase manually using appropriate tools and commands such as the LOAD DATA statement in Hive Data transfer from Hive or HBase to RDBMS 1.Not supported.Workaround: Follow this two-step approach.Extract data from Hive or HBase into HDFS (either as a text or Avro file) 2.Use Sqoop to export output of previous step to RDBMS Not supported. Follow the same workaround as for Sqoop 1. 6/16/201542Zhangxi Lin

Sqoop 1 & Sqoop 2 Architecture For more on Differences https://www.youtube.com/watch?v=xzU3HL4ZYI0 6/16/201543Zhangxi Lin

What is Flume ? Flume – It is a distributed, reliable service used for gathering, aggregating and transporting large amounts of streaming event data for analysis. Event data – streaming log data (website/application logs – to analyse user’s activity) or streaming data (e.g. social media – analyse an event, stock prices- to analyse a stock’s performance) 6/16/201544Zhangxi Lin

Architecture and Working 6/16/201545Zhangxi Lin

Impala –An open source SQL query engine Developed by Cloudera and fully open source, hosted on github. Released as beta in 10/2012 1.0 version available in 05/2013 6/16/201546Zhangxi Lin

About Impala 6/16/201547Zhangxi Lin

What is Chukwa Chukwa is an open source data collection system for monitoring large distributed systems. Used for log collection and analysis. Built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework Not a streaming database Not a real time system 6/16/201548Zhangxi Lin

Why do we need Chukwa? Data monitoring and analysis. ◦ To collect system matrices and log files. ◦ To store data in Hadoop clusters Uses MapReduce to analyze data. ◦ Robust ◦ Scalable ◦ Rapid Data Processing 6/16/201549Zhangxi Lin

How it Works? 6/16/201550Zhangxi Lin

Data Analysis 6/16/201551Zhangxi Lin

ETL ToolsFeaturesAdvantageDisadvantage Sqoop Bulk import Direct input Data interaction Data export Parallel data transfer Efficient data Analysis Not east to manage installations and configurations Flume Fan out Fan in Processors Auto-batching of events Multiplexing channels for data mining Reliable, Scalable, Manageable, Customizable, High Performance Feature Rich and Fully Extensible Contextual Routing Have to weaken some delivery guarantees Kettle Migrating data between applications or databases Exporting data from databases to flat files Loading data massively into databases Data cleansing Integrating applications Higher level than code Well tested full suite of components Data analysis tools Free Not running fast Take some time to install 6/16/201552Zhangxi Lin

Building a Datawarehouse in Hadoop using ETL Tools Copy data into HDFS with ETL tool (e.g. Informatica), Sqoop or Flume into standard HDFS files (write once). This registers the metadata with HCatalog. Declare the query schema in Hive or Impala, which doesn’t require data copying or re-loading, due to the schema-on-read advantage of Hadoop compared with schema-on-write constraint in RDBMS. Explore with SQL queries and launching BI tools e.g. Tableau, BusinessObjects for exploratory analytics. 6/16/201553Zhangxi Lin

HADOOP DATA MINING 546/16/2015Zhangxi Lin

What is Mahout? Meaning: A person who keep and drives an elephant – an Indian term Mahout is a scalable open source machine learning library hosted by Apache. Mahout core algorithms are implemented on top of Apache Hadoop using the Map/Reduce paradigm. 6/16/201555Zhangxi Lin

Mahout’s position 6/16/201556Zhangxi Lin

Mapreduce flow in mahout 6/16/201558Zhangxi Lin

What is H2O? H2O scales statistics, machine learning and math over BigData. H2O is extensible and users can build blocks using simple math legos in the core. H2O keeps familiar interfaces like R, Excel & JSON so that BigData enthusiasts & experts can explore, merge, model and score datasets using a range of simple to advanced algorithms. H2O makes it fast and easy to derive insights from your data through faster and better predictive modeling. H2O has a vision of online scoring and modeling in a single platform 6/16/201559Zhangxi Lin

H2O How is H2O Different form Mahout ? H2OMahout Can use any of R, REST/JSON, GUI (browser), Java or Scala. Can use Java H2O is GUI product with less algorithmsMore number of Algorithms that need knowledge od Java Algorithms are typically 100x faster than current Map/Reduce-based Mahout Algorithms are typically slower compared to H2O. Knowledge of Java is NOT required to develop prediction model Knowledge of Java required to develop prediction model Real TimeNot Real Time 6/16/201560Zhangxi Lin

H2O Predictive Modeling Factories – Better Marketing with H2O Advertising Technology – Better Conversions with H2O Risk & Fraud Analysis – Better detection with H2O Customer Intelligence – Better Sales with H2O Users of H2O 6/16/201561Zhangxi Lin

MAP/REDUCE ALGORITHM 6/16/2015 Zhangxi Lin62

How to write a MapReduce program Parallelization is the key Algorithm is different from a single server application ◦ Map function ◦ Reduce function Considerations ◦ Load balance ◦ Efficiency ◦ Memory management 6/16/201563Zhangxi Lin

MapReduce Executes 6/16/201564Zhangxi Lin

Schematic of a map-reduce computation 6/16/201565Zhangxi Lin

Example: counting the number of occurrences for each word in a collection of documents The input file is a repository of documents, and each document is an element. The Map function for this example uses keys that are of type String (the words) and values that are integers. The Map task reads a document and breaks it into its sequence of words w1,w2,...,wn. It then emits a sequence of key-value pairs where the value is always 1. That is, the output of the Map task for this document is the sequence of key- value pairs: (w1, 1), (w2, 1),..., (wn, 1) 6/16/201566Zhangxi Lin

Map Task A single Map task will typically process many documents. Thus, its output will be more than the sequence for the one document suggested above. If a word w appears m times among all the documents assigned to that process, then there will be m key- value pairs (w, 1) among its output. After all the Map tasks have completed successfully, the master controller merges the files from each Map task that are destined for a particular Reduce task and feeds the merged file to that process as a sequence of key-list-of-value pairs. That is, for each key k, the input to the Reduce task that handles key k is a pair of the form (k, [v1, v2,..., vn]), where (k, v1), (k, v2),..., (k, vn) are all the key-value pairs with key k coming from all the Map tasks. 6/16/201567Zhangxi Lin

Reduce Task The output of the Reduce function is a sequence of zero or more key-value pairs. The Reduce function simply adds up all the values. The output of a reducer consists of the word and the sum. Thus, the output of all the Reduce tasks is a sequence of (w,m) pairs, where w is a word that appears at least once among all the input documents and m is the total number of occurrences of w among all those documents. The application of the Reduce function to a single key and its associated list of values is referred to as a reducer. 6/16/201568Zhangxi Lin

Big Data Visualization and Tools 6/16/201569Zhangxi Lin

Big Data Visualization and Tools Tools : Tableau Pentaho Modrian Saiku Spotfire Gephi 6/16/201570Zhangxi Lin

Tableau Tableau is a visual analysis solution that allows people to explore and analyze data with simple drag and drop operations. What is Tableau? 6/16/201571Zhangxi Lin

Tableau Tableau Alliance Partners 6/16/201572Zhangxi Lin

Tableau 6/16/201573Zhangxi Lin

What is Pentaho? Pentaho is a commercial open source software for Business Intelligence (BI). Pentaho has been developed since 2004 in Orlando, Florida. Pentaho provides comprehensive reporting, OLAP analysis, dashboards, data integration, data mining and a BI platform. It is built under Java platform. Runs well under various platforms (Windows, Linux, Macintosh, Solaris, Unix, etc.) Has a complete package from reporting, ETL for warehousing data management, OLAP server data mining also dashboard. BI Platform supports Pentaho end to end business intelligence capabilities and provide central access to your business information, with back end security, integration, scheduling, auditing and more. Designed to meet the needs of any size organization. 6/16/201574Zhangxi Lin

A few facts 6/16/201575Zhangxi Lin

Analyzer 6/16/201577Zhangxi Lin

Reports 6/16/201578Zhangxi Lin

Overall Features 6/16/201579Zhangxi Lin

HADOOP IN YOUR LAPTOP 6/16/2015Zhangxi Lin80

Hortonworks Background  Hortonworks is a Business computer software company based in Palo Alto,California  Hortonworks supports & develops Apache Hadoop framework, that allows distributed processing of large data sets across clusters of computers  They are the sponsors of Apache Software Foundation  Founded in June 2011 by Yahoo and Benchmark capital as an independent company. It went public on December 2014 Below are the list of company collaborated with Hortonworks  Microsoft on October 2011 to develop Azure & Window server  Infomatica on November 2011 to develop HParser  Teradata on February 2012 to develop Aster data system  SAP AG on September 2012 announced it would resell Hortonworks distribution 6/16/201581Zhangxi Lin

They do Hadoop using HDP 6/16/201582Zhangxi Lin

Hortonworks Data Platform  Hortonworks' product named Hortonworks Data Platform (HDP) includes Apache Hadoop and is used for storing, processing, and analyzing large volumes of data.  It includes Apache Projects like HDFS, MapReduce, Pig, Hive, Hbase, Zookeeepr and other components  Why was it developed? It was develop with one aim to make Apache Hadoop ready for enterprise.  What does it do? It takes Big Data component of Apache Hadoop and make them ready for prime time use in Enterprise Environment. 6/16/201583Zhangxi Lin

HDP Functional Areas 6/16/201584Zhangxi Lin

Certified Technology Program One of the most important aspects of the Technology Partner Program is the certification of partner technologies with HDP Hortonworks Certified Technology Program simplifies big data planning by providing pre-built and validated integrations between leading enterprise technologies and Hortonworks Data Platform (HDP) YARN Ready Certification,Operations Ready Security Ready, Governance Ready More Details: http://hortonworks.com/partners/certified/http://hortonworks.com/partners/certified/ 6/16/201585Zhangxi Lin

How to get HDP? HDP is architected, developed, and built completely in the open. Anyone can download it from website http://hortonworks.com/hdp/downloads/ for freehttp://hortonworks.com/hdp/downloads/ It comes with different version which can used as per need.  HDP 2.2 on Sandbox – runs on VirtualBox or VMWare  Automated (Amabri) – RHEL/Ubuntu/CentOS/SLES  Manual – RHEL/Ubuntu/CentOS/SLES  Windows – Windows Server 2008 & 2012 6/16/201586Zhangxi Lin

Installing HDP IP address to login on the browser 6/16/201587Zhangxi Lin

DEMO-HDP Below are the step we will be preforming in HDP  Starting HDP  Upload a source file  Load in file in HCatalog  Pig Basic Tutorial 6/16/201588Zhangxi Lin

About Cloudera  Cloudera is “The commercial Hadoop company”  Founded by leading experts on Hadoop from Facebook, Google, Oracle and Yahoo  Provides consulting and training services for Hadoop users  Staff includes several committers to Hadoop projects 6/16/201590Zhangxi Lin

Who uses Cloudera? 6/16/201591Zhangxi Lin

Cloudera Software (All Open-Source)  Cloudera’s Distribution including Apache Hadoop (CDH) – A single, easy-to-install package from the Apache Hadoop core repository – Includes a stable version of Hadoop, plus critical bug fixes and solid new features from the development version  Components – Apache Hadoop – Apache Hive – Apache Pig – Apache HBase – Apache Zookeeper – Flume, Hue, Oozie, and Sqoop 6/16/201592Zhangxi Lin

CDH and Enterprise Ecosystem 6/16/201593Zhangxi Lin

Beyond Hadoop Hadoop is incapable of handling OLTP tasks because of its latency. Alibaba has deelop its own distributed system instead of using Hadoop. Currently, it takes Alipay’s system 20 ms to process a payment transaction, but 200 ms for fraud detection ◦ “2014 年双十一交易额是多少？当大家还正在酣睡之时，双十一的疯狂正在开始。 11 日凌晨，逆天的天猫双十一购物狂欢节开场，今年每分钟支付成功的峰值为 79 万笔 / 分，对比去年 20 万笔 / 分，较去年增长 4 倍， ” 12306.cn has replaced its old system with VMware vFabric TM GemFire ® in-memory database system. This makes its services stable and robustic. 946/16/2015Zhangxi Lin

HaaS(Hadoop as a Service)

HaaS example HaaS example  Amazon Web Services(AWS) -Amazon Elastic MapReduce (EMR) providing Hadoop based platform for data analysis with S3 as the storage system and EC2 as the compute system  Microsoft HDInsight, Cloudera CDH3, IBM Infoshpere BigInsights, EMC GreenPlum HD and Windows Azure HDInsight Service are the primary HaaS services by global IT giants

APPENDIX 1: HADOOP ECOLOGICAL SYSTEM 976/16/2015Zhangxi Lin

Choosing a right Hadoop architecture Application dependent Too many solution providers Too many choices 986/16/2015Zhangxi Lin

Teradata Big Data Platform 6/16/2015Zhangxi Lin99

Dell’s Hadoop ecosystem 1006/16/2015Zhangxi Lin

Nokia’s Big Data Architechture 6/16/2015Zhangxi Lin101

Cloudera’s Hadoop System 1026/16/2015Zhangxi Lin

Intel 1056/16/2015Zhangxi Lin

Comparison of Two Generations of Hadoop 1066/16/2015Zhangxi Lin

Different Components of Hadoop 1096/16/2015Zhangxi Lin

APPENDIX 2: MATRIX CALCULATION 6/16/2015 Zhangxi Lin111

Map/Reduce Matrix Multiplication 6/16/2015 Zhangxi Lin112

Map/Reduce – Scheme 1, Step 1 6/16/2015 113Zhangxi Lin

Map/Reduce – Scheme 1, Step 2 6/16/2015 Zhangxi Lin114

Map/Reduce – Scheme 2, Oneshot 6/16/2015 Zhangxi Lin115

Communication Cost 6/16/2015 Zhangxi Lin116 The sum of the communication cost of all the tasks implementing that algorithm. In addition to the amount of time to execute a task it also includes the time for moving data into the memory. ◦ The algorithm executed by each task tends to be very simple, often linear in the size of its input ◦ The typical interconnect speed for a computing cluster is one gigabit per second. ◦ The time taken to move the data from a chunk into the main memory may exceed the time needed to operate on the data.

Reducer size 6/16/2015 Zhangxi Lin117 The upper bound on the number of values that are allowed to appear in the list associated with a single key. Reducer size can be selected with at least two goals. ◦ By making the reducer size small, we can force there to be many reducers, according to which the problem input is divided by the Map tasks. ◦ We can choose a reducer size sufficiently small that we are certain the computation associated with a single reducer can be executed entirely in the main memory of the compute node where its Reduce task is located. The running time will be greatly reduced if we can avoid having to move data repeatedly between main memory and disk.

Replication rate 6/16/2015 Zhangxi Lin118 The number of key-value pairs produced by all the Map tasks on all the inputs, divided by the number of inputs. That is, the average communication from Map tasks to Reduce tasks (measured by counting key-value pairs) per input.

Segmenting Matrix to Reduce the Cost 6/16/2015 Zhangxi Lin119

Map/Reduce – Scheme 3 6/16/2015 Zhangxi Lin120

Hadoop & MapReduce Zhangxi Lin CAABI, Texas Tech University FIFE, Southwestern University of Finance & Economics Cellphone:18610660375, QQ/WeChat: 155970.

Similar presentations

Presentation on theme: "Hadoop & MapReduce Zhangxi Lin CAABI, Texas Tech University FIFE, Southwestern University of Finance & Economics Cellphone:18610660375, QQ/WeChat: 155970."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hadoop & MapReduce Zhangxi Lin CAABI, Texas Tech University FIFE, Southwestern University of Finance & Economics Cellphone:18610660375, QQ/WeChat: 155970.

Similar presentations

Presentation on theme: "Hadoop & MapReduce Zhangxi Lin CAABI, Texas Tech University FIFE, Southwestern University of Finance & Economics Cellphone:18610660375, QQ/WeChat: 155970."— Presentation transcript:

Similar presentations

About project

Feedback