Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Science Hadoop YARN Rodney Nielsen. Rodney Nielsen, Human Intelligence & Language Technologies Lab Outline Classical Hadoop What’s it all about Hadoop.

Similar presentations


Presentation on theme: "Data Science Hadoop YARN Rodney Nielsen. Rodney Nielsen, Human Intelligence & Language Technologies Lab Outline Classical Hadoop What’s it all about Hadoop."— Presentation transcript:

1 Data Science Hadoop YARN Rodney Nielsen

2 Rodney Nielsen, Human Intelligence & Language Technologies Lab Outline Classical Hadoop What’s it all about Hadoop Distributed File System (HDFS) Hadoop Distributed Processing, MapReduce Scalability and other issues YARN What’s it all about Architecture Resource Manager Client Application Master Node Manager Containers

3 Rodney Nielsen, Human Intelligence & Language Technologies Lab Perhaps most widely used tool to process Big Data Apache open source framework for: Distributed Storage: Hadoop Distributed File System (HDFS) Distributed Processing: MapReduce Robust to hardware failure Commodity hardware Based on Google research papers on: MapReduce, and Google File System

4 Rodney Nielsen, Human Intelligence & Language Technologies Lab -related Packages Apache Flume, Apache HBase, Apache Hive, Apache Oozie, Apache Phoenix, Apache Pig, Apache Spark, Apache Storm, Apache Sqoop, Apache ZooKeeper, Cloudera Impala, Etc.

5 Rodney Nielsen, Human Intelligence & Language Technologies Lab HDFS Name Node and Data Nodes DataNodes Blocks GBs – TBs 100+ PBs

6 Rodney Nielsen, Human Intelligence & Language Technologies Lab HDFS Rack Awareness Racks Rack Switches Data Nodes Rack 1Rack 2Rack 3 (A, B)

7 Rodney Nielsen, Human Intelligence & Language Technologies Lab Yahoo! HDFS Configuration ~2008 Facebook: 100PB Jun`12 +0.5 PB/day ~=0.8EB today By 2013, ~half of fortune 50 use Hadoop

8 Rodney Nielsen, Human Intelligence & Language Technologies Lab Hadoop Applications Log and/or clickstream analysis of various kinds Marketing analytics Machine learning and/or sophisticated data mining Image processing Processing of XML messages Web crawling and/or text processing General archiving, including of relational/tabular data

9 Rodney Nielsen, Human Intelligence & Language Technologies Lab Hadoop MRv1

10 Rodney Nielsen, Human Intelligence & Language Technologies Lab JobTracker Large MRv1 Cluster

11 Rodney Nielsen, Human Intelligence & Language Technologies Lab Architecture of YARN

12 Rodney Nielsen, Human Intelligence & Language Technologies Lab YARN Application Submission

13 Rodney Nielsen, Human Intelligence & Language Technologies Lab Resource Negotiation ApplicationMaster requests a number of containers from ResourceManager Container specifications: MBs and CPU shares Preferred location: host, rack, or anywhere (*) Priority within the application ApplicationMaster monitors progress of application and its tasks Restarts failed tasks Reports progress to client Resource Manager monitors health of ApplicationMaster

14 Rodney Nielsen, Human Intelligence & Language Technologies Lab MapReduce External Comments Why Cloudera is saying 'Goodbye, MapReduce' and 'Hello, Spark’ -Fortune.com, Sept. 9, 2015 “The One Platform Initiative the company announced Wednesday lays out Cloudera’s plan to officially replace MapReduce with Apache Spark as the default processing engine for Hadoop.” “…should be done in about a year.”

15 Rodney Nielsen, Human Intelligence & Language Technologies Lab MapReduce External Comments Bossie Awards 2015: The best open source big data tools- InfoWorld.com's top picks “How many Apache projects can sit on a pile of big data? Fire up your Hadoop cluster, and you might be able to count them. Among this year's Bossies in big data, you'll find the fastest, widest, and deepest newfangled solutions for large-scale SQL, stream processing, sort-of stream processing, and in-memory analytics, not to mention our favorite maturing members of the Hadoop ecosystem. It seems everyone has a nail to drive into MapReduce's coffin.” “Spark: With hundreds of contributors, Spark is one of the most active and fastest-growing Apache projects, and with heavyweights like IBM throwing their weight behind the project and major corporations bringing applications into large-scale production, the momentum shows no signs of letting up. The sweet spot for Spark continues to be ML.”


Download ppt "Data Science Hadoop YARN Rodney Nielsen. Rodney Nielsen, Human Intelligence & Language Technologies Lab Outline Classical Hadoop What’s it all about Hadoop."

Similar presentations


Ads by Google