Zhangxi Lin Texas Tech University

Slides:



Advertisements
Similar presentations
Spark in the Hadoop Ecosystem Eric Baldeschwieler (a.k.a. Eric14)
Advertisements

 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Hadoop Ecosystem Overview
Introduction to Apache Hadoop CSCI 572: Information Retrieval and Search Engines Summer 2010.
Apache Spark and the future of big data applications Eric Baldeschwieler.
This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.
Facebook (stylized facebook) is a Social Networking System and website launched in February 2004, operated and privately owned by Facebook, Inc. As.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
SEMINAR ON Guided by: Prof. D.V.Chaudhari Seminar by: Namrata Sakhare Roll No: 65 B.E.Comp.
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
How Companies are Using Spark And where the Edge in Big Data will be Matei Zaharia.
Hadoop implementation of MapReduce computational model Ján Vaňo.
Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Zhangxi Lin Texas Tech University
Hadoop & Spark Zhangxi Lin ISQS3358, Spring 2016.
Learn Hadoop and Big Data Technologies. Hadoop  An Open source framework that stores and processes Big Data in distributed manner on a large groups of.
BIG DATA/ Hadoop Interview Questions.
Data Science Hadoop YARN Rodney Nielsen. Rodney Nielsen, Human Intelligence & Language Technologies Lab Outline Classical Hadoop What’s it all about Hadoop.
What is it and why it matters? Hadoop. What Is Hadoop? Hadoop is an open-source software framework for storing data and running applications on clusters.
Microsoft Partner since 2011
BI 202 Data in the Cloud Creating SharePoint 2013 BI Solutions using Azure 6/20/2014 SharePoint Fest NYC.
Apache David Schneider (schnei21) ITEC400. What is Hadoop? Distributed Computing Open Source Reliable Scalable Fun Facts What is a Hadoop? Hadoop was.
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:
Introduction to Hadoop
OMOP CDM on Hadoop Reference Architecture
Best IT Training Institute in Hyderabad
Big Data is a Big Deal!.
PROTECT | OPTIMIZE | TRANSFORM
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
Machine Learning Library for Apache Ignite
Yarn.
Introduction to Distributed Platforms
From DBA to DPA – Becoming a Data Platform Administrator
ITCS-3190.
Big Data A Quick Review on Analytical Tools
An Open Source Project Commonly Used for Processing Big Data Sets
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Tutorial: Big Data Algorithms and Applications Under Hadoop
Zhangxi Lin Texas Tech University
Chapter 10 Data Analytics for IoT
Status and Challenges: January 2017
Spark Presentation.
Hadoopla: Microsoft and the Hadoop Ecosystem
Platform as a Service.
Data Platform and Analytics Foundational Training
Hadoop Clusters Tess Fulkerson.
Central Florida Business Intelligence User Group
NSF : CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science PI: Geoffrey C. Fox Software: MIDAS HPC-ABDS.
Ministry of Higher Education
Introduction to Spark.
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Introduction to Apache
Overview of big data tools
Data analytics with Hadoop In the Microsoft Azure cloud
ISQS 6339, Business Intelligence Big Data Management
Spark and Scala.
TIM TAYLOR AND JOSH NEEDHAM
Charles Tappert Seidenberg School of CSIS, Pace University
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Big DATA.
Big-Data Analytics with Azure HDInsight
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Big Data.
Presentation transcript:

Zhangxi Lin Texas Tech University Big Data Zhangxi Lin Texas Tech University 1 1

Hadoop & Spark

Hadoop Cases 10 Cases: 7 Cases: A Case Study of Hadoop in Healthcare: http://hadoopilluminated.com/hadoop_illuminated/Hadoop_Use_Cases.html 7 Cases: http://www.mrc-productivity.com/blog/2015/06/7-real-life-use-cases-of-hadoop/ A Case Study of Hadoop in Healthcare: http://www.bigdataeverywhere.com/files/chicago/BDE-LeadingaHealthcareCaseStudy-QURAISHI.pdf

Hadoop/Spark

Distributed business intelligence Deal with big data – the open & distributed approach LAMP: Linux, Apache, MySQL, PHP/Perl/Python Hadoop MapReduce HDFS NOSQL Zookeeper Storm ISQS 3358 BI

Videos of Hadoop Hadoop Architecture. 14’27” Challenges Created by Big Data. 8’51” Published on Apr 10, 2013. This video explains the challenges created by big data that Hadoop addresses efficiently. You will learn why traditional enterprise model fails to address the Variety, Volume, and Velocity challenges created by Big Data and why creation of Hadoop was required. http://www.youtube.com/watch?v=cA2btTHKPMY Hadoop Architecture. 14’27” Published on Mar 24, 2013 http://www.youtube.com/watch?v=YewlBXJ3rv8 History Behind Creation of Hadoop. 6’29” Published on Apr 5, 2013. This video talk about the brief history behind creation of Hadoop. How Google invented the technology, how it went into Yahoo, how Doug Cutting and Michael Cafarella created Hadoop, and how it went to Apache. http://www.youtube.com/watch?v=jA7kYyHKeX8

Apache Hadoop  The Apache Hadoop framework is composed of the following modules : Hadoop Common - contains libraries and utilities needed by other Hadoop modules Hadoop Distributed File System (HDFS). Hadoop YARN - a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users' applications. Hadoop MapReduce - a programming model for large scale data processing. ISQS 6339, Data Mgmt & BI

Hadoop – for BI in the Cloudera Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. Hadoop makes it possible to run applications on systems with thousands of nodes involving thousands of terabytes. Hadoop was inspired by Google's MapReduce, a software framework in which anapplication is broken down into numerous small parts. Doug Cutting, Hadoop's creator, named the framework after his child's stuffed toy elephant. ISQS 3358 BI

MapReduce MapReduce is a framework for processing parallelizable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster  or a grid.  ISQS 3358 BI

Hadoop 2: Big data's big leap forward The new Hadoop is the Apache Foundation's attempt to create a whole new general framework for the way big data can be stored, mined, and processed. The biggest constraint on scale has been Hadoop’s job handling. All jobs in Hadoop are run as batch processes through a single daemon called JobTracker, which creates a scalability and processing-speed bottleneck. Hadoop 2 uses an entirely new job-processing framework built using two daemons: ResourceManager, which governs all jobs in the system, and NodeManager, which runs on each Hadoop node and keeps the ResourceManager informed about what's happening on that node. ISQS 6339, Data Mgmt & BI

MapReduce 2.0 – YARN (Yet Another Resource Negotiator) The fundamental idea of YARN is to split up the functionalities of resource management And job scheduling/monitoring into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). ISQS 6339, Data Mgmt & BI

Comparison of Two Generations of Hadoop

Apache Spark An open-source cluster computing framework originally developed in the AMPLab at UC Berkeley. In contrast to Hadoop's two-stage disk-based MapReduce paradigm, Spark's in-memory primitives provide performance up to 100 times faster for certain applications.  Spark requires a cluster manager and a distributed storage system. For cluster manager, Spark supports standalone (native Spark cluster), Hadoop YARN, or Apache Mesos. For distributed storage, Spark can interface with a wide variety, including Hadoop Distributed File System (HDFS), Cassandra, OpenStack Swift, and Amazon S3. In February 2014, Spark became an Apache Top-Level Project. Spark has over 465 contributors in 2014. - Source: http://en.wikipedia.org/wiki/Apache_Spark

Apache Spark Apache Spark is an open source cluster computing framework. Originally developed at the University of California, Berkeley, the Spark codebase was later donated to the Apache Software Foundation that has maintained it since. What is Spark? 25’27” https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch01.html

Components of Spark Ecosystem Shark (SQL) Spark Streaming (Streaming) MLlib (Machine Learning) GraphX (Graph Computation) SparkR (R on Spark) BlindDB (Approximate SQL) MLLib is an open source library build by the people who build Spark, mainly inspired by the sci-kit learn library. H2O is a free library build by the company H2O. It is actually a stand alone library, which can be integrated with Spark with the 'Sparkling Water' connector.

Scala Scala is a general purpose programming language. Scala has full support for functional programming and a very strong static type system. Designed to be concise, many of Scala's design decisions were inspired by criticism of the shortcomings of Java. The name Scala is a portmanteau of "scalable" and "language" Spark is written by Scala. The design of Scala started in 2001 at the École Polytechnique Fédérale de Lausanne (EPFL) by Martin Odersky, following on from work on Funnel, a programming language combining ideas from functional programming and Petri nets. Odersky had previously worked on Generic Java and javac, Sun's Java compiler. After an internal release in late 2003, Scala was released publicly in early 2004 on the Java platform, and on the .NET platform in June 2004. A second version (v2.0) followed in March 2006. The .NET support was officially dropped in 2012. On 17 January 2011 the Scala team won a five-year research grant of over €2.3 million from the European Research Council. On 12 May 2011, Odersky and collaborators launched Typesafe Inc., a company to provide commercial support, training, and services for Scala. Typesafe received a $3 million investment in 2011 from Greylock Partners.

Cloudera’s Hadoop System ISQS 3358 BI

Comparison between big data platform and traditional BI platform Applications Data management ETL ISQS 3358 BI Data Source

Comparison between big data platform and traditional BI platform Hadoop/Spark Traditional DW Applications Pentaho, Tableau, QlikView R, Scala, Python, Pig Mahout, H2O, Mllib HBase/Hive, GraphX, Neo4J Kettle, Flume, Sqoop, Impala HDFS, NoSQL, NewSQL SAS, SPSS, SSRS SSAS SSMS SSIS Flat files, OLE DB, Excel, mails, FTP, etc. Data management ETL ISQS 3358 BI Data Source

Topics Data warehousing Publicly available big data services No: Topic Components 1 Data warehousing Focus: Hadoop Data warehouse design HDFS, HBase, HIVE, NoSQL/NewSQL, Solr 2 Publicly available big data services Focus: tools and free resources Hortonworks, CloudEra, HaaS, EC2, Spark 3 MapReduce & Data mining Focus: Efficiency of distributed data/text mining Mahout, H2O, MLlib, R, Python 4 Big data ETL Focus: Heterogeneous data processing across platforms Kettle, Flume, Sqoop, Impala 5 System management: Focus: Load balancing and system efficiency Oozie, ZooKeeper, Ambari, Loom, Ganglia, Mesos 6 Application development platform Focus: Algorithms and innovative development environments Tomcat, Neo4J, Taitan, GraphX, Pig, Hue 7 Tools & Visualizations Focus: Features for big data visualization and data utilization. Pentaho, Tableau, Qlik Saiku, Mondrian, Gephi, 8 Streaming data processing Focus: Efficiency and effectiveness of real-time data processing Spark, Storm, Kafka, Avro

Hadoop vs. Spark

Will Spark replace Hadoop? Hadoop is not a single product, it is an ecosystem. Same for Spark. MapReduce can be replaced with Spark Core. Yes, it can be replaced over the time and this replacement seems reasonable. But Spark is not yet mature enough to make a full replacement of this technology. Plus no one will completely give up on MapReduce unless all the tools that depend on it will support an alternative execution engine. Hive can be replaced with Spark SQL. Yes, it is again true. But you should understand that Spark SQL is even younger than the Spark itself, this technology is younger than 1yo. At the moment it can only toy around the mature Hive technology, I will look back at this in 1.5 – 2 years. As you remember, 2-3 years ago Impala was the Hive killer, but now both technologies are living together and Impala still didn’t kill Hive. Storm can be replaced with Spark Streaming. Yes, it can, but to be fair Storm is not a piece of Hadoop ecosystem as it is completely independent tool. They are targeting a bit different computational models so I don’t think that Storm will disappear, but it will continue to leave as a niche product Mahout can be replaced with MLlib. To be fair, Mahout is already losing the market and over the last year it became obvious that this tool will soon be dropped from the market. And here you can really say that Spark replaced something from Hadoop ecosystem.

Install Hadoop/Spark

Platform Installation Install VirtualBox 5.0.x https://www.virtualbox.org/wiki/Downloads Install Hadoop (need 10GB+ disk space) CloudEra CDH 5.5, or HortonWorks Sandbox 2.4 Install Spark Windows: https://spark.apache.org/downloads.html Mac OS X: http://genomegeek.blogspot.com/2014/11/how-to-install-apache-spark-on-mac-os-x.html

Hortonworks Data Platform Install & Setup Hortonworks Download, Mar 25, 2-15, 9’26” Hortonworks Sandbox, May 2, 2013, 8’35” Install Hortonworks Sandbox 2.0, Sep 1, 2014, 22’23” Setup Hortonworks Sandbox with Virtualbox VM, Nov 20, 2013, 24’25” Download & Install http://hortonworks.com/hdp/downloads/

Debugging Error after upgrade:VT-x is not available. (VERR_VMX_NO_VMX) https://forums.virtualbox.org/viewtopic.php?f=6&t=58820&sid=a1f50f7a44da06187cf5468e43a656e5&start=30

Install CloudEra’s QuickStart

Install Spark - Windows Download Spark from https://spark.apache.org/downloads.html Download Spark: spark-1.6.0-bin-hadoop2.6.tgz Look for the installation in this video. https://www.youtube.com/watch?v=KvQto_b3sqw Change command prompt directory to the downloads by typing cd C:\Users\...... or (as shown below)

Mac OS X Yosemite http://genomegeek.blogspot.com/2014/11/how-to-install-apache-spark-on-mac-os-x.html

Install Spark – Mac OS X Install Java Set JAVA_HOME - Download Oracle Java SE Development Kit 7 or 8 at Oracle JDK downloads page. - Double click on .dmg file to start the installation - Open up the terminal. - Type java -version, should display the following java version "1.7.0_71" Java(TM) SE Runtime Environment (build 1.7.0_71-b14) Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode) Set JAVA_HOME export JAVA_HOME=$(/usr/libexec/java_home)

Download Spark from https://spark.apache.org/downloads.html Install Homebrew ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)” Install Scala brew install scala Set SCALA_HOME export SCALA_HOME=/usr/local/bin/scala export PATH=$PATH:$SCALA_HOME/bin Download Spark from https://spark.apache.org/downloads.html tar -xvzf spark-1.1.1.tar cd spark-1.1.1 Fire up the Spark For the Scala shell: ./bin/spark-shell For the Python shell: ./bin/pyspark

Run Examples References: Calculate Pi: MLlib Correlations example: ./bin/run-example org.apache.spark.examples.SparkPi MLlib Correlations example: ./bin/run-example org.apache.spark.examples.mllib.Correlations MLlib Linear Regression example: ./bin/spark-submit --class org.apache.spark.examples.mllib.LinearRegression examples/target/scala-*/spark-*.jar data/mllib/sample_linear_regression_data.txt References: How to install Spark on Mac OS Xhttp://ondrej-kvasnovsky.blogspot.com/2014/06/how-to-install-spark-on-mac-os-x.html How To Set $JAVA_HOME Environment Variable On Mac OS X http://www.mkyong.com/java/how-to-set-java_home-environment-variable-on-mac-os-x/ Homebrew - The missing package manager for OS X http://brew.sh