ISQS 6339, Business Intelligence Big Data Management

ISQS 6339, Business Intelligence Big Data Management
Zhangxi Lin Texas Tech University 1 1

Outlines Hadoop/Spark Data Mart Case ISQS 6347, Data & Text Mining

Hadoop/Spark

Hadoop – for BI in the Cloudera
Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. Hadoop makes it possible to run applications on systems with thousands of nodes involving thousands of terabytes. Hadoop was inspired by Google's MapReduce, a software framework in which an application is broken down into numerous small parts. Doug Cutting, Hadoop's creator, named the framework after his child's stuffed toy elephant.

Videos of Hadoop Hadoop Architecture. 6’13”
Published on Sep 12, History Behind Creation of Hadoop. 6’29” Published on Apr 5, 2013. This video talk about the brief history behind creation of Hadoop. How Google invented the technology, how it went into Yahoo, how Doug Cutting and Michael Cafarella created Hadoop, and how it went to Apache.

Distributed business intelligence
Deal with big data – the open & distributed approach LAMP: Linux, Apache, MySQL, PHP/Perl/Python Hadoop MapReduce 11’47” HDFS 8’27” NoSQL 5’35” Zookeeper Storm H2O 33’36” ISQS 3358 BI

How Hadoop Operates ISQS 6339, Data Mgmt & BI

Hadoop 2: Big data's big leap forward
The new Hadoop is the Apache Foundation's attempt to create a whole new general framework for the way big data can be stored, mined, and processed. The biggest constraint on scale has been Hadoop’s job handling. All jobs in Hadoop are run as batch processes through a single daemon called JobTracker, which creates a scalability and processing-speed bottleneck. Hadoop 2 uses an entirely new job-processing framework built using two daemons: ResourceManager, which governs all jobs in the system, and NodeManager, which runs on each Hadoop node and keeps the ResourceManager informed about what's happening on that node. ISQS 6339, Data Mgmt & BI

Comparison of Two Generations of Hadoop

Comparison between big data platform and traditional BI platform
Applications Data management ETL ISQS 3358 BI Data Source

Comparison between big data platform and traditional BI platform
Hadoop/Spark Traditional DW Applications Pentaho, Tableau, QlikView R, Scala, Python, Pig Mahout, H2O, Mllib HBase/Hive, GraphX, Neo4J Kettle, Flume, Sqoop, Impala HDFS, NoSQL, NewSQL SAS, SPSS, SSRS SSAS SSMS SSIS Flat files, OLE DB, Excel, mails, FTP, etc. Data management ETL ISQS 3358 BI Data Source

Resolving legacy problem – Dual platform

Cloudera’s Hadoop System

Teradata Big Data Platform

Dell representation of the Hadoop ecosystem

Nokia’s Big Data Architecture

Apache Spark Apache Spark is an open source cluster computing framework. Originally developed at the University of California, Berkeley, the Spark codebase was later donated to the Apache Software Foundation that has maintained it since. What is Spark? 25’27”

Components of Spark Ecosystem
Shark (SQL) Spark Streaming (Streaming) MLlib (Machine Learning) GraphX (Graph Computation) SparkR (R on Spark) BlindDB (Approximate SQL) MLLib is an open source library build by the people who build Spark, mainly inspired by the sci-kit learn library. H2O is a free library build by the company H2O. It is actually a stand alone library, which can be integrated with Spark with the 'Sparkling Water' connector.

Scala Scala is a general purpose programming language. Scala has full support for functional programming and a very strong static type system. Designed to be concise, many of Scala's design decisions were inspired by criticism of the shortcomings of Java. The name Scala is a portmanteau of "scalable" and "language" Spark is written by Scala. The design of Scala started in 2001 at the École Polytechnique Fédérale de Lausanne (EPFL) by Martin Odersky, following on from work on Funnel, a programming language combining ideas from functional programming and Petri nets. Odersky had previously worked on Generic Java and javac, Sun's Java compiler. After an internal release in late 2003, Scala was released publicly in early 2004 on the Java platform, and on the .NET platform in June A second version (v2.0) followed in March The .NET support was officially dropped in 2012. On 17 January 2011 the Scala team won a five-year research grant of over €2.3 million from the European Research Council. On 12 May 2011, Odersky and collaborators launched Typesafe Inc., a company to provide commercial support, training, and services for Scala. Typesafe received a $3 million investment in 2011 from Greylock Partners.

Big Data Topics Data warehousing Publicly available big data services
No: Topic Components 1 Data warehousing Focus: Hadoop Data warehouse design HDFS, HBase, HIVE, NoSQL/NewSQL, Solr 2 Publicly available big data services Focus: tools and free resources Hortonworks, CloudEra, HaaS, EC2, Spark 3 MapReduce & Data mining Focus: Efficiency of distributed data/text mining Mahout, H2O, MLlib, R, Python 4 Big data ETL Focus: Heterogeneous data processing across platforms Kettle, Flume, Sqoop, Impala 5 System management: Focus: Load balancing and system efficiency Oozie, ZooKeeper, Ambari, Loom, Ganglia, Mesos 6 Application development platform Focus: Algorithms and innovative development environments Tomcat, Neo4J, Taitan, GraphX, Pig, Hue 7 Tools & Visualizations Focus: Features for big data visualization and data utilization. Pentaho, Tableau, Qlik Saiku, Mondrian, Gephi, 8 Streaming data processing Focus: Efficiency and effectiveness of real-time data processing Spark, Storm, Kafka, Avro

Will Spark replace Hadoop?
Hadoop is not a single product, it is an ecosystem. Same for Spark. MapReduce can be replaced with Spark Core. Yes, it can be replaced over the time and this replacement seems reasonable. But Spark is not yet mature enough to make a full replacement of this technology. Plus no one will completely give up on MapReduce unless all the tools that depend on it will support an alternative execution engine. Hive can be replaced with Spark SQL. Yes, it is again true. But you should understand that Spark SQL is even younger than the Spark itself, this technology is younger than 1yo. At the moment it can only toy around the mature Hive technology, I will look back at this in 1.5 – 2 years. As you remember, 2-3 years ago Impala was the Hive killer, but now both technologies are living together and Impala still didn’t kill Hive. Storm can be replaced with Spark Streaming. Yes, it can, but to be fair Storm is not a piece of Hadoop ecosystem as it is completely independent tool. They are targeting a bit different computational models so I don’t think that Storm will disappear, but it will continue to leave as a niche product Mahout can be replaced with MLlib. To be fair, Mahout is already losing the market and over the last year it became obvious that this tool will soon be dropped from the market. And here you can really say that Spark replaced something from Hadoop ecosystem.

7 top tools for taming big data
Jaspersoft: producing reports from database columns Pentaho: A report generating engine easier to absorb information from the new sources. Karmasphere: Makes it easier to create and run Hadoop jobs Talend: Offers an Eclipse-based IDE for stringing together data processing jobs with Hadoop. Skytree: Offers a bundle that performs many of the more sophisticated machine-learning algorithms. Tableau: A visualization tool to look at data in new ways, then slice it up and look at it in a different way. Splunk: Creates an index of data as if the data were a book or a block of text. Link

Open Source Software for Big Data
Oracle VM VirtualBox Cloudera Hadoop - Get Started With Enterprise Hadoop‎ Hortonworks Data Platform - Hortonworks.com‎ Google Hadoop Solutions - google.com‎ Hadoop on Google Cloud Platform Hadoop & NoSQL - MarkLogic.com‎

Reading Assignments Find a BI application case from the web, and understand how it works. Find paper “CACM2011 Overview of BI.pdf” in the network drive under ~\Texts\Readings\. Read it carefully.

Appendix: Install Hadoop/Spark

Platform Installation
Install VirtualBox 5.0.x Install Hadoop (need 10GB+ disk space) CloudEra CDH 5.5, or HortonWorks Sandbox 2.4 Install Spark (Note: Apache Spark is included with CDH 5) Windows: Mac OS X:

Hortonworks Data Platform
Install & Setup Hortonworks Download, Mar 25, 2-15, 9’26” Hortonworks Sandbox, May 2, 2013, 8’35” Install Hortonworks Sandbox 2.0, Sep 1, 2014, 22’23” Setup Hortonworks Sandbox with Virtualbox VM, Nov 20, 2013, 24’25” Download & Install

Debugging Error after upgrade:VT-x is not available. (VERR_VMX_NO_VMX)

Install CloudEra’s QuickStarts

Install Spark - Windows
Download Spark from Download Spark: spark bin-hadoop2.6.tgz Look for the installation in this video. Change command prompt directory to the downloads by typing cd C:\Users\ or (as shown below)

Mac OS X Yosemite

Install Spark – Mac OS X Install Java Set JAVA_HOME
- Download Oracle Java SE Development Kit 7 or 8 at Oracle JDK downloads page. - Double click on .dmg file to start the installation - Open up the terminal. - Type java -version, should display the following java version "1.7.0_71" Java(TM) SE Runtime Environment (build 1.7.0_71-b14) Java HotSpot(TM) 64-Bit Server VM (build b01, mixed mode) Set JAVA_HOME export JAVA_HOME=$(/usr/libexec/java_home)

Download Spark from https://spark.apache.org/downloads.html
Install Homebrew ruby -e "$(curl -fsSL Install Scala brew install scala Set SCALA_HOME export SCALA_HOME=/usr/local/bin/scala export PATH=$PATH:$SCALA_HOME/bin Download Spark from tar -xvzf spark tar cd spark-1.1.1 Fire up the Spark For the Scala shell: ./bin/spark-shell For the Python shell: ./bin/pyspark

Run Examples References: Calculate Pi: MLlib Correlations example:
./bin/run-example org.apache.spark.examples.SparkPi MLlib Correlations example: ./bin/run-example org.apache.spark.examples.mllib.Correlations MLlib Linear Regression example: ./bin/spark-submit --class org.apache.spark.examples.mllib.LinearRegression examples/target/scala-*/spark-*.jar data/mllib/sample_linear_regression_data.txt References: How to install Spark on Mac OS Xhttp://ondrej-kvasnovsky.blogspot.com/2014/06/how-to-install-spark-on-mac-os-x.html How To Set $JAVA_HOME Environment Variable On Mac OS X Homebrew - The missing package manager for OS X

ISQS 6339, Business Intelligence Big Data Management

Similar presentations

Presentation on theme: "ISQS 6339, Business Intelligence Big Data Management"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ISQS 6339, Business Intelligence Big Data Management

Similar presentations

Presentation on theme: "ISQS 6339, Business Intelligence Big Data Management"— Presentation transcript:

Similar presentations

About project

Feedback