© 2015 IBM Corporation Partner Webinar Hadoop and Spark – the value of IBM ? October 21 st, 2015 Ian Radmore / Nikolay Manchev – IBM Analytics.

Slides:



Advertisements
Similar presentations
Syncsort Data Integration Update Summary Helping Data Intensive Organizations Across the Big Data Continuum Hadoop – The Operating System.
Advertisements

Thanks to Microsoft Azure’s Scalability, BA Minds Delivers a Cost-Effective CRM Solution to Small and Medium-Sized Enterprises in Latin America MICROSOFT.
FAST FORWARD WITH MICROSOFT BIG DATA Vinoo Srinivas M Solutions Specialist Windows Azure (Hadoop, HPC, Media)
Observation Pattern Theory Hypothesis What will happen? How can we make it happen? Predictive Analytics Prescriptive Analytics What happened? Why.
Big Data Workflows N AME : A SHOK P ADMARAJU C OURSE : T OPICS ON S OFTWARE E NGINEERING I NSTRUCTOR : D R. S ERGIU D ASCALU.
© 2015 IBM Corporation IBM SPSS Statistics prepared by: Dennis Buttera, Curriculum Advisor IBM Academic Partnerships.
An Information Architecture for Hadoop Mark Samson – Systems Engineer, Cloudera.
With the Help of the Microsoft Azure Platform, Devbridge Group Provides Powerful, Flexible, and Scalable Responsive Web Solutions MICROSOFT AZURE ISV PROFILE:
Hadoop Ecosystem Overview
TITLE SLIDE: HEADLINE Presenter name Title, Red Hat Date For Red Hat, it's 1994 all over again Sarangan Rangachari VP and GM, Storage and Big Data Red.
Apache Spark and the future of big data applications Eric Baldeschwieler.
This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.
©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential.
Introduction to Hadoop and HDFS
HAMS Technologies 1
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Hadoop Introduction Wang Xiaobo Outline Install hadoop HDFS MapReduce WordCount Analyzing Compile image data TeleNav Confidential.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
How Companies are Using Spark And where the Edge in Big Data will be Matei Zaharia.
Actualog Social PIM Helps Companies to Manage and Share Product Information Using Secure, Scalable Ease of Microsoft Azure MICROSOFT AZURE ISV PROFILE:
Built on Azure, Moodle Helps Educators Create Proprietary Private Web Sites Filled with Dynamic Courses that Extend Learning Anytime, Anywhere MICROSOFT.
OpenField Consolidates Stadium Data, Provides CRM and Analysis Functions for an Intelligent, End-to-End Solution COMPANY PROFILE : OPENFIELD Founded by.
Hosting Websites and Web Applications with Microsoft ® SQL Server ® 2008.
MidVision Enables Clients to Rent IBM WebSphere for Development, Test, and Peak Production Workloads in the Cloud on Microsoft Azure MICROSOFT AZURE ISV.
CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.
What we know or see What’s actually there Wikipedia : In information technology, big data is a collection of data sets so large and complex that it.
Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project.
Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery.
Microsoft Azure and DataStax: Start Anywhere and Scale to Any Size in the Cloud, On- Premises, or Both with a Leading Distributed Database MICROSOFT AZURE.
Axis AI Solves Challenges of Complex Data Extraction and Document Classification through Advanced Natural Language Processing and Machine Learning MICROSOFT.
Saasabi’s Analytical Processing Engine in the Cloud Makes Business Intelligence Affordable for Everyone COMPANY PROFILE: Saasabi Saasabi is a BizSpark.
Microsoft Azure and ServiceNow: Extending IT Best Practices to the Microsoft Cloud to Give Enterprises Total Control of Their Infrastructure MICROSOFT.
Business Intelligence for everyone 2 For BI to deliver maximum value, all Information Workers must participate: Broad access to uncover and share insights.
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
Harnessing Big Data with Hadoop Dipti Sangani; Madhu Reddy DBI210.
BIG DATA/ Hadoop Interview Questions.
Microsoft Partner since 2011
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
Data Analytics Challenges Some faults cannot be avoided Decrease the availability for running physics Preventive maintenance is not enough Does not take.
Business Insights Play briefing deck.
Univa Grid Engine Makes Work Management Automatic and Efficient, Accelerates Deployment of Cloud Services with Power of Microsoft Azure MICROSOFT AZURE.
Data Platform and Analytics Foundational Training
Organizations Are Embracing New Opportunities
Data Platform and Analytics Foundational Training
Big Data is a Big Deal!.
SAS users meeting in Halifax
PROTECT | OPTIMIZE | TRANSFORM
Spark Presentation.
Couchbase Server is a NoSQL Database with a SQL-Based Query Language
Central Florida Business Intelligence User Group
Ministry of Higher Education
Introduction to Spark.
MIT 802 Introduction to Data Platforms and Sources Lecture 2
Designed for Big Data Visual Analytics, Zoomdata Allows Business Users to Quickly Connect, Stream, and Visualize Data in the Microsoft Azure Platform MICROSOFT.
Yellowfin: An Azure-Compatible Business Intelligence Platform That Connects People with Their Data for Better Decision Making MICROSOFT AZURE APP BUILDER.
DeFacto Planning on the Powerful Microsoft Azure Platform Puts the Power of Intelligent and Timely Planning at Any Business Manager’s Fingertips Partner.
Accelerate Your Self-Service Data Analytics
Unitrends Enterprise Backup Solution Offers Backup and Recovery of Data in the Microsoft Azure Cloud for Better Protection of Virtual and Physical Systems.
Dell Data Protection | Rapid Recovery: Simple, Quick, Configurable, and Affordable Cloud-Based Backup, Retention, and Archiving Powered by Microsoft Azure.
Appcelerator Arrow: Build APIs in Minutes. Connect to Any Data Source
XtremeData on the Microsoft Azure Cloud Platform:
Overview of big data tools
Spark and Scala.
Improve Patient Experience with Saama and Microsoft Azure
Charles Tappert Seidenberg School of CSIS, Pace University
MIT 802 Introduction to Data Platforms and Sources Lecture 2
Big-Data Analytics with Azure HDInsight
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Architecture of modern data warehouse
Presentation transcript:

© 2015 IBM Corporation Partner Webinar Hadoop and Spark – the value of IBM ? October 21 st, 2015 Ian Radmore / Nikolay Manchev – IBM Analytics

2 © 2015 IBM Corporation Agenda Spark The Big Data Challenge BigInsights IBM’s approach Spark – More Technical Overview Summary

3 © 2015 IBM Corporation What we hear from customers....  Lots of potentially valuable data is dormant or discarded due to size/performance considerations  Large volume of unstructured or semi- structured data is not worth integrating fully (e.g. Tweets, logs,...)  Not clear what should be analyzed (exploratory, iterative)  Information distributed across multiple systems and/or Internet  Some information has a short useful lifespan  Volumes can be extremely high  Analysis needed in the context of existing information (not stand alone)

4 © 2015 IBM Corporation Big Data scenarios span many industries Providing LOBs with analytics solutions across the enterprise Cuts time needed to process billions of textual items from several days to 30 minutes Predict weather patterns to plan optimal wind turbine usage and placement Multi-channel customer sentiment and experience a analysis Connected Car Services

5 © 2015 IBM Corporation The Market

6 © 2015 IBM Corporation HDFSMap Reduce

7 © 2015 IBM Corporation Very Scaleable + IBM BigInsights = Very Flexible High Performing Cheap Simple Enterprise Ready

8 © 2015 IBM Corporation Why is IBM involved?  Strong history of leadership in open source & standards  Supports our commitment to open source currency in all future releases  Accelerates our innovation within Hadoop & surrounding applications ODP and Apache Software Foundation (ASF)  ODP supports the ASF mission  ASF provides a governance model around individual projects without looking at ecosystem  ODP aims to provide a vendor-led consistent packaging model for core Apache components as an ecosystem All Standard Apache Open Source Components HDFS YARN MapReduce Ambari HBase Spark Flume Hive Pig Sqoop HCatalog Solr/Lucene ODP initial spec IBM Joins ODPi as Founding Platinum Member

9 © 2015 IBM Corporation Text Analytics POSIX Distributed Filesystem Multi-workload, Multi-tenant scheduling IBM BigInsights Enterprise Management Machine Learning on Big R Big R IBM BigInsights Data Scientist IBM BigInsights Analyst Big SQL BigSheets Big SQL BigSheets IBM BigInsights for Apache Hadoop IBM Open Platform with Apache Hadoop – all open source HDFS YARN MapReduce Ambari HBase Spark Flume Hive Pig Sqoop HCatalog Solr/Lucene Zookeeper Oozie Knox Slider New: IBM BigInsights v4 for Apache Hadoop Commitment to Hadoop currency: “days, not months”

10 © 2015 IBM Corporation BigInsights Has Built in-Hadoop Analytics To empower rapid discovery of new insights Strongest SQL on Hadoop offering Best ANSI SQL standard support means best compatibility for end-users and tools Fined grained access control for enhanced security Sophisticated workload management Federation to other data sources Text Analytics Deep text analytic toolkit from IBM Research Unstructured text extraction engine Rich application development toolkit for analyzing text and managing the veracity of data (including social media) Advanced Machine Learning Scalable algorithms from IBM Research Set of scalable machine learning algorithms beyond R Optimized execution plans for highest levels of performance Big R Leverage R as a query language on Hadoop Use open source R models & scale out on Hadoop cluster Task parallelism BigSheets Browser-based excel-like analytics inteface Explore, visualize & transform unstructured & structured data Discover & cleanse data Export into common formats

11 © 2015 IBM Corporation IBM SQL on Hadoop Market Leadership “Interactive SQL is the biggest battleground for enterprise Hadoop sales in 2015; IBM approaches this battle with the strongest offering that operates directly on HDFS (Hadoop Distributed File System)” Ovum Research “IBM also recently conducted an independently audited benchmark[1], which was reviewed by third-party InfoSizing, of three popular SQL-on- Hadoop implementations and the results showed that IBM's Big SQL was the only Hadoop solution tested that was able to run all 99 Hadoop-DS[2] queries. Of the three, IBM Big SQL was found to be the fastest, most scalable, and most reliable – with a 3.6 times performance advantage running queries on a 10 TB for the compared solutions. “ independently audited benchmark

12 © 2015 IBM Corporation BigInsights Has Deep Integration with IBM Portfolio Security Optim Guardium Data Integration Information Server Data Click Analytics Cognos SPSS Streams Master Data Mgmt InfoSphere MDM BigMatch Data Warehouse DB2 Netezza Governance Information Governanc e Catalog Data Replication Infosphere Replication (CDC) Watson Explorer “IBM BigInsights is differentiated by the broad and deep InfoSphere data management and integration tooling portfolio that integrates with it and a deep portfolio of analytic tools, some of which are bundled with the platform.” Security Intelligence QRadar

13 © 2015 IBM Corporation Key Capability / Technology Advantages BigInsights v4 Cloudera CDH 5.4 HortonWorks HDP 2.3 MapR V5Pivotal HD 3.0 Open Data Platform Member – Alignment on Apache Hadoop HDFS, YARN, Ambari SQL on Hadoop – Rich, high-performance ANSI SQL on Native Hadoop files BigSheets – Spreadsheet style visualization tool for business users Big R – full R language integration with native R analytics on Hadoop with new IBM Research machine learning algorithms Text Analytics– Simplified development for text analytics Cognos BI - Powerful and scalable business intelligence and performance management) on Hadoop InfoSphere Streams – Real time streaming analytics into Hadoop Watson Explorer – search, index and visualize information in and out of Hadoop InfoSphere DataClick - self-service data integration for business users Information Governance Catalog - One- stop-shop for business and technical metadata in Hadoop Data Studio - Simplify Big SQL database administration, accelerate development Platform Symphony - Multi-workload, Multi-tenant scheduling POSIX Distributed Filesystem (IBM offers Spectrum Scale, formerly known as GPFS) Capabilities that matter to the business  Open Data Platform – Assurance of 100% open source Apache Hadoop  Big SQL – Strict ANSI language compliance ensures that customers can run the same SQL unmodified across multiple data sources and can protect investments in existing client-side applications and tools.  BigSheets – Without an easy to use visual tool for non programmers, organizations may need to investment in third party tools, adding cost and complexity to the deployment  Big R and Advanced Machine Learning take R statistical analysis to a new level of performance, scalability, along with new high performance machine learning algorithms from IBM Research  Text Analytics – Enables analytics on unstructured text (call center, social media) for sentiment analysis, consumer behavior, Illegal or suspicious activities  IBM Analytics Portfolio software for Hadoop – including licenses for Cognos BI, Watson Explorer, InfoSphere Streams, Data Click, Information Governance Catalog, Data Studio  Platform Symphony provides multi-instance support that lets administrators allocate and optimize resources to scale the environment for large numbers of users  Spectrum Scale is a POSIX file system that helps reduce storage costs by allowing Hadoop and non- Hadoop applications to manipulate the same file directly avoiding the need for data to be replicated in and out of HDFS Capabilities that matter to I/T BigInsights Comparison

14 © 2015 IBM Corporation  The Hartree Centre is a venture by the Science and Technology Facilities Council, one of Europe’s largest multi-disciplinary research organisations, and IBM  It is established as the industrial gateway to the UK’s most advanced computing and data analytics capabilities, and to world-class scientists, engineers and data experts  Combining the skills, technology and best practice of high performance computing (HPC) and data analysis, organisations from any sector can benefit  Show clients how they can quickly extract value from data or apply HPC to be more competitive – reduce the cost and risk of research and get to market more quickly  Hartree can help companies to deliver proof of concept and demonstrate the value in a project or business idea using the power of HPC, data analytics and visualisation  UK Government and IBM have invested a further £313m, funding research into data-centric and cognitive computing  Based at Daresbury in Cheshire – visualisation and computational demo facility in the north available for IBM and its clients to use  The complete stack of IBM Big Data and Analytics technology from SPSS to BigInsights and Streams available for use AT SCALE Predicting archaeology to reduce risk with large infrastructure projects YouTube video Streamlining healthcare costs Read the case study Highlighting weather-related emergency black-spots Read the case study Jump-starting Big Data initiatives in partnership with world-class talent, tools and resources

© 2015 IBM Corporation What is Spark and why is it exciting ?

16 © 2015 IBM Corporation Hadoop MapReduce Challenges Need deep Java skills Few abstractions available for analysts No in-memory framework Application tasks write to disk with each cycle Only suitable for batch workloads Rigid processing model In-Memory Performance Ease of Development Combine Workflows

17 © 2015 IBM Corporation What is Spark  Fast and general in-memory big data processing engine Step

18 © 2015 IBM Corporation Performance WordCount based benchmark performed by Databricks

19 © 2015 IBM Corporation The Combination: The Flexibility of Spark on a Stable Hadoop Platform In-Memory Performance Ease of Development Combine Workflows Unlimited Scale Enterprise Platform Wide Range of Data Formats

20 © 2015 IBM Corporation IBM Leads the Way with Spark  Announcing:  Open Source System ML  Educate One Million Data Professionals  Establish Spark Technology Center  Founding Member of AMPLab  Contributing to the Core

21 © 2015 IBM Corporation IBM BigInsights Summary  100% Open Source Apache Hadoop distribution including Spark  **New** free for production use, technical support subscription available  BigInsights adds value with in-Hadoop analytics  Big SQL is the most mature SQL engine for Hadoop  BigSheets for data discovery and exploration  Big R adds unique value for Data Scientists on Hadoop  Text Analytics for Hadoop from IBM Research for sentiment analysis, etc.  Complete BigInsights offering includes  Cognos BI, InfoSphere Streams, Watson Explorer, Information Governance Catalog, Data Studio, Data Click, Platform Symphony, Spectrum Scale  Richest set of data management and integration tooling for Hadoop

22 © 2015 IBM Corporation IBM BigInsights for Apache Hadoop IBM System zIBM Power Intel Servers On Cloud Your choice of infrastructure and deployment model

23 © 2015 IBM Corporation BigInsights List Pricing – production licensing software (indicative only – please refer to your partner manager) Typical minimum configuration is a 6 node cluster. There are many clusters in the 100’s of servers and a handful in the 1000’s

24 © 2015 IBM Corporation Next Steps  Download Quick Start offering  Test drive the technologies  Links all available from HadoopDev – Invite your clients to our weekly webinars 22 nd October – Big Data Governance and Integration 29 th October – BigInsights and Spark Overview 5 th November – BigInsights on Power tbc date – cybersecurity tbc date – manufacturing and industrial Start the discussion and involve us Everyone wants to know about Big Data ? Nick Ansell Ian Radmore Nikolay Manchev

© 2015 IBM Corporation Spark Technical Overview

26 © 2015 IBM Corporation Data that’s too big or too ugly to fit in a relational database. Processing that requires massively parallel software running on tens, hundreds, or even thousands of servers.

27 © 2015 IBM Corporation HDFSMap Reduce

28 © 2015 IBM Corporation MAP REDUCE

29 © 2015 IBM Corporation Reliability Resiliency Security Multiple data sources Multiple applications Multiple users Hadoop Advantages Files Semi-structured Databases Unlimited Scale Enterprise Platform Wide Range of Data Formats

30 © 2015 IBM Corporation Hadoop MapReduce Challenges Need deep Java skills Few abstractions available for analysts Ease of Development Word Count Example import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WordCount { public static class TokenizerMapper extends Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } public static class IntSumReducer extends Reducer { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } cat filename | xargs -n1 | sort | uniq -c > newfilename

31 © 2015 IBM Corporation Hadoop MapReduce Challenges Need deep Java skills Few abstractions available for analysts No in-memory framework Application tasks write to disk with each cycle Only suitable for batch workloads Rigid processing model In-Memory Performance Ease of Development Combine Workflows

32 © 2015 IBM Corporation Hadoop MapReduce Challenges No in-memory framework Application tasks write to disk with each cycle In-Memory Performance MAP REDUCE MAP REDUCE

33 © 2015 IBM Corporation Hadoop MapReduce Challenges Need deep Java skills Few abstractions available for analysts No in-memory framework Application tasks write to disk with each cycle Only suitable for batch workloads Rigid processing model In-Memory Performance Ease of Development Combine Workflows

34 © 2015 IBM Corporation Hadoop MapReduce Challenges Only suitable for batch workloads Rigid processing model Combine Workflows Only one supported pattern: map reduce What if we want something else: map reduce reduce

35 © 2015 IBM Corporation Hadoop MapReduce Challenges Need deep Java skills Few abstractions available for analysts No in-memory framework Application tasks write to disk with each cycle Only suitable for batch workloads Rigid processing model In-Memory Performance Ease of Development Combine Workflows

36 © 2015 IBM Corporation What is Spark  Fast and general in-memory big data processing engine Step

37 © 2015 IBM Corporation Performance WordCount based benchmark performed by Databricks

38 © 2015 IBM Corporation Ease of use  Spark provides the following APIs  Java API  Python API  Scala API  Scala  General purpose programming language  Strong static type system  Full support for functional programming textFile.flatMap(line => line.split(“ “)).map(word => (word, 1)).reduceByKey((a,b) => a + b)

39 © 2015 IBM Corporation Spark Workflows  Spark features an advanced Directed Acyclic Graph (DAG) engine supporting complex data flow

40 © 2015 IBM Corporation In-Memory Performance Ease of Development Easier APIs Python, Scala, Java Resilient Distributed Datasets Unify processing Spark Advantages Batch Interactive Iterative algorithms Micro-batch Combine Workflows

41 © 2015 IBM Corporation Spark on Hadoop Apache Spark Spark SQL Spark Streaming GraphXMLlibSparkR Apache Hadoop-HDFS Apache Hadoop-YARN Resource management Storage management Compute layer

42 © 2015 IBM Corporation The Combination: The Flexibility of Spark on a Stable Hadoop Platform In-Memory Performance Ease of Development Combine Workflows Unlimited Scale Enterprise Platform Wide Range of Data Formats

43 © 2015 IBM Corporation IBM Leads the Way with Spark  Announcing:  Open Source System ML  Educate One Million Data Professionals  Establish Spark Technology Center  Founding Member of AMPLab  Contributing to the Core

© 2015 IBM Corporation IBM big data IBM big data IBM big data IBM big data THINK