Presentation is loading. Please wait.

Presentation is loading. Please wait.

© 2015 IBM Corporation Partner Webinar Hadoop and Spark – the value of IBM ? October 21 st, 2015 Ian Radmore / Nikolay Manchev – IBM Analytics.

Similar presentations


Presentation on theme: "© 2015 IBM Corporation Partner Webinar Hadoop and Spark – the value of IBM ? October 21 st, 2015 Ian Radmore / Nikolay Manchev – IBM Analytics."— Presentation transcript:

1 © 2015 IBM Corporation Partner Webinar Hadoop and Spark – the value of IBM ? October 21 st, 2015 Ian Radmore / Nikolay Manchev – IBM Analytics

2 2 © 2015 IBM Corporation Agenda Spark The Big Data Challenge BigInsights IBM’s approach Spark – More Technical Overview Summary

3 3 © 2015 IBM Corporation What we hear from customers....  Lots of potentially valuable data is dormant or discarded due to size/performance considerations  Large volume of unstructured or semi- structured data is not worth integrating fully (e.g. Tweets, logs,...)  Not clear what should be analyzed (exploratory, iterative)  Information distributed across multiple systems and/or Internet  Some information has a short useful lifespan  Volumes can be extremely high  Analysis needed in the context of existing information (not stand alone)

4 4 © 2015 IBM Corporation Big Data scenarios span many industries Providing LOBs with analytics solutions across the enterprise Cuts time needed to process billions of textual items from several days to 30 minutes Predict weather patterns to plan optimal wind turbine usage and placement Multi-channel customer sentiment and experience a analysis Connected Car Services

5 5 © 2015 IBM Corporation The Market

6 6 © 2015 IBM Corporation HDFSMap Reduce

7 7 © 2015 IBM Corporation Very Scaleable + IBM BigInsights = Very Flexible High Performing Cheap Simple Enterprise Ready

8 8 © 2015 IBM Corporation Why is IBM involved?  Strong history of leadership in open source & standards  Supports our commitment to open source currency in all future releases  Accelerates our innovation within Hadoop & surrounding applications ODP and Apache Software Foundation (ASF)  ODP supports the ASF mission  ASF provides a governance model around individual projects without looking at ecosystem  ODP aims to provide a vendor-led consistent packaging model for core Apache components as an ecosystem All Standard Apache Open Source Components HDFS YARN MapReduce Ambari HBase Spark Flume Hive Pig Sqoop HCatalog Solr/Lucene ODP initial spec IBM Joins ODPi as Founding Platinum Member

9 9 © 2015 IBM Corporation Text Analytics POSIX Distributed Filesystem Multi-workload, Multi-tenant scheduling IBM BigInsights Enterprise Management Machine Learning on Big R Big R IBM BigInsights Data Scientist IBM BigInsights Analyst Big SQL BigSheets Big SQL BigSheets IBM BigInsights for Apache Hadoop IBM Open Platform with Apache Hadoop – all open source HDFS YARN MapReduce Ambari HBase Spark Flume Hive Pig Sqoop HCatalog Solr/Lucene Zookeeper Oozie Knox Slider New: IBM BigInsights v4 for Apache Hadoop Commitment to Hadoop currency: “days, not months”

10 10 © 2015 IBM Corporation BigInsights Has Built in-Hadoop Analytics To empower rapid discovery of new insights Strongest SQL on Hadoop offering Best ANSI SQL standard support means best compatibility for end-users and tools Fined grained access control for enhanced security Sophisticated workload management Federation to other data sources Text Analytics Deep text analytic toolkit from IBM Research Unstructured text extraction engine Rich application development toolkit for analyzing text and managing the veracity of data (including social media) Advanced Machine Learning Scalable algorithms from IBM Research Set of scalable machine learning algorithms beyond R Optimized execution plans for highest levels of performance Big R Leverage R as a query language on Hadoop Use open source R models & scale out on Hadoop cluster Task parallelism BigSheets Browser-based excel-like analytics inteface Explore, visualize & transform unstructured & structured data Discover & cleanse data Export into common formats

11 11 © 2015 IBM Corporation IBM SQL on Hadoop Market Leadership “Interactive SQL is the biggest battleground for enterprise Hadoop sales in 2015; IBM approaches this battle with the strongest offering that operates directly on HDFS (Hadoop Distributed File System)” Ovum Research “IBM also recently conducted an independently audited benchmark[1], which was reviewed by third-party InfoSizing, of three popular SQL-on- Hadoop implementations and the results showed that IBM's Big SQL was the only Hadoop solution tested that was able to run all 99 Hadoop-DS[2] queries. Of the three, IBM Big SQL was found to be the fastest, most scalable, and most reliable – with a 3.6 times performance advantage running queries on a 10 TB for the compared solutions. “ independently audited benchmark

12 12 © 2015 IBM Corporation BigInsights Has Deep Integration with IBM Portfolio Security Optim Guardium Data Integration Information Server Data Click Analytics Cognos SPSS Streams Master Data Mgmt InfoSphere MDM BigMatch Data Warehouse DB2 Netezza Governance Information Governanc e Catalog Data Replication Infosphere Replication (CDC) Watson Explorer “IBM BigInsights is differentiated by the broad and deep InfoSphere data management and integration tooling portfolio that integrates with it and a deep portfolio of analytic tools, some of which are bundled with the platform.” Security Intelligence QRadar

13 13 © 2015 IBM Corporation Key Capability / Technology Advantages BigInsights v4 Cloudera CDH 5.4 HortonWorks HDP 2.3 MapR V5Pivotal HD 3.0 Open Data Platform Member – Alignment on Apache Hadoop HDFS, YARN, Ambari SQL on Hadoop – Rich, high-performance ANSI SQL on Native Hadoop files BigSheets – Spreadsheet style visualization tool for business users Big R – full R language integration with native R analytics on Hadoop with new IBM Research machine learning algorithms Text Analytics– Simplified development for text analytics Cognos BI - Powerful and scalable business intelligence and performance management) on Hadoop InfoSphere Streams – Real time streaming analytics into Hadoop Watson Explorer – search, index and visualize information in and out of Hadoop InfoSphere DataClick - self-service data integration for business users Information Governance Catalog - One- stop-shop for business and technical metadata in Hadoop Data Studio - Simplify Big SQL database administration, accelerate development Platform Symphony - Multi-workload, Multi-tenant scheduling POSIX Distributed Filesystem (IBM offers Spectrum Scale, formerly known as GPFS) Capabilities that matter to the business  Open Data Platform – Assurance of 100% open source Apache Hadoop  Big SQL – Strict ANSI language compliance ensures that customers can run the same SQL unmodified across multiple data sources and can protect investments in existing client-side applications and tools.  BigSheets – Without an easy to use visual tool for non programmers, organizations may need to investment in third party tools, adding cost and complexity to the deployment  Big R and Advanced Machine Learning take R statistical analysis to a new level of performance, scalability, along with new high performance machine learning algorithms from IBM Research  Text Analytics – Enables analytics on unstructured text (call center, social media) for sentiment analysis, consumer behavior, Illegal or suspicious activities  IBM Analytics Portfolio software for Hadoop – including licenses for Cognos BI, Watson Explorer, InfoSphere Streams, Data Click, Information Governance Catalog, Data Studio  Platform Symphony provides multi-instance support that lets administrators allocate and optimize resources to scale the environment for large numbers of users  Spectrum Scale is a POSIX file system that helps reduce storage costs by allowing Hadoop and non- Hadoop applications to manipulate the same file directly avoiding the need for data to be replicated in and out of HDFS Capabilities that matter to I/T BigInsights Comparison

14 14 © 2015 IBM Corporation  The Hartree Centre is a venture by the Science and Technology Facilities Council, one of Europe’s largest multi-disciplinary research organisations, and IBM  It is established as the industrial gateway to the UK’s most advanced computing and data analytics capabilities, and to world-class scientists, engineers and data experts  Combining the skills, technology and best practice of high performance computing (HPC) and data analysis, organisations from any sector can benefit  Show clients how they can quickly extract value from data or apply HPC to be more competitive – reduce the cost and risk of research and get to market more quickly  Hartree can help companies to deliver proof of concept and demonstrate the value in a project or business idea using the power of HPC, data analytics and visualisation  UK Government and IBM have invested a further £313m, funding research into data-centric and cognitive computing  Based at Daresbury in Cheshire – visualisation and computational demo facility in the north available for IBM and its clients to use  The complete stack of IBM Big Data and Analytics technology from SPSS to BigInsights and Streams available for use AT SCALE Predicting archaeology to reduce risk with large infrastructure projects YouTube video Streamlining healthcare costs Read the case study Highlighting weather-related emergency black-spots Read the case study Jump-starting Big Data initiatives in partnership with world-class talent, tools and resources

15 © 2015 IBM Corporation What is Spark and why is it exciting ?

16 16 © 2015 IBM Corporation Hadoop MapReduce Challenges Need deep Java skills Few abstractions available for analysts No in-memory framework Application tasks write to disk with each cycle Only suitable for batch workloads Rigid processing model In-Memory Performance Ease of Development Combine Workflows

17 17 © 2015 IBM Corporation What is Spark  Fast and general in-memory big data processing engine Step

18 18 © 2015 IBM Corporation Performance WordCount based benchmark performed by Databricks https://databricks.com/blog/2014/03/20/apache-spark-a-delight-for-developers.html

19 19 © 2015 IBM Corporation The Combination: The Flexibility of Spark on a Stable Hadoop Platform In-Memory Performance Ease of Development Combine Workflows Unlimited Scale Enterprise Platform Wide Range of Data Formats

20 20 © 2015 IBM Corporation IBM Leads the Way with Spark  Announcing:  Open Source System ML  Educate One Million Data Professionals  Establish Spark Technology Center  Founding Member of AMPLab  Contributing to the Core

21 21 © 2015 IBM Corporation IBM BigInsights Summary  100% Open Source Apache Hadoop distribution including Spark  **New** free for production use, technical support subscription available  BigInsights adds value with in-Hadoop analytics  Big SQL is the most mature SQL engine for Hadoop  BigSheets for data discovery and exploration  Big R adds unique value for Data Scientists on Hadoop  Text Analytics for Hadoop from IBM Research for sentiment analysis, etc.  Complete BigInsights offering includes  Cognos BI, InfoSphere Streams, Watson Explorer, Information Governance Catalog, Data Studio, Data Click, Platform Symphony, Spectrum Scale  Richest set of data management and integration tooling for Hadoop

22 22 © 2015 IBM Corporation IBM BigInsights for Apache Hadoop IBM System zIBM Power Intel Servers On Cloud Your choice of infrastructure and deployment model

23 23 © 2015 IBM Corporation BigInsights List Pricing – production licensing software (indicative only – please refer to your partner manager) Typical minimum configuration is a 6 node cluster. There are many clusters in the 100’s of servers and a handful in the 1000’s

24 24 © 2015 IBM Corporation Next Steps  Download Quick Start offering  Test drive the technologies  Links all available from HadoopDev –https://developer.ibm.com/hadoop/https://developer.ibm.com/hadoop/ Invite your clients to our weekly webinars 22 nd October – Big Data Governance and Integration 29 th October – BigInsights and Spark Overview 5 th November – BigInsights on Power tbc date – cybersecurity tbc date – manufacturing and industrial Start the discussion and involve us Everyone wants to know about Big Data ? Nick Ansell Nickanse@uk.ibm.com 07827 844920 Ian Radmore Ian.Radmore@uk.ibm.com 07843 368078 Nikolay Manchev nmancev@uk.ibm.com 07919 565747

25 © 2015 IBM Corporation Spark Technical Overview

26 26 © 2015 IBM Corporation Data that’s too big or too ugly to fit in a relational database. Processing that requires massively parallel software running on tens, hundreds, or even thousands of servers.

27 27 © 2015 IBM Corporation HDFSMap Reduce

28 28 © 2015 IBM Corporation MAP REDUCE

29 29 © 2015 IBM Corporation Reliability Resiliency Security Multiple data sources Multiple applications Multiple users Hadoop Advantages Files Semi-structured Databases Unlimited Scale Enterprise Platform Wide Range of Data Formats

30 30 © 2015 IBM Corporation Hadoop MapReduce Challenges Need deep Java skills Few abstractions available for analysts Ease of Development Word Count Example import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WordCount { public static class TokenizerMapper extends Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } public static class IntSumReducer extends Reducer { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } cat filename | xargs -n1 | sort | uniq -c > newfilename

31 31 © 2015 IBM Corporation Hadoop MapReduce Challenges Need deep Java skills Few abstractions available for analysts No in-memory framework Application tasks write to disk with each cycle Only suitable for batch workloads Rigid processing model In-Memory Performance Ease of Development Combine Workflows

32 32 © 2015 IBM Corporation Hadoop MapReduce Challenges No in-memory framework Application tasks write to disk with each cycle In-Memory Performance MAP REDUCE MAP REDUCE

33 33 © 2015 IBM Corporation Hadoop MapReduce Challenges Need deep Java skills Few abstractions available for analysts No in-memory framework Application tasks write to disk with each cycle Only suitable for batch workloads Rigid processing model In-Memory Performance Ease of Development Combine Workflows

34 34 © 2015 IBM Corporation Hadoop MapReduce Challenges Only suitable for batch workloads Rigid processing model Combine Workflows Only one supported pattern: map reduce What if we want something else: map reduce reduce

35 35 © 2015 IBM Corporation Hadoop MapReduce Challenges Need deep Java skills Few abstractions available for analysts No in-memory framework Application tasks write to disk with each cycle Only suitable for batch workloads Rigid processing model In-Memory Performance Ease of Development Combine Workflows

36 36 © 2015 IBM Corporation What is Spark  Fast and general in-memory big data processing engine Step

37 37 © 2015 IBM Corporation Performance WordCount based benchmark performed by Databricks https://databricks.com/blog/2014/03/20/apache-spark-a-delight-for-developers.html

38 38 © 2015 IBM Corporation Ease of use  Spark provides the following APIs  Java API  Python API  Scala API  Scala  General purpose programming language  Strong static type system  Full support for functional programming textFile.flatMap(line => line.split(“ “)).map(word => (word, 1)).reduceByKey((a,b) => a + b)

39 39 © 2015 IBM Corporation Spark Workflows  Spark features an advanced Directed Acyclic Graph (DAG) engine supporting complex data flow

40 40 © 2015 IBM Corporation In-Memory Performance Ease of Development Easier APIs Python, Scala, Java Resilient Distributed Datasets Unify processing Spark Advantages Batch Interactive Iterative algorithms Micro-batch Combine Workflows

41 41 © 2015 IBM Corporation Spark on Hadoop Apache Spark Spark SQL Spark Streaming GraphXMLlibSparkR Apache Hadoop-HDFS Apache Hadoop-YARN Resource management Storage management Compute layer

42 42 © 2015 IBM Corporation The Combination: The Flexibility of Spark on a Stable Hadoop Platform In-Memory Performance Ease of Development Combine Workflows Unlimited Scale Enterprise Platform Wide Range of Data Formats

43 43 © 2015 IBM Corporation IBM Leads the Way with Spark  Announcing:  Open Source System ML  Educate One Million Data Professionals  Establish Spark Technology Center  Founding Member of AMPLab  Contributing to the Core

44 © 2015 IBM Corporation IBM big data IBM big data IBM big data IBM big data THINK


Download ppt "© 2015 IBM Corporation Partner Webinar Hadoop and Spark – the value of IBM ? October 21 st, 2015 Ian Radmore / Nikolay Manchev – IBM Analytics."

Similar presentations


Ads by Google