THE BUSINESS CASE FOR AI, SPARK & MORE Sanjay Mathur, CEO sanjay@svds.com • @sanjaymathur • www.svds.com
Silicon Valley Data Science is a boutique consulting firm focused on transforming business through data science and engineering. We are a company of experienced software engineers, architects, data scientists, strategists, and designers who specialize in data-driven product development and experimentation. We work in blended teams trained in holistic, strategic, and agile approaches.
Come see us & say hello! Tuesday, 23 May Data 101: The Business Case for Deep Learning, Spark, and Friends with Sanjay Mathur Architecting a Data Platform with John Akred & Stephen O’Sullivan Developing a Modern Enterprise Data Strategy with John Akred & Scott Kurth Thursday, 25 May What’s Your Data Worth? with John Akred Ask Me Anything with John Akred, Scott Kurth & Stephen O’Sullivan
To view SVDS speakers and scheduling, or to receive a copy of our slides, go to: www.svds.com/StrataUK2017
… it’s really about agility BIG DATA … it’s really about agility
BUYING AGILITY Linear scale-out cost Opex vs capex Ease of purchase
Scale-out systems move us from managing scarcity to promoting utility
DEVELOPMENT AGILITY Architectural factors Schema on read Rapid deployment Mirror production setup Executes faster Programmer factors Fun to program Concision Easier to test Faster to write DEVELOPMENT AGILITY
The experimental enterprise Supports investigative work and builds a solid layer for production. Conducts experiments and responds to the changing environment. Makes foundational infrastructure readily accessible. http://www.forbes.com/sites/edddumbill/2014/02/05/the-experimental-enterprise/
Data Management Security, Operations, Data Quality, Meta Data Management and Data Lineage Analytics Low Latency Access Data Ingest Repository Persistence Offline Processing Real-Time Processing Batch Processing Data Services External Systems Data Acquisition Internal Data Platform
What is Docker? Container technology: bundles every part of an application Provides isolation for each application without the overhead of running a virtual machine Ships only the parts that are needed—leaves out the operating system Answers the question of “how do I get my data” VM – full isolation, guaranteed resources Docker – process isolation, and better resource sharing
WHY SHOULD BUSINESS CARE? Better use of server resource than virtual machines A fast and reliable way of deploying applications It’s the ideal packaging mechanism for scale-out distributed systems Easy for developers to work in an environment identical to production Sharing containers leads to innovation DevOps enabler with programmatic infrastructure components. Fast scale up and tear down.
What is Apache Spark? In-memory distributed computing platform Comes from Berkeley AMPlab In production with early adopters, now integral to every commercial Hadoop distribution Doesn’t need Hadoop, but runs easily on top
Use cases Managing a major retailer’s inventory across a diverse network of entities in near real time Managing and processing event streams for online gaming Supporting data science initiatives across massive data sets at a media analytics company
Why should business care? Enables use cases Hadoop didn’t provide, all in one platform streaming, interactive analytics, machine learning, graphs Fast Iteration time down, more productive Use existing cluster investment Sits on HDFS, can run under YARN (or use Amazon S3, or Cassandra)
Why should business care? (2) SparkSQL Use SQL skills and tools, e.g. Tableau Dataframes integrate external data sources into one context: RDBMS, Hive, JSON… Developer-friendly Concise and fluid to program Language integration: Scala, R, Python, Java
THE AI BOOM Image from https://devblogs.nvidia.com/parallelforall/deep-learning-for-computer-vision-with-matlab-and-cudnn/ Deep learning uses deep neural networks which have been around for a few decades; what’s changed in recent years is the availability of large labeled datasets and powerful GPUs. Neural networks are inherently parallel algorithms and GPUs with thousands of cores can take advantage of this parallelism to dramatically reduce computation time needed for training deep learning networks.
THE AI BOOM A long history of under-delivered promise—why? Traditionally, an encoding of knowledge Expert systems Current excitement is based on statistical methods Machine learning—many methods Deep learning—neural networks
AI USE CASES Cognition at scale Pattern recognition Predictive analytics Deep learning takes this further than before Extraordinary levels of composition, eg. Image captioning, generation
AI, WHY CARE? Leverage data previously “invisible” e.g. search images by text Automate low-value cognition tasks Call center triage Natural interfaces X.ai — scheduling bot Amazon Alexa, Google Now, Cortana, etc. Amazing open source toolkits and cloud APIs Google Tensorflow, MSFT, IBM Watson
What are Notebooks? Interactive documents that contain a program and its output Long history: Mathematica Particularly successful with data science Projects to watch Jupyter https://jupyter.org/ Apache Zeppelin https://zeppelin.incubator.apache.org/ Answers the question of “how do I get my data”
Demo goes here
Why should business care? Easy collaboration and sharing of data science Think “Docker for analysis” Easy access to data and compute resource A building block for more self-service analytical capabilities Commercial version of Notebooks + Spark is the Databricks Cloud
Data is your business
SILICON VALLEY’S DATA MACHINE
The experimental enterprise Supports investigative work and builds a solid layer for production. Conducts experiments and responds to the changing environment. Makes foundational infrastructure readily accessible. http://www.forbes.com/sites/edddumbill/2014/02/05/the-experimental-enterprise/
BECOME DATA NATIVE Can only win with situational awareness New architectures offer new opportunities Creation of data-driven value requires new approach Create an Experimental Enterprise Business must lead, and understand the potential of the technology
Sanjay Mathur• sanjay@svds. com • @sanjaymathur www. svds Sanjay Mathur• sanjay@svds.com • @sanjaymathur www.svds.com/StrataCA2017
Come see us & say hello! Tuesday, 23 May Data 101: The Business Case for Deep Learning, Spark, and Friends with Sanjay Mathur Architecting a Data Platform with John Akred & Stephen O’Sullivan Developing a Modern Enterprise Data Strategy with John Akred & Scott Kurth Thursday, 25 May What’s Your Data Worth? with John Akred Ask Me Anything with John Akred, Scott Kurth & Stephen O’Sullivan
To view speakers and scheduling, or to receive a copy of our slides, go to: www.svds.com/StrataCA2017