Clearstorydata.com Using Spark and Shark for Fast Cycle Analysis on Diverse Data 12.2.13 Vaibhav Nivargi.

Slides:

Advertisements

Similar presentations

Turning Data into Value Ion Stoica CEO, Databricks (also, UC Berkeley and Conviva) UC BERKELEY.

Advertisements

Spark in the Hadoop Ecosystem Eric Baldeschwieler (a.k.a. Eric14)

Berkeley Data Analytics Stack (BDAS) Overview Ion Stoica UC Berkeley UC BERKELEY.

Spark: Cluster Computing with Working Sets

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,

Discretized Streams: Fault-Tolerant Streaming Computation at Scale Wenting Wang 1.

1 Large-Scale Machine Learning at Twitter Jimmy Lin and Alek Kolcz Twitter, Inc. Presented by: Yishuang Geng and Kexin Liu.

Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.

Why Spark on Hadoop Matters

Hive: A data warehouse on Hadoop

Business Intelligence System September 2013 BI.

CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications C. Faloutsos – A. Pavlo How to Scale a Database System.

AMPCamp Introduction to Berkeley Data Analytics Systems (BDAS)

Copyright © 2014 Pearson Education, Inc. 1 It's what you learn after you know it all that counts. John Wooden Key Terms and Review (Chapter 6) Enhancing.

BIG DATA – WHAT’S THE BIG DEAL The call would start soon, please be on mute. Thanks for your time and patience.

Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.

Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.

Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.

Findly Leads the World in Talent Innovation with Its Enterprise-Cloud for Global Talent Acquisition COMPANY PROFILE: FINDLY Findly is a SaaS ISV founded.

` tuplejump The data engineering platform. A startup with a vision to simplify data engineering and empower the next generation of data powered miracles!

Science Research: Journey to 10,000 Sources Presented by: Abe Lederman, President and Founder Deep Web Technologies, Inc. Special Libraries Association.

Maximize Return on Engagement via Scalable Omni-Channel Online Services in the Cloud COMPANY PROFILE: XOMNI, INC. Founded in 2011 and headquartered in.

1 CS 294: Big Data System Research: Trends and Challenges Fall 2015 (MW 9:30-11:00, 310 Soda Hall) Ion Stoica and Ali Ghodsi (

Clearstorydata.com The Next Era of Data Analysis Stephanie McReynolds, VP Marketing Stephen McDaniel, Freakalytics Vaibhav Nivargi, Founder Brian Zotter,

DELIVERING THE ENTERPRISE FABRIC FOR BIG DATA Aiaz Kazi SVP, Platform Strategy and Adoption

Digging for Data? Make Sure You Have the Right Tool Gidi Cohen CEO, Vigil Technologies KM World 2000 Conference and Exposition Friday, September 15, 2000.

How Companies are Using Spark And where the Edge in Big Data will be Matei Zaharia.

© 2009 IBM Corporation 1 The API Economy and Cast Iron Web API Andrew Daniel – Cast Iron UI Developer Andrew Daniel – Cast Iron Web API Software Engineer.

Accumulus Delivers Enterprise Class Subscription Billing and Automation Solutions for Gaming, Retail, and More on the Scalable Microsoft Azure Platform.

Testing in the Cloud with Tosca Testsuite: A Comprehensive Test Management and Test Automation Suite Built on Microsoft Azure MICROSOFT AZURE ISV PROFILE:

Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

Matei Zaharia Introduction to. Outline The big data problem Spark programming model User community Newest addition: DataFrames.

Matthew Winter and Ned Shawa

Powered by Microsoft Azure, PointMatter Is a Flexible Solution to Move and Share Data between Business Groups and IT MICROSOFT AZURE ISV PROFILE: LOGICMATTER.

Berkeley Data Analytics Stack Prof. Chi (Harold) Liu November 2015.

Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Haoyuan Li, Justin Ma, Murphy McCauley, Joshua Rosen, Reynold Xin,

Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery.

+ Logentries Is a Real-Time Log Analytics Service for Aggregating, Analyzing, and Alerting on Log Data from Microsoft Azure Apps and Systems MICROSOFT.

Microsoft Azure and DataStax: Start Anywhere and Scale to Any Size in the Cloud, On- Premises, or Both with a Leading Distributed Database MICROSOFT AZURE.

Axis AI Solves Challenges of Complex Data Extraction and Document Classification through Advanced Natural Language Processing and Machine Learning MICROSOFT.

Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.

Dato Confidential 1 Danny Bickson Co-Founder. Dato Confidential 2 Successful apps in 2015 must be intelligent Machine learning key to next-gen apps Recommenders.

BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.

The Future of Whole Human Genome Data Management and Analysis, Available on the Microsoft Azure Platform Today MICROSOFT AZURE APP BUILDER PROFILE: SPIRAL.

Ignite in Sberbank: In-Memory Data Fabric for Financial Services

Discover How You Can Increase Collaboration with External Partners While Reducing Your Cost in Managing an Extranet from the Azure Cloud MICROSOFT AZURE.

DreamFactory for Microsoft Azure Is an Open Source REST API Platform That Enables Mobilization of Data in Minutes across Frameworks and Storage Methods.

Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.

Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit

AuraPortal Cloud Helps Empower Organizations to Organize and Control Their Business Processes via Applications on the Microsoft Azure Cloud Platform MICROSOFT.

Univa Grid Engine Makes Work Management Automatic and Efficient, Accelerates Deployment of Cloud Services with Power of Microsoft Azure MICROSOFT AZURE.

Nicho Joins Microsoft Azure Certified Program to Transform Brand Engagement, Boost Customer Acquisition and Conversions with Scalable Ease MICROSOFT AZURE.

Gather Valuable Customer Data

Spark Presentation.

Primal and Microsoft Azure Deliver Personalized Content, Intelligence, and Analytics That Match Your Content to the Interests of Your Audience MICROSOFT.

Couchbase Server is a NoSQL Database with a SQL-Based Query Language

NGAGE Intelligence Leverages Microsoft Azure Platform to Provide Essential Analytics for Hybrid SharePoint Server/Office 365 Environments MICROSOFT AZURE.

02 | Design and implement database

MyHealthDirect’s Enterprise Scheduling Platform, Based on Microsoft Azure, Improves the Patient Experience and Reduces Patient Readmissions MICROSOFT AZURE.

Introduction to Spark.

Designed for Big Data Visual Analytics, Zoomdata Allows Business Users to Quickly Connect, Stream, and Visualize Data in the Microsoft Azure Platform MICROSOFT.

Microsoft SQL Server 2008 Reporting Services

DeFacto Planning on the Powerful Microsoft Azure Platform Puts the Power of Intelligent and Timely Planning at Any Business Manager’s Fingertips Partner.

Excelian Grid as a Service Offers Compute Power for a Variety of Scenarios, with Infrastructure on Microsoft Azure and Costs Aligned to Actual Use MICROSOFT.

XtremeData on the Microsoft Azure Cloud Platform:

Overview of big data tools

Quasardb Is a Fast, Reliable, and Highly Scalable Application Database, Built on Microsoft Azure and Designed Not to Buckle Under Demand MICROSOFT AZURE.

Improve Patient Experience with Saama and Microsoft Azure

Mark Quirk Head of Technology Developer & Platform Group

Presentation transcript:

clearstorydata.com Using Spark and Shark for Fast Cycle Analysis on Diverse Data Vaibhav Nivargi

clearstorydata.com About ClearStory Data

clearstorydata.com Analysis in the New Data Landscape New use cases seen in all industries. Live situational analysis requiring fast-cycle analysis across internal data and sources of external data Multi-source analysis with data refreshing on new insights, as data from sources evolves Large-scale analysis of structured and unstructured data combined in integrated insights

clearstorydata.com Example: Interactive Multi-source Analysis More data and more people change the analysis. Facebook Shares, Likes, Comments News Coverage Online, Print, Television Twitter Followers, Tweets, Retweets Donations New Members, Donations Website Traffic Traffic, Referrals, Content Data Intelligence Interactive analysis on diverse internal & external data Corporate Sponsors Corporate Engagement, New Inquiries

clearstorydata.com Today’s Need is Speed, Scale & Ad Hoc Flexibility With more sources, more data and more people. ?? ??

clearstorydata.com Why Spark and Shark ? RDDs – Low latency & scale – Iterative and Interactive computation Lineage and fault tolerance – Able to re-derive data Expressive power of Scala and SQL – Operations beyond aggregations, joins, and statistical operators – Advanced: ML, data mining, segmentation, approximate queries, graphs … Support for structured and semi-structured data BDAS Stack & AMPLab – Tachyon, MLBase, BlinkDB, GraphX … Community and adoption

clearstorydata.com Data SourcesClearStory PlatformClearStory Application The ClearStory Solution Data Inference & Profiling Harmonization Visualization Collaboration In-Memory Data Units

clearstorydata.com Public Premium Web RDBMS Hadoop ClearStory API User Application Data Access, Inference and Lineage Data Source API Files Spark Cluster + ClearStory IP Harmonization Engine and Blended Data Processing Where do Spark & Shark fit ?

clearstorydata.com How we leverage Spark & Shark User intent captured and translated to custom API Harmonization-as-a-Service Manages Spark and Shark query execution Read cached data from HDFS RESTful Merges datasets (RDDs) on the fly – on user request Support conversion of user actions to backend queries Query optimizations Performance optimizations Mixed-mode execution (sql2rdd & spark native) Caching Pre-computation

clearstorydata.com How we leverage Spark & Shark Query results returned to the application for scalable visualization and ClearStory-specific viz techniques RDDs cached/un-cached and materialized at strategic points based on usage patterns and signals Data updates automatically processed as source data changes ClearStory’s own deployment, packaging, and integrated monitoring for operations at scale

clearstorydata.com Spark Developments – What We Like Query cancellation, progress indication (0.8.1 and beyond) More performance breakthroughs Workload Management BlinkDB MLBase Tachyon GraphX

clearstorydata.com We’re Hiring! Working with the community, giving back Lots of exciting new developments This is like the early days of Hadoop – massive momentum gathering The First Spark Summit! More Meet-ups!

clearstorydata.com