Big Data Use Cases in the cloud Peter Sirota, GM Elastic

Slides:



Advertisements
Similar presentations
Large Scale Computing Systems
Advertisements

EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
R and HDInsight in Microsoft Azure
Basic Marketing Social Media: Catch the Buzz By: Eric Elliott BusRates.com.
Hadoop in the Wild CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
Marketing Communications Services Hayward, WI.
Big Data Workflows N AME : A SHOK P ADMARAJU C OURSE : T OPICS ON S OFTWARE E NGINEERING I NSTRUCTOR : D R. S ERGIU D ASCALU.
25 Need-to-Know Facts. Fact 1 Every 2 days we create as much information as we did from the beginning of time until 2003 [Source]Source © 2014 Bernard.
#1 Google #2 Facebook #3 Youtube #7 Ebay #8 Twitter #9 Craigslist.
Big Data A big step towards innovation, competition and productivity.
SM STRATA PRESENTATION Tim Garnto - SVP Engineering, edo Interactive Rob Rosen – Big Data Field Lead, Pentaho.
BIG DATA – WHAT’S THE BIG DEAL The call would start soon, please be on mute. Thanks for your time and patience.
This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.
Ch 4. The Evolution of Analytic Scalability
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
Analytics Map Reduce Query Insight Hive Pig Hadoop SQL Map Reduce Business Intelligence Predictive Operational Interactive Visualization Exploratory.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Tyson Condie.
Processing and Analyzing Large log from Search Engine Meng Dou 13/9/2012.
Scientific Computing at Amazon Disruptive Innovations in Distributed Computing Dave Ward, Principal Product Manager Adam Gray, Senior Product Manager.
Wrangling Customer Usage Data with Hadoop Clearwire – Thursday, June 27 th Carmen Hall – IT Director Mathew Johnson – Sr. IT Manager.
` tuplejump The data engineering platform. A startup with a vision to simplify data engineering and empower the next generation of data powered miracles!
Charles Tappert Seidenberg School of CSIS, Pace University
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Face Detection And Recognition For Distributed Systems Meng Lin and Ermin Hodžić 1.
© Hortonworks Inc Hortonworks Page 1. © Hortonworks Inc Big Data Changes the Game Megabytes Gigabytes Terabytes Petabytes Purchase detail.
Building BI App on Cloud Rohit Chatter Sr.
© 2012 IBM Corporation IBM Security Systems 1 © 2013 IBM Corporation 1 Ecommerce Antoine Harfouche.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Join the Conversation: Active Listening on Social Media By Lauren Cleland New Media Specialist, Explore Georgia #TeamGaSocial.
Data Science and Big Data Analytics Chap1: Intro to Big Data Analytics
Introducing MapReduce to High End Computing Grant Mackey, Julio Lopez, Saba Sehrish, John Bent, Salman Habib, Jun Wang University of Central Florida, Carnegie.
VMob Mobile Marketing Platform Delivers Highly Targeted Marketing Directly into Shoppers’ Existing Smartphone Apps from the Microsoft Azure Cloud MICROSOFT.
+ Big Data IST210 Class Lecture. + Big Data Summary by EMC Corporation ( More videos that.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
© 2012 IBM Corporation Converting Big Data into Big Knowledge.
What we know or see What’s actually there Wikipedia : In information technology, big data is a collection of data sets so large and complex that it.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery.
Taking Advantage of the Microsoft Azure Platform, Pingvalue Connects People, Products, Stores, and Brands to Deliver Better Experiences for Everyone MICROSOFT.
Axis AI Solves Challenges of Complex Data Extraction and Document Classification through Advanced Natural Language Processing and Machine Learning MICROSOFT.
AZURE DISTRIBUTED DATA Storage, HDInsight Hadoop, Azure Data Lake.
Big Data Analytics with Excel Peter Myers Bitwise Solutions.
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
Dato Confidential 1 Danny Bickson Co-Founder. Dato Confidential 2 Successful apps in 2015 must be intelligent Machine learning key to next-gen apps Recommenders.
Configuring SQL Server for a successful SharePoint Server Deployment Haaron Gonzalez Solution Architect & Consultant Microsoft MVP SharePoint Server
Big Data for the SQL Eye Cindy Look, it’s SQL! SELECT score, fun FROM toDo WHERE type = 'they pay me for
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
Microsoft Ignite /28/2017 6:07 PM
Hadoop in the Wild CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
Protecting a Tsunami of Data in Hadoop
Connected Infrastructure
Univa Grid Engine Makes Work Management Automatic and Efficient, Accelerates Deployment of Cloud Services with Power of Microsoft Azure MICROSOFT AZURE.
Smart Building Solution
Meemim's Microsoft Azure-Hosted Knowledge Management Platform Simplifies the Sharing of Information with Colleagues, Clients or the Public MICROSOFT AZURE.
BIG Data 25 Need-to-Know Facts.
Map Reduce.
Discover How Your Business Can Benefit from a Facebook Fanpage
Discover How Your Business Can Benefit from a Facebook Fanpage
Big-Data Fundamentals
Connected Infrastructure
Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.
SpatialHadoop: A MapReduce Framework for Spatial Data
Ch 4. The Evolution of Analytic Scalability
Big Data.
Big Data Overview.
XtremeData on the Microsoft Azure Cloud Platform:
Cost Effective Presto on AWS
Presentation transcript:

Big Data Use Cases in the cloud Peter Sirota, GM Elastic

What is Big Data?

Computer generated data  Application server logs (web sites, games)  Sensor data (weather, water, smart grids)  Images/videos (traffic, security cameras)

Human generated data  Twitter “Firehose” (50 mil tweets/day 1,400% growth per year)  Blogs/Reviews/ s/Pictures Social graphs  Facebook, linked-in, contacts

Big Data is full of valuable, unanswered questions!

Why is Big Data Hard (and Getting Harder)?

Data Volume  Unconstrained growth  Current systems don’t scale Why is Big Data Hard (and Getting Harder)?

Data Structure  Need to consolidate data from multiple data sources in multiple formats across multiple businesses

Why is Big Data Hard (and Getting Harder)? Changing Data Requirements  Faster response time of fresher data  Sampling is not good enough and history is important  Increasing complexity of analytics  Users demand inexpensive experimentation

We need tools built specifically for Big Data!

Innovation #1: Apache Hadoop  The MapReduce computational paradigm  Open source, scalable, fault ‐ tolerant, distributed system Hadoop lowers the cost of developing a distributed system for data processing

Innovation #2: Amazon Elastic Compute Cloud (EC2) “provides resizable compute capacity in the cloud.” Amazon EC2 lowers the cost of operating a distributed system for data processing

Amazon Elastic MapReduce = Amazon EC2 + Hadoop

Elastic MapReduce applications Targeted advertising / Clickstream analysis Security: anti-virus, fraud detection, image recognition Pattern matching / Recommendations Data warehousing / BI Bio-informatics (Genome analysis) Financial simulation (Monte Carlo simulation) File processing (resize jpegs, video encoding) Web indexing

Clickstream Analysis – Big Box Retailer came to Razorfish  3.5 billion records  71 million unique cookies  1.7 million targeted ads required per day Problem: Improve Return on Ad Spend (ROAS)

Clickstream Analysis – Targeted Ad User recently purchased a sports movie and is searching for video games (1.7 Million per day)

Clickstream Analysis – Lots of experimentation but final design:  100 node on-demand Elastic MapReduce cluster running Hadoop

Clickstream Analysis – Processing time dropped from 2+ days to 8 hours (with lots more data)

Clickstream Analysis – Increased Return On Ad Spend by 500%

World’s largest handmade marketplace  8.9 million items  1 billion page view per month  $320MM 2010 GMS

Easy to ‘backfill’ and run experiments just boot up a cluster with 100, 500, or 1000 nodes Production DB snapshots Web event logs ETL – Step 1 ETL – Step 2 Job

Recommendations The Taste Test

Recommendations etsy.com/gifts Gift Ideas for Facebook Friends

Yelp generates close to 400GB of logs per day Yelp

Yelp does not have a physical MapReduce cluster Running 250 production clusters per week All of those run on Elastic MapReduce MapReduce at Yelp

Features driven by MapReduce

Analyze ad stats (reporting, billing, algorithm inputs) Analyze A/B test results Detect duplicate business listings bounce processing Identify bots based on traffic patterns More MapReduce uses

9/23/2011 Amazon EMR Strata Justin Moore Big foursquare

9/23/2011 Amazon EMR Strata Justin Moore How do we use EMR? Map-Reduce – Run algorithms on our entire dataset – Streaming jobs, complex analyses Hive – Business intelligence – Exploratory analyses – Infographics!

9/23/2011 Amazon EMR Strata Justin Moore How big is our data? Global reach (North Pole, Space) Native app for almost every smartphone, SMS, web, mobile-web 10M+ users, 15M+ venues, ~1B check-ins Terabytes of log data

9/23/2011 Amazon EMR Strata Justin Moore Our Stack

9/23/2011 Amazon EMR Strata Justin Moore Computing venue-to-venue similarity Spin up 40 node cluster Submit Ruby streaming job – Invert User x Venue matrix – Grab Co-occurrences – Compute similarity Spin down cluster Load data to app server

9/23/2011 Amazon EMR Strata Justin Moore Who is checking in?

9/23/2011 Amazon EMR Strata Justin Moore What are people doing?

9/23/2011 Amazon EMR Strata Justin Moore Where are our users?

9/23/2011 Amazon EMR Strata Justin Moore When do people go to a place? Thursday Friday SaturdaySunday

9/23/2011 Amazon EMR Strata Justin Moore Why are people checking in? Explore their city, discover new places Find friends, meet up Save with local deals Get insider tips on venues Personal analytics, diary Follow brands and celebrities Earn points, badges, gamification of life The list grows…

9/23/2011 Amazon EMR Strata Justin Moore How can we leverage these insights?

9/23/2011 Amazon EMR Strata Justin Moore Join us! foursquare is hiring Justin