Download presentation
Presentation is loading. Please wait.
Published byHilary Mills Modified over 9 years ago
1
Big Data Use Cases in the cloud Peter Sirota, GM Elastic MapReduce @petersirota
2
What is Big Data?
3
Computer generated data Application server logs (web sites, games) Sensor data (weather, water, smart grids) Images/videos (traffic, security cameras)
4
Human generated data Twitter “Firehose” (50 mil tweets/day 1,400% growth per year) Blogs/Reviews/Emails/Pictures Social graphs Facebook, linked-in, contacts
5
Big Data is full of valuable, unanswered questions!
6
Why is Big Data Hard (and Getting Harder)?
7
Data Volume Unconstrained growth Current systems don’t scale Why is Big Data Hard (and Getting Harder)?
8
Data Structure Need to consolidate data from multiple data sources in multiple formats across multiple businesses
9
Why is Big Data Hard (and Getting Harder)? Changing Data Requirements Faster response time of fresher data Sampling is not good enough and history is important Increasing complexity of analytics Users demand inexpensive experimentation
10
We need tools built specifically for Big Data!
11
Innovation #1: Apache Hadoop The MapReduce computational paradigm Open source, scalable, fault ‐ tolerant, distributed system Hadoop lowers the cost of developing a distributed system for data processing
12
Innovation #2: Amazon Elastic Compute Cloud (EC2) “provides resizable compute capacity in the cloud.” Amazon EC2 lowers the cost of operating a distributed system for data processing
13
Amazon Elastic MapReduce = Amazon EC2 + Hadoop
14
Elastic MapReduce applications Targeted advertising / Clickstream analysis Security: anti-virus, fraud detection, image recognition Pattern matching / Recommendations Data warehousing / BI Bio-informatics (Genome analysis) Financial simulation (Monte Carlo simulation) File processing (resize jpegs, video encoding) Web indexing
15
Clickstream Analysis – Big Box Retailer came to Razorfish 3.5 billion records 71 million unique cookies 1.7 million targeted ads required per day Problem: Improve Return on Ad Spend (ROAS)
16
Clickstream Analysis – Targeted Ad User recently purchased a sports movie and is searching for video games (1.7 Million per day)
17
Clickstream Analysis – Lots of experimentation but final design: 100 node on-demand Elastic MapReduce cluster running Hadoop
18
Clickstream Analysis – Processing time dropped from 2+ days to 8 hours (with lots more data)
19
Clickstream Analysis – Increased Return On Ad Spend by 500%
20
World’s largest handmade marketplace 8.9 million items 1 billion page view per month $320MM 2010 GMS
21
Easy to ‘backfill’ and run experiments just boot up a cluster with 100, 500, or 1000 nodes Production DB snapshots Web event logs ETL – Step 1 ETL – Step 2 Job
22
Recommendations The Taste Test http://www.etsy.com/tastetesthttp://www.etsy.com/tastetest
23
Recommendations etsy.com/gifts Gift Ideas for Facebook Friends
24
Yelp generates close to 400GB of logs per day Yelp
25
Yelp does not have a physical MapReduce cluster Running 250 production clusters per week All of those run on Elastic MapReduce MapReduce at Yelp
26
Features driven by MapReduce
28
Analyze ad stats (reporting, billing, algorithm inputs) Analyze A/B test results Detect duplicate business listings Email bounce processing Identify bots based on traffic patterns More MapReduce uses
29
9/23/2011 Amazon EMR Strata Justin Moore - @injust Big Data @ foursquare
30
9/23/2011 Amazon EMR Strata Justin Moore - @injust How do we use EMR? Map-Reduce – Run algorithms on our entire dataset – Streaming jobs, complex analyses Hive – Business intelligence – Exploratory analyses – Infographics!
31
9/23/2011 Amazon EMR Strata Justin Moore - @injust How big is our data? Global reach (North Pole, Space) Native app for almost every smartphone, SMS, web, mobile-web 10M+ users, 15M+ venues, ~1B check-ins Terabytes of log data
32
9/23/2011 Amazon EMR Strata Justin Moore - @injust Our Stack
33
9/23/2011 Amazon EMR Strata Justin Moore - @injust Computing venue-to-venue similarity Spin up 40 node cluster Submit Ruby streaming job – Invert User x Venue matrix – Grab Co-occurrences – Compute similarity Spin down cluster Load data to app server
34
9/23/2011 Amazon EMR Strata Justin Moore - @injust Who is checking in?
35
9/23/2011 Amazon EMR Strata Justin Moore - @injust What are people doing?
36
9/23/2011 Amazon EMR Strata Justin Moore - @injust Where are our users?
37
9/23/2011 Amazon EMR Strata Justin Moore - @injust When do people go to a place? Thursday Friday SaturdaySunday
38
9/23/2011 Amazon EMR Strata Justin Moore - @injust Why are people checking in? Explore their city, discover new places Find friends, meet up Save with local deals Get insider tips on venues Personal analytics, diary Follow brands and celebrities Earn points, badges, gamification of life The list grows…
39
9/23/2011 Amazon EMR Strata Justin Moore - @injust How can we leverage these insights?
40
9/23/2011 Amazon EMR Strata Justin Moore - @injust Join us! foursquare is hiring www.foursquare.com/jobs Justin Moore @injust justin@foursquare.com
41
http://aws.amazon.com/elasticmapreduce/
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.