Media6. Who We Are Media6° is an Online Advertising Company Specializing in Social Graph Targeting –Birds of a feather flock together! –We build.

Slides:



Advertisements
Similar presentations
Capacity Planning for LAMP Architectures John Allspaw Manager, Operations Flickr.com Web Builder 2.0 Las Vegas.
Advertisements

Yahoo! Experience with Hadoop OSCON 2007 Eric Baldeschwieler.
From Startup to Enterprise A Story of MySQL Evolution Vidur Apparao, CTO Stephen OSullivan, Manager of Data and Grid Technologies April 2009.
Info to Enterprise Migration Implementation Case Study: SBC Corporation Presented to the Crystal Decisions Regional Users Group for the Bay Area on October.
A Ridiculously Easy & Seriously Powerful SQL Cloud Database Itamar Haber AVP Ops & Solutions.
Hadoop at ContextWeb February ContextWeb: Traffic Traffic – up to 6 thousand Ad requests per second. Comscore Trend Data:
Neil Perry Vice President iMedia Communications Insight Presentation: Behavioral Targeting.
Transitioning of existing applications to use HDFS August 2008.
Four ways to give electronically 1. Making it easy for givers to give! 2.
Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.
A Media Subscription Service By Peter Kriens CEO aQute OSGi Technology Officer
The Evolution of P2P Technology Robert Levitan, CEO Pando Networks P2P MEDIA SUMMIT, CES 2008.
1Abacast - Confidential1 Hybrid Content Delivery Network (CDN) Technologies and Services.
1 Chapter 12 File Management Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,
Year 6/7 mental test 5 second questions
Disclaimer: SDATT.org is an Independent Ministry supporting Official SDA churches of TT. Evangelism Online Through Web Technology Presenter: Lael Samuel.
How The Internet Changed the Game Presented by: Duston Barto from Infinicom USA.
Auto-scaling Axis2 Web Services on Amazon EC2 By Afkham Azeez.
Yammer Technical Solutions Overview
Job Order and Process Costing
Building LinkedIn’s Real-time Data Pipeline
Data Freeway : Scaling Out to Realtime Author: Eric Hwang, Sam Rash Speaker : Haiping Wang
Andrew Hanushevsky7-Feb Andrew Hanushevsky Stanford Linear Accelerator Center Produced under contract DE-AC03-76SF00515 between Stanford University.
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,
4.1 © 2004 Pearson Education, Inc. Exam Managing and Maintaining a Microsoft® Windows® Server 2003 Environment Lesson 4: Organizing a Disk for Data.
Introduction to Hadoop Richard Holowczak Baruch College.
Why should I consider Implementing a Document Imaging / Management System? Created by Harold Hegerhorst North American Technology. LLC © North American.
Enterprise Document Management Symposium October 5 th – 6 th 2010 Niagara Falls, Canada.
© Tally Solutions Pvt. Ltd. All Rights Reserved 1 Housekeeping in Shoper 9 POS February 2010.
State of Connecticut Core-CT Project Query 8 hrs Updated 6/06/2006.
1 Sizing the Streaming Media Cluster Solution for a Given Workload Lucy Cherkasova and Wenting Tang HPLabs.
powerful network monitoring & management solution
WHICH TO CHOOSE RIGHT SERVER FOR THE RIGHT JOB. Today’s business environment demands that small and midsize businesses do more with less. The large majority.
Strategy Review Meeting Strategy Review Meeting
LeadManager™- Internet Marketing Lead Management Solution May, 2009.
Performance Tuning for Informer PRESENTER: Jason Vorenkamp| | October 11, 2010.
Import Tracking and Landed Cost Processing An Enhancement For AS/400 DMAS from  Copyright I/O International, 2001, 2005, 2008, 2012 Skip Intro Version.
Introduction to ikhlas ikhlas is an affordable and effective Online Accounting Solution that is currently available in Brunei.
Performance Considerations of Data Acquisition in Hadoop System
WSUS Presented by: Nada Abdullah Ahmed.
Wilma Hodges  Began faculty training and moving content in Nov  Original plan was to be fully migrated to Sakai by.
CCC/WNUG Exchange Update May 5, 2005 Nate Wilken Web and Messaging Applications Information Technology Arizona State University.
Operational Dataset Update Functionality Included in the NCAR Research Data Archive Management System 1 Zaihua Ji Doug Schuster Steven Worley Computational.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.
Eyeblaster Casual Games / Downloadable Try and Buy Model.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Channel Archiver Stats & Problems Kay Kasemir, Greg Lawson, Jeff Patton Presented by Xiaosong Geng (ORNL/SNS) March 2008.
Chapter 8 Implementing Disaster Recovery and High Availability Hands-On Virtual Computing.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
ETL Overview February 24, DS User Group - ETL - February ETL Overview “ETL is the heart and soul of business intelligence (BI).” -- TDWI ETL.
ValueAd Inc. AdXpress ® Enterprise Ad Serving platform.
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. LogKV: Exploiting Key-Value.
IPortal Bringing your company and your business partners together through customized WEB-based portal software. SanSueB Software Presents iPortal.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
CNN Case Study: Deploying eDirectory ™ in a UNIX Environment Steve Brunton Chief Engineer CNN Internet Technologies
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
Performance Testing Test Complete. Performance testing and its sub categories Performance testing is performed, to determine how fast some aspect of a.
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
By: Joel Dominic and Carroll Wongchote 4/18/2012.
BIG DATA/ Hadoop Interview Questions.
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
Microsoft Ignite /28/2017 6:07 PM
Presenter: Yue Zhu, Linghan Zhang A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: a Case Study by PowerPoint.
Maximum Availability Architecture Enterprise Technology Centre.
Operational Dataset Update Functionality Included in the NCAR Research Data Archive Management System Zaihua Ji Doug Schuster Steven Worley Computational.
Moodle Scalability What is Scalability?
Presentation transcript:

Media6

Who We Are Media6° is an Online Advertising Company Specializing in Social Graph Targeting –Birds of a feather flock together! –We build custom audiences for marketers composed of the existing customers of a brand and the consumers most closely connected to them via the social graph. We use non-personally identifiable data from across social media to deliver highly scalable audiences across the top comScore 1000 sites.

How We Do It Gather Data, Build Models, Identify Targets, Show Ads –sample browser visitation data from micro (social network) and macro (blog) user generated content sites –acquire browser visitation data from client assets –correlate brand interest with UGC affinities to identify brand neighbors –build-out brand-specific audiences in ad exchanges –purchase impressions on brand neighbor browsers from exchanges

Why Hadoop? Business Intelligence Needs: –monitor data gathering for reach, value and data partner payment –monitor campaign audience generation and expansion –monitor app server activity levels and data quality Previous experience: –online advertising platform reporting –relational databases, data warehousing and ETL processing Needed web-scale solution that was: –affordable (free) and would run on available hardware –could handle initial logs of 50 to 100 million lines/day –could grow to handle expected 1 billion log lines/day Possibilities considered were custom application and Hadoop –custom application offered known capabilities, relatively quick implementation, would likely be outgrown –Hadoop promised proper foundation, but was unknown with learning curve and potential that it wouldnt meet our needs

Initial Implementation Had legacy (2004) hardware inherited from prior company –3 Slaves: dual 3Ghz single core Xeon, 4GB RAM, 120GB disk –master: dual 3Ghz single core Xeon, 4GB RAM, 660GB disk –running Linux CentoOS 5, Java 1.6 Set-up of development environment and cluster took about 3 days –master setup took a day –slave setup took about an hour each –cluster set-up took a couple days (Retrying connect to server...) Developed custom Java Map/Reduce application –mapper included 20 classes to parse log lines into fields and do counts –combiner & Reducer consisted of one class to aggregate counts –development time was approximately 2 weeks for initial prototype

v1.0 Production configuration: –5 legacy servers as slaves, 1 legacy server as Master –CentOS 5, Java 1.6, Hadoop 16 – Upgrade from 15 to 16 was seamless 11 aggregation sets which group on 4 fields and have 10 counts Maximum through-put of 6,000 lines/second –jobs consisted up to 5 million lines in up to 300 files and took from 4 to more than 30 minutes or more –processed an average of 160 million lines/day with peaks of 260 million –no data was maintained in the Hadoop file system Normal behavior was for Hadoop cluster to run continuously starting at between 3 and 7 pm and finishing around 5am. We experienced no Hadoop specific errors or unplanned down time in 8 months of continuous operation from May to December 2008

v2.0 Updated configuration –6 Slaves: dual 2.5Ghz quad-core Xeon, 16GB RAM, 4TB disk –2 Masters: fully fault tolerant (DRBD) with automatic failover, dual 2.5Ghz quad-core Xeon, 16GB RAM, 1.4TB –CentOS 5, Java 1.6, Hadoop 18 – Upgrade from 16 was seamless Currently have 16 aggregations plus jobs to gather browser specific data and generate input for data models –Currently maintain 8.5TB of data with replication factor of 2x (17TB total). Replication factor 2x is used to maximize available disk space Experienced through-put of more than 22,500 lines/second, estimated capacity of more than 40,000 lines/second –jobs consist of up to 15 million lines and up to 1000 files –process 360 files/hr and between 6 and 30 million lines/hr –average 450 million/day; Record was 771 million Normal behavior is for Hadoop to be essentially idle 40% of the time We have still experienced no Hadoop specific errors

Primary Processing Cycle Every 10 minutes 60 tomcat servers generate gzipped log files of between 3MB and 100MB A cron runs every 10 minutes to download the files to a to-do directory on the Hadoop master 4 additional crons run every 5 minutes to –copy batches of files into an HDFS directory named with a job ID –run an initial m/r job to generate aggregations and extract browser specific event data –copy the aggregated data to the local file system, move the raw input data to an archive directory within the HDFS and copy the browser specific data into a secondary staging directory within the HDFS –load the aggregated data into MySQL tables

Browser Data Processing Every 30 minutes a cron runs to pull the latest browser specific data from what has been extracted from he logs over the course of the day. –on average 1.25 million new browsers are added every hour with an average of 30 million unique browsers with new data touched daily Every morning at 2:30am details of brand specific browser activity accumulated the prior day are compiled using a map job with no reducer. –approximately 1.75 million (6%) of browsers have brand specific activity –from these browser records, approximately 20 million brand relational data points are identified –the results are exported to the local file system and imported into MySQL tables which feed our data modeling

HDFS Layout/Maintenance HDFS space is divided between work space, raw log archives and browser history data –persistent file space utilization is limited to 70% to allow for work space and redistribution of data if a slave fails –raw logs are maintained for 14 days in the original m/r input directories Browser history data is partitioned by date and divided into: –21 days of browser data extracted from raw logs –90 days of daily browser data –90 days of brand relational data Cron runs once an hour and removes oldest files when utilization percentage is greater than 70.

v3.0+ Reduce or eliminate dependence on MySQL for data set generation –Data set builds currently take 50 to 80 hours; aim is to reduce it to 10% of that or less Replace static MySQL data sets with distributed cache with real-time updates Potential for use of HBase Cascading