Hadoop at ContextWeb February 2009. 2 ContextWeb: Traffic Traffic – up to 6 thousand Ad requests per second. Comscore Trend Data:

Slides:



Advertisements
Similar presentations
A Lightweight Platform for Integration of Mobile Devices into Pervasive Grids Stavros Isaiadis, Vladimir Getov University of Westminster, London {s.isaiadis,
Advertisements

From Startup to Enterprise A Story of MySQL Evolution Vidur Apparao, CTO Stephen OSullivan, Manager of Data and Grid Technologies April 2009.
Transitioning of existing applications to use HDFS August 2008.
eSafe Implementation Topologies
1 May 19th, 2009 Announcement. 2 Drivers for Web Application Delivery Web traffic continues to increase More processing power at data aggregation points.
Media6. Who We Are Media6° is an Online Advertising Company Specializing in Social Graph Targeting –Birds of a feather flock together! –We build.
17 Copyright © 2005, Oracle. All rights reserved. Deploying Applications by Using Java Web Start.
0 - 0.
Addition Facts
Computing Infrastructure
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,
Tom Hamilton – America’s Channel Database CSE
Introduction to Hadoop Richard Holowczak Baruch College.
© 2003 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Performance Measurements of a User-Space.
Operating Systems Operating Systems - Winter 2011 Dr. Melanie Rieback Design and Implementation.
Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
Apache Hadoop and Hive.
Suggested Course Outline Cloud Computing Bahga & Madisetti, © 2014Book website:
Node Lessons Learned James Hudson Wisconsin Department of Natural Resources.
Dan Bassett, Jonathan Canfield December 13, 2011.
Addition 1’s to 20.
Test B, 100 Subtraction Facts
Performance Tuning for Informer PRESENTER: Jason Vorenkamp| | October 11, 2010.
Chapter 10: The Traditional Approach to Design
Hadoop Namenode High Availability August 2008 Requirements and Procedures.
Performance Considerations of Data Acquisition in Hadoop System
Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Lesson 15 – INSTALL AND SET UP NETWARE 5.1. Understanding NetWare 5.1 Preparing for installation Installing NetWare 5.1 Configuring NetWare 5.1 client.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop Distributed File System by Swathi Vangala.
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
© 2013 Mellanox Technologies 1 NoSQL DB Benchmarking with high performance Networking solutions WBDB, Xian, July 2013.
The Hadoop Distributed File System: Architecture and Design by Dhruba Borthakur Presented by Bryant Yao.
Windows Server MIS 424 Professor Sandvig. Overview Role of servers Performance Requirements Server Hardware Software Windows Server IIS.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
Nagios Is Down and Your Boss Wants to See You Andrew Widdersheim
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Our Experience Running YARN at Scale Bobby Evans.
Introduction to Hadoop and HDFS
Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data.
Page 1 of John Wong CTO Twin Peaks Software Inc. Mirror File System A Multiple Server File System.
The application of DRBD in Linux-HA Haibo Zhang 4/28/2014.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
Storage Netværk Mød Microsoft Feb 2005, Agenda Data Protection Server (opdatering) Microsoft og iSCSI Demo.
Next Generation of Apache Hadoop MapReduce Owen
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
CommVault Architecture
Consulting Services JobScheduler Architecture Decision Template
Hadoop.
INTRODUCTION TO BIGDATA & HADOOP
Consulting Services JobScheduler Architecture Decision Template
The Basics of Apache Hadoop
CS6604 Digital Libraries IDEAL Webpages Presented by
湖南大学-信息科学与工程学院-计算机与科学系
BARC Scaleable Servers
Hadoop Technopoints.
Scalable Database Services for Physics: Oracle 10g RAC on Linux
CS 345A Data Mining MapReduce This presentation has been altered.
PerformanceBridge Application Suite and Practice 2.0 IT Specifications
Presentation transcript:

Hadoop at ContextWeb February 2009

2 ContextWeb: Traffic Traffic – up to 6 thousand Ad requests per second. Comscore Trend Data:

3 ContextWeb Architecture highlights Pre – Hadoop aggregation framework Logs are generated on each server and aggregated in memory to 15 minute chunks Aggregation of logs from different servers into one log Load to DB Multi-stage aggregation in DB About 20 different jobs end-to-end Could take 2hr to process through all stages

Hadoop Data Set Up to 100GB of raw log files per day. 40GB compressed 40 different aggregated data sets 15TB total to cover 1 year (compressed) Multiply by 3 replicas …

5 Architectural Challenges How to organize data set to keep aggregated data sets fresh. Logs constantly appended to the main Data Set. Reports and aggregated datasets should be refreshed every 15 minutes Mix of.NET and Java applications. (80%+.Net, 20%- Java) How to make.Net application write logs to Hadoop? Some 3 rd party applications to consume results of MapReduce Jobs (e.g. reporting application) How make 3 rd party or internal Legacy applications to read data from Hadoop ?

Hadoop Cluster Today: 26 nodes/208 Cores DELL 2950, 1.8TB per node 43TB total capacity NameNode high availability using DRBD Replication. Hadoop > In-house developed Java framework on top of hadoop.mapred.* PIG and Perl Streaming for ad-hoc reports ~1,000 MapReduce jobs per day OpdWise scheduler Exposing data to Windows: WebDav Server with WebDrive clients Reporting Application: Qlikview Cloudera support for Hadoop Archival/Backup: Amazon S3 By end of 2009 ~50 nodes/400 Cores ~85TB total capacity

Internal Components Disks o 2x 300 GB 15k RPM SAS. o Hardware RAID 1 mirroring. o SMART monitoring. Network o Dual 1Gbps on-board NICs. o Linux bonding with LACP.

8 Redundant Network Architecture Linux bonding o See bonding.txt from Linux kernel docs. o LACP, aka 802.3ad, aka mode=4. ( o Must be supported by your switches. o Throughput advantage Observed at 1.76Gb/s o Allows for failure of either NIC instead of a single heartbeat connection via crossover.

The Data Flow

10 Partitioned Data Set: approach Date/Time as dimension for Partitioning Segregate results of MapReduce jobs into Daily and Hourly Directories Each Daily/Hourly directory is regenerated if input into MR job contains data for this Day/Hour Use revision number for each directory/file. This way multi- stage jobs could overlap during processing

11 Partitioned Data Set: processing flow

Workflow Opswise scheduler

13 Getting Data in and out Mix of.NET and Java applications. (80%+.Net, 20%- Java) How to make.Net application write logs to Hadoop? Some 3 rd party applications to consume results of MapReduce Jobs (e.g. reporting application) How make 3 rd party or internal Legacy applications to read data from Hadoop ?

14 Getting Data in and out: distcp Hadoop Distcp - hdfs - /mnt/abc – network share Easy to start – just allocate storage on network share But… Difficult to maintain if there are more than 10 types of data to copy Need extra storage. Outside of HDFS. (oxymoron!) Extra step in processing Clean up

15 Getting Data in and out: WebDAV driver WebDAV server is part of Hadoop source code tree Needed some minor clean up. Was co-developed with IponWeb. Available There are multiple commercial Windows WebDav clients you can use (we use WebDrive) Linux Mount Modules available from

16 Getting Data in and out: WebDav

WebDAV and compression But your results are compressed… Options: Decompress files on HDFS – an extra step again Refactor your application to read compressed files… Java – Ok.Net – much more difficult. Cannot decompress SequenceFiles 3 rd party- not possible

WebDAV and compression Solution – extend WebDAV to support compressed SequenceFiles Same driver can provide compressed and uncompressed files If file with requested name foo.bar exists – return as is foo.bar If file with requested name foo.bar does not exist – check if there is a compressed version foo.bar.seq. Uncompress on the fly and return as if foo.bar Outstanding issues Temporary files are created on Windows client side There are no native Hadoop (de)compression codecs on Windows

QlikView Reporting Application Load from TXT files is supported In-memory DB AJAX support for integration into WEB portals