Titel der Folie Datum | zanox Group Autor | Position Product Search and Reporting powered by Hadoop 2. March 2011 | Dr. Dragan Milosevic.

Slides:



Advertisements
Similar presentations
Meet Hadoop Doug Cutting & Eric Baldeschwieler Yahoo!
Advertisements

A Lightweight Platform for Integration of Mobile Devices into Pervasive Grids Stavros Isaiadis, Vladimir Getov University of Westminster, London {s.isaiadis,
Media6. Who We Are Media6° is an Online Advertising Company Specializing in Social Graph Targeting –Birds of a feather flock together! –We build.
Ada, Model Railroading, and Software Engineering Education John W. McCormick University of Northern Iowa.
Making the System Operational
MINAP: DATA HANDLING PROCEDURES & DATA ACCESS Data Management Group, 13 July 2009.
Tintu David Joy. Agenda Motivation Better Verification Through Symmetry-basic idea Structural Symmetry and Multiprocessor Systems Mur ϕ verification system.
Publish-Subscribe Approach to Social Annotation of News Top-k Publish-Subscribe for Social Annotation of News Joint work with: Maxim Gurevich (RelateIQ)
LABELING TURKISH NEWS STORIES WITH CRF Prof. Dr. Eşref Adalı ISTANBUL TECHNICAL UNIVERSITY COMPUTER ENGINEERING 1.
Data Freeway : Scaling Out to Realtime Author: Eric Hwang, Sam Rash Speaker : Haiping Wang
Configuration management
Software change management
Introduction to Hadoop Richard Holowczak Baruch College.
Effectively applying ISO9001:2000 clauses 6 and 7.
Database Performance Tuning and Query Optimization
Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
Page 1 October 31, 2000 An Introduction to Large-Scale Software Development Steve Varnau Core HP-UX Operation October 31, 2000.
1 World Meteorological Organization Working together in weather, climate and water Sixteenth World Meteorological Congress Information Technology Support.
Lecture 1: Software Engineering: Introduction
BT Managed Compute Solution Overview April 2013 Chris Faulkner Head of Managed Hosting Rob Ellison Managed Hosting Architect.
ArrayExpress Query Interface Gonzalo Garc í a Lara January, / 24.
Lecture 8: Testing, Verification and Validation
Chapter 10 Software Testing
RED-PD: RED with Preferential Dropping Ratul Mahajan Sally Floyd David Wetherall.
Executional Architecture
Strategy Review Meeting Strategy Review Meeting
Dan Bassett, Jonathan Canfield December 13, 2011.
12 January 2009SDS batch generation, distribution and web interface 1 ExESS IT tool for SDS batch generation, distribution and web interface ExESS IT tool.
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Choosing an Order for Joins
Pig Optimization and Execution Page 1 Alan F. © Hortonworks Inc
Iowa Code and Rules Easy Navigation and Search Scope Analysis &Planning Phases Completed Request for Execution Funding.
Performance Considerations of Data Acquisition in Hadoop System
Application of Ensemble Models in Web Ranking
Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
Business Intelligence Dr. Mahdi Esmaeili 1. Technical Infrastructure Evaluation Hardware Network Middleware Database Management Systems Tools and Standards.
Undergraduate Poster Presentation Match 31, 2015 Department of CSE, BUET, Dhaka, Bangladesh Wireless Sensor Network Integretion With Cloud Computing H.M.A.
The Chinese University of Hong Kong. Research on Private cloud : Eucalyptus Research on Hadoop MapReduce & HDFS.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.
Introduction to Hadoop and HDFS
O’Reilly – Hadoop: The Definitive Guide Ch.1 Meet Hadoop May 28 th, 2010 Taewhi Lee.
Semantic Interoperability Berlin, 25 March 2008 Semantically Enhanced Resource Allocator Marc de Palol Jorge Ejarque, Iñigo Goiri, Ferran Julià, Jordi.
CSE 548 Advanced Computer Network Security Document Search in MobiCloud using Hadoop Framework Sayan Cole Jaya Chakladar Group No: 1.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
RDFPath: Path Query Processing on Large RDF Graph with MapReduce Martin Przyjaciel-Zablocki et al. University of Freiburg ESWC May 2013 SNU IDB.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
Site Technology TOI Fest Q Celebration From Keyword-based Search to Semantic Search, How Big Data Enables That?
Next Generation of Apache Hadoop MapReduce Owen
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
BIG DATA/ Hadoop Interview Questions.
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
Large-scale file systems and Map-Reduce
High Performance Computing on an IBM Cell Processor --- Bioinformatics
CS110: Discussion about Spark
Introduction to Apache
Flexible Distributed Reporting for Millions of Publishers and Thousands of Advertisers Berlin |
Presentation transcript:

Titel der Folie Datum | zanox Group Autor | Position Product Search and Reporting powered by Hadoop 2. March 2011 | Dr. Dragan Milosevic

zanox2 Who am I? Senior Architect at zanox AG –Over the last three years I have been writing map-reduce jobs which help applications cope with millions of products and billions of clicks I have applied different machine-learning techniques mainly to optimise resource usage while performing distributed search during my PhD –See my book: Beyond Centralised Search Engines An Agent-Based Filtering Framework

zanox3 What is it about? Part I: Processing product and tracking data by Map-Reduce –Normalising and categorising product data –Joining and aggregating tracking data Part II: Lucene-powered distributed search and aggregation –Merger-based coordination of multiple searchers –Observer infrastructure to ensure robust and reliable services Part III: Technical details –Hardware, how much data, number of jobs, how many requests

zanox4 Part I: Product Normalisation Problem: Manufacturer names are not normalised in imported data –Single manufacturer has sometimes more than 50 different names –There are more than 1 million different names, which are too much for exhaustive comparison Solution: Divide-and-Conquer to make it suitable for Map-Reduce –Use fast clustering that puts together potentially identical names –Each Map task applies on cluster-level several distance computation algorithms: –Coding-based (Soundex) – code(samsung) = s525 –Edit-distance (Levenshtein) – d(gumbo,gambol) = 2 –N-gram-based – code(samsung)={sa,am,ms,su,un,ng} –Suffix-Tree-based (Longest-Common-Substring) d(megaphon importservice, import megaphon) = = 14

zanox5 Part I: Product Categorisation Every category (out of 600) has been assigned language specific-model to be used in categorisation process –Models are compact and suitable to be loaded in memory –They can be seen as collection of words and phrases together with heuristic-rules helping to correctly categorise –Models are semi-automatically updated to improve categorisation Compact models are loaded by Map tasks –Markov-Chain-based language detection of a product to select model –Appling rules to reduce the set of possible categories –Computing scores based on word and phrase belongingness

zanox6 Part I: Joining and Aggregating Tracking Data Custom Report Definition Custom Tracking Data Definition Custom Tracking Data Lucene Indexes zanox Tracking Data Search Engine Data = Map-Reduce Inputs Map-Reduce Outputs

zanox7 Part II: Distributed Search and Aggregation Problem: Indexes are so large that they cannot be handled by a single machine –Combined size of daily produced indexes is over 600 GB –Neither searching nor aggregation can be done by one machine Solution: Distributed search –Indexes are loaded by several Lucene searchers –Searchers are capable of finding matching documents, building facets, aggregating (reducing) selected data –Mergers select searchers to be used, adapt query to be sent to every searcher and aggregate results received from searchers –Observers control how searchers and mergers are performing

zanox8 Part II: Merger-Based Coordination of Searchers Report Sub-request I Sub-request II Sub-request III Sub-request IV Request Sub-report I Sub-report II Sub-report III Sub-report IV Each Report Searcher = Hadoop RPC Server + Adapted Lucene Index Searcher + Report Aggregator Report Merger = Hadoop RPC Server + Hadoop RPC Client + Report Aggregator Report Generator & Web Service = EJB + Hadoop RPC Client

zanox9 Part III: Technical Details - Hardware 3 machines = Single Core + 1GB RAM + 40GB HD 62 machines ( ) = 8 Core + 16GB RAM + 4 x 1TB HD ?

zanox10 Part III: Technical Details – Data, Jobs, Queries Data in HDFS –Data volume growing by 80 GB/day (30 million clicks, 600 million views and 2 million product updates) –600 GB Lucene indexes built on daily basis –Total data volume of 24 TB for 18 billion clicks, 240 billion views and 120 million products Jobs –More than 800 scheduled jobs per day Queries –10 queries per second and more than 30 million queries in the last 2 months

Thank you for your attention

zanox12 21 Different Samsung Manufacturers Samsung", "SAMSUNG - MONITORS", "SAMSUNG - PLASMA", "SAMSUNG - PRESENTATION", "Samsung (Electronics)", "SAMSUNG (SA)", "Samsung Books", "Samsung BW", "SAMSUNG by NORTEK", "SAMSUNG Compatible", "SAMSUNG COMPUTER", "SAMSUNG DEUTSCHLAND", "Samsung Music", "Samsung Notebook", "SAMSUNG, TELECOM "Samsung Opto-Electronics UK Ltd.", "SAMSUNG ORIGINAL", "SAMSUNG PLEOMAX", "SAMSUNG SEMICONDUCTOR", "SAMSUNG SGH-E390", "Samsung UK Ltd and "Samsung WW.