Search at Scale Hadoop, Katta & Solr the smores of Search.

Slides:



Advertisements
Similar presentations
Starfish: A Self-tuning System for Big Data Analytics.
Advertisements

The map and reduce functions in MapReduce are easy to test in isolation, which is a consequence of their functional style. For known inputs, they produce.
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
MapReduce Simplified Data Processing on Large Clusters
HDFS & MapReduce Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
A Hadoop Overview. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A.
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing.
Spark: Cluster Computing with Working Sets
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) How MapReduce Works (in Hadoop) Shivnath Babu.
The Chinese University of Hong Kong. Research on Private cloud : Eucalyptus Research on Hadoop MapReduce & HDFS.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Cloud Computing Other Mapreduce issues Keke Chen.
Module 14: Scalability and High Availability. Overview Key high availability features available in Oracle and SQL Server Key scalability features available.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
Distributed and Parallel Processing Technology Chapter6
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Sky Agile Horizons Hadoop at Sky. What is Hadoop? - Reliable, Scalable, Distributed Where did it come from? - Community + Yahoo! Where is it now? - Apache.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
HAMS Technologies 1
CSE 548 Advanced Computer Network Security Document Search in MobiCloud using Hadoop Framework Sayan Cole Jaya Chakladar Group No: 1.
CSED421 Database Systems Lab. Welcome Lab Class –Library 501, Fri 9:00 – 10:40 Teacher Assistants – 안석현, 이상훈 –{ashworld, –IDS.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
MAPREDUCE PRESENTED BY: KATIE WOODS & JORDAN HOWELL.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
Hadoop implementation of MapReduce computational model Ján Vaňo.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?
MapReduce: Simplified Data Processing on Large Clusters Lim JunSeok.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
HADOOP Carson Gallimore, Chris Zingraf, Jonathan Light.
Next Generation of Apache Hadoop MapReduce Owen
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
By: Joel Dominic and Carroll Wongchote 4/18/2012.
| presented by Vasileios Zois CS at USC 09/20/2013 Introducing Scalability into Smart Grid 1.
TensorFlow– A system for large-scale machine learning
Hadoop Aakash Kag What Why How 1.
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
Curator: Self-Managing Storage for Enterprise Clusters
Software Engineering Introduction to Apache Hadoop Map Reduce
Myoungjin Kim1, Yun Cui1, Hyeokju Lee1 and Hanku Lee1,2,*
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
MapReduce Simplied Data Processing on Large Clusters
CS6604 Digital Libraries IDEAL Webpages Presented by
湖南大学-信息科学与工程学院-计算机与科学系
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Search at Scale Hadoop, Katta & Solr the smores of Search

Issues Managing the raw data Building Indexes Handling updates Reliable Search Search Latency

The Tools HDFS for Raw Data Storage SolrIndexWriter for building indexes (Solr- 1301) Katta for Search Latency Katta for Reliable Search Brute Force Map/Reduce for index updates Near Real Time updates – Jason Rutherglen

HOW-TO SolrRecordWriter Solr config with schema An implementation of SolrDocumentConverter A Hadoop Cluster you can trash, wrong tuning will crash your machines. ZipFile Output – some compression, reduces the number of files in your hdfs, easy deployment. Use jar xf to unpack, zip will fail.

SolrRecordWriter and your Cluster Each SolrRecordWriter instance uses substantial quantities of system resources: Processor – analyzing the input records Memory – buffering the processed index records IOP, optimize saturates storage devices Be very careful in how many instances you have running per machine.

Katta Distributed Search Replicated Indexes Fault Tolerant Direct deployment from hdfs

Katta Issues Solr is a pig, run few instances per machine. Large indexes can take time to copy in and start, consuming substantial io resources Use hftp: to reference your indexes, passes through firewalls and hdfs version independent. Use one of the balancing distribution policies Nodes don’t handle Solr OOMs gracefully

Search Latency Run as many replicas of your indexes as needed to ensure that your latency is low enough Run as many solr front ends to manage latency.

Solr Issues Poorly chosen facets can cause OOMs be careful Solr is slow to start, so rolling new indexes in takes time Solr is a black box to Katta, unlike Lucene which is intimate.

Updates Brute Force, rebuild the entire corpus and redeploy Distribute updates to deployed indexes (not implemented) Merge indexes (Jason Rutherglen) Distribute new indexes and handle merge in the fronting solr intances (not implemented)

Code and Configurations We run a 12 node katta cluster, with 3 masters and 3 zookeeper machines, for 18 machines. We give each kata node jvm 4gig of heap. I run 1-3 solr front end instances with 6gig of heap, Code and configurations will be on for members.