8/9/2015 Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, dtsouma, Computing Systems Laboratory.

Slides:

Advertisements

Similar presentations

Inner Architecture of a Social Networking System Petr Kunc, Jaroslav Škrabálek, Tomáš Pitner.

Advertisements

 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.

Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.

Overview of MapReduce and Hadoop

LIBRA: Lightweight Data Skew Mitigation in MapReduce

© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert

Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html

EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.

BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.

Paula Ta-Shma, IBM Haifa Research 1 “Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University Big Data and.

BY VAIBHAV NACHANKAR ARVIND DWARAKANATH Evaluation of Hbase Read/Write (A study of Hbase and it’s benchmarks)

Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.

Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.

Nikolay Tomitov Technical Trainer SoftAcad.bg.  What are Amazon Web services (AWS) ?  What’s cool when developing with AWS ?  Architecture of AWS 

CS 405G: Introduction to Database Systems 24 NoSQL Reuse some slides of Jennifer Widom Chen Qian University of Kentucky.

Hadoop Ecosystem Overview

HADOOP ADMIN: Session -2

1 Yasin N. Silva Arizona State University This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.

Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.

Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.

USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.

Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.

MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …

Zois Vasileios Α. Μ :4183 University of Patras Department of Computer Engineering & Informatics Diploma Thesis.

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.

Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.

Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

Introduction to Hadoop and HDFS

Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.

Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team

(C) 2008 Clusterpoint(C) 2008 ClusterPoint Ltd. Empowering You to Manage and Drive Down Database Costs April 17, 2009 Gints Ernestsons, CEO © 2009 Clusterpoint.

MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.

Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.

Scalable Hybrid Keyword Search on Distributed Database Jungkee Kim Florida State University Community Grids Laboratory, Indiana University Workshop on.

Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

NoSQL Or Peles. What is NoSQL A collection of various technologies meant to work around RDBMS limitations (mostly performance) Not much of a definition...

Nov 2006 Google released the paper on BigTable.

ApproxHadoop Bringing Approximations to MapReduce Frameworks

MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,

Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.

NoSQL: Graph Databases. Databases Why NoSQL Databases?

Data and Information Systems Laboratory University of Illinois Urbana-Champaign Data Mining Meeting Mar, From SQL to NoSQL Xiao Yu Mar 2012.

Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.

CS422 Principles of Database Systems Introduction to NoSQL Chengyu Sun California State University, Los Angeles.

Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.

BIG DATA/ Hadoop Interview Questions.

1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.

Image taken from: slideshare

”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.

CS 405G: Introduction to Database Systems

INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER

Running virtualized Hadoop, does it make sense?

Spark Presentation.

Introduction to MapReduce and Hadoop

Central Florida Business Intelligence User Group

MapReduce Simplied Data Processing on Large Clusters

CS6604 Digital Libraries IDEAL Webpages Presented by

Introduction to PIG, HIVE, HBASE & ZOOKEEPER

Overview of big data tools

Charles Tappert Seidenberg School of CSIS, Pace University

Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper

Map Reduce, Types, Formats and Features

Presentation transcript:

8/9/2015 Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, dtsouma, Computing Systems Laboratory School of Electrical and Computer Engineering National Technical University of Athens Greece Ioannis Konstantinou Evangelos Angelou Dimitrios Tsoumakos Nectarios Koziris

8/9/2015 Problem Data explosion era: Increasing data volume (e- mail-web logs, historical data, click streams) pushes classic RDBMS to their limits Cheaper storage and bandwidth enables the growth of publicly available datasets. – Internet Archive’s WaybackMachine – Wikipedia – Amazon public datasets Centralized indices are slow to create and not scalable 2/36

8/9/2015 Our contribution A Distributed processing framework to index, store and serve web-scale content under heavy request loads Simple indexing rules NoSQL and MapReduce combination: MapReduce jobs process input to create index Index and content is served through a NoSQL system 3/36

8/9/2015 Goals 1/2 Support of almost any type of datasets – Unstructured: plain text files – Semi-structured: XML, HTML – Fully structured: SQL Databases Near real-time query response times – Query execution times should be in the order of milliseconds 4/36

8/9/2015 Goals 2/2 Scalability (preferably elastic) – Storage space – Concurrent user requests Ease of use – Simple index rules – Meaningful searches – Find conferences whose title contains cloud and were held in California 5/36

8/9/2015 Architecture 1/2 Raw dataset is uploaded to HDFS Dataset with index rules is fed to the Uploader, to create the Content table 6/36

8/9/2015 Architecture 2/2 The Content table is fed to the Indexer that extracts the Index table The client API contacts the index table to perform searches, and the content table to serve objects 7/36

8/9/2015 Index rules Ioannis Konstantinou 26/04/2010 Attribute types Record boundaries Record boundaries split input into distinct entities used as processing units attribute types: record regions to index – A specific XML tag, HTML table, database column 8/36

8/9/2015 Uploader 1/2 MapReduce class that crunches data input from HDFS to create the Content table Mappers emit a key-value pair for each encountered record 9/36

8/9/2015 Reducers lexicographically sort incoming key- values according to the MD5Hash Results are stored in HDFS in HFile format Hbase is informed about the new Content table Uploader 2/2 10/36

8/9/2015 Content table Acts as a record hashmap Each row contains a single record item Row key is the MD5Hash of the record content Allows fast random access reads during successful index searches Incremental content addition or removal Row key: MD5HashRow value: record content 2da0ae7cb598ac8e a9c2f19fe Ioannis Konstantinou c14b2a8c7bbe2 4ba0d6854dd6f3cc Evangelos Konstantino u /36

8/9/2015 Indexer 1/3 Creates an inverted list of index terms and document locations (index table) MapReduce class that reads the Content table and creates the Index table 12/36

8/9/2015 Indexer 2/3 Mappers read the content table and emit key-value pairs Reducers lexicographically sort incoming key- values according to the keyword value 13/36

8/9/2015 Indexer 3/3 Reducers aggregate key- values to key-values Results are stored in HDFS in HFile format Hbase is informed about the new Index table 14/36

8/9/2015 Index table Row key: “term”_”attribute type”Row value: list(MD5Hash) _date2da0ae7cb598ac8e a9c2f19fe _date223c14b2a8c7bbe24ba0d6854dd6f3cc Konstantinou_name2da0ae7cb598ac8e a9c2f19fe, 223c14b2a8c7bbe24ba0d6854dd6f3cc Row key: index term followed by the attribute type Row content: list of MD5Hashes of the records that contain this specific keyword in a specific attribute 15/36

8/9/2015 Client API 1/3 Built on top of HBase client API Basic search and get operations Google-style freetext queries over multiple indexed attributes 16/36

8/9/2015 Client API 2/3 Search for a keyword using Index table – Exact match (Hbase Get) – Prefix search (Hbase Scan). Retrieve object from Content table – Using a simple Hbase Get operation 17/36

8/9/2015 Client API 3/3 Supports AND-ing and OR-ing through client side processing 18/36

8/9/2015 Experimental Setup 11 Worker Nodes, each with: – 8 2GHz – 8GB Ram, disabled page swapping – 500GB storage A Total of 88 CPUs, 88GB of RAM and 5.55 TB of hard disk space Hadoop version Hbase version – Contributed 3 bug fixes for the version 19/36

8/9/2015 Hadoop-HBase Configuration Hadoop, HBase were given 1GB of RAM each Each Map/Reduce task was given 512MB of RAM Each worker node could concurrently spawn 6 Maps and 2 Reduces A total of 66 Maps and 22 Reduces Hadoop’s speculative execution was disabled 20/36

8/9/2015 Datasets 1/2 Downloaded from Wikipedia’s dump service and from Project Gutenberg’s custom dvd creation service Structured: 23 GB MySQL Database dump of the Latest English wikipedia 21/36

8/9/2015 Datasets 2/2 Semi-structured (XML-HTML) – 150 GB XML part of a 2,55TB of every English wiki page along with its revisions up to May 2008 – 150 GB HTML from a static wikipedia version Unstructured: 20GB full dump of all languages of Gutenberg’s text document collection ( files) 22/36

8/9/2015 Real-life Query traffic Publicly available AOL dataset – 20M keywords – 650K users – 3 month period Clients created both point and prefix queries following a zipfian pattern 23/36

8/9/2015 Content table creation time Time increases linear with data size DB dataset is created faster than TXT – Less processing is required 24/36

8/9/2015 Index Table size 1/2 XML-HTML – Index growth gets smaller when dataset increases – For the same dataset size, XML index is larger than HTML index 25/36

8/9/2015 Index Table size 2/2 TXT-DB – TXT index is bigger for the same dataset size – Diversity of terms for the Gutenberg TXT dataset 26/36

8/9/2015 Index Table creation time 1/2 XML is more demanding compared to HTML A lot of HTML code gets stripped during indexing 27/36

8/9/2015 Index Table creation time 2/2 Time increases linear with data size DB dataset is created faster than TXT – Less processing is required 28/36

8/9/2015 Indexing Scalability 23GB SQL The speed is proportional to the number of processing nodes Typical cloud application requirement – Extra nodes are acquired by a cloud vendor in an easy and inexpensive manner 29/36

8/9/2015 Query Performance 3 types of queries – Point: keyword search in one attribute e.g. keyword_attribute – Any: keyword search in any attribute e.g. keyword_* – Range: Prefix query in any attribute e.g. keyword* Traffic was created concurrently by 14 machines Mean query response time for different – Load (queries/sec) – Dataset sizes – Number of indexed attribute types 30/36

8/9/2015 Response time vs query load 1/2 Range query loads above 140queries/sec failed Response times for a load of 14 queries/sec – Point queries: 20ms – Any attribute queries: 150ms – Range queries: 27sec 31/36

8/9/2015 Response time vs query load 2/2 HBase caching result due to skewed workload – Up to 100queries/sec there are enough client channels – Between 100 and 1000 queries/sec caching is significant – After 1000 queries/sec response time increases exponentially 32/36

8/9/2015 Response time vs dataset size Average load of 14 queries/second Range queries are more expensive Response times for exact queries remain constant 33/36

8/9/2015 Related Work 1/2 MapReduce based data analysis frameworks – Yahoo’s PIG, Facebook’s Hive and HadoopDB Analytical jobs are described in a declarative scripting language They are translated in a chain of MapReduce steps and executed on Hadoop Query responses time in the order of minutes to hours 34/36

8/9/2015 Related Work 2/2 Distributed indexing frameworks based on Hadoop – Ivory distributes only the index creation but the created index is served through a simple DB – Hindex distributes Indices through HBase but the index creation is centralized In our system, both the index creation and serving is done in a distributed way 35/36

8/9/2015 Questions 36/36