Geoffrey Architecture for real-time ad-hoc query on distributed filesystems.

Slides:



Advertisements
Similar presentations
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Advertisements

Technology of Data Analytics. INTRODUCTION OBJECTIVE  Data Analytics mindset – shallow and wide, deep when you need it  Quick overview, useful tidbits,
Setting Big Data Capabilities Free How to Make Business on Big Data? Stig Torngaard, Partner Platon.
1HP Confidential THE BIG DATA ECOSYSTEM AND YOU!.
Information Retrieval in Practice
Evaluation of distributed open source solutions in CERN database use cases HEPiX, spring 2015 Kacper Surdy IT-DB-DBF M. Grzybek, D. L. Garcia, Z. Baranowski,
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan.
Overview of Search Engines
Daniel Abadi Yale University. * The Big Data phenomenon is the best thing that could have happened to the database community * Despite other definitions.
SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.
Databases & Data Warehouses Chapter 3 Database Processing.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Ch 4. The Evolution of Analytic Scalability
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Tyson Condie.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
` tuplejump The data engineering platform. A startup with a vision to simplify data engineering and empower the next generation of data powered miracles!
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Face Detection And Recognition For Distributed Systems Meng Lin and Ermin Hodžić 1.
Penwell Debug Intel Confidential BRIEF OVERVIEW OF HIVE Jonathan Brauer ESE 380L Feb
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Data and SQL on Hadoop. Cloudera Image for hands-on Installation instruction – 2.
Chapter 1 Data Structures and Algorithms. Primary Goals Present commonly used data structures Present commonly used data structures Introduce the idea.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Big Data Directions Greg.
Clusterpoint Margarita Sudņika ms RDBMS & NoSQL Databases & tables → Document stores Columns, rows → Schemaless documents Scales UP → Scales UP.
Site Technology TOI Fest Q Celebration From Keyword-based Search to Semantic Search, How Big Data Enables That?
Distributed Time Series Database
Nov 2006 Google released the paper on BigTable.
Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.
Two-Tier DW Architecture. Three-Tier DW Architecture.
What we know or see What’s actually there Wikipedia : In information technology, big data is a collection of data sets so large and complex that it.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Frequent Word Combinations Mining and Indexing on HBase.
Experiments in Utility Computing: Hadoop and Condor Sameer Paranjpye Y! Web Search.
1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
An Introduction To Big Data For The SQL Server DBA.
BIG DATA. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database.
1 Copyright © 2008, Oracle. All rights reserved. Repository Basics.
BIG DATA/ Hadoop Interview Questions.
B ig D ata Analysis for Page Ranking using Map/Reduce R.Renuka, R.Vidhya Priya, III B.Sc., IT, The S.F.R.College for Women, Sivakasi.
Abstract MarkLogic Database – Only Enterprise NoSQL DB Aashi Rastogi, Sanket V. Patel Department of Computer Science University of Bridgeport, Bridgeport,
2013: year of real-time access to Big Data?
BIG DATA BIGDATA, collection of large and complex data sets difficult to process using on-hand database tools.
1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.
Big Data-An Analysis. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult.
Information Retrieval in Practice
CS 405G: Introduction to Database Systems
Big Data Enterprise Patterns
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
Extraction, aggregation and classification at Web Scale
APACHE HAWQ 2.X A Hadoop Native SQL Engine
Operationalize your data lake Accelerate business insight
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Ch 4. The Evolution of Analytic Scalability
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Academic & More Group 4 谢知晖 王逸雄 郭嘉宋 程若愚.
Presentation transcript:

Geoffrey Architecture for real-time ad-hoc query on distributed filesystems

Motivation Big Data is more opaque than small data – Spreadsheets choke – BI tools can’t scale – Small samples often fail to replicate issues Engineers, data scientists, analysts need: – Faster “time to answer” on Big Data – Rapid “find, quantify, extract” Solve “I don’t know what I don’t know” This is NOT about looking up items in a product catalog (i.e. not a consumer search problem)

Scaling search with classic sharding

Classic “side system” approach Definition of KLUDGE: “a system and especially a computer system made up of poorly matched components” –Merriam-Webster Hadoop Search Cluster Search Cluster ?????

Classic “search toolkit” Built around fulltext use case Inverted Indexes optimized for on-the-fly ranking of results – TF-IDF – Okapi BM-25 Yet never able to fully realize google-style search capability Issues: – Phrase detection – Pseudo synonymy – Open loop architecture

Big data ad-hoc query Not typically a fulltext “document search” problem Data is structured, mixed structured, and denormalized – Log lines – Json records – CSV files – Hadoop native formats (SequenceFile) Ranking is explicit (ORDER BY), not relevance based Sometimes “needle in haystack” (support, debugging) Sometimes “haystack in haystack” (summary analytics, segmentation)

Dremel MPP query execution tree

Finer points of Dremel architecture MapReduce friendly In-Situ approach is DFS friendly Excels at aggregation. Not so much for needle-in- haystack. Column storage format accelerates mapreduce (less extraneous data pushed through) But in some regards still a “side system” Applications must explicitly store their data in a columnar format “massive” is both a benefit and a hazard – Complex (operationally and WRT query execution) – Queries can execute quickly…on huge clusters

Crawled In-Situ Index Architecture HDFS MapReduce Data Crawl In-situ Index SimpleSearch Application Hadoop

Benefits to crawled In-Situ index No changes to application data format – CSV – JSON – SequenceFile Clear “separation of concerns” between data and index Indexes become “disposable”: easily built, easily thrown away There is no “side system” that needs to be maintained Use the mapreduce “hammer” to pound a nail

Architect for Elasticity AWS S3 Elastic MapReduce JetS3t EC2 M1.large EC2 M1.large Application Crawl Index HTTP Interesting: you don’t actually need to have hadoop installed…

Declarative Crawl Indexing HDFS MapReduce Data Crawl In-situ Index SimpleSearc h Application Hadoop { "filter”:"column[4]==\"athens\"" } { "filter”:"column[4]==\"athens\"" } Parse.json Indexer reads declarative instructions from in-situ file “pull” vs. traditional “push” indexing approach

Thin index Index size is small because data is a holistic part of the system data does not need to be “put into” the search system and repicated in the index. HDFS MapReduce Data Crawl In-situ Index Data Index

Lazy data loading HDFS MapReduce Data Crawl Execution Runtime Execution Runtime Data Index LRU Index Cache LRU Index Cache Lazy Pull

Column Oriented Approach

Contact Info Private Beta