Battle of the Giants Apache Solr 4.0 vs ElasticSearch 0.20

Slides:



Advertisements
Similar presentations
Amaze business, make your devs happy
Advertisements

Database Architectures and the Web
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Spark: Cluster Computing with Working Sets
Solr has a lot of extensive features Solr Integration and Enhancements Todd Hatcher.
Microsoft ® Official Course Interacting with the Search Service Microsoft SharePoint 2013 SharePoint Practice.
Implementing search with free software An introduction to Solr By Mick England.
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
Search Search Drupal with Apache Solr with CERN Web Communications Group – Copyright 2013.
Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.
8 Copyright © 2004, Oracle. All rights reserved. Creating LOVs and Editors.
WebFOCUS Developer Studio Update Dimitris Poulos Technical Director September 3, 2015 Copyright 2009, Information Builders. Slide 1.
Software Engineer, #MongoDBDays.
Battle of the Giants Apache Solr 4.0 vs ElasticSearch 0.20 Rafał Kuć – sematext.com.
Configuration Management and Server Administration Mohan Bang Endeca Server.
Elasticsearch in Dashboard Data Management Applications David Tuckett IT/SDC 30 August 2013 (Appendix 11 November 2013)
Tutorial 1 Getting Started with Adobe Dreamweaver CS3
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
1 Apache. 2 Module - Apache ♦ Overview This module focuses on configuring and customizing Apache web server. Apache is a commonly used Hypertext Transfer.
Open Data Protocol * Han Wang 11/30/2012 *
University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.
Revolutionizing enterprise web development Searching with Solr.
Open Search Office Web Services Database Doc Mgt Sys Pipeline Index Geospatial Analysis Text Search Faceting Caching Query parsing Clustering Synonyms.
ILDG Middleware Status Chip Watson ILDG-6 Workshop May 12, 2005.
Andrew S. Budarevsky Adaptive Application Data Management Overview.
SQL Injection Jason Dunn. SQL Overview Structured Query Language For use with Databases Purpose is to retrieve information Main Statements Select Insert.
The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.
Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.
Copyright © 2006 Pilothouse Consulting Inc. All rights reserved. Search Overview Search Features: WSS and Office Search Architecture Content Sources and.
DNS DNS overview DNS operation DNS zones. DNS Overview Name to IP address lookup service based on Domain Names Some DNS servers hold name and address.
807 - TEXT ANALYTICS Massimo Poesio Lab 2: (Quick intro to) SOLR Document clustering with MAHOUT.
Senior Solutions Architect, MongoDB Inc. Massimo Brignoli #MongoDB Introduction to Sharding.
Linux Operations and Administration
Module 5: Managing Content. Overview Publishing Content Executing Reports Creating Cached Instances Creating Snapshots and Report History Creating Subscriptions.
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
Configuring SQL Server for a successful SharePoint Server Deployment Haaron Gonzalez Solution Architect & Consultant Microsoft MVP SharePoint Server
Apache ZooKeeper CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
Emdeon Office Batch Management Services This document provides detailed information on Batch Import Services and other Batch features.
Solr Performance Monitoring with Scalable Performance Monitoring SaaS Otis Gospodnetić – Sematext ◦ sematext.com sematext.com/spm.
Section 4 – Link Access Module (Lam) aka Data Adapters
Distributed Cache Technology in Cloud Computing and its Application in the GIS Software Wang Qi Zhu Yitong Peng Cheng
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
z/Ware 2.0 Technical Overview
Self Healing and Dynamic Construction Framework:
Searching and Indexing
Open Source distributed document DB for an enterprise
Experience in CMS with Analytic Tools for Data Flow and HLT Monitoring
Spark Presentation.
Safe by default, optimized for efficiency
CHAPTER 3 Architectures for Distributed Systems
Building Search Systems for Digital Library Collections
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Database System Concepts and Architecture.
CS 5604 Information Storage and Retrieval
CS6604 Digital Libraries IDEAL Webpages Presented by
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Ashutosh Rana Rahul Nori 7/17/2018
Overview of big data tools
another noSql customization for the HDB++ archiving system
Lucene/Solr Architecture
Introduction to Elasticsearch with basics of Lucene May 2014 Meetup
Rafał Kuć – Sematext sematext.com
Academic & More Group 4 谢知晖 王逸雄 郭嘉宋 程若愚.
Overview Multimedia: The Role of WINS in the Network Infrastructure
Indexing with ElasticSearch
Pig Hive HBase Zookeeper
Using TLA+ for fun and profit in the development of Elasticsearch
Presentation transcript:

Battle of the Giants Apache Solr 4.0 vs ElasticSearch 0.20 Rafał Kuć – Sematext International @kucrafal @sematext sematext.com

Copyright 2012 Sematext Int’l. All rights reserved Who Am I „Solr 3.1 Cookbook” author (4.0 inc) Sematext consultant & engineer Solr.pl co-founder Father and husband  Copyright 2012 Sematext Int’l. All rights reserved

Copyright 2012 Sematext Int’l. All rights reserved What Will I Talk About ? VS Copyright 2012 Sematext Int’l. All rights reserved

Copyright 2012 Sematext Int’l. All rights reserved Under the Hood ElasticSearch 0.20 Apache Lucene 3.6.1 Apache Solr 4.0 Apache Lucene 4.0 Copyright 2012 Sematext Int’l. All rights reserved

Copyright 2012 Sematext Int’l. All rights reserved Architecture What we expect Scalability Fault toleranance High availablity Features What we are also looking for Manageability Installation ease Tools Copyright 2012 Sematext Int’l. All rights reserved

ElasticSearch Cluster Architecture Distributed Fault tolerant Only ElasticSearch nodes Single leader Automatic leader election Copyright 2012 Sematext Int’l. All rights reserved

SolrCloud Cluster Architecture Distributed Fault tolerant Apache Solr + ZooKeeper ensemble Leader per shard Automatic leader election Copyright 2012 Sematext Int’l. All rights reserved

Copyright 2012 Sematext Int’l. All rights reserved Collection vs Index Collection – Solr main logical index Index – ElasticSearch main logic structure Collections and Indices can be spread among different nodes in the cluster Copyright 2012 Sematext Int’l. All rights reserved

Multiple Document Types in Index ElasticSearch - multiple document types in a single index Apache Solr - multiple document types in a single collection – shared schema.xml Copyright 2012 Sematext Int’l. All rights reserved

Copyright 2012 Sematext Int’l. All rights reserved Shards and Replicas Index / Collection can have many shards Each shard can have 0 or more replicas Replicas are automatically updated Replicas can be promoted to leaders when a leader shard goes off-line Copyright 2012 Sematext Int’l. All rights reserved

Index and Query Routing Control where documents are going Control where queries are going Manual data distribution Copyright 2012 Sematext Int’l. All rights reserved

Querying Without Routing Shard 1 Shard 2 Shard 3 Shard 4 Shard 5 Shard 6 Shard 7 Shard 8 Collection / Index Application Copyright 2012 Sematext Int’l. All rights reserved

Copyright 2012 Sematext Int’l. All rights reserved Query With Routing Shard 1 Shard 2 Shard 3 Shard 4 Shard 5 Shard 6 Shard 7 Shard 8 Collection / Index Application Copyright 2012 Sematext Int’l. All rights reserved

Routing Docs and Queries in Solr Requires some effort Defaults to hash based on document identifiers Can be turned off using solr.NoOpDistributingUpdateProcessorFactory <updateRequestProcessorChain> <processor class="solr.LogUpdateProcessorFactory" /> <processor class="solr.RunUpdateProcessorFactory" /> <processor class="solr.NoOpDistributingUpdateProcessorFactory" /> </updateRequestProcessorChain> Copyright 2012 Sematext Int’l. All rights reserved

Routing Docs and Queries - ElasticSearch routing parameter controls target shard which document/query will be forwarded to defaults to document identifiers can be changed to any value curl -XPUT localhost:9200/sematext/test/1?routing=1234 -d '{ "title" : "Test routing document" }' curl –XGET localhost:9200/sematext/test/_search/?q=*&routing=1234 Copyright 2012 Sematext Int’l. All rights reserved

Apache Solr Index Structure Field types defined in schema.xml file Fields defined in schema.xml file Allows automatic value copying Allows dynamic fields Allows custom similarity definition Copyright 2012 Sematext Int’l. All rights reserved

ElasticSearch Index Structure Schema - less Analyzers and filters defined with HTTP API Fields defined with an HTTP request Multi – field support Allows nested documents Allows parent – child relationship Allows structured data Copyright 2012 Sematext Int’l. All rights reserved

Index Structure Manipulation Possible to some extent in Solr as well as ElasticSearch ElasticSearch allows dynamic mappings update (not always) Copyright 2012 Sematext Int’l. All rights reserved

Copyright 2012 Sematext Int’l. All rights reserved Aliasing Solr Allows core aliasing ElasticSearch Allows index aliasing We can add filter to alias We can add index routing We can add search routing Copyright 2012 Sematext Int’l. All rights reserved

Copyright 2012 Sematext Int’l. All rights reserved Server Configuration Solr Static in solrconfig.xml Can be reloaded during runtime with collection/core reload ElasticSearch Static in elasticsearch.yml Properties can be changed during runtime (although not all) without reloading Copyright 2012 Sematext Int’l. All rights reserved

ElasticSearch Gateway Module Your data time machine Stores indices and meta data Currently available: Local Shared FS Hadoop S3 Copyright 2012 Sematext Int’l. All rights reserved

Copyright 2012 Sematext Int’l. All rights reserved Discovery Apache Solr uses ZooKeeper ElasticSearch uses Zen Discovery Copyright 2012 Sematext Int’l. All rights reserved

ElasticSearch Zen Discovery Allows automatic node discovery Provides multicast and unicast discovery methods Automatic master detection Two - way failure detection Copyright 2012 Sematext Int’l. All rights reserved

Apache Solr & Apache ZooKeeper Requires additional software ZooKeeper ensemble with 1+ ZooKeeper instances Prevents split – brain situations Holds collections configurations Solr needs to know address of one of the ZooKeeper instances Copyright 2012 Sematext Int’l. All rights reserved

Copyright 2012 Sematext Int’l. All rights reserved API HTTP REST API in ElasticSearch or Query String for simple queries HTTP with Query String in Apache Solr Both provide specialized Java API SolrJ for Apache Solr and CloudSolrServer ElasticSearch with TransportClient for remote connections Copyright 2012 Sematext Int’l. All rights reserved

Apache Solr and Query String Queries are built of request parameters Some degree of structuring allowed (local params) curl 'http://localhost:8983/solr/select?q=text:weird&sort=date+desc' Copyright 2012 Sematext Int’l. All rights reserved

ElasticSearch REST End-Points Simple queries built of request parameters Stuctured queries built as JSON objects curl –XGET 'localhost:9200/sematext/test/_search/?q=_all:weird&sort=date:desc' curl -XGET 'localhost:9200/sematext/test_search' -d '{ "query" : { "term" : { "_all" : "weird" }, "sort" : { "date" : { "order" : "desc" } }' Copyright 2012 Sematext Int’l. All rights reserved

Copyright 2012 Sematext Int’l. All rights reserved Data Handling Solr Multiple formats allowed as input Can return results in multiple formats ElasticSearch JSON in / JSON out Copyright 2012 Sematext Int’l. All rights reserved

Copyright 2012 Sematext Int’l. All rights reserved Single or Batch Solr Single or multiple documents per request ElasticSearch Single document with a standard indexing call _bulk end – point exposed for batch indexing _bulk UDP end – point can be exposed for low latency batch indexing Copyright 2012 Sematext Int’l. All rights reserved

Partial Document Updates Not based on LUCENE-3837 proposed by Andrzej Białecki Document reindexing on the side of search server Both servers use versioning to prevent changes being overwritten Can lead to decreased network traffic in some cases Copyright 2012 Sematext Int’l. All rights reserved

ElasticSearch Partial Doc Update Special end – point exposed - _update Supports parameters like routing, parent, replication, percolate, etc (similar to Index API) Uses scripts to perform document updates curl -XPOST 'localhost:9200/sematext/test/12345/_update' -d '{ "script" : "ctx._source.enabled = enabled", "params" : { "enabled" : true } }' Copyright 2012 Sematext Int’l. All rights reserved

Apache Solr Partial Doc Update Sent to the standard update handler Requires _version_ field to be present curl 'localhost:8983/solr/update?commit=true' -H 'Content-type:application/json' -d '[ { "id" : "12345", "enabled" : { "set" : true } ]' Copyright 2012 Sematext Int’l. All rights reserved

Copyright 2012 Sematext Int’l. All rights reserved Solr Collections API Built on top of Core Admin Allows: Collection creation Collection reload Collection deletion Copyright 2012 Sematext Int’l. All rights reserved

ElasticSearch Indices REST API Allows: Index creation Index deletion Index closing and opening Index refreshing Existence checking Copyright 2012 Sematext Int’l. All rights reserved

Analysis Chain Definition Solr Static in schema.xml Can be reloaded during runtime with collection/core reload ElasticSearch Static in elasticsearch.yml Defined during index/type creation with REST call Possible to change with update mapping call (not all changes allowed) Copyright 2012 Sematext Int’l. All rights reserved

Multilingual Data Handling Both ElasticSearch and Apache Solr built on top of Apache Lucene Solr – analyzers defined per field in schema.xml file ElasticSearch – analyzer defined in mappings, but can be set during query or specified on the basis of field values Copyright 2012 Sematext Int’l. All rights reserved

Copyright 2012 Sematext Int’l. All rights reserved Results Grouping Available in Apache Solr only Allows for results grouping based on: Field value Query Function query (not available during distributed searching) Copyright 2012 Sematext Int’l. All rights reserved

Copyright 2012 Sematext Int’l. All rights reserved Prospective Search Allows for checking if a document matches a stored query Not available in Apache Solr Available in ElasticSearch under the name of Percolator Copyright 2012 Sematext Int’l. All rights reserved

Copyright 2012 Sematext Int’l. All rights reserved Spellchecker Allows to check and correct spelling mistakes Not available in ElasticSearch currently Multiple implementations available in Apache Solr IndexBasedSpellChecker WordBreakSolrSpellChecker DirectSolrSpellChecker Copyright 2012 Sematext Int’l. All rights reserved

Full Text Search Capabilities Variety of queries Ability to control score calculation Different query parsers available Advanced Lucene queries (like SpanQueries) exposed Copyright 2012 Sematext Int’l. All rights reserved

Copyright 2012 Sematext Int’l. All rights reserved Score Calculation Leverage Lucene scoring capabilities Control over document importance Control over query importance Control over term and phrase importance Copyright 2012 Sematext Int’l. All rights reserved

Apache Solr and Score Influence Index time Document boosts Field boosts Query time Term boosts Phrases boost Function queries Copyright 2012 Sematext Int’l. All rights reserved

ElasticSearch and Score Influence Index time Document and field boosts Query time Different queries provide different boost controls Can calculate distributed term frequencies Negative and Positive boosting queries Custom score filters Scripts Control scoring with scripts Copyright 2012 Sematext Int’l. All rights reserved

Copyright 2012 Sematext Int’l. All rights reserved Nested Objects Possible only in ElasticSearch Indexed as separate documents Stored in the same part of the index as the root document Hidden from standard queries and filters Need appropriate queries and filters (nested) Copyright 2012 Sematext Int’l. All rights reserved

Copyright 2012 Sematext Int’l. All rights reserved More Like This Lets us find similar documents Solr More Like This Component ElasticSearch More Like This Query More Like This Field Query _mlt REST end – point Copyright 2012 Sematext Int’l. All rights reserved

Solr Parent – Child Relationship Used at query time Multi core joins possible http://localhost:8983/solr/select?q={!join from=parent to=id}color:Yellow Copyright 2012 Sematext Int’l. All rights reserved

ElasticSearch Parent – Child Handling Proper indexing required Indexed as separate documents Standard queries don’t return child documents In order to retrieve parent docs one should use appropriate queries and filters (has_child, has_parent, top_children) Copyright 2012 Sematext Int’l. All rights reserved

Copyright 2012 Sematext Int’l. All rights reserved Filters Used to narrown down query results Good candidates for caching and reuse Supported by ElasticSearch and Apache Solr Should be used for repeatable query elements Copyright 2012 Sematext Int’l. All rights reserved

Apache Solr Filter Queries Multiple filters per query Filters are addictive Different query parsers can be used Local params can be used Narrow down faceting results Copyright 2012 Sematext Int’l. All rights reserved

ElasticSearch Filtered Queries Can be defined using queries exposed by the Query DSL Can be used for custom score calculation (i.e., custom filters score query) Doesn’t narrow down faceting results by default (facets have their own filters) Copyright 2012 Sematext Int’l. All rights reserved

Copyright 2012 Sematext Int’l. All rights reserved Filter Cache Control Both Solr and ElasticSearch let us control cache for filters Solr Using local params and cache property ElasticSearch _cache property _cache_key property Copyright 2012 Sematext Int’l. All rights reserved

Copyright 2012 Sematext Int’l. All rights reserved Faceting Both provide common facets Terms Range & query Terms statistics Spatial distance Solr Pivot faceting ElasticSearch Histograms Copyright 2012 Sematext Int’l. All rights reserved

Copyright 2012 Sematext Int’l. All rights reserved Real Time Or Not ? Allow getting document not yet indexed Don’t need searcher reopening ElasticSearch Separate Get and Multi Get API’s Apache Solr Separate Realtime Get Handler Can be used as a search component Copyright 2012 Sematext Int’l. All rights reserved

Copyright 2012 Sematext Int’l. All rights reserved Caches and Warming ElasticSearch and Solr allow caching Both allow running warming queries ElasticSearch by default doesn’t limit cache sizes Copyright 2012 Sematext Int’l. All rights reserved

Copyright 2012 Sematext Int’l. All rights reserved Solr Caches Types Filter Cache Query Result Cache Document Cache Implementation choices LRUCache FastLRUCache LFUCache Other configuration options: Size Maximum size Autowarming count Copyright 2012 Sematext Int’l. All rights reserved

Copyright 2012 Sematext Int’l. All rights reserved ElasticSearch Caches Types Filter Cache Field Data Cache Implementation choices Resident Soft Weak Other configuration options: Max size (entries per segment) Expiration time Copyright 2012 Sematext Int’l. All rights reserved

Cluster State Monitoring Apache Solr – multiple mbeans exposed by JMX ElasticSearch – multiple REST end – points exposed to get different statistics Copyright 2012 Sematext Int’l. All rights reserved

ElasticSearch Statistics API Health and State Check Nodes Information and Statistics Cache Statistics Index Segments Information Index Information and Statistics Mappings Information Copyright 2012 Sematext Int’l. All rights reserved

Copyright 2012 Sematext Int’l. All rights reserved Cluster Monitoring Copyright 2012 Sematext Int’l. All rights reserved

Cluster Monitoring with SPM Copyright 2012 Sematext Int’l. All rights reserved

Cluster Settings Update ElasticSearch lets us: Control rebalancing Control recovery Control allocation Change the above on the live cluster Copyright 2012 Sematext Int’l. All rights reserved

Custom Shard Allocation Possible in ElasticSearch Cluster level: Index level: curl -XPUT localhost:9200/_cluster/settings -d '{     "persistent" : {         "cluster.routing.allocation.exclude._ip" : "192.168.2.1"     } }' curl -XPUT localhost:9200/sematext/ -d '{     "index.routing.allocation.include.tag" : "nodeOne,nodeTwo" }' Copyright 2012 Sematext Int’l. All rights reserved

Moving Shards and Replicas Possible in ElasticSearch, not available in Solr Allows to move shards and replicas to any node in the cluster on demand Available in ElasticSearch: curl -XPOST 'localhost:9200/_cluster/reroute' -d '{ "commands" : [ {"move" : {"index" : "sematext", "shard" : 0, "from_node" : "node1", "to_node" : "node2"}}, {"allocate" : {"index" : "sematext", "shard" : 1, "node" : "node3"}} ] }' Copyright 2012 Sematext Int’l. All rights reserved

Copyright 2012 Sematext Int’l. All rights reserved And The Winner Is ? The Users Copyright 2012 Sematext Int’l. All rights reserved

Copyright 2012 Sematext Int’l. All rights reserved How to Reach Us Rafał Kuć Twitter: @kucrafal E-mail: rafal.kuc@sematext.com Sematext Twitter: @sematext Website: http://sematext.com Solr vs ElasticSearch series: http://blog.sematext.com/2012/08/23/solr-vs-elasticsearch-part-1-overview/ Copyright 2012 Sematext Int’l. All rights reserved

Copyright 2012 Sematext Int’l. All rights reserved We Are Hiring ! Dig Search ? Dig Analytics ? Dig Big Data ? Dig Performance ? Dig working with and in open – source ? We’re hiring world – wide ! http://sematext.com/about/jobs.html Copyright 2012 Sematext Int’l. All rights reserved