Rafał Kuć – Sematext sematext.com

Slides:



Advertisements
Similar presentations
Apache ZooKeeper By Patrick Hunt, Mahadev Konar
Advertisements

Amaze business, make your devs happy
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
1 Vic Hargrave |
Solr has a lot of extensive features Solr Integration and Enhancements Todd Hatcher.
Internet Networking Spring 2006 Tutorial 12 Web Caching Protocols ICP, CARP.
Advanced Topics COMP163: Database Management Systems University of the Pacific December 9, 2008.
1 Spring Semester 2007, Dept. of Computer Science, Technion Internet Networking recitation #13 Web Caching Protocols ICP, CARP.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Internet Networking Spring 2002 Tutorial 13 Web Caching Protocols ICP, CARP.
Graph databases …the other end of the NoSQL spectrum. Material taken from NoSQL Distilled and Seven Databases in Seven Weeks.
Implementing search with free software An introduction to Solr By Mick England.
ECPRD seminar on the net IX”, Brussels, 2011 Faceted Search Some examples of applied faceted search on websites developed by the EP Jerry.
Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.
Battle of the Giants Apache Solr 4.0 vs ElasticSearch 0.20 Rafał Kuć – sematext.com.
Elasticsearch in Dashboard Data Management Applications David Tuckett IT/SDC 30 August 2013 (Appendix 11 November 2013)
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.
Revolutionizing enterprise web development Searching with Solr.
Open Search Office Web Services Database Doc Mgt Sys Pipeline Index Geospatial Analysis Text Search Faceting Caching Query parsing Clustering Synonyms.
A FIRST TOUCH ON NOSQL SERVERS: COUCHDB GENOVEVA VARGAS SOLAR, JAVIER ESPINOSA CNRS, LIG-LAFMIA, FRANCE
Copyright © 2012 UNICOM Systems, Inc. Confidential Information z/Ware Product Overview illustro Systems International A Division of UNICOM Global.
Copyright © 2006 Pilothouse Consulting Inc. All rights reserved. Search Overview Search Features: WSS and Office Search Architecture Content Sources and.
CERN IT Department CH-1211 Geneva 23 Switzerland t CF Computing Facilities Agile Infrastructure Monitoring CERN IT/CF.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
807 - TEXT ANALYTICS Massimo Poesio Lab 2: (Quick intro to) SOLR Document clustering with MAHOUT.
Clusterpoint Margarita Sudņika ms RDBMS & NoSQL Databases & tables → Document stores Columns, rows → Schemaless documents Scales UP → Scales UP.
ASP-2-1 SERVER AND CLIENT SIDE SCRITPING Colorado Technical University IT420 Tim Peterson.
Introduction to MongoDB. Database compared.
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
#SummitNow Inspecting Alfresco – Tools and Techniques Nathan McMinn Technical Consultant - Alfresco.
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
Solr Performance Monitoring with Scalable Performance Monitoring SaaS Otis Gospodnetić – Sematext ◦ sematext.com sematext.com/spm.
A presentation on ElasticSearch
and Big Data Storage Systems
Jon Fancey Enterprise Integration with Logic Apps
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
z/Ware 2.0 Technical Overview
Searching and Indexing
Open Source distributed document DB for an enterprise
Experience in CMS with Analytic Tools for Data Flow and HLT Monitoring
Spark Presentation.
CSE-291 (Cloud Computing) Fall 2016
NOSQL.
Safe by default, optimized for efficiency
Custom search forms with Apache Solr David Hernández
Building Search Systems for Digital Library Collections
NOSQL databases and Big Data Storage Systems
APACHE HAWQ 2.X A Hadoop Native SQL Engine
Internet Networking recitation #12
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Search Techniques and Advanced tools for Researchers
CS6604 Digital Libraries IDEAL Webpages Presented by
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Ashutosh Rana Rahul Nori 7/17/2018
Overview of big data tools
another noSql customization for the HDB++ archiving system
Introduction of Week 11 Return assignment 9-1 Collect assignment 10-1
Lucene/Solr Architecture
Introduction to Elasticsearch with basics of Lucene May 2014 Meetup
Getting Started With Solr
Academic & More Group 4 谢知晖 王逸雄 郭嘉宋 程若愚.
敦群數位科技有限公司(vanGene Digital Inc.) 游家德(Jade Yu.)
Battle of the Giants Apache Solr 4.0 vs ElasticSearch 0.20
Factual Claim Validation Models
Indexing with ElasticSearch
Big Data tools for IT professionals supporting statisticians Hands-on : NoSQL DB Donato Summa THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED.
Using TLA+ for fun and profit in the development of Elasticsearch
Presentation transcript:

Rafał Kuć – Sematext Group, Inc. @kucrafal @sematext sematext.com Battle of the Giants Round 2 VS Rafał Kuć – Sematext Group, Inc. @kucrafal @sematext sematext.com

Copyright 2013 Sematext Group. Inc. All rights reserved Ich bin ein… Sematext consultant & engineer Solr Cookbook series author „ElasticSearch Server” author „Mastering ElasticSearch” author Solr.pl co-founder Father and husband  Copyright 2013 Sematext Group. Inc. All rights reserved

Copyright 2013 Sematext Group. Inc. All rights reserved VS Copyright 2013 Sematext Group. Inc. All rights reserved

Copyright 2013 Sematext Group. Inc. All rights reserved Under the Hood Lucene 4.3 Lucene 4.3 Copyright 2013 Sematext Group. Inc. All rights reserved

Copyright 2013 Sematext Group. Inc. All rights reserved Expectations Scalability Fault toleranance High availablity Features Manageability Ease of installation Tools Support Copyright 2013 Sematext Group. Inc. All rights reserved

Expectations vs Reality Solr + ZooKeeper Leader per shard Only ElasticSearch nodes Single leader Distributed Fault tolerant Automatic leader election Copyright 2013 Sematext Group. Inc. All rights reserved

All Time Top Committers Copyright 2013 Sematext Group. Inc. All rights reserved

Copyright 2013 Sematext Group. Inc. All rights reserved Active Contributors Copyright 2013 Sematext Group. Inc. All rights reserved

Copyright 2013 Sematext Group. Inc. All rights reserved The Code Copyright 2013 Sematext Group. Inc. All rights reserved

Copyright 2013 Sematext Group. Inc. All rights reserved The Mailing Lists Copyright 2013 Sematext Group. Inc. All rights reserved

Copyright 2013 Sematext Group. Inc. All rights reserved Trends Copyright 2013 Sematext Group. Inc. All rights reserved

Copyright 2013 Sematext Group. Inc. All rights reserved Collection vs Index Collection – main logical index Index – main logical structure Collections and Indices can be spread among different nodes in the cluster Copyright 2013 Sematext Group. Inc. All rights reserved

Apache Solr Index Structure Field and types defined in schema Automatic value copying Dynamic fields Custom similarity Custom postings format Multiple document types require shared schema Can be read using API Copyright 2013 Sematext Group. Inc. All rights reserved

ElasticSearch Index Structure Schema - less Fields and types defined with HTTP API Multi – field support Nested and parent – child documents Custom similarity Custom postings format Multiple document with different structure Can be read and written using API Copyright 2013 Sematext Group. Inc. All rights reserved

Copyright 2013 Sematext Group. Inc. All rights reserved Shards and Replicas Many shards 0 or more replicas Replica can become leader Replicas can be created on live cluster Copyright 2013 Sematext Group. Inc. All rights reserved

Configuration Static in solrconfig.xml Can be reloaded with core reload Static in elasticsearch.yml Changable at runtime Copyright 2013 Sematext Group. Inc. All rights reserved

Copyright 2013 Sematext Group. Inc. All rights reserved Discovery Apache Zookeeper Zen Discovery Copyright 2013 Sematext Group. Inc. All rights reserved

Copyright 2013 Sematext Group. Inc. All rights reserved Solr & ZooKeeper Requires additional software Prevents split – brain situations Holds collections configurations ZooKeeper ensemble needed Copyright 2013 Sematext Group. Inc. All rights reserved

ElasticSearch Zen Discovery Automatic node discovery Multicast and unicast discovery methods Automatic master detection Two - way failure detection Copyright 2013 Sematext Group. Inc. All rights reserved

Copyright 2013 Sematext Group. Inc. All rights reserved HTTP FTW HTTP REST API in ElasticSearch or Query String for simple queries HTTP with Query String in Apache Solr Both provide specialized Java API Copyright 2013 Sematext Group. Inc. All rights reserved

Copyright 2013 Sematext Group. Inc. All rights reserved Results Grouping Group on: field value query result function query Copyright 2013 Sematext Group. Inc. All rights reserved

Copyright 2013 Sematext Group. Inc. All rights reserved Prospective Search Called Percolator Matches documents to stored queries Copyright 2013 Sematext Group. Inc. All rights reserved

Full Text Search Capabilities Variety of queries Control score calculation Different query parsers Advanced Lucene queries Copyright 2013 Sematext Group. Inc. All rights reserved

Copyright 2013 Sematext Group. Inc. All rights reserved Score Calculation Leverage Lucene scoring Control importance of: documents queries terms phrases Similiarity configuration Copyright 2013 Sematext Group. Inc. All rights reserved

Apache Solr and Score Influence Index - time boosting Query - time Term boosts Field boosts Phrases boost Function queries Sub-queries used for boosting Copyright 2013 Sematext Group. Inc. All rights reserved

ElasticSearch and Score Influence Index - time Query - time Different queries provide different boost controls Can calculate distributed term frequencies Negative and Positive boosting queries Custom score filters Scripts Copyright 2013 Sematext Group. Inc. All rights reserved

ElasticSearch Query Rescore Reorders top N hits by using other query Executed on shards before results are returned to the node handling it Not executed with scan and count Copyright 2013 Sematext Group. Inc. All rights reserved

ElasticSearch Nested Objects Indexed as separate documents Stored in the same part of index as root doc Hidden from standard queries and filters Need appropriate queries and filters (nested) Top level documents can be sorted on the basis of nested ones Copyright 2013 Sematext Group. Inc. All rights reserved

Solr Parent – Child Relationship Used at query time Multi core joins possible select?q={!join from=parent to=id}color:Yellow Copyright 2013 Sematext Group. Inc. All rights reserved

ElasticSearch Parent – Child Proper indexing required Indexed as separate documents Standard queries don’t return child documents Retrieve parent docs using queries and filters (has_child, has_parent, top_children) Copyright 2013 Sematext Group. Inc. All rights reserved

Copyright 2013 Sematext Group. Inc. All rights reserved Filters Used to narrown down query results Good candidates for caching and reuse Addictive Can use different query parsers Can use local params Narrows down faceting results Defined using Query DSL Can be used for score calculation Doesn’t narrow down faceting results Copyright 2013 Sematext Group. Inc. All rights reserved

Copyright 2013 Sematext Group. Inc. All rights reserved Faceting Pivot Histograms Terms Range & query Terms statistics Spatial distance Copyright 2013 Sematext Group. Inc. All rights reserved

Real Time Or Not ? Get not yet indexed docs from transaction log Don’t need searcher reopening Separate Realtime Get Handler Separate Get and Multi Get API Copyright 2013 Sematext Group. Inc. All rights reserved

Data Handling Single and batch indexing supported Different formats allowed (XML, JSON, CSV, binary) JSON in / JSON out (and YAML) Single and batch indexing supported Copyright 2013 Sematext Group. Inc. All rights reserved

Partial Document Updates Not based on LUCENE-3837 Server-side doc reindexing Both servers use versioning Decreases network traffic Copyright 2013 Sematext Group. Inc. All rights reserved

Apache Solr Partial Doc Update Sent to the standard update handler Requires _version_ field curl 'localhost:8983/solr/update?commit=true' -H 'Content-type:application/json' -d '[ { "id" : "12345", "enabled" : { "set" : true } } ]' Copyright 2013 Sematext Group. Inc. All rights reserved

ElasticSearch Partial Doc Update Special end – point exposed - _update Supports parameters like routing, parent, replication, percolate, etc (similar to Index API) Uses scripts to perform document updates curl -XPOST 'localhost:9200/sematext/test/12345/_update' -d '{ "script" : "ctx._source.enabled = enabled", "params" : { "enabled" : true } }' Copyright 2013 Sematext Group. Inc. All rights reserved

Copyright 2013 Sematext Group. Inc. All rights reserved Solr Collections API Collection creation reload deletion shards splitting Copyright 2013 Sematext Group. Inc. All rights reserved

ElasticSearch Indices REST API Index creation deletion closing and opening refreshing existence checking Copyright 2013 Sematext Group. Inc. All rights reserved

Apache Solr Shard Splitting admin/collections?action=SPLITSHARD&collection=collection1&shard=shard1 Copyright 2013 Sematext Group. Inc. All rights reserved

Cluster State Monitoring Multiple MBeans exposed by JMX Multiple REST end – points exposed to get different statistics Copyright 2013 Sematext Group. Inc. All rights reserved

ElasticSearch Statistics API Health and state check Nodes information Cache statistics Segments information Index information Mappings information SPM – „One to rule them all” Copyright 2013 Sematext Group. Inc. All rights reserved

ElasticSearch Cluster Settings Update Control rebalancing recovery allocation Change cluster configuration properties Copyright 2013 Sematext Group. Inc. All rights reserved

ElasticSearch Custom Shard Allocation Cluster level: Index level: curl -XPUT localhost:9200/_cluster/settings -d '{     "persistent" : {         "cluster.routing.allocation.exclude._ip" : "192.168.2.1"     } }' curl -XPUT localhost:9200/sematext/_settings/ -d '{     "index.routing.allocation.include.tag" : "nodeOne,nodeTwo" }' Copyright 2013 Sematext Group. Inc. All rights reserved

Moving Shards and Replicas Move shards between nodes on demand curl -XPOST 'localhost:9200/_cluster/reroute' -d '{ "commands" : [ {"move" : {"index" : "sematext", "shard" : 0, "from_node" : "node1", "to_node" : "node2"}}, {"allocate" : {"index" : "sematext", "shard" : 1, "node" : "node3"}} ] }' Copyright 2013 Sematext Group. Inc. All rights reserved

Copyright 2013 Sematext Group. Inc. All rights reserved The Verdict Copyright 2013 Sematext Group. Inc. All rights reserved

Copyright 2013 Sematext Group. Inc. All rights reserved And The Winner Is ? The Users Copyright 2013 Sematext Group. Inc. All rights reserved

Copyright 2013 Sematext Group. Inc. All rights reserved We Are Hiring ! Dig Search ? Dig Analytics ? Dig Big Data ? Dig Performance ? Dig working with and in open – source ? We’re hiring world – wide ! http://sematext.com/about/jobs.html Copyright 2013 Sematext Group. Inc. All rights reserved

Copyright 2013 Sematext Group. Inc. All rights reserved Thank You ! Rafał Kuć @kucrafal rafal.kuc@sematext.com Sematext @sematext http://sematext.com http://blog.sematext.com ElasticSearch Server 25% off: MREESS25 Copyright 2013 Sematext Group. Inc. All rights reserved