Introduction to Elasticsearch with basics of Lucene May 2014 Meetup

Slides:



Advertisements
Similar presentations
Amaze business, make your devs happy
Advertisements

Chapter 5: Introduction to Information Retrieval
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
1 Vic Hargrave |
Information Retrieval in Practice
Overview of Search Engines
A Social blog using MongoDB ITEC-810 Final Presentation Lucero Soria Supervisor: Dr. Jian Yang.
Implementing search with free software An introduction to Solr By Mick England.
1 Yasin N. Silva Arizona State University This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
1 Introduction to Lucene Rong Jin. What is Lucene ?  Lucene is a high performance, scalable Information Retrieval (IR) library Free, open-source project.
Introduction to Apache Lucene/Solr CSCI 572: Information Retrieval and Search Engines Summer 2010.
Battle of the Giants Apache Solr 4.0 vs ElasticSearch 0.20 Rafał Kuć – sematext.com.
Elasticsearch in Dashboard Data Management Applications David Tuckett IT/SDC 30 August 2013 (Appendix 11 November 2013)
Clemens Düpmeier (KIT / IAI)
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Experimenting Lucene Index on HBase in an HPC Environment Xiaoming Gao Vaibhav Nachankar Judy Qiu.
University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.
Kelly Boccia Abi Natarajan Konstantin Livitski Senthil Anand Subbanan Meyyappan 1.
Overview of IU Digital Collections Search Hui Zhang Jon Dunn Indiana University Digital Library Program IU Digital Library Brown Bag October 19, 2011.
Open Search Office Web Services Database Doc Mgt Sys Pipeline Index Geospatial Analysis Text Search Faceting Caching Query parsing Clustering Synonyms.
HathiTrust Research Center Architecture Data subsystem.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation MongoDB Architecture.
Iccha Sethi Serdar Aslan Team 1 Virginia Tech Information Storage and Retrieval CS 5604 Instructor: Dr. Edward Fox 10/11/2010.
Carlos Fernando Gamboa RACF, BNL HEPiX
CERN IT Department CH-1211 Geneva 23 Switzerland t CF Computing Facilities Agile Infrastructure Monitoring CERN IT/CF.
807 - TEXT ANALYTICS Massimo Poesio Lab 2: (Quick intro to) SOLR Document clustering with MAHOUT.
Site Technology TOI Fest Q Celebration From Keyword-based Search to Semantic Search, How Big Data Enables That?
2 Floor, , Sunnae-Dong,Kangdong-Gu Seoul, Korea T | F | SEOJINDSA CO. LTD Enterprise LDAP Team LDAP.
CERN IT Department CH-1211 Geneva 23 Switzerland t CF Computing Facilities IT Monitoring CERN IT-CF HEPiX Fall 2013.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
Orion Contextbroker PROF. DR. SERGIO TAKEO KOFUJI PROF. MS. FÁBIO H. CABRINI PSI – 5120 – TÓPICOS EM COMPUTAÇÃO EM NUVEM
HW3 Overview There are 4 components to this homework; you will possibly not need all of them; 1. Installing Ubuntu 2. Installing Solr 3. Using Solr to.
Look Mom! – NoSQL Charles Nurse | DotNetNuke Corp.
1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.
A presentation on ElasticSearch
Information Retrieval in Practice
and Big Data Storage Systems
Search Engine Architecture
OpenLegacy Training Day Four Introduction to Microservices
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
Getting Started with Alfresco Development
Searching and Indexing
MongoDB Er. Shiva K. Shrestha ME Computer, NCIT
Experience in CMS with Analytic Tools for Data Flow and HLT Monitoring
NOSQL.
Combining Metrics and Logs for Holistic System/Application Analysis
Safe by default, optimized for efficiency
Dineesha Suraweera.
Building Search Systems for Digital Library Collections
Introduction to Microservices Prepared for
NOSQL databases and Big Data Storage Systems
Extraction, aggregation and classification at Web Scale
Microsoft Build /8/2018 5:15 AM © 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY,
Gen-Tao Chiang Data and Analytic Engineer
CS 5604 Information Storage and Retrieval
CS6604 Digital Libraries IDEAL Webpages Presented by
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Elasticsearch Query DSL
Introduction to Apache
Overview of big data tools
another noSql customization for the HDB++ archiving system
The ELK stack - get to know logs
Rafał Kuć – Sematext sematext.com
Battle of the Giants Apache Solr 4.0 vs ElasticSearch 0.20
Indexing with ElasticSearch
Stamo Petkov Full text search in digital and scanned documents with Elasticsearch and Tesseract.
Intro to Azure Search Julie Smith 2019.
Working with GEOLocation Data
Intro to Azure Search Julie Smith 2019.
Presentation transcript:

Introduction to Elasticsearch with basics of Lucene May 2014 Meetup Rahul Jain @rahuldausa @http://www.meetup.com/Hyderabad-Apache-Solr-Lucene-Group/

Who am I Software Engineer 7 years of software development experience Built a platform to search logs in Near real time with volume of 1TB/day# Worked on a Solr search based SEO/SEM software with 40 billion records/month (Topic of next talk?) Areas of expertise/interest High traffic web applications JAVA/J2EE Big data, NoSQL Information-Retrieval, Machine learning # http://www.slideshare.net/lucenerevolution/building-a-near-real-time-search-engine-analytics-for-logs-using-solr

Agenda IR Overview Basic Concepts Lucene Elasticsearch Logstash & Kibana - Short Introduction Q&A

Information Retrieval (IR) ”Information retrieval is the activity of obtaining information resources (in the form of documents) relevant to an information need from a collection of information resources. Searches can be based on metadata or on full-text (or other content-based) indexing” - Wikipedia

Basic Concepts Term t : a noun or compound word used in a specific context tf (t in d) : term frequency in a document measure of how often a term appears in the document the number of times term t appears in the currently scored document d idf  (t) : inverse document frequency measure of whether the term is common or rare across all documents, i.e. how often the term appears across the index obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient. boost (index) : boost of the field at index-time boost (query) : boost of the field at query-time

Credit: http://http://whatisgraphsearch.com/ Basic Concepts TF - IDF TF - IDF = Term Frequency X Inverse Document Frequency Credit: http://http://whatisgraphsearch.com/

Apache Lucene

Apache Lucene Fast, high performance, scalable search/IR library Open source Initially developed by Doug Cutting (Also author of Hadoop) Indexing and Searching Inverted Index of documents Provides advanced Search options like synonyms, stopwords, based on similarity, proximity. http://lucene.apache.org/

Lucene Internals - Inverted Index Credit: https://developer.apple.com/library/mac/documentation/userexperience/conceptual/SearchKitConcepts/searchKit_basics/searchKit_basics.html

Lucene Internals (Contd.) Defines documents Model Index contains documents. Each document consist of fields. Each Field has attributes. What is the data type (FieldType) How to handle the content (Analyzers, Filters) Is it a stored field (stored="true") or Index field (indexed="true")

Indexing Pipeline Analyzer : create tokens using a Tokenizer and/or applying Filters (Token Filters) Each field can define an Analyzer at index time/query time or the both at same time. Credit : http://www.slideshare.net/otisg/lucene-introduction

Analysis Process - Tokenizer WhitespaceAnalyzer Simplest built-in analyzer The quick brown fox jumps over the lazy dog. [The] [quick] [brown] [fox] [jumps] [over] [the] [lazy] [dog.] Tokens

Analysis Process - Tokenizer SimpleAnalyzer Lowercases, split at non-letter boundaries The quick brown fox jumps over the lazy dog. [the] [quick] [brown] [fox] [jumps] [over] [the] [lazy] [dog] Tokens

Elasticsearch

Introduction Enterprise Search platform for Apache Lucene Open source Highly reliable, scalable, fault tolerant Support distributed Indexing, Replication, and load balanced querying http://www.elasticsearch.org/

Elasticsearch - Features Distributed RESTful search server Document oriented Domain Driven Schema less Restful Easy to scale horizontally

Elasticsearch - Features Highlighting Spelling Suggestions Facets (Group by)  Query DSL based on JSON to define queries Automatic shard replication, routing Zen discovery Unicast Multicast Master Election Re-election if Master Node fails

APIs HTTP RESTful Api Java Api Clients perl, python, php, ruby, .net etc All APIs perform automatic node operation rerouting.

How to start It’s this Easy.

Operations

INDEX CREATION http://localhost:9200/<index>/<type>/[<id>] curl -XPUT "http://localhost:9200/movies/movie/1" -d‘ { "title": "The Godfather", "director": "Francis Ford Coppola", "year": 1972 }' Credit: http://joelabrahamsson.com/elasticsearch-101/

INDEX CREATION RESPONSE Credit: http://joelabrahamsson.com/elasticsearch-101/

UPDATE curl -XPUT "http://localhost:9200/movies/movie/1" -d' { "title": "The Godfather", "director": "Francis Ford Coppola", "year": 1972, "genres": ["Crime", "Drama"] }' New field Updated Version Credit: http://joelabrahamsson.com/elasticsearch-101/

GET curl -XGET "http://localhost:9200/movies/movie/1" -d'' Credit: http://joelabrahamsson.com/elasticsearch-101/

DELETE curl -XDELETE "http://localhost:9200/movies/movie/1" -d'' Credit: http://joelabrahamsson.com/elasticsearch-101/

SEARCH Search across all types in the movies index. Search across all indexes and all types http://localhost:9200/_search  Search across all types in the movies index. http://localhost:9200/movies/_search Search explicitly for documents of type movie within the movies index. http://localhost:9200/movies/movie/_search curl -XPOST "http://localhost:9200/_search" -d' { "query": { "query_string": { "query": "kill" } }' Credit: http://joelabrahamsson.com/elasticsearch-101/

SEARCH RESPONSE Credit: http://joelabrahamsson.com/elasticsearch-101/

Updating existing Mapping curl -XPUT "http://localhost:9200/movies/movie/_mapping" -d' { "movie": { "properties": { "director": { "type": "multi_field", "fields": { "director": {"type": "string"}, "original": {"type" : "string", "index" : "not_analyzed"} } }' Credit: http://joelabrahamsson.com/elasticsearch-101/

Cluster Architecture Source: http://www.slideshare.net/DmitriBabaev1/elastic-search-moscow-bigdata-cassandra-sept-2013-meetup

Index Request Source: http://www.slideshare.net/DmitriBabaev1/elastic-search-moscow-bigdata-cassandra-sept-2013-meetup

Search Request Source: http://www.slideshare.net/DmitriBabaev1/elastic-search-moscow-bigdata-cassandra-sept-2013-meetup

Who are using Github Stumbleupon Soundcloud Datadog Stackoverflow Many more… http://www.elasticsearch.com/case-studies/

Logstash

Logstash Open Source, Apache licensee Written in JRuby Part of Elasticsearch family http://logstash.net/ Current version: 1.4.0 This talk is with 1.3.3

Logstash Multiple Input/ Multiple Output Centralize logs Collect Parse Forward/Store

Architecture Source: http://www.infoq.com/articles/review-the-logstash-book

Logstash – life of an event Input  Filters  Output Filters are processed in order of config file Outputs are processed in order of config file Input: Input stream File input (tail) Log4j Redis Syslog and many more… http://logstash.net/docs/1.3.3/

Logstash – life of an event Codecs : decoding log messages Json Multiline Netflow and many more… Filters : processing messages Date – Date format Grok – Regular expression based extraction Mutate – Change data type Output : storing the structured message Elasticsearch Mongodb Email Nagios http://logstash.net/docs/1.3.3/

Quick Start < 1.3.3 version: basic-agent.conf : input { tcp { type => "apache" port => 3333 } output { stdout { debug => true elasticsearch { embedded => true < 1.3.3 version: java -jar logstash-1.3.3-flatjar.jar agent -f agent.conf – web 1.4 version: bin/logstash agent –f agent.conf bin/logstash –web

Kibana

Source: http://www. slideshare

Source: http://www. slideshare

Analytics Analytics source : Kibana.org based on ElasticSearch and Logstash Image Source : http://semicomplete.com/presentations/logstash-monitorama-2013/#/8

Join us @ http://www.meetup.com/Hyderabad-Apache-Solr-Lucene-Group/ Thanks! @rahuldausa on twitter and slideshare http://www.linkedin.com/in/rahuldausa Find Interesting ? Join us @ http://www.meetup.com/Hyderabad-Apache-Solr-Lucene-Group/