Introduction to Elasticsearch with basics of Lucene May 2014 Meetup Rahul Jain @rahuldausa @http://www.meetup.com/Hyderabad-Apache-Solr-Lucene-Group/
Who am I Software Engineer 7 years of software development experience Built a platform to search logs in Near real time with volume of 1TB/day# Worked on a Solr search based SEO/SEM software with 40 billion records/month (Topic of next talk?) Areas of expertise/interest High traffic web applications JAVA/J2EE Big data, NoSQL Information-Retrieval, Machine learning # http://www.slideshare.net/lucenerevolution/building-a-near-real-time-search-engine-analytics-for-logs-using-solr
Agenda IR Overview Basic Concepts Lucene Elasticsearch Logstash & Kibana - Short Introduction Q&A
Information Retrieval (IR) ”Information retrieval is the activity of obtaining information resources (in the form of documents) relevant to an information need from a collection of information resources. Searches can be based on metadata or on full-text (or other content-based) indexing” - Wikipedia
Basic Concepts Term t : a noun or compound word used in a specific context tf (t in d) : term frequency in a document measure of how often a term appears in the document the number of times term t appears in the currently scored document d idf (t) : inverse document frequency measure of whether the term is common or rare across all documents, i.e. how often the term appears across the index obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient. boost (index) : boost of the field at index-time boost (query) : boost of the field at query-time
Credit: http://http://whatisgraphsearch.com/ Basic Concepts TF - IDF TF - IDF = Term Frequency X Inverse Document Frequency Credit: http://http://whatisgraphsearch.com/
Apache Lucene
Apache Lucene Fast, high performance, scalable search/IR library Open source Initially developed by Doug Cutting (Also author of Hadoop) Indexing and Searching Inverted Index of documents Provides advanced Search options like synonyms, stopwords, based on similarity, proximity. http://lucene.apache.org/
Lucene Internals - Inverted Index Credit: https://developer.apple.com/library/mac/documentation/userexperience/conceptual/SearchKitConcepts/searchKit_basics/searchKit_basics.html
Lucene Internals (Contd.) Defines documents Model Index contains documents. Each document consist of fields. Each Field has attributes. What is the data type (FieldType) How to handle the content (Analyzers, Filters) Is it a stored field (stored="true") or Index field (indexed="true")
Indexing Pipeline Analyzer : create tokens using a Tokenizer and/or applying Filters (Token Filters) Each field can define an Analyzer at index time/query time or the both at same time. Credit : http://www.slideshare.net/otisg/lucene-introduction
Analysis Process - Tokenizer WhitespaceAnalyzer Simplest built-in analyzer The quick brown fox jumps over the lazy dog. [The] [quick] [brown] [fox] [jumps] [over] [the] [lazy] [dog.] Tokens
Analysis Process - Tokenizer SimpleAnalyzer Lowercases, split at non-letter boundaries The quick brown fox jumps over the lazy dog. [the] [quick] [brown] [fox] [jumps] [over] [the] [lazy] [dog] Tokens
Elasticsearch
Introduction Enterprise Search platform for Apache Lucene Open source Highly reliable, scalable, fault tolerant Support distributed Indexing, Replication, and load balanced querying http://www.elasticsearch.org/
Elasticsearch - Features Distributed RESTful search server Document oriented Domain Driven Schema less Restful Easy to scale horizontally
Elasticsearch - Features Highlighting Spelling Suggestions Facets (Group by) Query DSL based on JSON to define queries Automatic shard replication, routing Zen discovery Unicast Multicast Master Election Re-election if Master Node fails
APIs HTTP RESTful Api Java Api Clients perl, python, php, ruby, .net etc All APIs perform automatic node operation rerouting.
How to start It’s this Easy.
Operations
INDEX CREATION http://localhost:9200/<index>/<type>/[<id>] curl -XPUT "http://localhost:9200/movies/movie/1" -d‘ { "title": "The Godfather", "director": "Francis Ford Coppola", "year": 1972 }' Credit: http://joelabrahamsson.com/elasticsearch-101/
INDEX CREATION RESPONSE Credit: http://joelabrahamsson.com/elasticsearch-101/
UPDATE curl -XPUT "http://localhost:9200/movies/movie/1" -d' { "title": "The Godfather", "director": "Francis Ford Coppola", "year": 1972, "genres": ["Crime", "Drama"] }' New field Updated Version Credit: http://joelabrahamsson.com/elasticsearch-101/
GET curl -XGET "http://localhost:9200/movies/movie/1" -d'' Credit: http://joelabrahamsson.com/elasticsearch-101/
DELETE curl -XDELETE "http://localhost:9200/movies/movie/1" -d'' Credit: http://joelabrahamsson.com/elasticsearch-101/
SEARCH Search across all types in the movies index. Search across all indexes and all types http://localhost:9200/_search Search across all types in the movies index. http://localhost:9200/movies/_search Search explicitly for documents of type movie within the movies index. http://localhost:9200/movies/movie/_search curl -XPOST "http://localhost:9200/_search" -d' { "query": { "query_string": { "query": "kill" } }' Credit: http://joelabrahamsson.com/elasticsearch-101/
SEARCH RESPONSE Credit: http://joelabrahamsson.com/elasticsearch-101/
Updating existing Mapping curl -XPUT "http://localhost:9200/movies/movie/_mapping" -d' { "movie": { "properties": { "director": { "type": "multi_field", "fields": { "director": {"type": "string"}, "original": {"type" : "string", "index" : "not_analyzed"} } }' Credit: http://joelabrahamsson.com/elasticsearch-101/
Cluster Architecture Source: http://www.slideshare.net/DmitriBabaev1/elastic-search-moscow-bigdata-cassandra-sept-2013-meetup
Index Request Source: http://www.slideshare.net/DmitriBabaev1/elastic-search-moscow-bigdata-cassandra-sept-2013-meetup
Search Request Source: http://www.slideshare.net/DmitriBabaev1/elastic-search-moscow-bigdata-cassandra-sept-2013-meetup
Who are using Github Stumbleupon Soundcloud Datadog Stackoverflow Many more… http://www.elasticsearch.com/case-studies/
Logstash
Logstash Open Source, Apache licensee Written in JRuby Part of Elasticsearch family http://logstash.net/ Current version: 1.4.0 This talk is with 1.3.3
Logstash Multiple Input/ Multiple Output Centralize logs Collect Parse Forward/Store
Architecture Source: http://www.infoq.com/articles/review-the-logstash-book
Logstash – life of an event Input Filters Output Filters are processed in order of config file Outputs are processed in order of config file Input: Input stream File input (tail) Log4j Redis Syslog and many more… http://logstash.net/docs/1.3.3/
Logstash – life of an event Codecs : decoding log messages Json Multiline Netflow and many more… Filters : processing messages Date – Date format Grok – Regular expression based extraction Mutate – Change data type Output : storing the structured message Elasticsearch Mongodb Email Nagios http://logstash.net/docs/1.3.3/
Quick Start < 1.3.3 version: basic-agent.conf : input { tcp { type => "apache" port => 3333 } output { stdout { debug => true elasticsearch { embedded => true < 1.3.3 version: java -jar logstash-1.3.3-flatjar.jar agent -f agent.conf – web 1.4 version: bin/logstash agent –f agent.conf bin/logstash –web
Kibana
Source: http://www. slideshare
Source: http://www. slideshare
Analytics Analytics source : Kibana.org based on ElasticSearch and Logstash Image Source : http://semicomplete.com/presentations/logstash-monitorama-2013/#/8
Join us @ http://www.meetup.com/Hyderabad-Apache-Solr-Lucene-Group/ Thanks! @rahuldausa on twitter and slideshare http://www.linkedin.com/in/rahuldausa Find Interesting ? Join us @ http://www.meetup.com/Hyderabad-Apache-Solr-Lucene-Group/