Stamo Petkov Full text search in digital and scanned documents with Elasticsearch and Tesseract.

Stamo Petkov Full text search in digital and scanned documents with Elasticsearch and Tesseract

Sponsors Gold Sponsors Trusted Partner Innovation Sponsor PASS
Global Sponsor PASS Swag Sponsor

About me Stamo Petkov Information services Plc. Contact
Head of Microsoft Technologies department Contact @stamo_petkov

Agenda Distributed Document Management System Polyglot persistence
Storing and Indexing documents content Elasticsearch Ingesting attachments Full text search Tesseract Demo application MongoDB RabbitMQ

Distributed Document Management System

Distributed Document Management System
Distributed across wide geographical area High speed low latency service for uploading, searching and downloading of large amount of digital and scanned documents All documents must be available globally Performance of resource-intensive tasks Optical character recognition (OCR) Extracting and indexing of digital content Fast and robust full text search across documents, regardless of their format and type

Polyglot persistence “…I'm confident to say that if you are starting a new strategic enterprise application you should no longer be assuming that your persistence should be relational. The relational option might be the right one - but you should seriously look at other alternatives…” Martin Fowler, 16 November 2011

Storing and Indexing documents content
Where to store files? Data base or File system Relational or NoSql Distributed or centralized Backup and replication Extracting and indexing document content

Elasticsearch

Elasticsearch Elasticsearch is a distributed, RESTful search and analytics engine Elasticsearch Is Fast. Really, Really Fast inverted indices with finite state transducers for full text querying BKD trees for storing numeric and geo data You talk to Elasticsearch running on a single node the same way you would in a 300-node cluster

Cross-cluster replication in Elasticsearch
Disaster Recovery (DR) / High Availability (HA) Data Locality Replicate data in Elasticsearch to get closer to the user or application server, reducing latencies Centralized Reporting Replicate data from a large number of smaller clusters back to a centralized reporting cluster CCR is a platinum level feature

Elasticsearch Ingesting attachments

Ingest Attachments Plugin
The ingest attachment plugin lets Elasticsearch extract file attachments in common formats (such as PPT, XLS, and PDF) by using the Apache text extraction library Tika The source field must be a base64 encoded binary The plugin must be installed on every node in the cluster, and each node must be restarted after installation

Ingest Attachments Plugin
Name Required Default Description field Yes - base64 encoded field to ingest target_field No attachment field to store the attachment information in indexed_chars 100000 The number of chars being used for extraction to prevent huge fields. Use -1 for no limit indexed_chars_field null Field name from which you can overwrite the number of chars being used for extraction. See indexed_chars properties all properties Array of properties to select to be stored. Can be content, title, name, author, keywords, date, content_type, content_length, language ignore_missing false If true and field does not exist, the processor quietly exits without modifying the document

Elasticsearch Full text search

Full text search in Elasticsearch
The high-level full text queries are usually used for running full text queries on full text fields. They understand how the field being queried is analyzed and will apply each field’s analyzer (or search_analyzer) to the query string before executing Analysis is the process of converting text into tokens or terms which are added to the inverted index for searching. Analysis is performed by an analyzer which can be either a built-in analyzer or a custom analyzer defined per index

Using built-in english analyzer on the sentence Convert sentence into distinct tokens Lowercase each token Remove frequent stopwords ("the") Reduce the terms to their word stems foxes → fox jumped → jump lazy → lazi In the end, the following terms will be added to the inverted index "The QUICK brown foxes jumped over the lazy dog!“ [ quick, brown, fox, jump, over, lazi, dog ]

match query The standard query for performing full text queries, including fuzzy matching and phrase or proximity queries match_phrase query Like the match query but used for matching exact phrases or word proximity matches match_phrase_prefix query The poor man’s search-as-you-type. Like the match_phrase query, but does a wildcard search on the final word multi_match query The multi-field version of the match query

common terms query A more specialized query which gives more preference to uncommon words query_string query Supports the compact Lucene query string syntax, allowing you to specify AND|OR|NOT conditions and multi-field search within a single query string. For expert users only simple_query_string query A simpler, more robust version of the query_string syntax suitable for exposing directly to users intervals query A full text query that allows fine-grained control of the ordering and proximity of matching terms

Match query Match query is of type boolean
the text provided is analyzed and the analysis process constructs a boolean query from the provided text operator flag controls the boolean clauses (defaults to or) minimum_should_match parameter sets minimum number of optional should clauses to match analyzer controls which analyzer will perform the analysis process on the text lenient parameter can be set to true to ignore exceptions caused by data-type mismatches, such as trying to query a numeric field with a text query string

Match query Fuzziness - when querying text or keyword fields, fuzziness is interpreted as a Levenshtein Edit Distance - the number of one character changes that need to be made to one string to make it the same as another string The fuzziness parameter can be specified as Single digit (0, 1, 2 …) - the maximum allowed Levenshtein Edit Distance (or number of edits) Auto([min], [max]) - Generates an edit distance based on the length of the term fuzzy_transpositions - the default metric used by fuzzy queries to determine a match is the Damerau-Levenshtein distance formula which supports transpositions. Setting transposition to false will switch to classic Levenshtein distance

Match query zero_terms_query - the default behavior when analyzer removes all tokens in a query is to match no documents at all. You can change it to match all documents by setting this option to all cutoff_frequency - specifies an absolute or relative document frequency where high frequency terms are moved into an optional subquery and are only scored if one of the low frequency (below the cutoff) terms in the case of an or operator or all of the low frequency terms in the case of an and operator match

Tesseract

Tesseract Tesseract was originally developed at Hewlett-Packard Laboratories Bristol and at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994 In 2005 Tesseract was open sourced by HP Since 2006 it is developed by Google Tesseract 4 adds a new neural net (LSTM) based OCR engine which is focused on line recognition, but also still supports the legacy Tesseract OCR engine of Tesseract 3 which works by recognizing character patterns

Tesseract Has unicode (UTF-8) support, and can recognize more than 100 languages "out of the box“ Can be trained to recognize other languages Supports various output formats: plain text, hOCR (HTML), PDF, invisible-text-only PDF, TSV. The master branch also has experimental support for ALTO (XML) output Developers can use libtesseract C or C++ API to build their own application

Demo Application

Demo Application You can get the FullTextSearch demo application on GitHub Requires .Net Core 2.2 SDK and Docker to be installed Precompiled binaries for Tesseract for Linux containers are included in the solution All containers are configured with additional plugins and libraries that are needed for the application NginX reverse proxy is used to access containers from outside

Demo Application – used technologies
MongoDB – used for file storage Elasticsearch – You know, for search… RabbitMq – to separate uploading from processing and indexing of documents Tesseract – Optical Character Recognition

Sponsors Gold Sponsors Trusted Partner Innovation Sponsor PASS
Global Sponsor PASS Swag Sponsor

Thank you!

Stamo Petkov Full text search in digital and scanned documents with Elasticsearch and Tesseract.

Similar presentations

Presentation on theme: "Stamo Petkov Full text search in digital and scanned documents with Elasticsearch and Tesseract."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Stamo Petkov Full text search in digital and scanned documents with Elasticsearch and Tesseract.

Similar presentations

Presentation on theme: "Stamo Petkov Full text search in digital and scanned documents with Elasticsearch and Tesseract."— Presentation transcript:

Similar presentations

About project

Feedback