© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert

Slides:



Advertisements
Similar presentations
Efficient full-text search in databases Andrew Aksyonoff, Peter Zaitsev Percona Ltd. shodan (at) shodan.ru.
Advertisements

© 2008 EBSCO Information Services SUSHI, COUNTER and ERM Systems An Update on Usage Standards Ressources électroniques dans les bibliothèques électroniques.
Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan.
Thanks to Microsoft Azure’s Scalability, BA Minds Delivers a Cost-Effective CRM Solution to Small and Medium-Sized Enterprises in Latin America MICROSOFT.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
For ITCS 6265 Professor: Wensheng Wu Present by TA: Xu Fei.
ARCHIMÈDE Presented by Guy Teasdale Directeur, Services soutien et développement Bibliothèque de l’Université Laval CARL Workshop on Institutional Repositories.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Information Retrieval in Practice
Search Engines and Information Retrieval
Web Server Hardware and Software
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
U of R eXtensible Catalog Team MetaCat. Problem Domain.
Microsoft ® Official Course Interacting with the Search Service Microsoft SharePoint 2013 SharePoint Practice.
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,
Overview of Search Engines
GOAT SEARCH Revorg GOAT Search Solution (Powered by Lucene)
Russ Houberg Senior Technical Architect, MCM KnowledgeLake, Inc.
Implementing search with free software An introduction to Solr By Mick England.
ECPRD seminar on the net IX”, Brussels, 2011 Faceted Search Some examples of applied faceted search on websites developed by the EP Jerry.
Word Up! Using Lucene for full-text search of your data set.
Xpantrac connection with IDEAL Sloane Neidig, Samantha Johnson, David Cabrera, Erika Hoffman CS /6/2014.
Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.
A/WWW Enterprises1 Introduction to CNIDR’s Isearch Archie Warnock
Search Engines and Information Retrieval Chapter 1.
Terrier: TERabyte RetRIevER An Introduction By: Kavita Ganesan (Last Updated April 21 st 2009)
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
Building Search Portals With SP2013 Search. 2 SharePoint 2013 Search  Introduction  Changes in the Architecture  Result Sources  Query Rules/Result.
Experimenting Lucene Index on HBase in an HPC Environment Xiaoming Gao Vaibhav Nachankar Judy Qiu.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.
Revolutionizing enterprise web development Searching with Solr.
NCSU Libraries Kristin Antelman NCSU Libraries June 24, 2006.
Kelly Boccia Abi Natarajan Konstantin Livitski Senthil Anand Subbanan Meyyappan 1.
SharePoint 2010 Search Architecture The Connector Framework Enhancing the Search User Interface Creating Custom Ranking Models.
Sébastien François, EPrints Lead Developer EPrints Developer Powwow, ULCC.
EXist Indexing Using the right index for you data Date: 9/29/2008 Dan McCreary President Dan McCreary & Associates (952) M.
Glynn Edwards SAA – August 22, 2015 Director, ePADD Project Archival Stewardship of using ePADD Software.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Digas Digital Archiving System. Digas is the database program used for research and fact checking in the Research Department (“Dokumentation”, ~ 60 researchers)
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Uwe SchindlerGES 2007 – May 2-4, 2007 Data Information Service based on Open Archives Initiative Protocols and Apache Lucene Uwe Schindler 1, Benny Bräuer.
1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.
Carolina Environmental Program 1 UNC Chapel Hill A New Control Strategy Tool within the Emissions Modeling Framework Alison M. Eyth Carolina Environmental.
Making Watson Fast Daniel Brown HON111. Need for Watson to be fast to play Jeopardy successfully – All computations have to be done in a few seconds –
Copyright © 2006 Pilothouse Consulting Inc. All rights reserved. Search Overview Search Features: WSS and Office Search Architecture Content Sources and.
Clusterpoint Margarita Sudņika ms RDBMS & NoSQL Databases & tables → Document stores Columns, rows → Schemaless documents Scales UP → Scales UP.
11 Why tune relevance Because we want to find the one single best item, among a large group of possible candidates….
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
Abstract MarkLogic Database – Only Enterprise NoSQL DB Aashi Rastogi, Sanket V. Patel Department of Computer Science University of Bridgeport, Bridgeport,
High performance, full-featured text search engine written in Java. Technology suitable for nearly any application requiring full-text search, especially.
A presentation on ElasticSearch
Information Retrieval in Practice
Why indexing? For efficient searching of a document
Univa Grid Engine Makes Work Management Automatic and Efficient, Accelerates Deployment of Cloud Services with Power of Microsoft Azure MICROSOFT AZURE.
Big Data is a Big Deal!.
Netscape Application Server
Vidcoding Introduces Scalable Video and TV Encoding in the Cloud at an Affordable Price by Utilizing the Processing Power of Azure Batch MICROSOFT AZURE.
Searching and Indexing
Building Search Systems for Digital Library Collections
PHP / MySQL Introduction
Built on the Powerful Microsoft Azure Platform, Lievestro Delivers Care Information, Capacity Management Solutions to Hospitals, Medical Field MICROSOFT.
Be Better: Achieve Customer Service Excellence and Create a Lean RMA and Returns Process with Renewity RMA and the Power of Microsoft Azure MICROSOFT AZURE.
Scalable SoftNAS Cloud Protects Customers’ Mission-Critical Data in the Cloud with a Highly Available, Flexible Solution for Microsoft Azure MICROSOFT.
The Only Digital Asset Management System on Microsoft Azure, MediaValet Is Uniquely Equipped to Meet Any Company’s Needs MICROSOFT AZURE ISV PROFILE: MEDIAVALET.
Overview of big data tools
McGraw-Hill Technology Education
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Presentation transcript:

© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert

What is Apache Lucene? “Apache Lucene(TM) is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross- platform.” - from

Features Scalable, High-Performance Indexing –over 95GB/hour on modern hardware –small RAM requirements -- only 1MB heap –incremental indexing as fast as batch indexing –index size roughly 20-30% the size of text indexed Powerful, Accurate and Efficient Search Algorithms –ranked searching -- best results returned first –many powerful query types: phrase queries, wildcard queries, proximity queries, range queries and more –fielded searching (e.g., title, author, contents) –date-range searching –sorting by any field –multiple-index searching with merged results –allows simultaneous update and searching

Features Cross-Platform Solution –Available as Open Source software under the Apache License which lets you use Lucene in both commercial and Open Source programsApache License –100%-pure Java –Implementations in other programming languages available that are index-compatible CLucene - Lucene implementation in C++CLucene Lucene.Net - Lucene implementation in.NETLucene.Net Zend Search - Lucene implementation in the Zend Framework for PHP 5Zend Search 4

Ranked Searching 1.Phrase Matching 2.Keyword Matching –Prefer more unique terms first –Scoring and ranking takes into account the uniqueness of each term when determining a document’s relevance score

Flexible Queries Phrases “star wars” Wildcards star* Ranges {star-stun} [ ] Boolean Operators star AND wars

Field-specific Queries Field-specific queries can be used to target specific fields in the Document Index. For example title:”star wars” AND director:”George Lucas”

Sorting Can sort any field in a Document –For example, by Price, Release Date, Amazon Sales Rank, etc… By default, Lucene will sort results by their relevance score. Sorting by any other field in a Document is also supported.

LUCENE INTERNALS 9

Everything is a Document A document can represent anything textual: –Word Document –DVD (the textual metadata only) –Website Member (name, ID, etc…) A Lucene Document need not refer to an actual file on a disk, it could also resemble a row in a relational database. Developers are responsible for turning their own data sets into Lucene Documents A document is seen as a list of fields, where a field has a name an a value

Indexes The unit of indexing in Lucene is a term. A term is often a word. Indexes track term frequencies Every term maps back to a Document Lucene uses inverted index which allows Lucene to quickly locate every document currently associated with a given set up input search terms.

Basic Indexing 1.Parse different types of documents (HTML, PDF, Word, text files, etc.) 2.Extract tokens and related info (Lucene Analyser) 3.Add the Document to an Index Lucene provide a standard analyzer for English and latin based languages.

Basic Searching 1.Create a Query (eg. by parsing user input) 2.Open an Index 3.Search the Index Use the same Analyzer as before 4.Iterate through returned Documents Extract out needed results Extract out result scores (if needed)

Lucene as SOA 1.Design an HTTP query syntax –GET queries –XML for results 2.Wrap Tomcat around core code 3.Write a Client Library As it follows SOA principles, basic building blocks such as load balancers can be deployed to quickly scale up the capacity of the search subsystem.

Lucene as SOA Diagram Single-Machine Architecture Lucene-based Application includes three components 1.Lucene Custom Client Library 2.Search Service 3.Custom Core Search Library

LUCENE SCALABILITY 16

Scalability Limits 3 main scalability factors: –Query Rate –Index Size –Update Rate

Query Rate Scalability Lucene is already fast –Built-in caching Easy solution for heavy workloads: (gives near-linear scaling) –Add more query servers behind a load balancer –Can grow as your traffic grows

Lucene as SOA Diagram High-Scale Multi-Machine Architecture

Index Size Scalability Can easily handle millions of Documents Lucene is very commonly deployed into systems with 10s of millions of Documents. Main limits related to Index size that one is likely to run in to will be disk capacity and disk I/O limits. If you need bigger: Built-in multi-machine capabilities –Can merge multiple remote indexes at query-time.

Update Rate Lucene is threadsafe –Can update and query at the same time I/O is limiting factor Strategies for achieving even higher update rates: –Vertical Split – for big workloads (Centralized Index Building) 1.Build indexes apart from query service 2.Push updated indexes on intervals –Horizontal Split – for huge workloads 1.Split data into columns 2.Merge columns for queries 3.Columns only receive their own data for updates