Apache Lucene in LexGrid. Lucene Overview High-performance, full-featured text search engine library. Written entirely in Java. An open source project.

Slides:



Advertisements
Similar presentations
Chapter 1 Writing a Program Fall Class Overview Course Information –On the web page and Blackboard –
Advertisements

© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Lucene Part3‏. Lucene High Level Infrastructure When you look at building your search solution, you often find that the process is split into two main.
Information Retrieval in Practice
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 14 Web Database Programming Using PHP.
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
SQL Reporting Services Overview SSRS includes all the development and management pieces necessary to publish end user reports in  HTML  PDF 
Overview of Search Engines
 2004 Prentice Hall, Inc. All rights reserved. Chapter 25 – Perl and CGI (Common Gateway Interface) Outline 25.1 Introduction 25.2 Perl 25.3 String Processing.
Chapter 4 Query Languages.... Introduction Cover different kinds of queries posed to text retrieval systems Keyword-based query languages  include simple.
8 Copyright © 2004, Oracle. All rights reserved. Creating LOVs and Editors.
JavaScript, Fifth Edition Chapter 1 Introduction to JavaScript.
CIS Computer Programming Logic
Computer Science Standard Level Mastery Aspects. Mastery Item Claimed JustificationWhere Listed Arrays Used to store the student data Lines P.
Chapter 17 Domain Name System
1 PHP and MySQL. 2 Topics  Querying Data with PHP  User-Driven Querying  Writing Data with PHP and MySQL PHP and MySQL.
CISC474 - JavaScript 03/02/2011. Some Background… Great JavaScript Guides: –
1 LexEVS 5.0 Advanced Topics Configuration Options LexEVS Boot Camp November, 2009.
Invitation to Computer Science, Java Version, Second Edition.
Copyright © 2007, Oracle. All rights reserved. Managing Concurrent Requests.
Lecture Set 14 B new Introduction to Databases - Database Processing: The Connected Model (Using DataReaders)
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
CIS 218 Advanced UNIX1 CIS 218 – Advanced UNIX (g)awk.
Chapter 8 Cookies And Security JavaScript, Third Edition.
Sébastien François, EPrints Lead Developer EPrints Developer Powwow, ULCC.
Information Retrieval CSE 8337 Spring 2007 Query Languages & Matching Material for these slides obtained from: Modern Information Retrieval by Ricardo.
LexBIG/LexGrid Services for LexBIG 2.3 Model and API for the Grid.
Property of Jack Wilson, Cerritos College1 CIS Computer Programming Logic Programming Concepts Overview prepared by Jack Wilson Cerritos College.
CSC 212 – Data Structures Lecture 37: Course Review.
Lecture Set 14 B new Introduction to Databases - Database Processing: The Connected Model (Using DataReaders)
Data Structures and Algorithms Lecture 1 Instructor: Quratulain Date: 1 st Sep, 2009.
Caché SQL More than you think Ian Cargill Development Manager Dendrite Clinical Systems.
ITGS Databases.
SQL Fundamentals  SQL: Structured Query Language is a simple and powerful language used to create, access, and manipulate data and structure in the database.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
ECA 225 Applied Interactive Programming1 ECA 225 Applied Online Programming basics.
Data Structures and Algorithms Dr. Tehseen Zia Assistant Professor Dept. Computer Science and IT University of Sargodha Lecture 1.
IBM TSpaces Lab 2 Customizing tuples and fields. Summary Blocking commands Tuple Expiration Extending Tuples (The SubclassableTuple) Reading/writing user.
Files Tutor: You will need ….
JAVA BEANS JSP - Standard Tag Library (JSTL) JAVA Enterprise Edition.
S11-1 ADM , Section 11, August 2005 Copyright  2005 MSC.Software Corporation SECTION 11 MACROS: OVERVIEW.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Lucene Jianguo Lu.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 14 Web Database Programming Using PHP.
1 LexEVS 5.0 Advanced Topics Advanced Topics: Query Optimization LexEVS Boot Camp November, 2009.
Compiler Construction CPCS302 Dr. Manal Abdulaziz.
Text TCS INTERNAL Oracle PL/SQL – Introduction. TCS INTERNAL PL SQL Introduction PLSQL means Procedural Language extension of SQL. PLSQL is a database.
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
Chapter 9 Introduction to Arrays Fundamentals of Java.
Some of the utilities associated with the development of programs. These program development tools allow users to write and construct programs that the.
Web Database Programming Using PHP
Microsoft Office Access 2010 Lab 3
Web Database Programming Using PHP
JavaScript Objects.
Chapter 3: Using Methods, Classes, and Objects
SECTION 3 MACROS: OVERVIEW.
CS 430: Information Discovery
Expressions and Control Flow in JavaScript
Query Languages.
File Handling Programming Guides.
Web DB Programming: PHP
Introduction C is a general-purpose, high-level language that was originally developed by Dennis M. Ritchie to develop the UNIX operating system at Bell.
Coding Concepts (Data- Types)
HYPERTEXT PREPROCESSOR BY : UMA KAKKAR
The ultimate in data organization
JavaScript: Objects.
Information Retrieval and Web Design
Introduction to Computer Science
Presentation transcript:

Apache Lucene in LexGrid

Lucene Overview High-performance, full-featured text search engine library. Written entirely in Java. An open source project available for free download.

Lucene structure overview Index : Contains a sequence of documents. Document : Is a sequence of fields. Field : Is a named sequence of terms. Term : Is a string. –Same string can be assigned to different fields. –Indexes only text or Strings.

Index Or Store Fields Index : –Will be used for searching. –Stores statistics about terms in order to make term- based search more efficient. –Inverted index : for a term, it can list all the documents that contain it. Store : –Not used for searching. –Helpful for debugging. –Term is stored in the index literally.

Analyzer Responsible for breaking up the text in each of the document fields into individual tokens. Tokens are the smallest piece of information that you can search. You can also use a different analyzer for each field so they can be treated differently. However, at search time your search analyzer must match your indexing analyzer in order to get good results.

The Mapping Our indexer code reads LexGrid data from database. The reader code needs to assemble the concept information so that it can call this method: protected void addConcept(String codingSchemeName, String codingSchemeId, String conceptCode, String propertyType, String property, String propertyValue, Boolean isActive, String presentationFormat, String language, Boolean isPreferred, String conceptStatus, String propertyId, String degreeOfFidelity, Boolean matchIfNoContext, String representationalForm, String[] sources, String[] usageContexts, Qualifier[] qualifiers) Every time this method is called, it creates a Lucene document out of this information.

The Mapping (Cont..) The above method called for every Presentation, Property, Definitions etc in a concept code. This is all of the information from LexGrid that is currently stored in the index. When the Boolean parameters are indexed, they are stored as a 'T' or an 'F', if supplied. When constructing a field, we have to decide if it will be analyzed, stored, and indexed.

The Mapping (Cont..) Here is the breakdown of the lucene fields that we create: codingSchemeName -> “codingSchemeName” S codingSchemeId -> “codingSchemeId” S conceptCode -> “conceptCodeTokenized” T conceptCode -> “conceptCode” S conceptCode -> “conceptCodeLC” LC property -> “property” S language -> “language” S propertyType is special – if it is not supplied, it is automatically set to textualPresentation, definition, comment, instruction, or property, depending on the value of the property variable. propertyType -> “propertyType” S The following fields are optional – only added if the provided values are non-null: propertyValue -> “propertyValue” ST propertyValue ->(lowercased)-> “untokenizedLCPropertyValue” LC *If normalization enabled* - propertyValue -> “norm_propertyValue” T *If doubleMetaphone enabled* - propertyValue -> “dm_propertyValue” T *If stemming enabled* - propertyValue -> “stem_propertyValue” T isActive -> “inactive” S isPreferred -> “isPreferred” S presentationFormat -> “presentationFormat” S conceptStatus -> “conceptStatus” S propertyId -> “propertyId” S degreeOfFidelity -> “degreeOfFidelity” S representationalForm -> “representationalForm” S matchIfNoContext -> “matchIfNoContext” S sources -> “sources” S T* usageContexts -> “usageContexts” S T* qualifiers -> “qualifiers” S T**

The Mapping (Cont..) Field “fields” : –Added to each document. –List of present in this document. –Helps searching for documents that contain (or don't contain) a particular field. Field “UNIQUE_DOCUMENT_IDENTIFIER_FIELD” : –Added to each document. –Populated by the “codingSchemeName” plus a hyphen and a document counter value. –Makes it easier to remove documents.

The Mapping (Cont..) Analyzers / Tokenizers WhiteSpaceLowerCaseAnalyzer –The default analyzer. –Makes text lower case. –splits the text into tokens on white space and following : '-', ';', '(', ')', '{', '}', '[', ']', ' ', '|‘ –Removes the following characters: ',', '.', '/', '\\', '`', '\'', '"', '+', '*', '=', '#', '$', '%', '^', '&','?', '!‘ –used on the “conceptCode” and “propertyValue” fields.

The Mapping (Cont..) Analyzers / Tokenizers NormAnalyzer –Used when normalization is enabled. –Uses the WhiteSpaceLowerCaseAnalyzer and LVG Norm. Ex : if the string “trees” is fed into the analyzer, Lucene will end up indexing “tree”. –Used on the “norm_propertyValue” field. EncoderAnalyzer –Used when Double Metaphone indexing is enabled. –Uses the WhiteSpaceLowerCaseAnalyzer and Apache Commons Codec Double Metaphone Algorithm.

Index Usage in LexBIG Restrictions on CodeNodeSets are turned into Lucene queries. Supported Queries: –LuceneQuery –DoubleMetaphoneLuceneQuery –StemmedLuceneQuery –StartsWith –ExactMatch –Contains –RegExp

Index Usage in LexBIG (Cont..) Simple Queries: –Queries constructed using default field, term value and Analyzer based on user specified query. –for example, if you specify 'activeOnly', we add a section to the query which would require the “isActive” field to have a value of 'T'. –Nearly all of the untokenized fields are handled this way. –“startsWith” and “exactMatch” queries are also handled this way.

Index Usage in LexBIG (Cont..) Complex Queries: –User queries containing boolean logic, embedded wild cards, etc –Rely on Lucene Query Parser by providing appropriate Analyzer and field depending on the type of search algorithm selected. –For example, for normalized search, we feed the matchText into a QueryParser with the NormAnalyzer, and the “norm_propertyField” set at the default field. For the “LuceneQuery” match algorithm, we provide the WhiteSpaceLowerCaseAnalyzer and the “propertyValue” field. –Wild card and fuzzy searches supported.

Index Usage in LexBIG (Cont..) Result from Lucene : –BitSet (an array of bits – either 1 or 0) – with one bit per Lucene document. Each bit will be set to 1 if the document matched the query, or it will be set to 0 if it did not satisfy the query. –We take advantage of the boundary documents. –Combination of boundary bitSet and user query bitSet gives all the matching unique concept code data. –Additional restriction on a codedNodeSet are resolved to a bitSet in the same way as above and then the bitSets are AND'ed together. –BitSet resolved into CardHolder object (contains only the conceptCode, codingScheme, and version + the score, if requested) which can be used for ‘union’, ‘intersection’ and ‘difference’.

Index Usage in LexBIG (Cont..) –If entire result to be returned at once : Each of the items in the CodeHolder is resolved into a ResolvedConceptReference – this is done through a series of SQL calls. –If asked for an iterator : An Iterator object is created which holds the CodeHolder. Individual ResolvedConceptReferences are resolved from the SQL Server as needed.

Questions ??