Richa Arora.  Tool Identified and Overview  Schema.xml  Tokenization, Stop words, and Synonym Handling  Indexing  Data Import Handler  Query format.

Slides:



Advertisements
Similar presentations
Chapter 6 Server-side Programming: Java Servlets
Advertisements

© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
1.  Understanding about How to Working with Server Side Scripting using PHP Framework (CodeIgniter) 2.
 Apache Solr Apache Solr – Introduction David Shemer.
Solr has a lot of extensive features Solr Integration and Enhancements Todd Hatcher.
Information Retrieval in Practice
Parametric search and zone weighting Lecture 6. Recap of lecture 4 Query expansion Index construction.
B.Sc. Multimedia ComputingMedia Technologies Database Technologies.
Intermediate PHP & MySQL
Microsoft ® Official Course Interacting with the Search Service Microsoft SharePoint 2013 SharePoint Practice.
Tutorial 11: Connecting to External Data
SQL Reporting Services Overview SSRS includes all the development and management pieces necessary to publish end user reports in  HTML  PDF 
Overview of Search Engines
Passage Three Introduction to Microsoft SQL Server 2000.
Implementing search with free software An introduction to Solr By Mick England.
Sys Prog & Scripting - HW Univ1 Systems Programming & Scripting Lecture 15: PHP Introduction.
Batch Import/Export/Restore/Archive
Apache Lucene in LexGrid. Lucene Overview High-performance, full-featured text search engine library. Written entirely in Java. An open source project.
UFCEKG-20-2 Data, Schemas & Applications Lecture 4 Server Side Scripting & PHP.
M. Taimoor Khan * Java Server Pages (JSP) is a server-side programming technology that enables the creation of dynamic,
CSCI 6962: Server-side Design and Programming Course Introduction and Overview.
HTML II. Factors to consider in designing a website. Organizing your files. HTML Tables. Unordered Lists. Ordered Lists. HTML Forms. Learning Objectives.
WORKING WITH XSLT AND XPATH
1 PHP and MySQL. 2 Topics  Querying Data with PHP  User-Driven Querying  Writing Data with PHP and MySQL PHP and MySQL.
Project Overview Bibliographic merging, Endeca, and Web application.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Part 1 – PubMed Interface, Display options, Saving, Printing, and ing results. Instructions This part of the course is a PowerPoint demonstration.
University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.
Revolutionizing enterprise web development Searching with Solr.
© FPT SOFTWARE – TRAINING MATERIAL – Internal use 04e-BM/NS/HDCV/FSOFT v2/3 Working with MSSQL Server Code:G0-C# Version: 1.0 Author: Pham Trung Hai CTD.
NMED 3850 A Advanced Online Design January 12, 2010 V. Mahadevan.
Chapter 3 Servlet Basics. 1.Recall the Servlet Role 2.Basic Servlet Structure 3.A simple servlet that generates plain text 4.A servlet that generates.
Overview of IU Digital Collections Search Hui Zhang Jon Dunn Indiana University Digital Library Program IU Digital Library Brown Bag October 19, 2011.
Professor Michael J. Losacco CIS 1110 – Using Computers Database Management Chapter 9.
Data Management Console Synonym Editor
Searching Business Data with MOSS 2007 Enterprise Search Presenter: Corey Roth Enterprise Consultant Stonebridge Blog:
Chapter 6 Server-side Programming: Java Servlets
7 1 Chapter 7 Introduction to Structured Query Language (SQL) Database Systems: Design, Implementation, and Management, Seventh Edition, Rob and Coronel.
Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.
Iccha Sethi Serdar Aslan Team 1 Virginia Tech Information Storage and Retrieval CS 5604 Instructor: Dr. Edward Fox 10/11/2010.
Solutions using Microsoft Content Management Server 2002 Connector for SharePoint Technologies Sue Corke Mark Harrison Microsoft UK.
Copyright © 2006 Pilothouse Consulting Inc. All rights reserved. Search Overview Search Features: WSS and Office Search Architecture Content Sources and.
Lattice QCD Data Grid Middleware: Meta Data Catalog (MDC) -- CCS ( tsukuba) proposal -- M. Sato, for ILDG Middleware WG ILDG Workshop, May 2004.
©SoftMooreSlide 1 Introduction to HTML: Forms ©SoftMooreSlide 2 Forms Forms provide a simple mechanism for collecting user data and submitting it to.
8 Chapter Eight Server-side Scripts. 8 Chapter Objectives Create dynamic Web pages that retrieve and display database data using Active Server Pages Process.
1 Java Server Pages A Java Server Page is a file consisting of HTML or XML markup into which special tags and code blocks are inserted When the page is.
Steven Perry Dave Vieglais. W a s a b i Web Applications for the Semantic Architecture of Biodiversity Informatics Overview WASABI is a framework for.
Chapter 10: Working with Large Data Spreadsheet-Based Decision Support Systems Prof. Name Position (123) University Name.
21 Copyright © 2009, Oracle. All rights reserved. Working with Oracle Business Intelligence Answers.
Module 5: Managing Content. Overview Publishing Content Executing Reports Creating Cached Instances Creating Snapshots and Report History Creating Subscriptions.
Session 11: Cookies, Sessions ans Security iNET Academy Open Source Web Development.
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
 CONACT UC:  Magnific training   
CS520 Web Programming Full Text Search Chengyu Sun California State University, Los Angeles.
Apache Cocoon – XML Publishing Framework 데이터베이스 연구실 박사 1 학기 이 세영.
Introduction to Information Systems SSD1: Introduction to Information Systems Unit 1. The World Wide Web Unit 2. Introduction to Java and Object- Oriented.
Information Retrieval in Practice
CS520 Web Programming Full Text Search
Search Engine Architecture
IST 516 Fall 2010 Dongwon Lee, Ph.D. Wonhong Nam, Ph.D.
Data Virtualization Tutorial: Introduction to SQL Script
CONTENT MANAGEMENT SYSTEM CSIR-NISCAIR, New Delhi
Searching and Indexing
Building Search Systems for Digital Library Collections
Searching Business Data with MOSS 2007 Enterprise Search
Configuring Internet-related services
Lucene/Solr Architecture
Getting Started With Solr
Presentation transcript:

Richa Arora

 Tool Identified and Overview  Schema.xml  Tokenization, Stop words, and Synonym Handling  Indexing  Data Import Handler  Query format and Matching documents to query  Function Queries  Bibliography

 SOLR - Open Source enterprise search platform from Apache Lucene project  Purpose ◦ To implement a full text search functionality in a web application  Commercial Websites using SOLR ◦ ◦ - Uses SOLR via Drupal for site search w/highlighting & faceting ◦ ◦

Web serverDatabase server Web Application SOLR Document Database

 Features ◦ Full text search ◦ Rich document handling (including MS Word, PDF, RTF etc.) ◦ HTML administration interface ◦ Scalable  Technology ◦ Java programming language ◦ Lucene Java search library ◦ Runs as a search server within a servlet container such as Tomcat or Jetty

Documents Browser based web interface Solr Server Documents for indexing Search Queries Search Results Indexing Searching Index schema.xml solrconfig.x ml

 Documents form the basic unit of SOLR  Documents are composed of fields  Examples: ◦ Document for Person: Fields – name, height, age, etc. ◦ Document for Recipes: Fields – origin, ingredients, etc.  Documents are fed to SOLR  SOLR extracts the information from the fields in the documents and makes it searchable  Steps: ◦ Field Analysis ◦ Tokenization ◦ Filter application ◦ Indexing

 Governs how should SOLR build indexes from input documents  Defines field types and specific fields that the documents can contain  Describes how SOLR should handle the fields when adding documents to the index or when querying those fields

 These are used for examining the text of fields and to generate a token stream  Indexing Analyzers: The results of the analysis are added to an index and a set of terms like positions, sizes, etc for a field are defined  Querying Analyzers: The values being searched for are analyzed and the terms that result are matched against those that are stored in the field's index

 To splits a stream of text into tokens  Tokens are subsequences of the characters  A token contains various metadata in addition to its text value, such as the location at which the token occurs in the field  Example ◦ Standard Tokenizer: Treats whitespace and punctuation as delimiters  Input: “  Output: “ ”, ◦ N-Gram Tokenizer: Reads the field text and generates n-gram tokens of sizes in the given range (default minimum is 1 and maximum is 2)  Input: “hello world”  Output: “h”, “e”, “l”, “l”, “o”, “ “, “w”, “o”, “r”, “l”, “d”, “he ”, “el”, “ll”, “lo”, “o “, “wo”, “or”, “rl”, “ld”

 Filters take tokens as input from the Tokenizers and produce another stream of tokens as output  Multiple filters can be used one after the other  Example:

 Stop Filter: This filter is used to discard tokens that are on the given stop words list. A standard stop words list is included in the SOLR config directory, named stopwords.txt, for English language text  Example: Using the standard stopwords.txt Tokenizer Input : “welcome to the world of Solr” Tokenizer Output/Filter Input: “welcome”(1), “to”(2), “the”(3), “world”(4), “of”(5), “Solr”(6) Filter Output: “welcome”(1), “world”(2), “Solr”(3)

 Synonym Filter: This is used for finding synonyms at the time of indexing as well as while querying. Tokens are looked up in the list of synonyms and if a match is found, then the synonyms are put in place of the token  Example: We can define the synonyms in a file (test_synonyms.txt) and use it for comparing the tokens ◦ home, dwelling, house ◦ shop => workshop, store ◦ teh => the Tokenizer Input : “teh home shop” Tokenizer Output/Filter Input: “teh”(1), “home”(2), “shop”(3) Filter Output: “the”(1), “workshop”(2), “shop”(2), “home”(2), “dwelling”(3), “house”(3)

 Refers to adding the content to a SOLR index  To make the content searchable  Sources of data for indexing: ◦ XML ◦ CSV ◦ Rich text formats (PDF, MS Word, MS Excel, text etc.) ◦ Data extracted from tables in a database

 Uploading Data with SOLR Cell ◦ Using ExtractingRequestHandler ◦ With a POST ◦ With SOLR Cell and SOLRJ  Uploading Data with Index Handlers ◦ XMLUpdateRequestHandler for XML-formatted Data ◦ Using the CSVRequestHandler for CSV Content ◦ Indexing Using SOLRJ  Uploading Structure Data Store Data with the Data Import Handler  Content Streams

 curl posts and retrieves data over HTTP, FTP, and many other protocols  In the example below, the Extraction Request Handler is called, uploads the file tutorial.html and assigns it the unique ID doc1  curl “ literal.id=doc1&uprefix=attr_&fmap.content=attr_content&commit=true” -F  literal.id provides a unique ID to the document uploaded to SOLR  commit=true makes the document searchable after indexing  The -F flag instructs curl to POST data using the Content-Type multipart/form-data and supports the uploading of binary files  symbol instructs curl to upload the attached file  The argument needs a valid file path

Order of operation: 1.Modify the schema.xml file to add the fields which may not be already existing in the schema.xml file, example: authors, dd, isbn, yearpub, publisher 2.Modify the schema.xml file to copy the newly created fields to text field to make the search results viewable 3.Run the curl utility with the command for adding XML document: curl -H "Content-Type: text/xml" --data-binary " doc26 Patrick Eagar Sports Collins "

 Often data is stored in relational databases  Data Import Handler (DIH) provides a mechanism to import data from database and to index it  DIH can also index content from RSS and ATOM feeds, repositories and structured XML

 Handler to be registered in the solrconfig.xml file ${solr.config.dir:./solr/conf}/dataimporthandler/data- config.xml  There can be multiple configuration files

1.Create a database in SQL Server The tables and the relationships in the database are shown below

3. Create an XML file called DIH_Test.xml for importing into SOLR 4. Modify solrconfig.xml file to instruct SOLR to import data as per the file DIH_Test.xml

5. Do a full-import of the DIH from the browser using: -import

7. Run queries on the newly indexed data from the database 8. Example: ipad2 The above query returns the result. Executing queries on the original database returns similar results

Request Handler Query Parser Index Response Writer qt: selects a Request Handler for a query using /select defType: selects a query parser for the query qf: selects which field to query in the index start: specifies an offset into the query results where the returned response should begin rows: specifies the number of rows to be displayed at run time fq: flters the query by applying an additional query to the initial query’s results; caches the results wt: selects a response writer for formatting the query response

 Advantage - Enables the user to specify very precise queries  Disadvantage – Is less tolerant of syntax errors than the DisMax query parser  Parameters Supported ◦ Terms – Use of wild card characters, Fuzzy Searches, Boosts and Ranges ◦ Fields – Identified by name followed by a colon ◦ Boolean Operators – AND, OR, NOT, &&, !, || ◦ Common query parameters – debugQuery, defType, explainOther, fl, fq, omitHeader, rows, sort, start, timeAllowed ◦ Functions – abs, constant, div, fieldValue, log, linear, max, etc. ◦ Faceting ◦ Highlighting ◦ MoreLikeThis (mlt)

 q – Defines a query using standard query syntax. This parameter is mandatory  q.op – Specifies the default operator for query expressions (this parameter’s value is defined in schema.xml). Possible values are “AND” or “OR”  df – Specifies a default field, overriding the definition of a default field in schema.xml Default parameter values are specified in solrconfig.xml

 Query q=id:6H500F0&popularity=6

 Fuzzy Searches - based on the Levenshtein Distance or Edit Distance  E.g. tight~ will match terms like flight, slight etc.  Additional parameter to specify degree of similarity – tight~0.8 will match sight. When set closer to 1, optional parameter causes only terms with higher similarity to be matched  If numerical parameter is omitted, the default value taken is 0.5

 Range Searches ◦ Specifies a range(with an upper and lower bound) of values for a field ◦ Can be inclusive or exclusive of the lower and upper bounds Query: q=popularity:{5 TO 7}

ParameterDescription defTypeQuery parser to be used (DisMax or Standard Query Parser) SortSorts the response to a query in asc or desc order based on response’s score or other characteristic StartOffset into the responses at which solr should begin displaying content RowsNumber of rows of responses displayed at a time fqFilter query for search results flLimits responses to a listed set of fields

ParameterDescription debugQueryInclude debugging information timeAllowedTime allowed for a query to be processed. If time elapses before response is complete are returned, partial information returned omitHeaderExcludes header information from returned results wtSpecifies the response writer

 Used to generate a relevancy score using the actual value of one or more numeric fields  Functions available for function queries ◦ abs – abs(x); abs(-5) ◦ constant - 1.5; _val_:1.5 ◦ div – div(1,y); div(sum(x,100), max(y,1)) ◦ linear – linear(x, m, c); linear(x, 2, 4) returns 2*x+4 ◦ log – log(x); log(sum(x,100)) ◦ …  Include function query in a SOLR query ◦ With a _val_keyword – e.g. _val_:myNumericField ◦ Parameter with an explicit type of FunctionQuery (DisMax query parser’s bf parameter)

 Generated a formatted response of a search  wt parameter sets the response writer  Response writers supported ◦ Json ◦ Php ◦ Phps ◦ Python ◦ Ruby ◦ Xml ◦ xslt

 (link last accessed on 04/25/2011)  Lucid Works SOLR Reference Guide load/certified/cdrg/lucidworks-solr- refguide-1.4.pdf load/certified/cdrg/lucidworks-solr- refguide-1.4.pdf (link last accessed on 04/25/2011)