For ITCS 6265 Professor: Wensheng Wu Present by TA: Xu Fei.

Slides:



Advertisements
Similar presentations
Efficient full-text search in databases Andrew Aksyonoff, Peter Zaitsev Percona Ltd. shodan (at) shodan.ru.
Advertisements

Chapter 5: Introduction to Information Retrieval
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Multi-Model Digital Video Library Professor: Michael Lyu Member: Jacky Ma Joan Chung Multi-Model Digital Video Library LYU9904 Multi-Model Digital Video.
Lucene Part3‏. Lucene High Level Infrastructure When you look at building your search solution, you often find that the process is split into two main.
Solr has a lot of extensive features Solr Integration and Enhancements Todd Hatcher.
Information Retrieval in Practice
1 CS 430 / INFO 430 Information Retrieval Lecture 13 Architecture of Information Retrieval Systems.
Searching with Lucene Chapter 2. For discussion Information retrieval What is Lucene? Code for indexer using Lucene Pagerank algorithm.
Open Source IR Tools and Libraries
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
1 CS6320 – Why Servlets? L. Grewe 2 What is a Servlet? Servlets are Java programs that can be run dynamically from a Web Server Servlets are Java programs.
1 CS 430 / INFO 430 Information Retrieval Lecture 6 Vector Methods 2.
Crawler-Based Search Engine By: Bryan Chapman, Ryan Caplet, Morris Wright.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Overview of Search Engines
GOAT SEARCH Revorg GOAT Search Solution (Powered by Lucene)
Implementing search with free software An introduction to Solr By Mick England.
Word Up! Using Lucene for full-text search of your data set.
Databases & Data Warehouses Chapter 3 Database Processing.
1 Introduction to Lucene Rong Jin. What is Lucene ?  Lucene is a high performance, scalable Information Retrieval (IR) library Free, open-source project.
Introduction to Apache Lucene/Solr CSCI 572: Information Retrieval and Search Engines Summer 2010.
A/WWW Enterprises1 Introduction to CNIDR’s Isearch Archie Warnock
Terrier: TERabyte RetRIevER An Introduction By: Kavita Ganesan (Last Updated April 21 st 2009)
DB Libraries: An Alternative to DBMS By Matt Stegman November 22, 2005.
Information Systems: Databases Define the role of general information systems Describe the elements of a database management system (DBMS) Describe the.
Experimenting Lucene Index on HBase in an HPC Environment Xiaoming Gao Vaibhav Nachankar Judy Qiu.
Building a scalable distributed WWW search engine … NOT in Perl! Presented by Alex Chudnovsky ( at Birmingham Perl Mongers.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.
Overview What is a Web search engine History Popular Web search engines How Web search engines work Problems.
Revolutionizing enterprise web development Searching with Solr.
WAD Web application for managing the indicators of the research activity in a university department.
MET280: Computing for Bioinformatics Introduction to databases What is a database? Not a spreadsheet. Data types and uses DBMS (DataBase Management System)
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Document Indexing and Scoring in Solr
IR Homework #2 By J. H. Wang Mar. 31, Programming Exercise #2: Query Processing and Searching Goal: to search relevant documents for a given query.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
MAKANI ANDROID APPLICATION Prepared by: Asma’ Hamayel Alaa Shaheen.
IR Homework #1 By J. H. Wang Mar. 21, Programming Exercise #1: Vector Space Retrieval Goal: to build an inverted index for a text collection, and.
“ Lucene.Net is a source code, class-per-class, API-per-API and algorithmatic port of the Java Lucene search engine to the C# and.NET ”
IUScholarWorks Technical Overview Randall Floyd Digital Library Program Programmer/Database Administrator.
Uwe SchindlerGES 2007 – May 2-4, 2007 Data Information Service based on Open Archives Initiative Protocols and Apache Lucene Uwe Schindler 1, Benny Bräuer.
By: Namrata Lele Mentors: Dave Vieglais Bruce Wilson 1 VDC/TWG Meeting August 09.
Iccha Sethi Serdar Aslan Team 1 Virginia Tech Information Storage and Retrieval CS 5604 Instructor: Dr. Edward Fox 10/11/2010.
Design a full-text search engine for a website based on Lucene
Database and Information Management Chapter 9 – Computers: Understanding Technology, 3 rd edition.
807 - TEXT ANALYTICS Massimo Poesio Lab 2: (Quick intro to) SOLR Document clustering with MAHOUT.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Lucene Jianguo Lu.
User-Friendly Systems Instead of User-Friendly Front-Ends Present user interfaces are not accepted because the underlying systems are too difficult to.
Single Document Key phrase Extraction Using Neighborhood Knowledge.
Information Retrieval Lecture 6 Vector Methods 2.
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
1 Using the Lucene Search Engine. 2 Team Phil Corcoran Project Leader 10 Years Software Telecoms, Finance, Manufacturing Reqs, Design, Test Derek O’ Keeffe.
High performance, full-featured text search engine written in Java. Technology suitable for nearly any application requiring full-text search, especially.
A presentation on ElasticSearch
Information Retrieval in Practice
Search Engine Architecture
Searching and Indexing
Building Search Systems for Digital Library Collections
Implementation Issues & IR Systems
PHP / MySQL Introduction
Data Mining Chapter 6 Search Engines
Inverted Indexing for Text Retrieval
Web Application Development Using PHP
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Presentation transcript:

For ITCS 6265 Professor: Wensheng Wu Present by TA: Xu Fei

What is Lucene “Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. ” high performance, scalable Information Retrieval (IR) library. a project in the Apache Software Foundation mature, free, open-source implemented in Java.

full-text indexing and searching “In text retrieval, full text search refers to a technique for searching a computer-stored document or database. In a full text search, the search engine examines all of the words in every stored document as it tries to match search words supplied by the user. ” “Search engine indexing collects, parses, and stores data to facilitate fast and accurate information retrieval. ”

Lucene is popular a number of ports or integrations to other programming languages C/C++, C#, Ruby, Perl, Python, PHP, etc installations: HP, FedEx, Iron Mountain, Akamai, DSpace, IBM/Yahoo, Healthline, Webmail, CNET, Lookout (acquired by Microsoft), webshots.com (100M docs, 4M queries/day), Siderean, Monster….

Lucene is just a hammer! NOT a ready-to-use search application, like Google a software library, a toolkit a single compact JAR file (less than 1 MB!) A number of full-featured search applications have been built on top of Lucene.

What Lucene can do for you add search capabilities to your application index and make searchable any data that you can extract text from Lucene doesn’t care about the source of the data, its format, or even its language, as long as you can derive text from it. You can even index data stored in your databases, indirectly!

Search Application Figure 1. Typical components of search application; the shaded components show which parts Lucene handles. Components for indexing Acquire Content Build Document Analyze Document Index Document Components for searching Search User Interface Build Query Search Query Render Results Others Administration Interface Analytics Interface Scaleout

Ranking formula score(Q,D) = coord(Q,D) · queryNorm(Q) · ∑ t in Q ( tf(t in D) · idf(t) 2 · t.getBoost() · norm(D) ) tf–idf weight (term frequency–inverse document frequency)

Key index files in Lucene Segments file Fields information file Text information file Frequency file Position file

Inverted Index Example Doc 1: Penn State Football … football Doc 2: Football players … State Posting id worddocoffset 1footballDoc Doc 21 2pennDoc 11 3playersDoc 22 4stateDoc 12 Doc 213 Posting Table

Demo How to install Lucene and run the demo Boolean retrieval example apache – lucene apache + lucene apache lucene Luke: A online demo (PHP + Lucene) :

Reference: Lucene: Apache: “Lucene in Action” Chapter 1 and code: LinkLink Lucene index: lucene/ lucene/ rch/Similarity.html rch/Similarity.html