Open Source IR Tools and Libraries

Slides:

Advertisements

Similar presentations

Information Retrieval in Practice

Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.

Chapter 5: Introduction to Information Retrieval

© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

For ITCS 6265 Professor: Wensheng Wu Present by TA: Xu Fei.

June 22-23, 2005 Technology Infusion Team Committee1 High Performance Parallel Lucene search (for an OAI federation) K. Maly, and M. Zubair Department.

“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS

Information Retrieval in Practice

Search Engines and Information Retrieval

Architecture of a Search Engine

T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)

Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,

Information Retrieval in Practice

1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.

INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,

Stimulating reuse with an automated active code search tool Júlio Lins – André Santos (Advisor) –

Chapter 5: Information Retrieval and Web Search

Overview of Search Engines

CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.

Search Engines and Information Retrieval Chapter 1.

CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏

Terrier: TERabyte RetRIevER An Introduction By: Kavita Ganesan (Last Updated April 21 st 2009)

Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.

The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.

Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.

CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.

Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.

1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)

Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.

Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.

The Internet 8th Edition Tutorial 4 Searching the Web.

Chapter 6: Information Retrieval and Web Search

Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.

The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.

Introduction to Digital Libraries hussein suleman uct cs honours 2003.

Course grading Project: 75% Broken into several incremental deliverables Paper appraisal/evaluation/project tool evaluation in earlier May: 25%

IR Homework #2 By J. H. Wang Mar. 31, Programming Exercise #2: Query Processing and Searching Goal: to search relevant documents for a given query.

The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.

IR Homework #1 By J. H. Wang Mar. 21, Programming Exercise #1: Vector Space Retrieval Goal: to build an inverted index for a text collection, and.

Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.

Evaluation of Agent Building Tools and Implementation of a Prototype for Information Gathering Leif M. Koch University of Waterloo August 2001.

IR Homework #1 By J. H. Wang Mar. 16, Programming Exercise #1: Vector Space Retrieval - Indexing Goal: to build an inverted index for a text collection.

Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.

1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,

Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.

Information Retrieval

Lucene Jianguo Lu.

CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.

Indri at TREC 2004: UMass Terabyte Track Overview Don Metzler University of Massachusetts, Amherst.

Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.

1 CS 8803 AIAD (Spring 2008) Project Group#22 Ajay Choudhari, Avik Sinharoy, Min Zhang, Mohit Jain Smart Seek.

Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.

September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.

The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.

General Architecture of Retrieval Systems 1Adrienn Skrop.

The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)

Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)

Information Retrieval in Practice

Why indexing? For efficient searching of a document

Information Retrieval in Practice

Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance Hello everyone,

Search Engine Architecture

Searching and Indexing

Implementation Issues & IR Systems

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Chapter 5: Information Retrieval and Web Search

Information Retrieval and Web Design

Introduction to Search Engines

Web Application Development Using PHP

Presentation transcript:

Open Source IR Tools and Libraries CS-463 Information Retrieval Models Computer Science Department University of Crete

Outline Google Search API Lucene Terrier Lemur Dragon Groogle

Google Search API

Google Search API: Overview The API exposes the Google engine to developers. You can write scripts that access the Google search in real-time. Google no longer issuing new API keys for the SOAP Search API. Instead, Google provides an AJAX Search API. You can put Google Search in your web pages with JavaScript.

Google Search API: SOAP Based on the Web Services Technology SOAP (the XML-based Simple Object Access Protocol). Developers write software programs that connect remotely to the Google SOAP Search API service. Developers can issue search requests to Google’s index of billions of web pages and receive results as structured data, access information in the Google cache and check the spelling of words. Limitations Default limit of 1,000 queries per day. Can only query for 10 results a time Can only access Google Web Search (not Google Images, Google Groups and so on).

Google Search API: AJAX Lets you put Google Search in your web pages with JavaScript. Does not have a limit on the number of queries per day. Supports additional features like Video, News, Maps, and Blog search results.

Google Search API: AJAX Web Search Incorporate results from Web Search, News Search, and Blog Search Local Search Provides access to local search results from Google Maps. Video Search Incorporate a simple search box incorporate dynamic, search powered strips of video and book thumbnails.

Google Search API: Demo

Google’s Solutions URL Queue/List Cached source pages (compressed) Use many features, e.g. font, layout,… Inverted index Hypertext structure

Google Search API: References Google SOAP Search API http://code.google.com/apis/soapsearch/ Google AJAX Search API http://code.google.com/apis/ajaxsearch/ Google AJAX Search API Developer Guide http://code.google.com/apis/ajaxsearch/documentation/ Google AJAX Search API Samples http://code.google.com/apis/ajaxsearch/samples.html

Apache Software Foundation Lucene Apache Software Foundation

Lucene Cross-Platform API Implemented in Java Ported in C++, C#, Perl, Python Offers scalable, high-performance indexing Incremental indexing as fast as bath indexing Index size roughly 20-30% the size of indexed text Supports many powerful query types

Lucene: Modules Analysis Document Index Query Parser Tokenization, Stop words, Stemming, etc. Document Unique ID for each document Title of document, date modified, content, etc. Index Provides access and maintains indexes. Query Parser Search / Search Spans

Lucene: Indexing A Document is a collection of Fields A Field is free text, keywords, dates, etc. A Field can have several characteristics indexed, tokenized, stored, term vectors Apply Analyzer to alter Tokens during indexing Stemming Stop-word removal Phrase identification Document: Field 1 Field 2 Field N

Lucene: Query Parser Syntax Terms Single terms and phrases Fields E.g. title:"Do it right" AND right Wildcard Searches ‘?’ for single character ‘*’ for multiple characters Proximity Searches “jakarta apache"~10 Fuzzy Searches Levenshtein Distance or Edit Distance algorithm Range Searches mod_date:[20020101 TO 20030101] title:{Aida TO Carmen} Boosting a Term E.g. jakarta^4 apache Boolean Operators Fuzzy Searches Lucene supports fuzzy searches based on the Levenshtein Distance, or Edit Distance algorithm. To do a fuzzy search use the tilde, "~", symbol at the end of a Single word Term. For example to search for a term similar in spelling to "roam" use the fuzzy search: roam~ This search will find terms like foam and roams. Starting with Lucene 1.9 an additional (optional) parameter can specify the required similarity. The value is between 0 and 1, with a value closer to 1 only terms with a higher similarity will be matched. For example: roam~0.8 The default that is used if the parameter is not given is 0.5. Proximity Searches Lucene supports finding words are a within a specific distance away. To do a proximity search use the tilde, "~", symbol at the end of a Phrase. For example to search for a "apache" and "jakarta" within 10 words of each other in a document use the search: "jakarta apache"~10 Boosting a Term Lucene provides the relevance level of matching documents based on the terms found. To boost a term use the caret, "^", symbol with a boost factor (a number) at the end of the term you are searching. The higher the boost factor, the more relevant the term will be. Boosting allows you to control the relevance of a document by boosting its term. For example, if you are searching for jakarta apache and you want the term "jakarta" to be more relevant boost it using the ^ symbol along with the boost factor next to the term. You would type: jakarta^4 apache This will make documents with the term jakarta appear more relevant. You can also boost Phrase Terms as in the example: "jakarta apache"^4 "Apache Lucene" By default, the boost factor is 1. Although the boost factor must be positive, it can be less than 1 (e.g. 0.2)

Lucene: More Advanced Options Relevance Feedback Manual User selects which documents are relevant/non-relevant Get the terms from the term vector for each of the documents and construct a new query. Automatic Application assumes the top X documents are relevant and the bottom Y are non-relevant and constructs a new query based on the terms in those documents. Span Queries Phrase Matching Relevance feedback is a technique used to augment the user’s query with terms from the best matching documents. Grant shows how to use Lucene’s Term Vectors to do this. Relevance feedback has been on my todo-list for a few weeks now, it’s great to have a good starting point. Span queries provide information about where a match took place within a document. The presentation explains how to use them for phrase matching (I wonder if that could replace named entity detection completely) and how to further use them for question answering. Looks like you could also use span queries as the basis for an efficient result summarizer. Spans object provides document and position info about the current match. Phrase Matching uses position distance instead of edit distance. SpanNearQuery class provides functionality similar to PhraseQuery class.

Lucene: Basic Demo The latest version can be obtained from http://www.apache.org/dyn/closer.cgi/lucene/java/ To build an index just type java org.apache.lucene.demo.IndexFiles <dir> To search from an index type: java org.apache.lucene.demo.SearchFiles <index>

Terrier University Of Glasgow

Terrier: Overview (1/2) Stands for TERabyte RetrIEveR. Open Source API (Mozilla Public Licence). Modular platform for the rapid development of large-scale IR applications. It is written in Java (and Perl) Highly compressed disk data structures. Handling large-scale document collections. Standard evaluation of TREC ad-hoc and known-item search retrieval results. Based on a new parameter-free probabilistic framework for IR (DFR), allowing adaptable term weighting functionalities.

Terrier: Overview (2/2) Includes state-of-the-art functionalities such as: hyperlink structure analysis, combination of evidence approaches, automatic query expansion/re-formulation techniques, query performance predictors compression techniques. Deploys over 50 term weighting/ matching functions. Has a robust and effective crawler, called Labrador. Allows a large-scale experimentation conducted in a robust, transparent, reproducible, modular, platform independent, and without constraints and parameters.

Terrier: Indexing Direct Index Document Index Lexicon Create your own Collection decoder and Document implementation. Centralized or distributed Setting. Indexer iterates through the collection and creates the following data structures Direct Index Document Index Lexicon

Terrier: Indexing The inverted index is built from the existing direct index, document index and lexicon In this way, we build the direct and document indices. Each document in the collection is tokenized and parsed. We also build temporary lexicons in order to reduce the required memory during indexing

Terrier: Retrieval Parsing Pre-processing Matching Post Processing Post Filtering Query Language term1 term2 term1^2.3 +term1 -term2 "term1 term2"~n The aim of query expansion is to reduce this query/document mismatch by expanding the query using words or phrases with a similar meaning or some other statistical relation to the set of relevant documents. This procedure may have even greater importance in spoken document retrieval, since the word mismatch problem is heightened by the presence of errors in the automatic transcription of spoken documents.

Terrier: Retrieval Terrier automatically select the optimal document weighting model If Query Expansion is applied an appropriate term weighting model is selected and the most informative terms from the top ranked documents are added to the query The aim of query expansion is to reduce this query/document mismatch by expanding the query using words or phrases with a similar meaning or some other statistical relation to the set of relevant documents. This procedure may have even greater importance in spoken document retrieval, since the word mismatch problem is heightened by the presence of errors in the automatic transcription of spoken documents. Remove stop words and apply stemming to the query.

Terrier: Sample Applications Trec Terrier An application that allows Terrier to index and retrieve from standard TREC test collections. Instructions are available at http://ir.dcs.gla.ac.uk/terrier/doc/trec_terrier.html

Terrier: Sample Applications Desktop Search A Swing (graphical) application that can be used to index files from the local machine, and then perform queries on them. The scripts for running the desktop search application are: desktop_search.sh (Linux, Mac OSX) desktop_search.bat (Windows)

Terrier: Sample Applications Interactive Querying A console application for performing simple queries on an existing index and seeing which documents are returned. The scripts for running the console application are: interactive_terrier.sh (Linux, Mac OS X) interactive_terrier.bat (Windows)

Terrier: Demo

University of Massachusetts Lemur University of Massachusetts

Lemur: Overview Support for XML and structured document retrieval Interactive interfaces for Windows, Linux, and Web Cross-Platform, fast and modular code written in C++ Free and open-source software

Lemur: API Provides interfaces to Lemur classes that are grouped at three different levels: Utility level Common utilities, such as memory management, document parsing, etc. Indexer level Converts a raw text collection to data structures for efficient retrieval. Retrieval level Abstract classes for a general retrieval architecture and concrete classes for several specific information retrieval

Lemur: Indexing Multiple indexing methods for small, medium and large-scale (terabyte) collections. Built-in support for English, Chinese and Arabic text. Porter and Krovetz word stemming. Incremental indexing.

Lemur: Retrieval Supports major language modelling approaches such as Indri and KL-divergence, as well as vector space, tf-idf, Ocapi and InQuery Relevance- and pseudo-relevance feedback Wildcard term expansion (using Indri) Supports arbitrary document priors (e.g., Page Rank, URL depth)

Sorted Vector of Score Extent Results Lemur: Query Flow Query Parser User Query Scoring Nodes runQuery() Sorted Vector of Score Extent Results

The Dragon Toolkit University Of Drexel

Dragon: Overview (1/2) Highly scalable to large data set Well designed Programming API and XML-based Interface Various document representations including words, multiword phrases, ontology-based concepts, and concept pairs Various text retrieval models Text classification, clustering, summarization and topic modeling

Dragon: Overview (2/2) Provides built-in supports for semantic-based IR and TM (different from Lucene and Lemur ). Integrates a set of NLP tools, which enable the toolkit to index text collections with various representation schemes including words, phrases, ontology-based concepts and relationships. It is specially designed for large-scale application. The toolkit uses sparse matrix to implement text representations and does not have to load all data into memory in the running time. Can handle hundred thousands of documents with very limited memory.

Dragon Toolkit Demo

Groogle University of Crete

Groogle Wiki Repository Bugzilla Credits http://groogle.csd.uoc.gr/apache2-default/ Repository http://groogle.csd.uoc.gr/bzr/groogle-devel Bugzilla http://groogle.csd.uoc.gr/bugzilla/ Credits http://groogle.csd.uoc.gr:8080/groogle-2007/index.jsp?tID=credits

Questions ?