Introduction to Open Source Search with Apache Lucene and Solr Grant Ingersoll.

Slides:



Advertisements
Similar presentations
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Advertisements

 Apache Solr Apache Solr – Introduction David Shemer.
BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.
“ Leveraging SharePoint 2010 Search Technologies ” With: Ivan Neganov.
Web- and Multimedia-based Information Systems. Assessment Presentation Programming Assignment.
Information Retrieval in Practice
13 October 2010 UGFIDD Unstructured Geospatial File Indexer and Distributed Dissemination 1.
Introducing Apache Mahout Scalable Machine Learning for All! Grant Ingersoll Lucid Imagination.
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
Enterprise Search With SharePoint Portal Server V2 Steve Tullis, Program Manager, Business Portal Group 3/5/2003.
LCT2506 Internet 2 Data-driven web sites Week 5. LCT2506 Internet 2 Current Practice  Combining web pages and data stored in a relational database is.
Stimulating reuse with an automated active code search tool Júlio Lins – André Santos (Advisor) –
Multimedia Data Mining Arvind Balasubramanian Multimedia Lab (ECSS 4.416) The University of Texas at Dallas.
Overview of Search Engines
Introduction to Apache Tika CSCI 572: Information Retrieval and Search Engines Summer 2010.
Implementing search with free software An introduction to Solr By Mick England.
Full-Text Search with Lucene Yonik Seeley 02 May 2007 Amsterdam, Netherlands.
ECPRD seminar on the net IX”, Brussels, 2011 Faceted Search Some examples of applied faceted search on websites developed by the EP Jerry.
Word Up! Using Lucene for full-text search of your data set.
Introduction to Apache Lucene/Solr CSCI 572: Information Retrieval and Search Engines Summer 2010.
© NYC Apache Lucene/Solr Meetup. Lucid Imagination, Inc. Agenda Welcome "Faster. Better. Solr! What to look for in Solr 1.4“ Yonik Seeley,
Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,
ManifoldCF for Content Acquisition
Terrier: TERabyte RetRIevER An Introduction By: Kavita Ganesan (Last Updated April 21 st 2009)
Advanced Lucene Grant Ingersoll Center for Natural Language Processing ApacheCon 2005 December 12, 2005.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Lucene Boot Camp I Grant Ingersoll Lucid Imagination Nov. 3, 2008 New Orleans, LA.
1CONFIDENTIAL | Thinking Lucene Think Lucid Grant Ingersoll Chief Scientist Lucid Imagination Enhancing Discovery with Solr and Mahout.
Web Indexing and Searching By Florin Zidaru. Outline Web Indexing and Searching Overview Swish-e: overview and features Swish-e: set-up Swish-e: demo.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
A Survey of Patent Search Engine Software Jennifer Lewis April 24, 2007 CSE 8337.
University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.
Revolutionizing enterprise web development Searching with Solr.
Introduction to Nutch CSCI 572: Information Retrieval and Search Engines Summer 2010.
Overview of IU Digital Collections Search Hui Zhang Jon Dunn Indiana University Digital Library Program IU Digital Library Brown Bag October 19, 2011.
U.S Geological Survey National Biological Information Infrastructure Technical Overview: NBII Metadata Clearinghouse May 2008 Mike Frame.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
Iccha Sethi Serdar Aslan Team 1 Virginia Tech Information Storage and Retrieval CS 5604 Instructor: Dr. Edward Fox 10/11/2010.
Lucene. Lucene A open source set of Java Classses ◦ Search Engine/Document Classifier/Indexer 
A radiologist analyzes an X-ray image, and writes his observations on papers  Image Tagging improves the quality, consistency.  Usefulness of the data.
Copyright © 2006 Pilothouse Consulting Inc. All rights reserved. Search Overview Search Features: WSS and Office Search Architecture Content Sources and.
Dean Anderson Polk County, Oregon GIS in Action 2014 Modifying Open Source Software (A Case Study)
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
13 October 2010 UGFIDD Unstructured Geospatial File Indexer and Distributed Dissemination 1.
807 - TEXT ANALYTICS Massimo Poesio Lab 2: (Quick intro to) SOLR Document clustering with MAHOUT.
Clusterpoint Margarita Sudņika ms RDBMS & NoSQL Databases & tables → Document stores Columns, rows → Schemaless documents Scales UP → Scales UP.
Site Technology TOI Fest Q Celebration From Keyword-based Search to Semantic Search, How Big Data Enables That?
Lucene Jianguo Lu.
1 CS 8803 AIAD (Spring 2008) Project Group#22 Ajay Choudhari, Avik Sinharoy, Min Zhang, Mohit Jain Smart Seek.
WEB MONITORING E6125 Web enHanced Information Management Presentation on Design of Web Monitoring applications. By Satyajeet Shaligram Columbia University.
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
Sitecore. Compelling Web Experiences Page 1www.sitecore.net Patrick Schweizer Director of Sales Enablement 2013.
Voyager Search. INTRODUCTION › Established in 2008 › Self-funded and privately owned › Geospatial search and data management › Leverages Open Source technology.
High performance, full-featured text search engine written in Java. Technology suitable for nearly any application requiring full-text search, especially.
A presentation on ElasticSearch
Information Retrieval in Practice
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Search Engine Architecture
Introducing Apache Mahout
Searching and Indexing
Open Source distributed document DB for an enterprise
Building Search Systems for Digital Library Collections
April 15, 2014 Faceted Browsing: Analysis and implementation of a Big Data Solution using Apache Solr. Advisor: Prof. Sonia Bergamaschi Co-Advisor: Prof.
Lucene/Solr Architecture
Rafał Kuć – Sematext sematext.com
Bryan Soltis – Kentico Technical Evangelist
Indexing with ElasticSearch
Introducing Apache Mahout
Presentation transcript:

Introduction to Open Source Search with Apache Lucene and Solr Grant Ingersoll

Lucid Imagination, Inc. The How Many Game How many of you: o Have taken a class in Information Retrieval (IR)? o Are doing work/research in IR? o Have heard of or are using Lucene? o Have heard of or are using Solr? o Are doing work on core IR algorithms such as compression techniques or scoring? o Are doing UI/Application work/research as they relate to search?

Lucid Imagination, Inc. Topics Brief Bio Search 101 (skip?) What is: o Apache Lucene o Apache Solr What can they do? o Features and functionality o Intangibles What’s new in Lucene and Solr? o How can they help my research/work/____?

Lucid Imagination, Inc. Brief Bio Apache Lucene/Solr Committer Apache Mahout co-founder o Scalable Machine Learning Co-founder of Lucid Imagination o Previously worked at Center for Natural Lang. Processing at Syracuse Univ. with Dr. Liddy Co-Author of upcoming “Taming Text” (Manning Publications) o

Lucid Imagination, Inc. Search 101 Search tools are designed for dealing with fuzzy data/questions o Works well with structured and unstructured data o Performs well when dealing with large volumes of data o Many apps don’t need the limits that databases place on content o Search fits well alongside a DB too Given a user’s information need, (query) find and, optionally, score content relevant to that need o Many different ways to solve this problem, each with tradeoffs What’s “relevant” mean?

Vector Space Model (VSM) for relevance Common across many search engines Apache Lucene is a highly optimized implementation of the VSM Search 101 RelevanceIndexing Finds and maps terms and documents Conceptually similar to a book index At the heart of fast search/retrieve

Lucid Imagination, Inc. Apache Lucene in a Nutshell Java based Application Programming Interface (API) for adding search and indexing functionality to applications Fast and efficient scoring and indexing algorithms Lots of contributions to make common tasks easier: o Highlighting, spatial, Query Parsers, Benchmarking tools, etc. Most widely deployed search library on the planet

Lucid Imagination, Inc. Lucene Basics Content is modeled via Documents and Fields o Content can be text, integers, floats, dates, custom o Analysis can be employed to alter content before indexing Searches are supported through a wide range of Query options o Keyword o Terms o Phrases o Wildcards o Many, many more

Lucid Imagination, Inc. Apache Solr in a Nutshell Lucene-based Search Server + other features and functionality Access Lucene over HTTP: o Java, XML, Ruby, Python,.NET, JSON, PHP, etc. Most programming tasks in Lucene are configuration tasks in Solr Faceting (guided navigation, filters, etc.) Replication and distributed search support Lucene Best Practices

A small sampling of Lucene/Solr-Powered Sites 10 Buy.com

Lucid Imagination, Inc. Features and Functionality

Lucid Imagination, Inc. Quick Solr/Lucene Demo Pre-reqs: o Apache Ant 1.7.x, Subversion (SVN) Command Line 1: o svn co solr-trunkhttps://svn.apache.org/repos/asf/lucene/dev/trunk o cd solr-trunk/solr/ o ant example o cd example o java –Dsolr.clustering.enabled=true –jar start.jar Command Line 2 o cd exampledocs; java –jar post.jar *.xml Browse=true Browse=true

Lucid Imagination, Inc. Other Features Data Import Handler o Database, Mail, RSS, etc. Rich document support via Apache Tika o PDF, MS Office, Images, etc. Replication for high query volume Distributed search for large indexes o Production systems with 1B+ documents Configurable Analysis chain and other extension points o Total control over tokenization, stemming, etc.

Lucid Imagination, Inc. Intangibles Open Source Flexible, non-restrictive license o Apache License v2 – non-viral o “Do what you want with the software, just don’t claim you wrote it” Large community willing to help o Great place to learn about real world IR systems Many books and other documentation o Lucene in Action by Hatcher, McCandless and Gospodnetic

Lucid Imagination, Inc. What’s New? HANGES.txt HANGES.txt NGES.txt NGES.txt Codecs o Pluggable Index Formats o Provide Different index compression techniques Stats to enable alternate scoring approaches  BM25, Lang. Modeling, etc. -- More work to be done here Faster o Java Strings are slow; convert to use byte arrays

Lucid Imagination, Inc. Other New Items Many new Analyzers (tokenizers, etc.) o Richer Language support (Hindi, Indonesian, Arabic, …) Richer Geospatial (Local) Search capabilities o Score, filter, sort by distance o Results Grouping o Group Related Results o More Faceting Capabilities o Pivot o New underlying algorithms

Lucid Imagination, Inc. How can Lucene/Solr help me? Everyone Fast indexing/search times means less time waiting for jobs to complete Completely Open (source, community) Free to use, modify, etc. Large community ready and willing to help User Experience Researchers Rapid UI prototyping Total Control of results and facets Easy to setup and use with little to no programming required IR Researchers Flexible Indexing models (trunk) Flexible Relevance models via functions and other mechanisms Extendable Job Seekers Google Summer of Code Other Internships (see me) Real programming skills that are highly valued in industry Publicly visible, demonstrable skills Lucene/Solr

Lucid Imagination, Inc. Job Trends

Lucid Imagination, Inc. Other Things that Can Help Nutch o Crawling o Mahout o Machine learning (clustering, classification, others) o OpenNLP o Part of Speech, Parsers, Named Entity Recognition o Open Relevance Project o Relevance Judgments o

Lucid Imagination, Inc. Resources