Searching with Lucene Chapter 2. For discussion Information retrieval What is Lucene? Code for indexer using Lucene Pagerank algorithm.

Slides:



Advertisements
Similar presentations
Copyright, UCL LEADERS: Linking EAD to Electronically Retrievable Sources Developing a Generic Toolkit: Architecture and technology issues ALLC/ACH Conference.
Advertisements

Topic Identification in Forums Evaluation Strategy IA Seminar Discussion Ahmad Ammari School of Computing, University of Leeds.
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Chapter 1 - VB 2008 by Schneider1 Chapter 1 - An Introduction to Computers and Problem Solving 1.1 An Introduction to Computers 1.2 Windows, Folders, and.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.
For ITCS 6265 Professor: Wensheng Wu Present by TA: Xu Fei.
NTIS on Engineering Village. What is the NTIS Database? The NTIS Database is the main resource for accessing the latest research.
 How many pages does it search?  How does it access all those pages?  How does it give us an answer so quickly?  How does it give us such accurate.
Search Engines and Information Retrieval
Lucene Brian Nisonger Feb 08,2006. What is it? Doug Cutting’s grandmother’s middle name Doug Cutting’s grandmother’s middle name A open source set of.
Information Retrieval
Big Data and Hadoop and DLRL Introduction to the DLRL Hadoop Cluster Sunshin Lee and Edward A. Fox DLRL, CS, Virginia Tech Feb. 18, 2015 presentation for.
Introduction to Apache Hadoop CSCI 572: Information Retrieval and Search Engines Summer 2010.
Web Information Retrieval Projects Ida Mele. Rules Students can work in teams (max 3 people) The project must be delivered by the deadline that will be.
How Search Engines Work. Any ideas? Building an index Dan taylor Flickr Creative Commons.
Introduction to Apache Lucene/Solr CSCI 572: Information Retrieval and Search Engines Summer 2010.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative.
Facebook (stylized facebook) is a Social Networking System and website launched in February 2004, operated and privately owned by Facebook, Inc. As.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Lecture 12 IR in Google Age. Traditional IR Traditional IR examples – Searching a university library – Finding an article in a journal archive – Searching.
X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox Associate Dean for.
Hadoop 2 cluster with Oracle Solaris Zones, ZFS and unified archives Orgad Kimchi - Principal Software Engineer September 29, 2014 Oracle Confidential.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Vyassa Baratham, Stony Brook University April 20, 2013, 1:05-2:05pm cSplash 2013.
INF 141 COURSE SUMMARY Crista Lopes. Lecture Objective Know what you know.
Data Science for DTIC Data Ecosystem Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community
Social Networking Algorithms related sections to read in Networked Life: 2.1,
Here you are at your computer, but you don’t have internet connections. Your ISP becomes your link to the internet. In order to get access you need to.
Introduction to Nutch CSCI 572: Information Retrieval and Search Engines Summer 2010.
CS212: DATA STRUCTURES Lecture 1: Introduction. What is this course is about ?  Data structures : conceptual and concrete ways to organize data for efficient.
KRUGLE BY: Roli Shrivastava. STORIES COLIN SAYS “ It was the first day at my new job and one my new colleagues told me that they were looking for a specific.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
Alastair Duncan STFC Pre Coffee talk STFC July 2014 The Trials and Tribulations and ultimate success of parallelisation using Hadoop within the SCAPE project.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
1 Innovative Solutions For Mission Critical Systems Using EMF Annotations to Drive Program Behavior February 19, 2014.
1 FollowMyLink Individual APT Presentation Third Talk February 2006.
Lucene. Lucene A open source set of Java Classses ◦ Search Engine/Document Classifier/Indexer 
Programming in Hadoop Guangda HU Huayang GUO
 Used MapReduce algorithms to process a corpus of web pages and develop required index files  Inverted Index evaluated using TREC measures  Used Hadoop.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture IX: 2014/05/05.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
TwitterFeedRank Nick Flacco Dalton Huynh Abhishek Jha Phong Lam.
1 Using the Lucene Search Engine. 2 Team Phil Corcoran Project Leader 10 Years Software Telecoms, Finance, Manufacturing Reqs, Design, Test Derek O’ Keeffe.
MapReduce using Hadoop Jan Krüger … in 30 minutes...
Apache David Schneider (schnei21) ITEC400. What is Hadoop? Distributed Computing Open Source Reliable Scalable Fun Facts What is a Hadoop? Hadoop was.
Big Data is a Big Deal!.
CSC 102 Lecture 12 Nicholas R. Howe
Tutorial: Big Data Algorithms and Applications Under Hadoop
Introducing Apache Mahout
A Context Sensitive Searching and Ranking
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2017 Lecture 7: Information Retrieval II Aidan Hogan
Implementation Issues & IR Systems
Central Florida Business Intelligence User Group
DATA ANALYTICS AND TEXT MINING
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
Hadoop Basics.
Map reduce use case Giuseppe Andronico INFN Sez. CT & Consorzio COMETA
Defining Data-intensive computing
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
Data Mining Chapter 6 Search Engines
TIM TAYLOR AND JOSH NEEDHAM
VI-SEEM data analysis service
Charles Tappert Seidenberg School of CSIS, Pace University
The Search Engine Architecture
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
Introducing Apache Mahout
Presentation transcript:

Searching with Lucene Chapter 2

For discussion Information retrieval What is Lucene? Code for indexer using Lucene Pagerank algorithm

Information Retrieval Consider a collection of documents You want to know what words are in each of the documents Given a word you want to know which document it occurs You want to know how many times a word occurs in document. You want to rank documents according to count

What is Lucene? Apache Lucene is a free/open source information retrieval software library, originally created in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License It does indexing at lightning speed. Lucene experience lead to the development of Hadoop (by Doug Cutting).

Why do need to study it? But search is more than indexing: link analysis, click analysis, natural language processing, latent dirichlet allocation (LDA),…page rank,… We are interested in data-intensive computing algorithm such as mapreduce and data structure such as Google file systems. Algorithms we discuss in the context of Lucene could all be converted to data-intensive methods for improving performance and scalability.

Pagerank algorithm