The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Chapter 5: Introduction to Information Retrieval
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Comp 335 File Structures Indexes. The Search for Information When searching for information, the information desired is usually associated with a key.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
How to Search the USFSP Digital Archive By Carol Hixson, Dean Nelson Poynter Memorial Library May 31, 2014.
Search Engines and Information Retrieval
Indexing UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides ©Addison Wesley, 2008.
Modern Information Retrieval
Search Strategies Online Search Techniques. Universal Search Techniques Precision- getting results that are relevant, “on topic.” Recall- getting all.
The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st.
IR Models: Structural Models
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
Parametric search and zone weighting Lecture 6. Recap of lecture 4 Query expansion Index construction.
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
Journal Citation Reports on the Web. Copyright 2006 Thomson Corporation 2 Introduction JCR distills citation trend data for 7,600+ journals from more.
Chapter 5 Searching for Truth: Locating Information on the WWW.
Information Storage and Retrieval CS French Chapter 3.
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
Adding metadata to web pages Please note: this is a temporary test document for use in internal testing only.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
Introduction n Keyword-based query answering considers that the documents are flat i.e., a word in the title has the same weight as a word in the body.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Programming Collective Intelligence by Toby.
The College of Saint Rose CIS 433 – Programming Languages David Goldschmidt, Ph.D. from Concepts of Programming Languages, 9th edition by Robert W. Sebesta,
Chapter 5 Searching for Truth: Locating Information on the WWW.
Search Engines and Information Retrieval Chapter 1.
PowerPoint Presentation for Dennis & Haley Wixom, Systems Analysis and Design, 2 nd Edition Copyright 2003 © John Wiley & Sons, Inc. All rights reserved.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Information Retrieval CSE 8337 Spring 2007 Query Languages & Matching Material for these slides obtained from: Modern Information Retrieval by Ricardo.
© 2005 Pearson Education, Inc. publishing as Longman Publishers. Chapter 6: Textbook Learning Breaking Through: College Reading, 7/e Brenda Smith.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
LIS 7450, Searching Electronic Databases Basic: Database Structure & Database Construction Dialog: Database Construction for Dialog (FYI) Deborah A. Torres.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Fundamentals of Music Processing Chapter 7: Content-Based Audio Retrieval Meinard Müller International Audio Laboratories Erlangen
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
1 CS 430: Information Discovery Sample Midterm Examination Notes on the Solutions.
1 Information Retrieval LECTURE 1 : Introduction.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Rensselaer Polytechnic Institute CSCI-4220 – Network Programming David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st edition.
Copyright 2002, Paradigm Publishing Inc. CHAPTER 25 BACKNEXTEND 25-1 LINKS TO OBJECTIVES Compiling a Table of Contents Compiling a Table of Contents Assigning.
David M. Kroenke and David J. Auer Database Processing Fundamentals, Design, and Implementation Appendix G: Data Structures for Database Processing.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008 Annotations by Michael L. Nelson.
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
Why indexing? For efficient searching of a document
Information Retrieval in Practice
Indexing UCSB 293S, 2017 Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides ©Addison Wesley,
Text Based Information Retrieval
Summon discovers contents from one search box!
Map Reduce.
Boolean Retrieval Term Vocabulary and Posting Lists Web Search Basics
Database & Record Structure
Query Languages.
Thanks to Bill Arms, Marti Hearst
Searching for Truth: Locating Information on the WWW
Introduction to Information Retrieval
Searching for Truth: Locating Information on the WWW
Searching for Truth: Locating Information on the WWW
Recuperação de Informação B
Information Retrieval and Web Design
Information Retrieval and Web Design
INF 141: Information Retrieval
Advance Database System
Information Retrieval and Web Design
Presentation transcript:

The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN

 An index is a data structure that is designed to make search (or finding things) fast and efficient  Text search often requires an inverted index  Represents a class of similar data structures  Inverted because we associate documents with words (rather than identifying words within or as part of documents)

 Each index term or document feature is obtained during text transformation  A document feature is some feature of the document expressed numerically  For example, a topical feature estimates the degree to which the document is about a particular topic  Example quality features include inlink count, number of days since page was last updated, etc.

 Regardless of the ranking function, the model below provides a roadmap to implementation

 Each index term is associated with an inverted list that may contain:  A list of documents  A list of word occurrences in documents  Word counts  Positional information regarding each word  Metadata identifying fields (title, author, etc.)  etc.

 Each entry in an inverted index is called a posting  The part of the posting that refers to a specific document or location is called a pointer  Each document in the collection is given a unique number  Lists are usually document-ordered ▪ Sorted by document number

assume each sentence is a separate document

 Inverted index for documents S 1, S 2, S 3, and S 4  Deduplicates word occurrences  What does this data structure tell us?

 Inverted index with counts for documents S 1, S 2, S 3, and S 4  What does this data structure tell us?

inverted index with word positions what does this data structure tell us?

 Proximity matching is a technique used to match multiword phrases  Proximity matching also is used to match words within a window of size n  e.g. words within five words of “fish” (n=5) matches “tropical fish”

 A document field is a section of a document with additional semantic meaning  e.g. date, from:, to:, etc.  e.g. title, author, copyright, publisher, isbn, etc.  Implementation options:  Use separate inverted lists for each field  Add extra information about fields to postings  Use extent lists....

 An extent is simply a contiguous region of a document (typically with special meaning)  We can represent extents using word positions  Inverted list records all extents for a given field extent list “fish” occurs at word position 2 in document S 1 the title occurs at word positions 1 and 2

 Read and study Chapter 5  Do Exercises 5.1, 5.2, 5.3, 5.4, 5.5, 5.6, 5.7, and 5.8