Evidence from Content INST 734 Module 2 Doug Oard.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
Multimedia Database Systems
Search and Ye Shall Find (maybe) Seminar on Emergent Information Technology August 20, 2007 Douglas W. Oard.
LBSC 796/INFM 718R: Week 3 Boolean and Vector Space Models Jimmy Lin College of Information Studies University of Maryland Monday, February 13, 2006.
Ranked Retrieval INST 734 Module 3 Doug Oard. Agenda  Ranked retrieval Similarity-based ranking Probability-based ranking.
INFM 603: Information Technology and Organizational Context Jimmy Lin The iSchool University of Maryland Thursday, November 7, 2013 Session 10: Information.
Information Retrieval Review
ISP 433/533 Week 2 IR Models.
Full-Text Indexing Session 10 INFM 718N Web-Enabled Databases.
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
The Vector Space Model LBSC 796/CMSC828o Session 3, February 9, 2004 Douglas W. Oard.
Text Retrieval and Spreadsheets Class 4 LBSC 690 Information Technology.
CS/Info 430: Information Retrieval
Advance Information Retrieval Topics Hassan Bashiri.
Chapter 14 The Second Component: The Database.
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Evidence from Content LBSC 796/INFM 718R Session 2 September 7, 2011.
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
Chapter 5: Information Retrieval and Web Search
Indexing and Complexity. Agenda Inverted indexes Computational complexity.
Databases C HAPTER Chapter 10: Databases2 Databases and Structured Fields  A database is a collection of information –Typically stored as computer.
The Boolean Retrieval Model LBSC 708A/CMSC 838L Session 2 - September 11, 2001 Philip Resnik.
IAT Text ______________________________________________________________________________________ SCHOOL OF INTERACTIVE ARTS + TECHNOLOGY [SIAT]
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Text Based Information Retrieval.
Indexing LBSC 708A/CMSC 838L Session 7, October 23, 2001 Philip Resnik.
Spreadsheet Project.
Module 5 Planning for SQL Server® 2008 R2 Indexing.
DATA STRUCTURE & ALGORITHMS (BCS 1223) CHAPTER 8 : SEARCHING.
Discovering Computers Fundamentals Fifth Edition Chapter 9 Database Management.
Evaluation INST 734 Module 5 Doug Oard. Agenda  Evaluation fundamentals Test collections: evaluating sets Test collections: evaluating rankings Interleaving.
DAY 12: DATABASE CONCEPT Tazin Afrin September 26,
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Chapter 6: Information Retrieval and Web Search
Information retrieval 1 Boolean retrieval. Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text)
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Matching LBSC 708A/CMSC 828O March 8, 1999 Douglas W. Oard and Dagobert Soergel.
The Structure of Information Retrieval Systems LBSC 708A/CMSC 838L Douglas W. Oard and Philip Resnik Session 1: September 4, 2001.
Representing and Indexing Content INST 734 Lecture 2 February 5, 2014.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
Ranked Retrieval LBSC 796/INFM 718R Session 3 February 16, 2011.
Database Concepts Track 3: Managing Information using Database.
Search Technology LBSC 708X/INFM 718X Week 5 Doug Oard.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Evidence from Content INST 734 Module 2 Doug Oard.
Text Retrieval and Spreadsheets Session 4 LBSC 690 Information Technology.
Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard.
Description and exemplification use of a Data Dictionary. A data dictionary is a catalogue of all data items in a system. The data dictionary stores details.
PREPARED BY: PN. SITI HADIJAH BINTI NORSANI. LEARNING OUTCOMES: Upon completion of this course, students should be able to: 1. Understand the structure.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Information Retrieval Inverted Files.. Document Vectors as Points on a Surface Normalize all document vectors to be of length 1 Define d' = Then the ends.
Why indexing? For efficient searching of a document
Lecture 1: Introduction and the Boolean Model Information Retrieval
Text Based Information Retrieval
Ch. 8 File Structures Sequential files. Text files. Indexed files.
Chapter 11: File System Implementation
LBSC 708X/INFM 718X Week 5 Doug Oard
CSCE 561 Information Retrieval System Models
IR Theory: IR Basics & Vector Space Model
Chapter 11: File System Implementation
2016 REPORT.
The ultimate in data organization
Chapter 11: File System Implementation
Information Retrieval and Web Design
201X REPORT.
IR Theory: IR Basics & Vector Space Model
2016 REPORT.
Presentation transcript:

Evidence from Content INST 734 Module 2 Doug Oard

Agenda Character sets Terms as units of meaning  Boolean Retrieval Building an index

Where Indexing Fits Source Selection Search Query Selection Ranked List Examination Document Delivery Document Query Formulation IR System Indexing Index Acquisition Collection

Desirable Index Characteristics Very rapid search –Less than ~100ms is typically imperceivable Reasonable hardware requirements –Processor speed, disk size, main memory size “Fast enough” creation and updates –Every couple of weeks may suffice for the Web –Every couple of minutes is needed for news

McDonald's slims down spuds Fast-food chain to reduce certain types of fat in its french fries with new cooking oil. NEW YORK (CNN/Money) - McDonald's Corp. is cutting the amount of "bad" fat in its french fries nearly in half, the fast-food chain said Tuesday as it moves to make all its fried menu items healthier. But does that mean the popular shoestring fries won't taste the same? The company says no. "It's a win-win for our customers because they are getting the same great french-fry taste along with an even healthier nutrition profile," said Mike Roberts, president of McDonald's USA. But others are not so sure. McDonald's will not specifically discuss the kind of oil it plans to use, but at least one nutrition expert says playing with the formula could mean a different taste. Shares of Oak Brook, Ill.-based McDonald's (MCD: down $0.54 to $23.22, Research, Estimates) were lower Tuesday afternoon. It was unclear Tuesday whether competitors Burger King and Wendy's International (WEN: down $0.80 to $34.91, Research, Estimates) would follow suit. Neither company could immediately be reached for comment. … 16 × said 14 × McDonalds 12 × fat 11 × fries 8 × new 6 × company, french, nutrition 5 × food, oil, percent, reduce, taste, Tuesday … “Bag of Words”

“Bag of Terms” Representation Bag = a “set” that can contain duplicates  “The quick brown fox jumped over the lazy dog’s back”  {back, brown, dog, fox, jump, lazy, over, quick, the, the} Vector = values recorded in any consistent order  {back, brown, dog, fox, jump, lazy, over, quick, the, the}  [ ]

Why Does “Bag of Terms” Work? Words alone tell us a lot about content It is relatively easy to come up with words that describe an information need Random: beating takes points falling another Dow 355 Alphabetical: 355 another beating Dow falling points Actual: Dow takes another beating, falling 355 points

Bag of Terms Example The quick brown fox jumped over the lazy dog’s back. Document 1 Document 2 Now is the time for all good men to come to the aid of their party. the quick brown fox over lazy dog back now is time for all good men to come jump aid of their party Term Document 1Document 2 Stopword List

Boolean “Free Text” Retrieval Limit the bag of words to “absent” and “present” –“Boolean” values, represented as 0 and 1 Represent terms as a “bag of documents” –Same representation, but rows rather than columns Combine the rows using “Boolean operators” –AND, OR, NOT Result set: every document with a 1 remaining

AND/OR/NOT AB All documents

Boolean Operators A OR B A AND BA NOT B A B A B A B B NOT B (= A AND NOT B)

Boolean View of a Collection Each column represents the view of a particular document: What terms are contained in this document? Each row represents the view of a particular term: What documents contain this term? To execute a query, pick out rows corresponding to query terms and then apply logic table of corresponding Boolean operator

Sample Queries fox dog Term Doc 1 Doc 2Doc 3Doc 4 Doc 5Doc 6Doc 7Doc 8 dog  fox dog  fox dog  fox fox  dog dog AND fox  Doc 3, Doc 5 dog OR fox  Doc 3, Doc 5, Doc 7 dog NOT fox  empty fox NOT dog  Doc 7 good party g  p g  p  o good AND party  Doc 6, Doc 8 over good AND party NOT over  Doc 6 Term Doc 1 Doc 2Doc 3Doc 4 Doc 5Doc 6Doc 7Doc 8

Why Boolean Retrieval Works Boolean operators approximate natural language –Find documents about a good party that is not over AND can discover relationships between concepts –good party OR can discover alternate terminology –excellent party NOT can discover alternate meanings –Democratic party

Proximity Operators More precise versions of AND –“NEAR n” allows at most n-1 intervening terms –“WITH” requires terms to be adjacent and in order Easy to implement, but less efficient –Store a list of positions for each word in each doc Warning: stopwords become important! –Perform normal Boolean computations Treat WITH and NEAR like AND with an extra constraint

Proximity Operator Example time AND come –Doc 2 time (NEAR 2) come –Empty quick (NEAR 2) fox –Doc 1 quick WITH fox –Empty quick brown fox over lazy dog back now time all good men come jump aid their party 01 (9) Term 1 (13) 1 (6) 1 (7) 1 (8) 1 (16) 1 (1) 1 (2) 1 (15) 1 (4) (5) 1 (9) 1 (3) 1 (4) 1 (8) 1 (6) 1 (10) Doc 1Doc 2

What’s in the Postings File? Boolean retrieval –Just the document number Proximity operators –Word offsets for each occurrence of the term Example: Doc 3 (t17, t36), Doc 13 (t3, t45) Ranked Retrieval –Document number and term weight

Agenda Character sets Terms as units of meaning Boolean retrieval  Building an index