1 Chap 14 Ranking Algorithm 指導教授 : 黃三益 博士 學生 : 吳金山 鄭菲菲.

Slides:



Advertisements
Similar presentations
OvidSP Flexible. Innovative. Precise. Introducing OvidSP Resources.
Advertisements

Relational Database and Data Modeling
XP New Perspectives on Microsoft Office Word 2003 Tutorial 6 1 Microsoft Office Word 2003 Tutorial 6 – Creating Form Letters and Mailing Labels.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Title Subtitle.
COMP3410 DB32: Technologies for Knowledge Management Lecture 4: Inverted Files and Signature Files for IR By Eric Atwell, School of Computing, University.
Evaluating Window Joins over Unbounded Streams Author: Jaewoo Kang, Jeffrey F. Naughton, Stratis D. Viglas University of Wisconsin-Madison CS Dept. Presenter:
Information Retrieval in Practice
Configuration management
Databasteknik Databaser och bioinformatik Data structures and Indexing (II) Fang Wei-Kleiner.
Yong Choi School of Business CSU, Bakersfield
Text Categorization.
1 Evaluations in information retrieval. 2 Evaluations in information retrieval: summary The following gives an overview of approaches that are applied.
Traditional IR models Jian-Yun Nie.
Boolean and Vector Space Retrieval Models
25 seconds left…...
We will resume in: 25 Minutes.
CSE3201/4500 Information Retrieval Systems
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
Chapter 5: Introduction to Information Retrieval
Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
Introduction to Information Retrieval
Multimedia Database Systems
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
IR Models: Overview, Boolean, and Vector
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
DYNAMIC ELEMENT RETRIEVAL IN A STRUCTURED ENVIRONMENT MAYURI UMRANIKAR.
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
CS/Info 430: Information Retrieval
Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.
Evaluating the Performance of IR Sytems
Vector Space Model CS 652 Information Extraction and Integration.
Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
Overview of Search Engines
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
1 Computing Relevance, Similarity: The Vector Space Model.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
1 FollowMyLink Individual APT Presentation Third Talk February 2006.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
Web- and Multimedia-based Information Systems Lecture 2.
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.
Why indexing? For efficient searching of a document
Information Retrieval in Practice
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Multimedia Information Retrieval
Basic Information Retrieval
CS 430: Information Discovery
Information Retrieval and Web Design
CS 430: Information Discovery
Presentation transcript:

1 Chap 14 Ranking Algorithm 指導教授 : 黃三益 博士 學生 : 吳金山 鄭菲菲

2 Outline  Introduction  Ranking models  Selecting ranking techniques  Data structures and algorithms  The creation of an inverted file  Searching the inverted file  Stemmed and unstemmed query terms  A Boolean systems with ranking  Pruning

3 Introduction  Boolean systems Providing powerful on-line search capabilities for librarians and other trained intermediaries Providing very poor service for end-users who use the system infrequently  The ranking approach Inputting a natural language query without Boolean syntax Producing a list of ranked records that “answer” the query More oriented toward end-users

4 Introduction (cont.)  Natural language/ranking approach is more effective for end-users The results being ranked based on co- occurrence of query terms modified by statistical term-weighting eliminating the often-wrong Boolean syntax used by end-users providing some results even if a query term is incorrect

5 Figure 14.1 Statistical ranking Term FactorsInformationHelpHumanOperationRetrievalsystems Qry. Human factors in information retrieval systems Vtr Rec1. Human, factors, information, retrieval Vtr Rec2. Human, factors, help, systems Vtr Rec3. Factors, operation, systems Vtr

6 Figure 14.1 Statistical ranking  Simple Match Query ( ) Rec1 ( ) ( ) = 4 Query ( ) Rec2 ( ) ( ) = 3 Query ( ) Rec3 ( ) ( ) = 2  Weighted Match Query ( ) Rec1 ( ) ( ) = 13 Query ( ) Rec2 ( ) ( ) = 8 Query ( ) Rec3 ( ) ( ) = 3

7 Ranking models  Two types of ranking models ranking the query against Individual documents  Vector space model  Probabilistic model ranking the query against entire sets of related documents

8 Ranking models (cont.)  Vector space model Using cosine correlation to compute similarity Early experiments  SMART system (overlap similarity function) Results  Within document frequency weighting > no term weighting  Cosine correlation with frequency term weighting > overlap similarity function  Salton & Yang (1973) (Relying on term importance within an entire collection) Results  Significant performance improvement using the within- document frequency weighting + the inverted document frequency (IDF)

9 Ranking models (cont.)  Probabilistic model Terms appearing in previously retrieved relevant documents was given a higher weight Croft and Harper (1979)  Probabilistic indexing without any relevance information  Assuming all query terms have equal probability  Deriving a term-weighting formula

10 Ranking models (cont.)  Probabilistic model Croft (1983)  Incorporating within-document frequency weights  Using a tuning factor K  Result Significant improvement over both the IDF weighting alone and the combination weighting

11 Other experiments involving ranking  Direct comparison of similarity measures and term-weighting schemes 4 types of term frequency weightings (Sparch Jones,1973)  Term frequency within a document  Term frequency within a collection  Term postings within a document (a binary measure)  Term postings within a collection Indexing was taken from manually extracted keywords Results  Using the term frequency (or postings) within a collection always improved performance  Using term frequency ( or postings) within a document improved performance only for some collections

12  Harman(1986) Four term-weighting factors  (a) The number of matches between a document & a query  (b) The distribution of a term within a document collection IDF & noise measure  (c) The frequency of a term within a document  (d) The length of the document Results  Using the single measures alone, the distribution of the term within the collection = 2 (c)  Combining the within-document frequency with either the IDF or noise measure = 2 (using the IDF or noise alone) Other experiments involving ranking (cont.)

13 Other experiments involving ranking (cont.)  Ranking based on document structure Not only using weights based on term importance both within an entire collection and within a given document (Bernstein and Williamson, 1984) But also using the structural position of the term  Summary versus text paragraphs In SIBRIS, increasing term-weights for terms in titles of documents and decreasing term- weights for terms added to a query from a thesaurus

14 Selecting ranking techniques  Using term-weighting based on the distribution of a term within a collection always improves performance  Within-document frequency + IDF weight often provides even more improvement  Within-document frequency + (Several methods) IDF measure  Adding additional weight for document structure Eg. higher weightings for terms appearing in the title or abstract vs. those appearing only in the text  Relevance weighting (Chap 11)

15 The creation of an inverted file  Implications for supporting inverted file structures Only the record id has to be stored (smaller index) Using strategies that increase recall at the expense of precision  Inverted file is usually split into two pieces for searching The dictionary containing the term, along with statistics about that term such as no. of postings and IDF, and a pointer to the location of the postings file for term The postings file containing the record ids and the weights for all occurrences of the term

16 The creation of an inverted file (cont.)  4 major options for storing weights in the postings file Store the raw frequency  Slowest search  Most flexible Store a normalized frequency  Not suitable for use with the cosine similarity function  Updating would not change the postings

17 The creation of an inverted file (cont.) Store the completely weighted term  Any of the combination weighting schemes are suitable  Disadvantage: updating requires changing all postings If no within-record weighting is used, then the postings records do not have to store weights

18 Searching the inverted file  Figure 14.4 flowchart of search engine query parser Dictionary Lookup Get Weights Accumulator Sort by weight Ranked record numbers Record numbers. Total weights Record numbers on a per term basis Dictionary entry

19 Searching the inverted file (cont.)  Inefficiencies of this technique The I/O needs to be minimized  A single read for all the postings of a given term, and then separating the buffer into record ids and weights Time savings can be gained at the expense of some memory space  Direct access to memory rather than through hashing A final major bottleneck can be the sort step of the “accumulators” for large data sets  Fast sort of thousands of records is very time consuming

20 Stemmed and unstemmed query terms  If query terms were automatically stemmed in a ranking system, users generally got better results (Frakes, 1984; Canadela, 1990) In some cases, a stem is produced that leads to improper results  the original record terms are not stored in the inverted file; only their stems are used

21 Stemmed and unstemmed query terms (cont.)  Harman & Candela (1990) 2 separate inverted files could be created and stored  Stem terms: normal query  Unstemmed terms: don’t stem Hybrid inverted file  Saving no space in the dictionary part  Saving considerable storage (2 versions of posting)  At the expense of some additional search time

22 A Boolean systems with ranking  SIRE system Full Boolean capability + a variation of the basic search process  Accepts queries that are either Boolean logic strings or natural language queries (implicit OR)  Major modification to the basic search process Merge postings from the query terms before ranking is done  Performance Faster response time for Boolean queries No increase in response time for natural language queries

23 Pruning  A major time bottleneck in the basic search process The sort of the accumulators for large data sets  Changed search algorithm with pruning: 1. Sort all query terms (stems) by decreasing IDF value 2. Do a binary search for the first term (i.e., the highest IDF) and get the address of the postings list for that term 3. Read the entire postings file for that term into a buffer and add the term weights for each record id into the contents of the unique accumulator for the record id

24 Pruning (cont.) 4. Check the IDF of the next query term. If the IDF >= 1/3 (max IDF of any term in the data set) then repeat steps 2, 3, and 4 otherwise repeat steps 2, 3, and 4, but do not add weights to zero weight accumulators 5. Sort the accumulators with nonzero weights to produce the final ranked record list 6. If a query has only high-frequency terms, then pruning cannot be done.

25 Thanks