1 Basic Text Processing and Indexing. 2 Document Processing Steps Lexical analysis (tokenizing) Stopwords removal Stemming Selection of indexing terms.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Chapter 5: Introduction to Information Retrieval
Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
Introduction to Information Retrieval
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Hashing as a Dictionary Implementation
Hashing Chapters What is Hashing? A technique that determines an index or location for storage of an item in a data structure The hash function.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Tries Standard Tries Compressed Tries Suffix Tries.
1 Lecture 8: Data structures for databases II Jose M. Peña
1 Foundations of Software Design Fall 2002 Marti Hearst Lecture 18: Hash Tables.
Hashing Techniques.
Learning for Text Categorization
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
Sets and Maps Chapter 9. Chapter 9: Sets and Maps2 Chapter Objectives To understand the Java Map and Set interfaces and how to use them To learn about.
Ch 4: Information Retrieval and Text Mining
Hash Tables1 Part E Hash Tables  
Hash Tables1 Part E Hash Tables  
COMP 171 Data Structures and Algorithms Tutorial 10 Hash Tables.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
Chapter 5: Information Retrieval and Web Search
1 Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval.
Hashing The Magic Container. Interface Main methods: –Void Put(Object) –Object Get(Object) … returns null if not i –… Remove(Object) Goal: methods are.
Comp 249 Programming Methodology Chapter 15 Linked Data Structure - Part B Dr. Aiman Hanna Department of Computer Science & Software Engineering Concordia.
Database Management 8. course. Query types Equality query – Each field has to be equal to a constant Range query – Not all the fields have to be equal.
Hashtables David Kauchak cs302 Spring Administrative Talk today at lunch Midterm must take it by Friday at 6pm No assignment over the break.
Data Structures and Algorithm Analysis Hashing Lecturer: Jing Liu Homepage:
CS 202, Spring 2003 Fundamental Structures of Computer Science II Bilkent University1 Hashing CS 202 – Fundamental Structures of Computer Science II Bilkent.
Clustering-based Collaborative filtering for web page recommendation CSCE 561 project Proposal Mohammad Amir Sharif
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
Efficient Minimal Perfect Hash Language Models David Guthrie, Mark Hepple, Wei Liu University of Sheffield.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
© 2006 Pearson Addison-Wesley. All rights reserved13 B-1 Chapter 13 (continued) Advanced Implementation of Tables.
TECH Computer Science Dynamic Sets and Searching Analysis Technique  Amortized Analysis // average cost of each operation in the worst case Dynamic Sets.
Hash Tables1   © 2010 Goodrich, Tamassia.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Chapter 6: Information Retrieval and Web Search
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
CS201: Data Structures and Discrete Mathematics I Hash Table.
CSC 427: Data Structures and Algorithm Analysis
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
Evidence from Content INST 734 Module 2 Doug Oard.
1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.
An Efficient Information Retrieval System Objectives: n Efficient Retrieval incorporating keyword’s position; and occurrences of keywords in heading or.
Hashtables David Kauchak cs302 Spring Administrative Midterm must take it by Friday at 6pm No assignment over the break.
Dictionaries CS 110: Data Structures and Algorithms First Semester,
Sets and Maps Chapter 9. Chapter Objectives  To understand the Java Map and Set interfaces and how to use them  To learn about hash coding and its use.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
Information Retrieval Inverted Files.. Document Vectors as Points on a Surface Normalize all document vectors to be of length 1 Define d' = Then the ends.
Why indexing? For efficient searching of a document
Indexing & querying text
Information Retrieval and Web Search
CS 430: Information Discovery
Java VSR Implementation
Text Categorization Assigning documents to a fixed set of categories
Implementation Based on Inverted Files
6. Implementation of Vector-Space Retrieval
Java VSR Implementation
Boolean and Vector Space Retrieval Models
Java VSR Implementation
Presentation transcript:

1 Basic Text Processing and Indexing

2 Document Processing Steps Lexical analysis (tokenizing) Stopwords removal Stemming Selection of indexing terms among the word collection Construction of indexing system

3 Simple Tokenizing Separate text into a sequence of discrete tokens (words). Sometimes punctuation ( ), numbers (1999), and case (China vs. china) can be a meaningful part of a token. However, frequently they are not. Simplest approach is to ignore all numbers and punctuation and use only case-insensitive unbroken strings of alphabetic characters as tokens.

4 Tokenizing HTML Should text in HTML commands not typically seen by the user be included as tokens? –Words appearing in URLs. –Words appearing in “meta text” of images. –Words in meta-tag Simplest approach is to exclude all HTML tag information (between “ ”) from tokenization. But it may lose important information.

5 Stopwords Removal It is typical to exclude high-frequency words (e.g. function words: “a”, “the”, “in”, “to”; pronouns: “I”, “he”, “she”, “it”). Stopwords are language dependent. One may find different set of stopwords from different sources. For efficiency, store strings for stopwords in a hashtable to recognize them in constant time.

6 Stemming Reduce tokens to “root” form of words to recognize morphological variation. –“computer”, “computational”, “computation” all reduced to same token “compute” Correct morphological analysis is language specific and can be complex. Stemming “blindly” strips off known affixes (prefixes and suffixes) in an iterative fashion.

7 Four Types of Strategies Affix removal : –Based on intuitive, heuristic approach. Four different algorithms. Porter’s algorithm is the most popular one Table look-up : –Look up the stem of a word from a table. Tables would be too big and difficult to construct. Successor variety : –Based on determining morpheme boundaries, complicated N-grams: –Based on term clustering, rather than stemming

8 Porter’s Algorithm Simple procedure for removing known affixes in English without using a dictionary. Can produce unusual stems that are not English words: –“computer”, “computational”, “computation” all reduced to same token “comput” May conflate (reduce to the same token) words that are actually distinct. Not recognize all morphological derivations. Appendix A of our text has a description of the algorithm.

9 Porter’s Algorithm -- Basic Idea Uses a list suffix list to strip suffixes. The idea is to apply a set of rules to the suffixes of the words in the text.. For example: four rules to remove plural form, select the rule with longest suffix –sses  ss; –ies  I; –ss  ss; –s  NULL; It makes the word stresses into stress

10 Sparse Vectors Vocabulary and therefore dimensionality of vectors can be very large, ~10 4. However, most documents and queries do not contain most words, so vectors are sparse (i.e. most entries are 0). Need efficient methods for storing and computing with sparse vectors.

11 Sparse Vectors as Lists Store vectors as linked lists of non-zero- weight tokens paired with a weight. –Space proportional to number of unique tokens (n) in document. –Requires linear search of the list to find (or change) the weight of a specific token. –Requires quadratic time in worst case to compute vector for a document:

12 Sparse Vectors as Trees Index tokens in a document in a balanced binary tree or trie with weights stored with tokens at the leaves. memory  <  <  < filmvariable 2 memory 1 film 1 bit 2 Balanced Binary Tree

13 Sparse Vectors as Trees (cont.) Space overhead for tree structure: ~2n nodes. O(log n) time to find or update weight of a specific token. O(n log n) time to construct vector. Need software package to support such data structures.

14 Sparse Vectors as HashTables Store tokens in hashtable, with token string as key and weight as value. –Storage overhead for hashtable ~1.5n. –Table must fit in main memory. –Constant time to find or update weight of a specific token (ignoring collisions). –O(n) time to construct vector (ignoring collisions).

15 Implementation Based on Inverted Files In practice, document vectors are not stored directly; an inverted organization provides much better efficiency. The keyword-to-document index can be implemented as a hash table, a sorted array, or a tree-based data structure (trie, B-tree). Critical issue is logarithmic or constant-time access to token information.

16 Inverted Index: Assumption Query will happen frequently –Find all documents that contain term t Delete will be rare –Delete document 52 Update will be rare –Correct the spelling of term t in document 52 Add will not happen too often –Add new document

17 Inverted Index: Basic Structure Term list: a list of all terms Document node: a structure that contains information such as term frequency, document ID, and others Posting list: for each term, a list containing document node for each document in which the term appears

18 Inverted Index system computer database science D 2, 4 D 5, 2 D 1, 3 D 7, 4 Index terms df D j, tf j Term List Postings lists     

19 Creating an Inverted Index Create an empty index term list I; For each document, D, in the document set V For each (non-zero) token, T, in D: If T is not already in I Insert T into I; Find the location for T in I; If (T, D) is in the posting list for T increase its term frequency for T; Else Create (T, D); Add it to the posting list for T;

20 Computing IDF Let N be the total number of documents; For each token, T, in I: Determine the total number of documents, M, in which T occurs (the length of T’s posting list); Set the IDF for T to log(N/M); Note this requires a second pass through all the tokens after all documents have been indexed.

21 Document Vector Length Remember that the length of a document vector is the square-root of sum of the squares of the weights of its tokens. Remember the weight of a token is: TF * IDF Therefore, must wait until IDF’s are known (and therefore until all documents are indexed) before document lengths can be determined.

22 Computing Document Vector Lengths Assume the length of all document vectors (stored in the DocumentReference) are initialized to 0.0; For each token T in I: Let, idf, be the IDF weight of T; For each TokenOccurence of T in document D Let, C, be the count of T in D; Increment the length of D by (idf*C) 2 ; For each document D in the document set: Set the length of D to be the square-root of the current stored length;

23 Time Complexity of Indexing Complexity of creating vector and indexing a document of n tokens is O(n). So indexing m such documents is O(m n). Computing token IDFs for a vocabularly V is O(|V|). Computing vector lengths is also O(m n). Since |V|  m n, complete process is O(m n), which is also the complexity of just reading in the corpus.

24 Retrieval with an Inverted Index Tokens that are not in both the query and the document do not effect cosine similarity. –Product of token weights is zero and does not contribute to the dot product. Usually the query is fairly short, and therefore its vector is extremely sparse. Use inverted index to find the limited set of documents that contain at least one of the query words.

25 Inverted Query Retrieval Efficiency Assume that, on average, a query word appears in B documents: Then retrieval time is O(|Q| B), which is typically, much better than naïve retrieval that examines all N documents, O(|V| N), because |Q| << |V| and B << N. Q = q 1 q 2 … q n D 11 …D 1B D 21 …D 2B D n1 …D nB

26 Processing the Query Incrementally compute cosine similarity of each indexed document as query words are processed one by one. To accumulate a total score for each retrieved document, store retrieved documents in a hashtable, where DocumentReference is the key and the partial accumulated score is the value.

27 Inverted-Index Retrieval Algorithm Create a vector, Q, for the query. Create empty HashMap, R, to store retrieved documents with scores. For each token, T, in Q: Let idf be the IDF of T, and K be the count of T in Q; Set the weight of T in Q: W = K * idf; Let L be the list of TokenOccurences of T from I (term list); For each TokenOccurence, O, in L: Let D be the document of O, and C be the count of O (tf of T in D); If D is not already in R (D was not previously retrieved) Then add D to R and initialize score to 0.0; Increment D’s score by W * idf * C; (product of T-weight in Q and D)

28 Retrieval Algorithm (cont) Compute the length, L, of the vector Q (square-root of the sum of the squares of its weights). For each retrieved document D in R: Let S be the current accumulated score of D; (S is the dot-product of D and Q) Let Y be the length of D as stored in its DocumentReference; Normalize D’s final score to S/(L * Y); Sort retrieved documents in R by final score and return results in an array.

29 User Interface Until user terminates with an empty query: Prompt user to type a query, Q. Compute the ranked array of retrievals R for Q; Print the name of top N documents in R; Until user terminates with an empty command: Prompt user for a command for this query result: 1) Show next N retrievals; 2) Show the Mth retrieved document;