Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh www.Gelbukh.com.

Slides:



Advertisements
Similar presentations
Alexander Gelbukh Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations Alexander Gelbukh
Advertisements

CLASSICAL ENCRYPTION TECHNIQUES
You have been given a mission and a code. Use the code to complete the mission and you will save the world from obliteration…
Chapter 13: Query Processing
Adders Used to perform addition, subtraction, multiplication, and division (sometimes) Half-adder adds rightmost (least significant) bit Full-adder.
1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.
Special Topics in Computer Science The Art of Information Retrieval Chapter 10: User Interfaces and Visualization Alexander Gelbukh
Alexander Gelbukh Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 4 (book chapter 8): Indexing.
Alexander Gelbukh Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 6 (book chapter 12): Multimedia.
Alexander Gelbukh Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 5 (book chapter 11): Multimedia.
Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 1: Introduction Alexander Gelbukh
Special Topics in Computer Science The Art of Information Retrieval Chapter 1: Introduction Alexander Gelbukh
OvidSP Flexible. Innovative. Precise. Introducing OvidSP Resources.
1 Average Case Analysis of an Exact String Matching Algorithm Advisor: Professor R. C. T. Lee Speaker: S. C. Chen.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Query optimisation.
1 Term 2, 2004, Lecture 5, Physical DesignMarian Ursu, Department of Computing, Goldsmiths College Physical Design 3.
Chapter 6 File Systems 6.1 Files 6.2 Directories
Information Retrieval in Practice
Databasteknik Databaser och bioinformatik Data structures and Indexing (II) Fang Wei-Kleiner.
Singly Linked Lists What is a singly-linked list? Why linked lists?
Data Structures: A Pseudocode Approach with C
Data Structures ADT List
Chapter 24 Lists, Stacks, and Queues
Hash Tables.
1 Symbol Tables Chapter Sedgewick. 2 Symbol Tables Searching Searching is a fundamental element of many computational tasks looking up a name.
Review Pseudo Code Basic elements of Pseudo code
Name Convolutional codes Tomashevich Victor. Name- 2 - Introduction Convolutional codes map information to code bits sequentially by convolving a sequence.
Chapter 6 File Systems 6.1 Files 6.2 Directories
Traditional IR models Jian-Yun Nie.
© 2012 National Heart Foundation of Australia. Slide 2.
Chapter 5 Test Review Sections 5-1 through 5-4.
25 seconds left…...
We will resume in: 25 Minutes.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 12 View Design and Integration.
February 12, 2007 WALCOM '2007 1/22 DiskTrie: An Efficient Data Structure Using Flash Memory for Mobile Devices N. M. Mosharaf Kabir Chowdhury Md. Mostofa.
PSSA Preparation.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
all-pairs shortest paths in undirected graphs
Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
Modern information retrieval Chapter 8 – Indexing and Searching.
Modern Information Retrieval Chapter 8 Indexing and Searching.
Modern Information Retrieval
1 Indexing and Searching (File Structures) Modern Information Retrieval (C hapter 8) With G. Navarro.
Modern Information Retrieval
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
Indexing and Searching
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.
Chapter. 8: Indexing and Searching Sections: 8.1 Introduction, 8.2 Inverted Files 9/13/ Dr. Almetwally Mostafa.
Information Retrieval CSE 8337 Spring 2005 Indexing and Searching Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates.
CS 430: Information Discovery
Introduction n How to retrieval information? n A simple alternative is to search the whole text sequentially n Another option is to build data structures.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
Evidence from Content INST 734 Module 2 Doug Oard.
Why indexing? For efficient searching of a document
Course Developer/Writer: A. J. Ikuomola
Tries 07/28/16 11:04 Text Compression
New Indices for Text : Pat Trees and PAT Arrays
CS 430: Information Discovery
Indexing and Searching (File Structures)
CSCE350 Algorithms and Data Structure
15-826: Multimedia Databases and Data Mining
Indexing and Searching
Presentation transcript:

Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh

2 Previous Chapter: Conclusions Text transformation: meaning instead of strings oLexical analysis oStopwords oStemming POS, WSD, syntax, semantics Ontologies to collate similar stems Text compression oSearchable (compress the query, then search) oRandom access oWord-based statistical methods (Huffman) Index compression

3 Previous Chapter: Research topics All computational linguistics oImproved POS tagging oImproved WSD Uses of thesaurus ofor user navigation ofor collating similar terms Better compression methods oSearchable compression oRandom access

4

5 Types of searching Sequential oSmall texts oVolatile, or space limited Indexed oSemi-static oSpace overhead First, we discuss indexed searching, then sequential

6 Inverted files Vocabulary: sqrt (n). Heaps law. 1GB 5M Occurrences: n * 40% (stopwords) opositions (word, char), files, sections...

7 Compression: Block addressing Block addressing: 5% overhead o256, 64K,..., blocks (1, 2,..., bytes) oEqual size (faster search) or logical sections (retrieval units)

8 Searching in inverted files Vocabulary search oSeparate file oMany searching techniques oLexicographic: log V (voc. size) = ½ log n (Heaps) oHashing is not good for prefix search Retrieval of occurrences Manipulation with occurrences: ~sqrt (n) ( Heaps, Zipf ) oBoolean operations. Context search Merging One list is shorter (Zipf law) Only inverted files allow sublinear both space & time Suffix trees and signature files dont

9 Building inverted file: 1 Infinite memory? Use trie to store vocabulary oappend positions O(n)

10 Building inverted file: 2 Finite memory? Fill the memory Write partial index; n/M pieces Merge partial indices (hierarchically): n log (n/M) Insertion: index, merge. n + n'log(n'/M) Deleting: eliminate every occurrence. n Very fast creating/maintenance

11 Suffix trees Text as one long string. No words. oGenetic databases oComplex queries oCompacted trie structure oProblem: space For text retrieval, inverted files are better

12

13

14 Suffix array All suffixes (by position) in lexicographic order Allows binary search Much less space: 40% n Supra-index: sampling, for better disk access

15 Searching. Construction Searching Patterns, prefixes, phrases. Not only words Suffix tree: O(m), but: space (m = query size) Suffix array: O(log n) (n = database size) Construction of arrays: sorting oLarge text: n 2 log (M)/M, more than for inverted files oSkip details Addition: n n' log (M)/M Deletion: n

16 Signature files Usually worse than inverted files Words are mapped to bit patterns Blocks are mapped to ORs of their word patterns If a block contains a word, all its bits are set Sequential search for blocks False drops! oDesign of the hash function oHave to traverse the block Good to search ANDs or proximity queries obit patterns are ORed

17

18 Boolean operations Merging file (occurrences) lists oAND: to find repetitions According to query syntax tree Complexity linear in intermediate results oCan be slow if they are huge There are optimization techniques oE.g.: merge small list with a big one by searching oThis is a usual case (Zipf)

19 Sequential search Necessary part of many algorithms (e.g., block addr) Brute force: O(nm) worst-case, O(n) on average Knuth-Morris-Pratt: linear worst, but the same avrg Boyer-Moore: n log(m) / m. Not all chars are examined ! oIf some part of the pattern was compared, no need to compare inside it: you analyze the pattern once Shift-Or: uses logical operation on all 32 bits in parallel BDM: automation. Complexity same as Boyer-Moore Combination of BDM with bit parallelism

20 Approximate string matching Match with k errors Levenshtein distance Dynamic programming: O(mn), O(kn) Automation: non-deterministic oConvert to deterministic: O(n), but huge structure oBit-parallel: O(n), the fastest known Filtering: sublinear! ok errors cannot alter k segments omultipattern exact search; detect suspicious places ouses approximate algorithm only when needed

21 Regular expressions oAutomation: O (m 2 m ) + O (n) – bad for long patterns oBit-parallel (simulates non-deterministic) Using indices to search for words with errors oInverted files: search in vocabulary, then each word oSuffix trees and Suffix arrays: the same algorithms!

22 Structural queries Ad-hoc index for structure Indexing tags as words oInverted files are good since they store occurrences in order

23 Search over compression Improves both space AND time (less disk operations) Compress query and search oHuffman compression, words as symbols, bytes (frequencies: most frequent shorter) oSearch each word in the vocabulary its code oMore sophisticated algorithms Compressed inverted files: less disk less time Text and index compression can be combined

24...compression Suffix trees can be compressed almost to size of suffix arrays Suffix arrays cant be compressed (almost random), but can be constructed over compressed text oinstead of Huffman, use a code that respects alphabetic order oalmost the same compression Signature files are sparse, so can be compressed oratios up to 70%

25

26 Research topics Perhaps, new details in integration of compression and search Linguistic indexing: allowing linguistic variations oSearch in plural or only singular oSearch with or without synonyms

27 Conclusions Inverted files seem to be the best option Other structures are good for specific cases oGenetic databases Sequential searching is an integral part of many indexing-based search techniques oMany methods to improve sequential searching Compression can be integrated with search

28 Thank you! Till compensation lecture?