INFO624 -- Week 7 Indexing and Searching Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University.

Slides:



Advertisements
Similar presentations
The Hierarchical Model
Advertisements

Chapter 5: Introduction to Information Retrieval
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Transform and Conquer Chapter 6. Transform and Conquer Solve problem by transforming into: a more convenient instance of the same problem (instance simplification)
Fundamentals, Design, and Implementation, 9/e Appendix A Data Structures for Database Processing.
2P13 Week 11. A+ Guide to Managing and Maintaining your PC, 6e2 RAID Controllers Redundant Array of Independent (or Inexpensive) Disks Level 0 -- Striped.
CS 171: Introduction to Computer Science II
Chapter 15 B External Methods – B-Trees. © 2004 Pearson Addison-Wesley. All rights reserved 15 B-2 B-Trees To organize the index file as an external search.
Modern Information Retrieval
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
Data Indexing Herbert A. Evans. Purposes of Data Indexing What is Data Indexing? Why is it important?
Techniques and Data Structures for Efficient Multimedia Similarity Search.
B + -Trees (Part 1) Lecture 20 COMP171 Fall 2006.
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.
File Structures Dale-Marie Wilson, Ph.D.. Basic Concepts Primary storage Main memory Inappropriate for storing database Volatile Secondary storage Physical.
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
Indexing structures for files D ƯƠ NG ANH KHOA-QLU13082.
CHP - 9 File Structures. INTRODUCTION In some of the previous chapters, we have discussed representations of and operations on data structures. These.
Indexing. Goals: Store large files Support multiple search keys Support efficient insert, delete, and range queries.
§6 B+ Trees 【 Definition 】 A B+ tree of order M is a tree with the following structural properties: (1) The root is either a leaf or has between 2 and.
Index Structures for Files Indexes speed up the retrieval of records under certain search conditions Indexes called secondary access paths do not affect.
B+ Trees COMP
Database Management 8. course. Query types Equality query – Each field has to be equal to a constant Range query – Not all the fields have to be equal.
IT The Relational DBMS Section 06. Relational Database Theory Physical Database Design.
1 © Prentice Hall, 2002 Physical Database Design Dr. Bijoy Bordoloi.
Announcements Exam Friday Project: Steps –Due today.
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
Oct 29, 2001CSE 373, Autumn External Storage For large data sets, the computer will have to access the disk. Disk access can take 200,000 times longer.
Chapter 6 1 © Prentice Hall, 2002 The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited) Project Identification and Selection Project Initiation.
1 Chapter 17 Disk Storage, Basic File Structures, and Hashing Chapter 18 Index Structures for Files.
Binary Trees, Binary Search Trees RIZWAN REHMAN CENTRE FOR COMPUTER STUDIES DIBRUGARH UNIVERSITY.
Copyright © 2015 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.
12.1 Chapter 12: Indexing and Hashing Spring 2009 Sections , , Problems , 12.7, 12.8, 12.13, 12.15,
File Organization Lecture 1
© 2010 Pearson Addison-Wesley. All rights reserved. Addison Wesley is an imprint of CHAPTER 12: Multi-way Search Trees Java Software Structures: Designing.
Lecture1 introductions and Tree Data Structures 11/12/20151.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
Chapter 10 Designing the Files and Databases. SAD/CHAPTER 102 Learning Objectives Discuss the conversion from a logical data model to a physical database.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
Clustering C.Watters CS6403.
GIS Data Models GEOG 370 Christine Erlien, Instructor.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Use of ICT in Data Management AS Applied ICT. Back to Contents Back to Contents.
Appendix C File Organization & Storage Structure.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
BINARY TREES Objectives Define trees as data structures Define the terms associated with trees Discuss tree traversal algorithms Discuss a binary.
INFO Week 5 Text Properties and Operations Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University.
Data Structure and Algorithms
Chapter 5 Record Storage and Primary File Organizations
Appendix C File Organization & Storage Structure.
Chapter 11 Indexing And Hashing (1) Yonsei University 1 st Semester, 2016 Sanghyun Park.
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
Information Retrieval Inverted Files.. Document Vectors as Points on a Surface Normalize all document vectors to be of length 1 Define d' = Then the ends.
Mohammed I DAABO COURSE CODE: CSC 355 COURSE TITLE: Data Structures.
Indexing Structures for Files and Physical Database Design
Record Storage, File Organization, and Indexes
Indexing and hashing.
Multiway Search Trees Data may not fit into main memory
Azita Keshmiri CS 157B Ch 12 indexing and hashing
Physical Database Design and Performance
Lecture 22 Binary Search Trees Chapter 10 of textbook
B+ Tree.
External Methods Chapter 15 (continued)
Trees and Binary Trees.
Session #, Speaker Name Indexing Chapter 8 11/19/2018.
Physical Database Design
Indexing and Hashing Basic Concepts Ordered Indices
Data Model.
Database Management System
Presentation transcript:

INFO Week 7 Indexing and Searching Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University

Effective Information Retrieval Data Structures Data Structures Knowledge Representation Knowledge Representation User Interface and User Interaction User Interface and User Interaction

Data Structures Describes how text, attributes of text, and indexes are stored in memory, files, or databases. Describes how text, attributes of text, and indexes are stored in memory, files, or databases. Describes the nature of relationships among information elements Describes the nature of relationships among information elements

Data Models Logical data model Logical data model  how the user view the data  how to represent or catch semantic (logical) relationships of data  how to present the relationships of data to the user  independent of physical implementation and systems Physical data model Physical data model  How data are actually stored in computer  Technical techniques for improving efficiency of data storage and access.

Logical Data Models Linear sequential model Linear sequential model  Arrange records by an order defined.  Advantage: fast access  Disadvantage: need to move data around when sorting Linkage sequential model Linkage sequential model  Data arranged in the order they are inserted.  Each element has a link pointer to the next element  Advantage: Don't need to move data around  Disadvantage: additional space for links

Hierarchical (tree) model Hierarchical (tree) model  Have an unique root  Each element (except root) has one and only one parent  Advantage: relationships are precisely defined  Disadvantage: describe only one type of relationships;

Poly-hierarchical model Poly-hierarchical model  Allows to have more than one parent for each element in the tree structure  Computationally complex  Advantage: represent more complex relationships  Disadvantage: possible infinite loops

Network model Network model  Hypertext model  Emphasize on links and nodes  Less formalism  A node can have any number of links  A node can be freely defined (don't have to be the same type)  Advantage: flexible  Disadvantage: lack of controls; lack of theories;

Space model Space model  Basics of the physical space:  Dimensions, axes, coordinates  Geography, physics, and rules of law.  Semantic space  Giving meaning to place  Searching for features of the space Vector space model Vector space model  Each indexing term is an axe  Each document is a vector

Physical Data Structure How data are actually stored in computer How data are actually stored in computer Technical implementation of data storage Technical implementation of data storage Record structures Record structures  Fixed-length  Variable-length  Tradeoff between speed and space

Examples: Fixed-length record: Fixed-length record: SMITH,JON 1287 MAPLE AVE,AKRON OH, SMIT Name-- 1, address -- 21, telephone -- 61, date Next Record-- 97

Variable-length record: Variable-length record: SMITH,JON|18287 MAPLE AVE,AKRON OH, 44444| |900315#SMIT 1 -- name, 2 -- address, 3 -- telephone 4 -- date

00075|021|030|63|73* SMITH,JON1287 MAPLE AVE,AKRON OH, #000 Variable-length record: (Fixed header) Variable-length record: (Fixed header)

File structures The focus of data structures in IR is file structures The focus of data structures in IR is file structures  A collection of documents is called a file.  Each document is called a record. The key to file structures is different search techniques or models for the files and indexes of files. The key to file structures is different search techniques or models for the files and indexes of files.

Index structures A main file and several indexing files A main file and several indexing files Main file is sequential without sorting Main file is sequential without sorting Indexing files are sorted and pointed to the main file Indexing files are sorted and pointed to the main file  Inverted files  How large is the inverted indexing files?

Sizes of Inverted Indexing Files Index Small Collection (1 MB) Medium Collection (200Mb) Large Collection (2GB) Addressing words 45%73%36%64%35%63% Addressing documents 19%26%18%32%26%47% Addressing 64 blocks 27%41%18%32%5%9% Addressing 256 blocks 18%25%1.7%2.4%0.5%0.7% First column, without stop words Second column, with stop words

Searching To go through the list of words in the inverted indexing file sequentially will take a long time, even for computer. To go through the list of words in the inverted indexing file sequentially will take a long time, even for computer. Data structures need to be created to speed up the search: Data structures need to be created to speed up the search:  Trees  Hashing tables  Signature files

Trees Binary tree Binary tree  Each node contains a key  Left sub-trees stored all keys smaller than the parent key  Right sub-trees stored all keys larger then the parent key Balanced trees Balanced trees  every parent has a balance left-and right sub- trees

B-tree B-tree  Each node can have more than one key  If a node has m keys, it will have m+1 children branches.  All keys in i-1 branch is smaller than key I  All leaves are at the same depth. B+ Trees B+ Trees  B-tree that stored all data in the leaves. Example: Example:  a B-tree of 10,000,000 keys with 50 keys per node  never needs to retrieve more than 4 nodes to find any key.

Procedures for Constructing Balanced Trees 1. Check if the original tree is balanced b Check if the left child is balanced b If it is not balanced, go to step two b Check if the right child is balanced b If it not balanced, go to step two

2. Rotate the unbalanced tree: 2. Rotate the unbalanced tree:  If the left branch is deeper  Move the left child of the root to become the new root, move the right branch of new root to become the left branch of the old root  Make the old root to become the right child of new root  If the right branch is deeper  … … 3. Go back to step 1 to check if the new tree is balanced or not 3. Go back to step 1 to check if the new tree is balanced or not

B+ tree: (F, M) (Ap, Bs, E)(Gr, H, L) (P, Ru, T)

Direct-Access Structures Hashing Hashing  Evenly distribute a long list to a short list using a hashing function  Remainder of a primary number is a common hashing function  Example:  Hashing function: H(k)=K mod 7  Put following numbers into the hashing table: 5, 22, 25, 89, 50, 71, 995, 22, 25, 89, 50, 71, 99

Signature Files Word Word Signature data base management system Block signature

Which of the following blocks contain the term “database”? Which of the following blocks contain the term “database”?    

Document Similarity Documents Documents  D 1 ={t 11, t 12, t 13, …, t 1n }  D 2 ={t 21, t 22, t 23, …, t 2n } t ik is either 0 or 1. Simple measurement of difference/ similarity: Simple measurement of difference/ similarity:  w=the number of times t 1k =1, t 2k =1.  x=the number of times t 1k =1, t 2k =0.  y=the number of times t 1k =0, t 2k =1.  z=the number of times t 1k =0, t 2k =0.

Similarity Measure Cosine Coefficient: Cosine Coefficient: The same as: The same as:

D1’s terms only: n1=w+x (the number of times t 1k =1) D1’s terms only: n1=w+x (the number of times t 1k =1) D2’s terms only: n2=w+y (the number of times t 2k =1) D2’s terms only: n2=w+y (the number of times t 2k =1) Sameness count: sc =(w+z)/(n1+n2) Sameness count: sc =(w+z)/(n1+n2) Difference count: dc =(x+y)/(n1+n2) Difference count: dc =(x+y)/(n1+n2) Rectangular Distance: rd = MAX(n1, n2) Rectangular Distance: rd = MAX(n1, n2) Conditional probability: cp=min(n1, n2) Conditional probability: cp=min(n1, n2) mean:mean =(n1+n2)/2 mean:mean =(n1+n2)/2

Similarity Measure Dice’s Coefficient: Dice’s Coefficient:  Dice(D1, D2)= 2w/(n1+n2)  where w is the number of terms that D1, and D2 have in common; n1, n2 are the number of terms in D1and D2. Jaccard Coefficient: Jaccard Coefficient:  Jaccard(D1, D2) = w/(N-z) = w/(n1+n2-w) = w/(n1+n2-w)

Similarity Metric A metric has three defining properties A metric has three defining properties  It’s value are non-negative  It’s symmetric  It satisfies the triangle inequality: |AC|  |AB|+|BC|

L p Metrics

Similarity Matrix Pairwise coupling of similarities among a group of documents Pairwise coupling of similarities among a group of documents S 11 S 12 S 13 S 14 S 15 S 16 S 17 S 18 S 21 S 22 S 23 S 24 S 25 S 26 S 27 S 28 S 31 S 32 S 33 S 34 S 35 S 36 S 37 S 38 S 41 S 42 S 43 S 44 S 45 S 46 S 47 S 48 S 51 S 52 S 53 S 54 S 55 S 56 S 57 S 58 S 61 S 62 S 63 S 64 S 65 S 66 S 67 S 68 S 71 S 72 S 73 S 74 S 75 S 76 S 77 S 78 S 81 S 82 S 83 S 84 S 85 S 86 S 87 S 88

Document clustering Grouping similar documents to different sets Grouping similar documents to different sets  Create similarity matrix  Apply a hierarchical clustering algorithm: 1 Identify the two closet documents and combine them into a cluster 2 Identify the next two closet documents and clusters and combine them into a clusters 3 If more then one cluster remains, return to step 1

Application of Document Clustering Vivisimo Vivisimo Vivisimo  Cluster search results on the fly  Hierarchical categories for drill-down capability AltaVista AltaVista  Refine search:  Cluster related words into different groups based on their co-occurrence rates in documents.

AltaVista

ViVisimo Cluster Search Engine

Clusty.com

Concept Clusters Use terms’ co-occurring frequencies Use terms’ co-occurring frequencies  to predict semantic relationships  to build concept clusters  to suggest search terms Visualization of term relationships Visualization of term relationships  Link displays  Map displays  Drag-and drop interface for searching