INFO Week 7 Indexing and Searching Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University
Effective Information Retrieval Data Structures Data Structures Knowledge Representation Knowledge Representation User Interface and User Interaction User Interface and User Interaction
Data Structures Describes how text, attributes of text, and indexes are stored in memory, files, or databases. Describes how text, attributes of text, and indexes are stored in memory, files, or databases. Describes the nature of relationships among information elements Describes the nature of relationships among information elements
Data Models Logical data model Logical data model how the user view the data how to represent or catch semantic (logical) relationships of data how to present the relationships of data to the user independent of physical implementation and systems Physical data model Physical data model How data are actually stored in computer Technical techniques for improving efficiency of data storage and access.
Logical Data Models Linear sequential model Linear sequential model Arrange records by an order defined. Advantage: fast access Disadvantage: need to move data around when sorting Linkage sequential model Linkage sequential model Data arranged in the order they are inserted. Each element has a link pointer to the next element Advantage: Don't need to move data around Disadvantage: additional space for links
Hierarchical (tree) model Hierarchical (tree) model Have an unique root Each element (except root) has one and only one parent Advantage: relationships are precisely defined Disadvantage: describe only one type of relationships;
Poly-hierarchical model Poly-hierarchical model Allows to have more than one parent for each element in the tree structure Computationally complex Advantage: represent more complex relationships Disadvantage: possible infinite loops
Network model Network model Hypertext model Emphasize on links and nodes Less formalism A node can have any number of links A node can be freely defined (don't have to be the same type) Advantage: flexible Disadvantage: lack of controls; lack of theories;
Space model Space model Basics of the physical space: Dimensions, axes, coordinates Geography, physics, and rules of law. Semantic space Giving meaning to place Searching for features of the space Vector space model Vector space model Each indexing term is an axe Each document is a vector
Physical Data Structure How data are actually stored in computer How data are actually stored in computer Technical implementation of data storage Technical implementation of data storage Record structures Record structures Fixed-length Variable-length Tradeoff between speed and space
Examples: Fixed-length record: Fixed-length record: SMITH,JON 1287 MAPLE AVE,AKRON OH, SMIT Name-- 1, address -- 21, telephone -- 61, date Next Record-- 97
Variable-length record: Variable-length record: SMITH,JON|18287 MAPLE AVE,AKRON OH, 44444| |900315#SMIT 1 -- name, 2 -- address, 3 -- telephone 4 -- date
00075|021|030|63|73* SMITH,JON1287 MAPLE AVE,AKRON OH, #000 Variable-length record: (Fixed header) Variable-length record: (Fixed header)
File structures The focus of data structures in IR is file structures The focus of data structures in IR is file structures A collection of documents is called a file. Each document is called a record. The key to file structures is different search techniques or models for the files and indexes of files. The key to file structures is different search techniques or models for the files and indexes of files.
Index structures A main file and several indexing files A main file and several indexing files Main file is sequential without sorting Main file is sequential without sorting Indexing files are sorted and pointed to the main file Indexing files are sorted and pointed to the main file Inverted files How large is the inverted indexing files?
Sizes of Inverted Indexing Files Index Small Collection (1 MB) Medium Collection (200Mb) Large Collection (2GB) Addressing words 45%73%36%64%35%63% Addressing documents 19%26%18%32%26%47% Addressing 64 blocks 27%41%18%32%5%9% Addressing 256 blocks 18%25%1.7%2.4%0.5%0.7% First column, without stop words Second column, with stop words
Searching To go through the list of words in the inverted indexing file sequentially will take a long time, even for computer. To go through the list of words in the inverted indexing file sequentially will take a long time, even for computer. Data structures need to be created to speed up the search: Data structures need to be created to speed up the search: Trees Hashing tables Signature files
Trees Binary tree Binary tree Each node contains a key Left sub-trees stored all keys smaller than the parent key Right sub-trees stored all keys larger then the parent key Balanced trees Balanced trees every parent has a balance left-and right sub- trees
B-tree B-tree Each node can have more than one key If a node has m keys, it will have m+1 children branches. All keys in i-1 branch is smaller than key I All leaves are at the same depth. B+ Trees B+ Trees B-tree that stored all data in the leaves. Example: Example: a B-tree of 10,000,000 keys with 50 keys per node never needs to retrieve more than 4 nodes to find any key.
Procedures for Constructing Balanced Trees 1. Check if the original tree is balanced b Check if the left child is balanced b If it is not balanced, go to step two b Check if the right child is balanced b If it not balanced, go to step two
2. Rotate the unbalanced tree: 2. Rotate the unbalanced tree: If the left branch is deeper Move the left child of the root to become the new root, move the right branch of new root to become the left branch of the old root Make the old root to become the right child of new root If the right branch is deeper … … 3. Go back to step 1 to check if the new tree is balanced or not 3. Go back to step 1 to check if the new tree is balanced or not
B+ tree: (F, M) (Ap, Bs, E)(Gr, H, L) (P, Ru, T)
Direct-Access Structures Hashing Hashing Evenly distribute a long list to a short list using a hashing function Remainder of a primary number is a common hashing function Example: Hashing function: H(k)=K mod 7 Put following numbers into the hashing table: 5, 22, 25, 89, 50, 71, 995, 22, 25, 89, 50, 71, 99
Signature Files Word Word Signature data base management system Block signature
Which of the following blocks contain the term “database”? Which of the following blocks contain the term “database”?
Document Similarity Documents Documents D 1 ={t 11, t 12, t 13, …, t 1n } D 2 ={t 21, t 22, t 23, …, t 2n } t ik is either 0 or 1. Simple measurement of difference/ similarity: Simple measurement of difference/ similarity: w=the number of times t 1k =1, t 2k =1. x=the number of times t 1k =1, t 2k =0. y=the number of times t 1k =0, t 2k =1. z=the number of times t 1k =0, t 2k =0.
Similarity Measure Cosine Coefficient: Cosine Coefficient: The same as: The same as:
D1’s terms only: n1=w+x (the number of times t 1k =1) D1’s terms only: n1=w+x (the number of times t 1k =1) D2’s terms only: n2=w+y (the number of times t 2k =1) D2’s terms only: n2=w+y (the number of times t 2k =1) Sameness count: sc =(w+z)/(n1+n2) Sameness count: sc =(w+z)/(n1+n2) Difference count: dc =(x+y)/(n1+n2) Difference count: dc =(x+y)/(n1+n2) Rectangular Distance: rd = MAX(n1, n2) Rectangular Distance: rd = MAX(n1, n2) Conditional probability: cp=min(n1, n2) Conditional probability: cp=min(n1, n2) mean:mean =(n1+n2)/2 mean:mean =(n1+n2)/2
Similarity Measure Dice’s Coefficient: Dice’s Coefficient: Dice(D1, D2)= 2w/(n1+n2) where w is the number of terms that D1, and D2 have in common; n1, n2 are the number of terms in D1and D2. Jaccard Coefficient: Jaccard Coefficient: Jaccard(D1, D2) = w/(N-z) = w/(n1+n2-w) = w/(n1+n2-w)
Similarity Metric A metric has three defining properties A metric has three defining properties It’s value are non-negative It’s symmetric It satisfies the triangle inequality: |AC| |AB|+|BC|
L p Metrics
Similarity Matrix Pairwise coupling of similarities among a group of documents Pairwise coupling of similarities among a group of documents S 11 S 12 S 13 S 14 S 15 S 16 S 17 S 18 S 21 S 22 S 23 S 24 S 25 S 26 S 27 S 28 S 31 S 32 S 33 S 34 S 35 S 36 S 37 S 38 S 41 S 42 S 43 S 44 S 45 S 46 S 47 S 48 S 51 S 52 S 53 S 54 S 55 S 56 S 57 S 58 S 61 S 62 S 63 S 64 S 65 S 66 S 67 S 68 S 71 S 72 S 73 S 74 S 75 S 76 S 77 S 78 S 81 S 82 S 83 S 84 S 85 S 86 S 87 S 88
Document clustering Grouping similar documents to different sets Grouping similar documents to different sets Create similarity matrix Apply a hierarchical clustering algorithm: 1 Identify the two closet documents and combine them into a cluster 2 Identify the next two closet documents and clusters and combine them into a clusters 3 If more then one cluster remains, return to step 1
Application of Document Clustering Vivisimo Vivisimo Vivisimo Cluster search results on the fly Hierarchical categories for drill-down capability AltaVista AltaVista Refine search: Cluster related words into different groups based on their co-occurrence rates in documents.
AltaVista
ViVisimo Cluster Search Engine
Clusty.com
Concept Clusters Use terms’ co-occurring frequencies Use terms’ co-occurring frequencies to predict semantic relationships to build concept clusters to suggest search terms Visualization of term relationships Visualization of term relationships Link displays Map displays Drag-and drop interface for searching