Download presentation
Presentation is loading. Please wait.
Published byJonas Preston Modified over 9 years ago
1
Introduction to Digital Libraries Information Retrieval
2
Sample Statistics of Text Collections Dialog: claims to have >12 terabytes of data in >600 Databases, > 800 million unique records LEXIS/NEXIS: claims 7 terabytes, 1.7 billion documents, 1.5 million subscribers, 11,400 databases; >200,000 searches per day; 9 mainframes, 300 Unix servers, 200 NT servers
3
Information Retrieval Motivation –the larger the holdings of the archive, the more useful it is –however, it is harder to find what you want
4
Simple IR Model User QueryResults Pre- Processing Post- Processing Searching Storage Collection & Processing Boolean Vector Stemming Thesaurus Signature Ranking Clustering Weighting Boolean Vector Feedback Flat Files Inverted Files Signature Files PAT Trees Stemming Stoplist
5
5 IR problem In libraries ISBN: 0-201-12227-8 Author: Salton, Gerard Title: Automatic text processing: the transformation, analysis, and retrieval of information by computer Editor: Addison-Wesley Date: 1989 Content: external attributes and internal attribute (content) Search by external attributes = Search in DB IR: search by content
6
Basic concepts Document is described by a set of representative keywords (index terms) Keywords may have binary weights or weights calculated from statistics of their frequency in text Retrieval is a ‘matching’ process between document keywords and words in queries
7
IR Outline Index Storage –flat files, inverted files, signature files, PAT trees Processing –Stemming, stop-words Searching & Queries –Boolean, vector (including ranking, weighting, feedback) Results –clustering
8
Flat Files Index Simple files, no additional processing or storage needed Worst case keyword search time: O(DW) –D = # of documents –W = # words per document –linear search Clearly only acceptable for small collections
9
Inverted Files All input files are read, and a list of which words appear in what documents (records) is made Extra space required can be up to 100% of original input files Worst case keyword search time is now O(log(DW)) Almost all indexing systems in popular usage use inverted files
10
Sample Inverted File
11
Structure of inverted index May be a hierarchical set of addresses, e.g. word number within sentence number within paragraph number within chapter number within volume number within document number Consider as a vector (d,v,c,p,s,w)
12
Inverted File Index Store appearance of terms in documents (like index of a book) alphabet database index information retrieval semistructured XML XPath (15,42);(26,186);(31,86) (41,10) (15,76);(51,164);(76,641);(81,64) (16,76) (16,88) (5,61);(15,174);(25,41) (1,108);(2,65);(15,741);(21,421) (5,90);(21,301) (document-ID,position in the doc) Answer queries like „xml and index“, „information near retrieval“ But: not suitable for evaluating path expressions
13
An Inverted File Search for –“databases” –“microsoft”
14
Other indexing structures Signature files –Each document has an associated signature, generating by hashing each term it contains –Leads to possible matches; further processing to resolve Bitmaps –One-to-one hash function; each distinct term in collection has a bit vector with one bit for each document –Special case of signature file; storage expensive
15
Signature Files Signature size. Number of bits in a signature, F. Word signature. A bit pattern of size F with exactly m bits set to 1 and the others 0. Block. A sequence of text that contains D distinct words. Block signature. The logical or of all the word signatures in a block of text.
16
Signature File Each document is divided into “logical blocks” - - pieces of text that contain a constant number D of distinct, non-common words Each word yields a “word signature” which is a bit pattern of size F, with m bits set to 1 and the rest to 0 –F and m are design parameters
17
Sample Signature File Figure, D=2, F=12, m=4
18
data 0000 0000 0000 0010 0000 base 0000 0001 0000 0000 0000 management 0000 1000 0000 0000 0000 system 0000 0000 0000 0000 1000 ---------------------------------------- block signature 0000 1001 0000 0010 1000 Figure, D=4, F=20, m=1
19
Signature File Searching –By examining each block signature for "1" 's in those bit positions that the signature of the search word has a "1". –False Drop –probability that the signature test will “fail”, creating a “false hit” or “false drop” –A word signature may match the block signature, but the word is not in the block. This is a false hit.
20
Sistrings Original text: ” The traditional approach for searching a regular expression…” Sistrings 1.“The traditional approach for searching … ” 2.“he traditional approach for searching a…” 3. “e traditional approach for searching a …” 4. “onal approach for searching a regular …”
21
Sistrings Once upon a time, in a far away land... –sistring1: Once upon a time... –sistring2: nce upon a time... –sistring8: on a time, in a... –sistring11: a time, in a far... –sistring22: a far away land...
22
PAT Trees PAT Tree: –a Patricia Tree constructed over all the possible sistrings of a document –bits of the key decide branching 0 is branch to left subtree 1 is branch to right subtree internal node decides which bit of the key to use at leaf node, check any skipped bits PAT (Suffix) tree of a string S is a compacted trie that represents all substrings of S or semi-infinite string (sistring).
23
PATRICIA TREE A particular type of “trie” Example, trie and PATRICIA TREE with content ‘010’, ‘011’, and ‘101’.
24
PAT Tree 1 22 3342 755163 48 01100100010111... Text 123456789.... Position Query: 00101 sistrings 1-8 already indexed = sistring = position to check
25
Try to build the Patricia tree A00001 S10011 E00101 R10010 C00011 H01000 I01001 N01110 G00111 X11000 M01101 P10000
26
PAT Tree A E S R XC H G I N M P
27
Example Text01100100010111 … sistring 101100100010111 … sistring 21100100010111 … sistring 3100100010111 … sistring 400100010111 … sistring 50100010111 … sistring 6100010111 … sistring 700010111 … sistring 80010111... 1 1 2 1 2 2 3 1 1 2 2 3 1 2 1 4 2 2 3 1 2 4 3 1 5 : external node sistring (integer displacement) total displacement of the bit to be inspected : internal node skip counter & pointer 0 1 01 01
28
SISTRING Bit level is too abstract, depends on application, we rarely apply this on bit level. Character level is a better idea! –e.g. CUHK –Corresponding sistrings would be CUHK000… UHK000… HK000… K000… –We require each should be at least 4 characters long. –(Why we pad 0/NULL at the end of sistring?)
29
SISTRING (USAGE) We may instead storing the sistrings of ‘CUHK’, which requires O(n 2 ) storage. –CUHK <- represent C CU CUH CUHK at the same time –UHK0 <- represent U UH UHK at the same time –HK00 <- represent H HK at the same time –K000 <- represent K only A prefix-matching on sistrings is equivalent to the exact matching on the sub-strings. Conclusion, sistrings is better representation for storing sub-string information.
30
PAT Tree (Example) By digitalizing the string, we can manually visualize how the PAT Tree could be. Following is the actual bit pattern of the four sistrings
31
PAT Tree (Example) This works! BUT… –We still need O(n 2 ) memory for storing those sistrings We may reduce the memory to O(n) by making use of points.
32
Space/Time Tradeoffs Space Time inverted files flat files signature files PAT trees
33
33 Stemming Reason: –Different word forms may bear similar meaning (e.g. search, searching): create a “standard” representation for them Stemming: –Removing some endings of word computer compute computes computing computed computation comput
34
Inverted File, Stemmed
35
Stemming am, are, is be car, cars, car's, cars' car the boy's cars are different colors the boy car be differ color
36
Stemming Manual or Automatic Can reduce index files up to 50% Effectiveness studies of stemming are mixed, but in general it has either no effect or a positive effect when measuring both precision and recall
37
Stopwords Stopwords exist in stoplists or negative dictionaries Idea: remove low semantic content –index should only have “important stuff” What not to index is domain dependent, but often includes: –“small” words: a, and, the, but, of, an, very, etc. –case is removed –punctuation
38
Stop words Very common words that have no discriminatory power (في، من، إلى،...)
39
Normalization Token normalization –Canonicalizing tokens so that matches occur despite superficial differences in the character sequences of the tokens –U.S.A vs USA –Anti-discriminatory vs antidiscriminatory –Car vs automobile?
40
Capitalization/case folding Good for –Allow instances of Automobile at the beginning of a sentence to match with a query of automobile –Helps a search engine when most users type ferrari when they are interested in a Ferrari car Bad for –Proper names vs common nouns –General Motors, Associated Press, Black Heuristic solution: lowercase only words at the beginning of the sentence; true casing via machine learning
41
Performance of search 3 major classes of measuring performance –precision / recall TREC conference series, http://trec.nist.gov/ –space / time see Esler & Nelson, JNCA for an example http://techreports.larc.nasa.gov/ltrs/PDF/1997/jp/NASA-97- jnca-sle.pdf –usability probably the most important measure, but largely ignored
42
Precision and Recall Precision = No. of relevant documents retrieved Total no. of documents retrieved Recall = No. of relevant documents retrieved. Total no. of relevant documents in database
43
Standard Evaluation Measures wx yz n 2 = w + y n 1 = w + x N relevant not relevant retrievednot retrieved Starts with a CONTINGENCY table
44
Precision and Recall Recall: Precision: w w+y w+x w From all the documents that are relevant out there, how many did the IR system retrieve? From all the documents that are retrieved by the IR system, how many are relevant?
45
User-Centered IR Evaluation More user-oriented measures –Satisfaction, informativeness Other types of measures –Time, cost-benefit, error rate, task analysis Evaluation of user characteristics Evaluation of interface Evaluation of process or interaction
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.