Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Digital Libraries Information Retrieval.

Similar presentations


Presentation on theme: "Introduction to Digital Libraries Information Retrieval."— Presentation transcript:

1 Introduction to Digital Libraries Information Retrieval

2 Sample Statistics of Text Collections Dialog: claims to have >12 terabytes of data in >600 Databases, > 800 million unique records LEXIS/NEXIS: claims 7 terabytes, 1.7 billion documents, 1.5 million subscribers, 11,400 databases; >200,000 searches per day; 9 mainframes, 300 Unix servers, 200 NT servers

3 Information Retrieval Motivation –the larger the holdings of the archive, the more useful it is –however, it is harder to find what you want

4 Simple IR Model User QueryResults Pre- Processing Post- Processing Searching Storage Collection & Processing Boolean Vector Stemming Thesaurus Signature Ranking Clustering Weighting Boolean Vector Feedback Flat Files Inverted Files Signature Files PAT Trees Stemming Stoplist

5 5 IR problem In libraries ISBN: 0-201-12227-8 Author: Salton, Gerard Title: Automatic text processing: the transformation, analysis, and retrieval of information by computer Editor: Addison-Wesley Date: 1989 Content: external attributes and internal attribute (content) Search by external attributes = Search in DB IR: search by content

6 Basic concepts Document is described by a set of representative keywords (index terms) Keywords may have binary weights or weights calculated from statistics of their frequency in text Retrieval is a ‘matching’ process between document keywords and words in queries

7 IR Outline Index Storage –flat files, inverted files, signature files, PAT trees Processing –Stemming, stop-words Searching & Queries –Boolean, vector (including ranking, weighting, feedback) Results –clustering

8 Flat Files Index Simple files, no additional processing or storage needed Worst case keyword search time: O(DW) –D = # of documents –W = # words per document –linear search Clearly only acceptable for small collections

9 Inverted Files All input files are read, and a list of which words appear in what documents (records) is made Extra space required can be up to 100% of original input files Worst case keyword search time is now O(log(DW)) Almost all indexing systems in popular usage use inverted files

10 Sample Inverted File

11 Structure of inverted index May be a hierarchical set of addresses, e.g. word number within sentence number within paragraph number within chapter number within volume number within document number Consider as a vector (d,v,c,p,s,w)

12 Inverted File Index Store appearance of terms in documents (like index of a book) alphabet database index information retrieval semistructured XML XPath (15,42);(26,186);(31,86) (41,10) (15,76);(51,164);(76,641);(81,64) (16,76) (16,88) (5,61);(15,174);(25,41) (1,108);(2,65);(15,741);(21,421) (5,90);(21,301) (document-ID,position in the doc) Answer queries like „xml and index“, „information near retrieval“ But: not suitable for evaluating path expressions

13 An Inverted File Search for –“databases” –“microsoft”

14 Other indexing structures Signature files –Each document has an associated signature, generating by hashing each term it contains –Leads to possible matches; further processing to resolve Bitmaps –One-to-one hash function; each distinct term in collection has a bit vector with one bit for each document –Special case of signature file; storage expensive

15 Signature Files Signature size. Number of bits in a signature, F. Word signature. A bit pattern of size F with exactly m bits set to 1 and the others 0. Block. A sequence of text that contains D distinct words. Block signature. The logical or of all the word signatures in a block of text.

16 Signature File Each document is divided into “logical blocks” - - pieces of text that contain a constant number D of distinct, non-common words Each word yields a “word signature” which is a bit pattern of size F, with m bits set to 1 and the rest to 0 –F and m are design parameters

17 Sample Signature File Figure, D=2, F=12, m=4

18 data 0000 0000 0000 0010 0000 base 0000 0001 0000 0000 0000 management 0000 1000 0000 0000 0000 system 0000 0000 0000 0000 1000 ---------------------------------------- block signature 0000 1001 0000 0010 1000 Figure, D=4, F=20, m=1

19 Signature File Searching –By examining each block signature for "1" 's in those bit positions that the signature of the search word has a "1". –False Drop –probability that the signature test will “fail”, creating a “false hit” or “false drop” –A word signature may match the block signature, but the word is not in the block. This is a false hit.

20 Sistrings Original text: ” The traditional approach for searching a regular expression…” Sistrings 1.“The traditional approach for searching … ” 2.“he traditional approach for searching a…” 3. “e traditional approach for searching a …” 4. “onal approach for searching a regular …”

21 Sistrings Once upon a time, in a far away land... –sistring1: Once upon a time... –sistring2: nce upon a time... –sistring8: on a time, in a... –sistring11: a time, in a far... –sistring22: a far away land...

22 PAT Trees PAT Tree: –a Patricia Tree constructed over all the possible sistrings of a document –bits of the key decide branching 0 is branch to left subtree 1 is branch to right subtree internal node decides which bit of the key to use at leaf node, check any skipped bits PAT (Suffix) tree of a string S is a compacted trie that represents all substrings of S or semi-infinite string (sistring).

23 PATRICIA TREE A particular type of “trie” Example, trie and PATRICIA TREE with content ‘010’, ‘011’, and ‘101’.

24 PAT Tree 1 22 3342 755163 48 01100100010111... Text 123456789.... Position Query: 00101 sistrings 1-8 already indexed = sistring = position to check

25 Try to build the Patricia tree A00001 S10011 E00101 R10010 C00011 H01000 I01001 N01110 G00111 X11000 M01101 P10000

26 PAT Tree A E S R XC H G I N M P

27 Example Text01100100010111 … sistring 101100100010111 … sistring 21100100010111 … sistring 3100100010111 … sistring 400100010111 … sistring 50100010111 … sistring 6100010111 … sistring 700010111 … sistring 80010111... 1 1 2 1 2 2 3 1 1 2 2 3 1 2 1 4 2 2 3 1 2 4 3 1 5 : external node sistring (integer displacement) total displacement of the bit to be inspected : internal node skip counter & pointer 0 1 01 01

28 SISTRING Bit level is too abstract, depends on application, we rarely apply this on bit level. Character level is a better idea! –e.g. CUHK –Corresponding sistrings would be CUHK000… UHK000… HK000… K000… –We require each should be at least 4 characters long. –(Why we pad 0/NULL at the end of sistring?)

29 SISTRING (USAGE) We may instead storing the sistrings of ‘CUHK’, which requires O(n 2 ) storage. –CUHK <- represent C CU CUH CUHK at the same time –UHK0 <- represent U UH UHK at the same time –HK00 <- represent H HK at the same time –K000 <- represent K only A prefix-matching on sistrings is equivalent to the exact matching on the sub-strings. Conclusion, sistrings is better representation for storing sub-string information.

30 PAT Tree (Example) By digitalizing the string, we can manually visualize how the PAT Tree could be. Following is the actual bit pattern of the four sistrings

31 PAT Tree (Example) This works! BUT… –We still need O(n 2 ) memory for storing those sistrings We may reduce the memory to O(n) by making use of points.

32 Space/Time Tradeoffs Space Time inverted files flat files signature files PAT trees

33 33 Stemming Reason: –Different word forms may bear similar meaning (e.g. search, searching): create a “standard” representation for them Stemming: –Removing some endings of word computer compute computes computing computed computation comput

34 Inverted File, Stemmed

35 Stemming am, are, is  be car, cars, car's, cars'  car the boy's cars are different colors  the boy car be differ color

36 Stemming Manual or Automatic Can reduce index files up to 50% Effectiveness studies of stemming are mixed, but in general it has either no effect or a positive effect when measuring both precision and recall

37 Stopwords Stopwords exist in stoplists or negative dictionaries Idea: remove low semantic content –index should only have “important stuff” What not to index is domain dependent, but often includes: –“small” words: a, and, the, but, of, an, very, etc. –case is removed –punctuation

38 Stop words Very common words that have no discriminatory power (في، من، إلى،...)

39 Normalization Token normalization –Canonicalizing tokens so that matches occur despite superficial differences in the character sequences of the tokens –U.S.A vs USA –Anti-discriminatory vs antidiscriminatory –Car vs automobile?

40 Capitalization/case folding Good for –Allow instances of Automobile at the beginning of a sentence to match with a query of automobile –Helps a search engine when most users type ferrari when they are interested in a Ferrari car Bad for –Proper names vs common nouns –General Motors, Associated Press, Black Heuristic solution: lowercase only words at the beginning of the sentence; true casing via machine learning

41 Performance of search 3 major classes of measuring performance –precision / recall TREC conference series, http://trec.nist.gov/ –space / time see Esler & Nelson, JNCA for an example http://techreports.larc.nasa.gov/ltrs/PDF/1997/jp/NASA-97- jnca-sle.pdf –usability probably the most important measure, but largely ignored

42 Precision and Recall Precision = No. of relevant documents retrieved Total no. of documents retrieved Recall = No. of relevant documents retrieved. Total no. of relevant documents in database

43 Standard Evaluation Measures wx yz n 2 = w + y n 1 = w + x N relevant not relevant retrievednot retrieved Starts with a CONTINGENCY table

44 Precision and Recall Recall: Precision: w w+y w+x w From all the documents that are relevant out there, how many did the IR system retrieve? From all the documents that are retrieved by the IR system, how many are relevant?

45 User-Centered IR Evaluation More user-oriented measures –Satisfaction, informativeness Other types of measures –Time, cost-benefit, error rate, task analysis Evaluation of user characteristics Evaluation of interface Evaluation of process or interaction


Download ppt "Introduction to Digital Libraries Information Retrieval."

Similar presentations


Ads by Google