Multimedia Databases Text - part I Slides by C. Faloutsos, CMU
Multi DB and D.M.Copyright: C. Faloutsos (2001)2 Outline Goal: ‘Find similar / interesting things’ Intro to DB Indexing - similarity search Data Mining
Multi DB and D.M.Copyright: C. Faloutsos (2001)3 Indexing - Detailed outline primary key indexing secondary key / multi-key indexing spatial access methods fractals text multimedia...
Multi DB and D.M.Copyright: C. Faloutsos (2001)4 Text - Detailed outline text –problem –full text scanning –inversion –signature files –clustering –information filtering and LSI
Multi DB and D.M.Copyright: C. Faloutsos (2001)5 Problem - Motivation Eg., find documents containing “data”, “retrieval” Applications:
Multi DB and D.M.Copyright: C. Faloutsos (2001)6 Problem - Motivation Eg., find documents containing “data”, “retrieval” Applications: –Web –law + patent offices –digital libraries –information filtering
Multi DB and D.M.Copyright: C. Faloutsos (2001)7 Problem - Motivation Types of queries: –boolean (‘data’ AND ‘retrieval’ AND NOT...)
Multi DB and D.M.Copyright: C. Faloutsos (2001)8 Problem - Motivation Types of queries: –boolean (‘data’ AND ‘retrieval’ AND NOT...) –additional features (‘data’ ADJACENT ‘retrieval’) –keyword queries (‘data’, ‘retrieval’) How to search a large collection of documents?
Multi DB and D.M.Copyright: C. Faloutsos (2001)9 Full-text scanning Build a FSA; scan c a t
Multi DB and D.M.Copyright: C. Faloutsos (2001)10 Full-text scanning for single term: –(naive: O(N*M)) ABRACADABRAtext CAB pattern
Multi DB and D.M.Copyright: C. Faloutsos (2001)11 Full-text scanning for single term: –(naive: O(N*M)) –Knuth Morris and Pratt (‘77) build a small FSA; visit every text letter once only, by carefully shifting more than one step ABRACADABRAtext CAB pattern
Multi DB and D.M.Copyright: C. Faloutsos (2001)12 Full-text scanning ABRACADABRAtext CAB pattern CAB...
Multi DB and D.M.Copyright: C. Faloutsos (2001)13 Full-text scanning for single term: –(naive: O(N*M)) –Knuth Morris and Pratt (‘77) –Boyer and Moore (‘77) preprocess pattern; start from right to left & skip! ABRACADABRAtext CAB pattern
Multi DB and D.M.Copyright: C. Faloutsos (2001)14 Full-text scanning ABRACADABRAtext CAB pattern CAB
Multi DB and D.M.Copyright: C. Faloutsos (2001)15 Full-text scanning ABRACADABRAtext OMINOUS pattern OMINOUS Boyer+Moore: fastest, in practice Sunday (‘90): some improvements
Multi DB and D.M.Copyright: C. Faloutsos (2001)16 Full-text scanning For multiple terms (w/o “don’t care” characters): Aho+Corasic (‘75) –again, build a simplified FSA in O(M) time Probabilistic algorithms: ‘fingerprints’ (Karp + Rabin ‘87) approximate match: ‘agrep’ [Wu+Manber, Baeza-Yates+, ‘92]
Multi DB and D.M.Copyright: C. Faloutsos (2001)17 Full-text scanning Approximate matching - string editing distance: d( ‘survey’, ‘surgery’) = 2 = min # of insertions, deletions, substitutions to transform the first string into the second SURVEY SURGERY
Multi DB and D.M.Copyright: C. Faloutsos (2001)18 Full-text scanning string editing distance - how to compute? A:
Multi DB and D.M.Copyright: C. Faloutsos (2001)19 Full-text scanning string editing distance - how to compute? A: dynamic programming cost( i, j ) = cost to match prefix of length i of first string s with prefix of length j of second string t
Multi DB and D.M.Copyright: C. Faloutsos (2001)20 Full-text scanning if s[i] = t[j] then cost( i, j ) = cost(i-1, j-1) else cost(i, j ) = min ( 1 + cost(i, j-1) // deletion 1 + cost(i-1, j-1) // substitution 1 + cost(i-1, j) // insertion )
Multi DB and D.M.Copyright: C. Faloutsos (2001)21 Full-text scanning Complexity: O(M*N) (when using a matrix to ‘memoize’ partial results)
Multi DB and D.M.Copyright: C. Faloutsos (2001)22 Full-text scanning Conclusions: Full text scanning needs no space overhead, but is slow for large datasets
Multi DB and D.M.Copyright: C. Faloutsos (2001)23 Text - Detailed outline text –problem –full text scanning –inversion –signature files –clustering –information filtering and LSI
Multi DB and D.M.Copyright: C. Faloutsos (2001)24 Text - Inversion
Multi DB and D.M.Copyright: C. Faloutsos (2001)25 Text - Inversion Q: space overhead?
Multi DB and D.M.Copyright: C. Faloutsos (2001)26 Text - Inversion A: mainly, the postings lists
Multi DB and D.M.Copyright: C. Faloutsos (2001)27 Text - Inversion how to organize dictionary? stemming – Y/N? insertions?
Multi DB and D.M.Copyright: C. Faloutsos (2001)28 Text - Inversion how to organize dictionary? –B-tree, hashing, TRIEs, PATRICIA trees,... stemming – Y/N? insertions?
Multi DB and D.M.Copyright: C. Faloutsos (2001)29 Text – Inversion newer topics: Parallelism [Tomasic+,93] Insertions [Tomasic+94], [Brown+] –‘zipf’ distributions Approximate searching (‘glimpse’ [Wu+])
Multi DB and D.M.Copyright: C. Faloutsos (2001)30 postings list – more Zipf distr.: eg., rank- frequency plot of ‘Bible’ log(rank) log(freq) Text - Inversion freq ~ 1/rank / ln(1.78V)
Multi DB and D.M.Copyright: C. Faloutsos (2001)31 Text - Inversion postings lists –Cutting+Pedersen (keep first 4 in B-tree leaves) –how to allocate space: [Faloutsos+92] geometric progression –compression (Elias codes) [Zobel+] – down to 2% overhead!
Multi DB and D.M.Copyright: C. Faloutsos (2001)32 Conclusions Conclusions: needs space overhead (2%- 300%), but it is the fastest
Multi DB and D.M.Copyright: C. Faloutsos (2001)33 Text - Detailed outline text –problem –full text scanning –inversion –signature files –clustering –information filtering and LSI
Multi DB and D.M.Copyright: C. Faloutsos (2001)34 Signature files idea: ‘quick & dirty’ filter
Multi DB and D.M.Copyright: C. Faloutsos (2001)35 Signature files idea: ‘quick & dirty’ filter then, do seq. scan on sign. file and discard ‘false alarms’ Adv.: easy insertions; faster than seq. scan Disadv.: O(N) search (with small constant) Q: how to extract signatures?
Multi DB and D.M.Copyright: C. Faloutsos (2001)36 Signature files A: superimposed coding!! [Mooers49],... m (=4 bits/word) F (=12 bits sign. size)
Multi DB and D.M.Copyright: C. Faloutsos (2001)37 Signature files A: superimposed coding!! [Mooers49],... data actual match
Multi DB and D.M.Copyright: C. Faloutsos (2001)38 Signature files A: superimposed coding!! [Mooers49],... retrieval actual dismissal
Multi DB and D.M.Copyright: C. Faloutsos (2001)39 Signature files A: superimposed coding!! [Mooers49],... nucleotic false alarm (‘false drop’)
Multi DB and D.M.Copyright: C. Faloutsos (2001)40 Signature files A: superimposed coding!! [Mooers49],... ‘YES’ is ‘MAYBE’ ‘NO’ is ‘NO’
Multi DB and D.M.Copyright: C. Faloutsos (2001)41 Signature files Q1: How to choose F and m ? Q2: Why is it called ‘false drop’? Q3: other apps of signature files?
Multi DB and D.M.Copyright: C. Faloutsos (2001)42 Signature files Q1: How to choose F and m ? m (=4 bits/word) F (=12 bits sign. size)
Multi DB and D.M.Copyright: C. Faloutsos (2001)43 Signature files Q1: How to choose F and m ? A: so that doc. signature is 50% full m (=4 bits/word) F (=12 bits sign. size)
Multi DB and D.M.Copyright: C. Faloutsos (2001)44 Signature files Q1: How to choose F and m ? Q2: Why is it called ‘false drop’? Q3: other apps of signature files?
Multi DB and D.M.Copyright: C. Faloutsos (2001)45 Signature files Q2: Why is it called ‘false drop’? Old, but fascinating story [1949] –how to find qualifying books (by title word, and/or author, and/or keyword) –in O(1) time? –without computers
Multi DB and D.M.Copyright: C. Faloutsos (2001)46 Signature files Solution: edge-notched cards each title word is mapped to m numbers(how?) and the corresponding holes are cut out:
Multi DB and D.M.Copyright: C. Faloutsos (2001)47 Signature files Solution: edge-notched cards data ‘data’ -> #1, #39
Multi DB and D.M.Copyright: C. Faloutsos (2001)48 Signature files Search, e.g., for ‘data’: activate needle #1, #39, and shake the stack of cards! data ‘data’ -> #1, #39
Multi DB and D.M.Copyright: C. Faloutsos (2001)49 Signature files Also known as ‘zatocoding’, from ‘Zator’ company.
Multi DB and D.M.Copyright: C. Faloutsos (2001)50 Signature files Q1: How to choose F and m ? Q2: Why is it called ‘false drop’? Q3: other apps of signature files?
Multi DB and D.M.Copyright: C. Faloutsos (2001)51 Signature files Q3: other apps of signature files? A: anything that has to do with ‘membership testing’: does ‘data’ belong to the set of words of the document?
Multi DB and D.M.Copyright: C. Faloutsos (2001)52 Signature files UNIX’s early ‘spell’ system [McIlroy] Bloom-joins in System R* [Mackert+] and ‘active disks’ [Riedel99] differential files [Severance+Lohman]
Multi DB and D.M.Copyright: C. Faloutsos (2001)53 Signature files - conclusions easy insertions; slower than inversion brilliant idea of ‘quick and dirty’ filter: quickly discard the vast majority of non- qualifying elements, and focus on the rest.
Multi DB and D.M.Copyright: C. Faloutsos (2001)54 References Aho, A. V. and M. J. Corasick (June 1975). "Fast Pattern Matching: An Aid to Bibliographic Search." CACM 18(6): Boyer, R. S. and J. S. Moore (Oct. 1977). "A Fast String Searching Algorithm." CACM 20(10): Brown, E. W., J. P. Callan, et al. (March 1994). Supporting Full-Text Information Retrieval with a Persistent Object Store. Proc. of EDBT conference, Cambridge, U.K., Springer Verlag.
Multi DB and D.M.Copyright: C. Faloutsos (2001)55 References - cont’d Faloutsos, C. and H. V. Jagadish (Aug , 1992). On B-tree Indices for Skewed Distributions. 18th VLDB Conference, Vancouver, British Columbia. Karp, R. M. and M. O. Rabin (March 1987). "Efficient Randomized Pattern-Matching Algorithms." IBM Journal of Research and Development 31(2): Knuth, D. E., J. H. Morris, et al. (June 1977). "Fast Pattern Matching in Strings." SIAM J. Comput 6(2):
Multi DB and D.M.Copyright: C. Faloutsos (2001)56 References - cont’d Mackert, L. M. and G. M. Lohman (August 1986). R* Optimizer Validation and Performance Evaluation for Distributed Queries. Proc. of 12th Int. Conf. on Very Large Data Bases (VLDB), Kyoto, Japan. Manber, U. and S. Wu (1994). GLIMPSE: A Tool to Search Through Entire File Systems. Proc. of USENIX Techn. Conf. McIlroy, M. D. (Jan. 1982). "Development of a Spelling List." IEEE Trans. on Communications COM-30(1):
Multi DB and D.M.Copyright: C. Faloutsos (2001)57 References - cont’d Mooers, C. (1949). Application of Random Codes to the Gathering of Statistical Information Bulletin 31. Cambridge, Mass, Zator Co. Pedersen, D. C. a. J. (1990). Optimizations for dynamic inverted index maintenance. ACM SIGIR. Riedel, E. (1999). Active Disks: Remote Execution for Network Attached Storage. ECE, CMU. Pittsburgh, PA.
Multi DB and D.M.Copyright: C. Faloutsos (2001)58 References - cont’d Severance, D. G. and G. M. Lohman (Sept. 1976). "Differential Files: Their Application to the Maintenance of Large Databases." ACM TODS 1(3): Tomasic, A. and H. Garcia-Molina (1993). Performance of Inverted Indices in Distributed Text Document Retrieval Systems. PDIS. Tomasic, A., H. Garcia-Molina, et al. (May 24-27, 1994). Incremental Updates of Inverted Lists for Text Document Retrieval. ACM SIGMOD, Minneapolis, MN.
Multi DB and D.M.Copyright: C. Faloutsos (2001)59 References - cont’d Wu, S. and U. Manber (1992). "AGREP- A Fast Approximate Pattern-Matching Tool.". Zobel, J., A. Moffat, et al. (Aug , 1992). An Efficient Indexing Technique for Full-Text Database Systems. VLDB, Vancouver, B.C., Canada.
Multi DB and D.M.Copyright: C. Faloutsos (2001)60