CMU SCS 15-826: Multimedia Databases and Data Mining Lecture #15: Text - part II C. Faloutsos.

Slides:



Advertisements
Similar presentations
Numbers Treasure Hunt Following each question, click on the answer. If correct, the next page will load with a graphic first – these can be used to check.
Advertisements

1 A B C
Variations of the Turing Machine
AP STUDY SESSION 2.
1
Select from the most commonly used minutes below.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Processes and Operating Systems
Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.
Author: Julia Richards and R. Scott Hawley
Addition and Subtraction Equations
David Burdett May 11, 2004 Package Binding for WS CDL.
Local Customization Chapter 2. Local Customization 2-2 Objectives Customization Considerations Types of Data Elements Location for Locally Defined Data.
Create an Application Title 1Y - Youth Chapter 5.
Process a Customer Chapter 2. Process a Customer 2-2 Objectives Understand what defines a Customer Learn how to check for an existing Customer Learn how.
Custom Statutory Programs Chapter 3. Customary Statutory Programs and Titles 3-2 Objectives Add Local Statutory Programs Create Customer Application For.
CALENDAR.
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt BlendsDigraphsShort.
1 Click here to End Presentation Software: Installation and Updates Internet Download CD release NACIS Updates.
Photo Slideshow Instructions (delete before presenting or this page will show when slideshow loops) 1.Set PowerPoint to work in Outline. View/Normal click.
CMU SCS : Multimedia Databases and Data Mining Lecture #17: Text - part IV (LSI) C. Faloutsos.
Chapter 7: Steady-State Errors 1 ©2000, John Wiley & Sons, Inc. Nise/Control Systems Engineering, 3/e Chapter 7 Steady-State Errors.
Break Time Remaining 10:00.
The basics for simulations
Turing Machines.
Table 12.1: Cash Flows to a Cash and Carry Trading Strategy.
PP Test Review Sections 6-1 to 6-6
MM4A6c: Apply the law of sines and the law of cosines.
Briana B. Morrison Adapted from William Collins
Operating Systems Operating Systems - Winter 2012 Chapter 4 – Memory Management Vrije Universiteit Amsterdam.
Operating Systems Operating Systems - Winter 2010 Chapter 3 – Input/Output Vrije Universiteit Amsterdam.
Exarte Bezoek aan de Mediacampus Bachelor in de grafische en digitale media April 2014.
BEEF & VEAL MARKET SITUATION "Single CMO" Management Committee 22 November 2012.
Chapter 20 Network Layer: Internet Protocol
TESOL International Convention Presentation- ESL Instruction: Developing Your Skills to Become a Master Conductor by Beth Clifton Crumpler by.
CMU SCS : Multimedia Databases and Data Mining Lecture #7: Spatial Access Methods - IV Grid files, dim. curse C. Faloutsos.
Sample Service Screenshots Enterprise Cloud Service 11.3.
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
GIS Lecture 8 Spatial Data Processing.
 Copyright I/O International, 2013 Visit us at: A Feature Within from Item Class User Friendly Maintenance  Copyright.
1..
1 TV Viewing Trends Rivière-du-Loup EM - Diary Updated Spring 2014.
Adding Up In Chunks.
SLP – Endless Possibilities What can SLP do for your school? Everything you need to know about SLP – past, present and future.
MaK_Full ahead loaded 1 Alarm Page Directory (F11)
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt Synthetic.
1 Termination and shape-shifting heaps Byron Cook Microsoft Research, Cambridge Joint work with Josh Berdine, Dino Distefano, and.
Artificial Intelligence
Before Between After.
: 3 00.
5 minutes.
1 hi at no doifpi me be go we of at be do go hi if me no of pi we Inorder Traversal Inorder traversal. n Visit the left subtree. n Visit the node. n Visit.
CMU SCS : Multimedia Databases and Data Mining Lecture #4: Multi-key and Spatial Access Methods - I C. Faloutsos.
Speak Up for Safety Dr. Susan Strauss Harassment & Bullying Consultant November 9, 2012.
Essential Cell Biology
Converting a Fraction to %
Clock will move after 1 minute
CMU SCS : Multimedia Databases and Data Mining Lecture#1: Introduction Christos Faloutsos CMU
PSSA Preparation.
Physics for Scientists & Engineers, 3rd Edition
Select a time to count down from the clock above
Copyright Tim Morris/St Stephen's School
1 Dr. Scott Schaefer Least Squares Curves, Rational Representations, Splines and Continuity.
1 Decidability continued…. 2 Theorem: For a recursively enumerable language it is undecidable to determine whether is finite Proof: We will reduce the.
Multimedia Databases Text I. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Text databases Image and video.
Multimedia Databases Text - part I Slides by C. Faloutsos, CMU.
Multimedia Databases Text I. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Data Mining Multimedia Databases Text databases Image.
15-826: Multimedia Databases and Data Mining
Presentation transcript:

CMU SCS : Multimedia Databases and Data Mining Lecture #15: Text - part II C. Faloutsos

CMU SCS Copyright: C. Faloutsos (2012)2 Must-read Material MM Textbook, Chapter 6

CMU SCS Copyright: C. Faloutsos (2012)3 Optional (but terrific to read) McIlroy, M. D. (Jan. 1982). "Development of a Spelling List." IEEE Trans. on Communications COM-30(1): Severance, D. G. and G. M. Lohman (Sept. 1976). "Differential Files: Their Application to the Maintenance of Large Databases." ACM TODS 1(3):

CMU SCS Copyright: C. Faloutsos (2012)4 Outline Goal: ‘Find similar / interesting things’ Intro to DB Indexing - similarity search Data Mining

CMU SCS Copyright: C. Faloutsos (2012)5 Indexing - Detailed outline primary key indexing secondary key / multi-key indexing spatial access methods fractals text multimedia...

CMU SCS Copyright: C. Faloutsos (2012)6 Text - Detailed outline text –problem –full text scanning –inversion –signature files –clustering –information filtering and LSI

CMU SCS Copyright: C. Faloutsos (2012)7 Text - Inversion … Aaron zoo … … DictionaryPostings listsText file

CMU SCS Copyright: C. Faloutsos (2012)8 Text - Inversion Q: space overhead? … Aaron zoo … … DictionaryPostings listsText file

CMU SCS Copyright: C. Faloutsos (2012)9 Text - Inversion A: mainly, the postings lists … Aaron zoo … … DictionaryPostings listsText file

CMU SCS Copyright: C. Faloutsos (2012)10 Text - Inversion how to organize dictionary? stemming – Y/N? insertions?

CMU SCS Copyright: C. Faloutsos (2012)11 Text - Inversion how to organize dictionary? –B-tree, hashing, TRIEs, PATRICIA trees,... stemming – Y/N? insertions?

CMU SCS Copyright: C. Faloutsos (2012)12 Text - Inversion variations: Parallelism [Tomasic+,93] Insertions [Tomasic+94], [Brown+] –‘zipf’ distributions Approximate searching (‘glimpse’ [Wu+])

CMU SCS Copyright: C. Faloutsos (2012)13 postings list – more Zipf distr.: eg., rank- frequency plot of ‘Bible’ log(rank) log(freq) Text - Inversion

CMU SCS Copyright: C. Faloutsos (2012)14 Text - Inversion postings lists –Cutting+Pedersen (keep first 4 in B-tree leaves) –how to allocate space: [Faloutsos+92] geometric progression –compression (Elias codes) [Zobel+] – down to 2% overhead! –Compression and doc reordering [Blandford+2002]

CMU SCS Integer coding: small integers - > few bits numberbinarySelf-delimiting Copyright: C. Faloutsos (2012)15 O (log (i) ) bits for integer i can drop middle ‘1’ can apply recursively, to length

CMU SCS Document Reordering Copyright: C. Faloutsos (2012)16 Aaron zoo … Doc1 Doc2 Doc3 … …

CMU SCS Document Reordering Copyright: C. Faloutsos (2012)17 Aaron zoo … Doc1 Doc3 … Doc … Shorter runs; easier to compress

CMU SCS Copyright: C. Faloutsos (2012)18 Conclusions Conclusions: needs space overhead (2%- 300%), but it is the fastest

CMU SCS Copyright: C. Faloutsos (2012)19 Text - Detailed outline text –problem –full text scanning –inversion –signature files –clustering –information filtering and LSI

CMU SCS Copyright: C. Faloutsos (2012)20 Signature files idea: ‘quick & dirty’ filter..JoSm.. … John Smith … Signature fileText file

CMU SCS Copyright: C. Faloutsos (2012)21 Signature files idea: ‘quick & dirty’ filter then, do seq. scan on sign. file and discard ‘false alarms’ Adv.: easy insertions; faster than seq. scan Disadv.: O(N) search (with small constant) Q: how to extract signatures?

CMU SCS Copyright: C. Faloutsos (2012)22 Signature files A: superimposed coding!! [Mooers49],... m (=4 bits/word) F (=12 bits sign. size) WordSignature data base doc.signature

CMU SCS Copyright: C. Faloutsos (2012)23 Signature files A: superimposed coding!! [Mooers49],... data actual match WordSignature data base doc.signature

CMU SCS Copyright: C. Faloutsos (2012)24 Signature files A: superimposed coding!! [Mooers49],... retrieval actual dismissal WordSignature data base doc.signature

CMU SCS Copyright: C. Faloutsos (2012)25 Signature files A: superimposed coding!! [Mooers49],... nucleotic false alarm (‘false drop’) WordSignature data base doc.signature

CMU SCS Copyright: C. Faloutsos (2012)26 Signature files A: superimposed coding!! [Mooers49],... ‘YES’ is ‘MAYBE’ ‘NO’ is ‘NO’ WordSignature data base doc.signature

CMU SCS Copyright: C. Faloutsos (2012)27 Signature files Q1: How to choose F and m ? Q2: Why is it called ‘false drop’? Q3: other apps of signature files?

CMU SCS Copyright: C. Faloutsos (2012)28 Signature files Q1: How to choose F and m ? m (=4 bits/word) F (=12 bits sign. size) WordSignature data base doc.signature

CMU SCS Copyright: C. Faloutsos (2012)29 Signature files Q1: How to choose F and m ? A: so that doc. signature is 50% full m (=4 bits/word) F (=12 bits sign. size) WordSignature data base doc.signature

CMU SCS Copyright: C. Faloutsos (2012)30 Signature files Q1: How to choose F and m ? Q2: Why is it called ‘false drop’? Q3: other apps of signature files?

CMU SCS Copyright: C. Faloutsos (2012)31 Signature files Q2: Why is it called ‘false drop’? Old, but fascinating story [1949] –how to find qualifying books (by title word, and/or author, and/or keyword) –in O(1) time? –without computers (1949…)

CMU SCS Copyright: C. Faloutsos (2012)32 Signature files ‘State of the art’: cards – but > O(1) one copy alpha by author one by title …

CMU SCS Copyright: C. Faloutsos (2012)33 Signature files Solution: edge-notched cards each title word is mapped to m numbers (how?) and the corresponding holes are cut out:

CMU SCS Copyright: C. Faloutsos (2012)34 Signature files Solution: edge-notched cards data ‘data’ -> #1, #39

CMU SCS Copyright: C. Faloutsos (2012)35 Signature files Search, e.g., for ‘data’: activate needle #1, #39, and shake the stack of cards! data ‘data’ -> #1, #39

CMU SCS Copyright: C. Faloutsos (2012)36 Signature files Also known as ‘zatocoding’, from ‘Zator’ company.

CMU SCS Copyright: C. Faloutsos (2012)37 Signature files Q1: How to choose F and m ? Q2: Why is it called ‘false drop’? Q3: other apps of signature files?

CMU SCS Copyright: C. Faloutsos (2012)38 Signature files Q3: other apps of signature files? A: anything that has to do with ‘membership testing’: does ‘data’ belong to the set of words of the document? WordSignature data base doc.signature

CMU SCS Copyright: C. Faloutsos (2012)39 Signature files UNIX’s early ‘spell’ system [McIlroy] Bloom-joins in System R* [Mackert+] and ‘active disks’ [Riedel99] differential files [Severance + Lohman] Eric Riedel Doug McIlroy Guy Lohman

CMU SCS Copyright: C. Faloutsos (2012)40 App#1: Unix’s spell aaron apple … zoo Dictionary electroencephalogram ? What to do if the dictionary does not fit in memory (~1980)? ~30,000 words

CMU SCS Copyright: C. Faloutsos (2012)41 App#1: Unix’s spell A: allow for a few typos! And use huge bit string (2**27) Hash each dictionary word to a bit Compress the string aaron apple … zoo 2**27 bits

CMU SCS Copyright: C. Faloutsos (2012)42 App#1: Unix’s spell Sub-questions: Q1: How often do we allow typos? Q2: Will the (compressed) bit string fit in memory? Q3: How to compress the bit string?

CMU SCS Copyright: C. Faloutsos (2012)43 App#2: Bloom-joins ABC … 1 12 AEF … R join S

CMU SCS Copyright: C. Faloutsos (2012)44 App#2: Bloom-joins ABC … 1 12 AEF … Idea: reduce transmission cost: ‘R semijoin S’

CMU SCS Copyright: C. Faloutsos (2012)45 App#2: Bloom-joins Idea: reduce transmission cost: ‘R semijoin S’ That is, -‘S’ ships its unique values of ‘A’ - ‘R’ deletes non-matching tuples - (and they both send their tuples to PIT) Q: what if we want to send at most, say 100 bytes NY -> Chicago?

CMU SCS Copyright: C. Faloutsos (2012)46 App#2: Bloom-joins Q: what if we want to send at most, say 100 bytes NY -> Chicago? A: Bloom-join! Send a bloom filter of the S.A values

CMU SCS Copyright: C. Faloutsos (2012)47 App#3: Differential files Problem definition: -A large file (eg., with EMPLOYEE records), nicely packed and organized (eg., B-tree) -A few insertions/deletions, that we would like to keep separate, and merge, at night -How to search, eg., for employee #123?

CMU SCS Copyright: C. Faloutsos (2012)48 App#3: Differential files ssnname… … flagssnname,.. i123 i55 d17 d33 Differential file

CMU SCS Copyright: C. Faloutsos (2012)49 App#3: Differential files Q: How to search, eg., for employee #123? A: bloom-filter, for keys of diff. file Q: What are the advantages of differential files? A:

CMU SCS Copyright: C. Faloutsos (2012)50 App#4: Virus-scan Q: How to search, for thousands of patterns? A: Bloom filter: Exact Pattern Matching with Feed-Forward Bloom Filters, J. Moraru, and D. Andersen, (ALENEX11), Jan 2011

CMU SCS Copyright: C. Faloutsos (2012)51 Signature files - conclusions easy insertions; slower than inversion brilliant idea of ‘quick and dirty’ filter: quickly discard the vast majority of non_qualifying elements, and focus on the rest.

CMU SCS Copyright: C. Faloutsos (2012)52 References Blandford, D. and Blelloch, G Index Compression through Document Reordering. Data Compression Conference (DCC '02) (April , 2002). Brown, E. W., J. P. Callan, et al. (March 1994). Supporting Full-Text Information Retrieval with a Persistent Object Store. EDBT conference, Cambridge, U.K., Springer Verlag.

CMU SCS Copyright: C. Faloutsos (2012)53 References - cont’d Faloutsos, C. and H. V. Jagadish (Aug , 1992). On B-tree Indices for Skewed Distributions. 18th VLDB Conference, Vancouver, British Columbia. Karp, R. M. and M. O. Rabin (March 1987). "Efficient Randomized Pattern-Matching Algorithms." IBM Journal of Research and Development 31(2): Knuth, D. E., J. H. Morris, et al. (June 1977). "Fast Pattern Matching in Strings." SIAM J. Comput 6(2):

CMU SCS Copyright: C. Faloutsos (2012)54 References - cont’d Mackert, L. M. and G. M. Lohman (August 1986). R* Optimizer Validation and Performance Evaluation for Distributed Queries. Proc. of 12th Int. Conf. on Very Large Data Bases (VLDB), Kyoto, Japan. Manber, U. and S. Wu (1994). GLIMPSE: A Tool to Search Through Entire File Systems. Proc. of USENIX Techn. Conf. McIlroy, M. D. (Jan. 1982). "Development of a Spelling List." IEEE Trans. on Communications COM-30(1):

CMU SCS Copyright: C. Faloutsos (2012)55 References - cont’d Mooers, C. (1949). Application of Random Codes to the Gathering of Statistical Information Bulletin 31. Cambridge, Mass, Zator Co. Pedersen, D. C. a. J. (1990). Optimizations for dynamic inverted index maintenance. ACM SIGIR. Riedel, E. (1999). Active Disks: Remote Execution for Network Attached Storage. ECE, CMU. Pittsburgh, PA.

CMU SCS Copyright: C. Faloutsos (2012)56 References - cont’d Severance, D. G. and G. M. Lohman (Sept. 1976). "Differential Files: Their Application to the Maintenance of Large Databases." ACM TODS 1(3): Tomasic, A. and H. Garcia-Molina (1993). Performance of Inverted Indices in Distributed Text Document Retrieval Systems. PDIS. Tomasic, A., H. Garcia-Molina, et al. (May 24-27, 1994). Incremental Updates of Inverted Lists for Text Document Retrieval. ACM SIGMOD, Minneapolis, MN.

CMU SCS Copyright: C. Faloutsos (2012)57 References - cont’d Wu, S. and U. Manber (1992). "AGREP- A Fast Approximate Pattern-Matching Tool.". Zobel, J., A. Moffat, et al. (Aug , 1992). An Efficient Indexing Technique for Full-Text Database Systems. VLDB, Vancouver, B.C., Canada.