Text Search and Fuzzy Matching

Slides:



Advertisements
Similar presentations
Indexing DNA Sequences Using q-Grams
Advertisements

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Computer Science & Engineering 2111 Text Functions 1CSE 2111 Lecture-Text Functions.
Space-for-Time Tradeoffs
SOFTWARE PRESENTATION ODMS (OPEN SOURCE DOCUMENT MANAGEMENT SYSTEM)
Indexing Methods for Faster and More Effective Person Name Search Mark Arehart MITRE Corporation
Introduction to Information Retrieval Introduction to Information Retrieval Adapted from Christopher Manning and Prabhakar Raghavan Tolerant Retrieval.
Aki Hecht Seminar in Databases (236826) January 2009
Stemming, tagging and chunking Text analysis short of parsing.
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
Data Quality Class 7. Agenda Record Linkage Data Cleansing.
Shared Ontology for Knowledge Management Atanas Kiryakov, Borislav Popov, Ilian Kitchukov, and Krasimir Angelov Meher Shaikh.
Modern Information Retrieval Chapter 4 Query Languages.
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
To quantitatively test the quality of the spell checker, the program was executed on predefined “test beds” of words for numerous trials, ranging from.
Introduction To Databases IDIA 618 Fall 2014 Bridget M. Blodgett.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Copyright © 2003 by Prentice Hall Module 4 Database Management Systems 1.What is a database? Data hierarchy and data organization Field, record, file,
ASP.NET Programming with C# and SQL Server First Edition
Copyright © 2001 by Wiley. All rights reserved. Chapter 10: Advanced Database Operations Revising Vintage Videos Setting RecordSource at run time DBGrid.
Chapter 10 Queries and Updating Part C. SQL Copyright 2005 Radian Publishing Co.
Search Engines and Information Retrieval Chapter 1.
DBease: Making Databases User-Friendly and Easily Accessible Guoliang Li, Ju Fan, Hao Wu, Jiannan Wang, Jianhua Feng Database Group, Department of Computer.
Miscellaneous Excel Combining Excel and Access. – Importing, exporting and linking Parsing and manipulating data. 1.
Eurotrace Hands-On The Eurotrace File System. 2 The Eurotrace file system Under MS ACCESS EUROTRACE generates several different files when you create.
Oracle vs SQL Server Dr. Alex Wang. Oracle Text Oracle Text uses standard SQL to do almost everything. Full-text retrieval technology, deal with unstructured.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
1. Connecting database from PHP 2. Sending query 3. Fetching data 4. Persistent connections 5. Best practices.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
EBI is an Outstation of the European Molecular Biology Laboratory. Anatomy ontology ArrayExpress Helen Parkinson,
Information Retrieval CSE 8337 Spring 2007 Query Languages & Matching Material for these slides obtained from: Modern Information Retrieval by Ricardo.
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
´Google-ized´ search in your business data Author: Krasen Paskalev Certified Oracle 8i/9i DBA Seniour Oracle Consultant Semantec GmbH Benzstr.
LIS618 lecture 3 Thomas Krichel Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.
Efficient Instant-Fuzzy Search with Proximity Ranking Authors: Inci Centidil, Jamshid Esmaelnezhad, Taewoo Kim, and Chen Li IDCE Conference 2014 Presented.
Guide to Oracle 10g ITBIS373 Database Development Lecture 4a - Chapter 4: Using SQL Queries to Insert, Update, Delete, and View Data.
1 CSE 2337 Introduction to Data Management Access Book – Ch 1.
Spelling correction. Spell correction Two principal uses Correcting document(s) being indexed Correcting user queries to retrieve “right” answers Two.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
Web- and Multimedia-based Information Systems Lecture 2.
Copyright © 2006 Pilothouse Consulting Inc. All rights reserved. Search Overview Search Features: WSS and Office Search Architecture Content Sources and.
1 Information Retrieval LECTURE 1 : Introduction.
0 / Database Management. 1 / Identify file maintenance techniques Discuss the terms character, field, record, and table Describe characteristics.
Chapter 5 Introduction To Form Builder. Lesson A Objectives  Display Forms Builder forms in a Web browser  Use a data block form to view, insert, update,
7 Copyright © 2009, Oracle. All rights reserved. Regular Expression Support.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Improving Search for Emerging Applications * Some techniques current being licensed to Bimaple Chen Li UC Irvine.
PowerPoint Presentation for Dennis, Wixom, & Tegarden Systems Analysis and Design with UML, 5th Edition Copyright © 2015 John Wiley & Sons, Inc. All rights.
BIT 3193 MULTIMEDIA DATABASE CHAPTER 4 : QUERING MULTIMEDIA DATABASES.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
1 CS 8803 AIAD (Spring 2008) Project Group#22 Ajay Choudhari, Avik Sinharoy, Min Zhang, Mohit Jain Smart Seek.
Analyzing Text with SQL Server 2014, R, AND Azure ML Dejan Sarka.
ADVANCED SQL.  The SQL ORDER BY Keyword  The ORDER BY keyword is used to sort the result-set by one or more columns.  The ORDER BY keyword sorts the.
Copyright 2015 Varigence, Inc. Unit and Integration Testing in SSIS A New Approach Scott @varigence.
Strings in Python String Methods. String methods You do not have to include the string library to use these! Since strings are objects, you use the dot.
Introduction to Database Programming with Python Gary Stewart
TECHNICAL SEMINAR ON IMPLEMENTATION OF PHONETICS IN CRYPTOGRAPHY BY:- VICKY AGARWAL (4JN03CS078) GUIDED BY:- SREEDEVI.S LECTURER DEPT OF CS&E.
Information Retrieval in Practice
Fuzzy Searches Fuzzy searching allows you to search for words with similar spelling to the entered search word. It can be a useful way to ensure that you.
Miscellaneous Excel Combining Excel and Access.
This shows the user interface and the SQL Select for a situation with two criteria in an AND relationship.
Building Search Systems for Digital Library Collections
CS 430: Information Discovery
Do-Gil Lee1*, Ilhwan Kim1 and Seok Kee Lee2
Query Languages.
Lecture 12: Data Wrangling
Chapter 7 Space and Time Tradeoffs
Developing a Model-View-Controller Component for Joomla Part 3
Matching Students to School Districts
Presentation transcript:

Text Search and Fuzzy Matching Presented by Andre Dovgal, Sunaptic Solutions andredovgal@hotmail.com April 20, 2017 © Copyright 2004 Sunaptic Solutions. All rights reserved.

Focus of the Presentation Text Search in Big Databases Data Cleansing in ETL Word Matching Usage of Different Matching Algorithms Real-world data is "dirty" because of misspellings, truncations, missing or inserted tokens, null fields, unexpected abbreviations, and other irregularities. April 20, 2017 © Copyright 2004 Sunaptic Solutions. All rights reserved.

Scenarios Scenario 1 Scenario 2 (ETL) Scenario 3 User Interface “Dirty” Data Other Systems Other Systems “clean” request for search “dirty” request for search “Dirty” Data “Clean” Data “Clean” Data April 20, 2017 © Copyright 2004 Sunaptic Solutions. All rights reserved.

Text Search Challenges Improving Search Speed Searching for a substring in a string regardless of the substring nature. Improving Relevance of Results Searching for words of a human language. Domain dependence. See examples: http://en.wikipedia.org/wiki/String_searching_algorithm - Rabin-Karp algorithm - Knuth-Morris-Pratt algorithm - Boyer-Moore algorithm - more … April 20, 2017 © Copyright 2004 Sunaptic Solutions. All rights reserved.

Word Matching Approaches Exact Matching Partial Matching (Pattern Matching) Grammatical Algorithms: Stemming Matching and Synonym Matching (Semantics) Phonetic Matching Fuzzy Matching Exact, partial matching. Matches text strings, regardless of language, grammar, domain, sentences, etc. Grammatical algorithms. Stemming matching uses grammar, morphology, matches words rather than just text strings. Domain does not matter. Synonym matching works with semantics. Domain dependent. Phonetic matching works with words. Most of the algorithms were designed for specific domains, i.e. names. Fuzzy matching based on the probability of typographic errors. What else? April 20, 2017 © Copyright 2004 Sunaptic Solutions. All rights reserved.

Exact Matching No additional challenge except speed. Domain does not really matter. Example: search in a file using notepad program. Example (SQL): SELECT field FROM table WHERE field = ‘string’. MS SQL Server: Proper indexing improves speed. April 20, 2017 © Copyright 2004 Sunaptic Solutions. All rights reserved.

Partial Matching Domain does not really matter. Example: wildcards, search patterns. Example (SQL): SELECT field FROM table WHERE field LIKE ‘string%’. MS SQL Server: Proper indexing improves speed. Example (SQL): SELECT field FROM table WHERE field BETWEEN ‘string1’ AND ‘string2’. April 20, 2017 © Copyright 2004 Sunaptic Solutions. All rights reserved.

Full-text Search in MS SQL Server Needs MS Search Service (for SQL Server 2000) Included in MS SQL Server 2005 as SQL Server Full Text Search Service CONTAINS Predicate Unlike LIKE, CONTAINS matches words. Can search for a word inflectionally generated from another (stemming matching). Can search for a word near another word. SQL Server discards noise words from the search criteria. FREETEXT Predicate A word or phrase close to the search word or phrase. Needs Additional Space on Disk http://msdn2.microsoft.com/ms142541.aspx -- diagram from the slide #9 + description EXAMPLES Prefix search: SELECT * FROM Production.ProductDescription WHERE CONTAINS(Description, ' "bik*" ') Inflectional search: SELECT * FROM Production.ProductDescription WHERE CONTAINS(Description, 'FORMSOF(INFLECTIONAL, "ride")') Search for a word near another word: SELECT * FROM Production.ProductDescription WHERE CONTAINS(Description, 'bike NEAR performance') Size: 1 MB for Production.ProductDescription table. Nvarchar(400)* 762 records < 300K the table itself. Size: 6.14 MB for FuzzyTest 2 varchars (50) * 280,000 records < 30,000,000 B – not bad SELECT ProductName FROM Products WHERE CONTAINS(ProductName, ' "choc*" ') SELECT ProductName FROM Products WHERE CONTAINS(ProductName, ' FORMSOF (INFLECTIONAL, dry) ') SELECT ProductName FROM Products WHERE CONTAINS(ProductName, 'spread NEAR Boysenberry') SELECT CategoryName FROM Categories WHERE FREETEXT (Description, 'sweetest candy bread and dry meat' ) April 20, 2017 © Copyright 2004 Sunaptic Solutions. All rights reserved.

Full-text Search Architecture in MS SQL 2000 MS Search Service is separate from MS SQL Server. Full-text catalog files are separate from the DB. Need a separate backup procedure. In MS SQL Server 2005 – are included in the backup. Needs to create indexes first. To create a full-text index on a table, the table must have a single, unique not null column. Indexes are sets of tokens indicating the position of the words in column. April 20, 2017 © Copyright 2004 Sunaptic Solutions. All rights reserved.

Full-text Search Architecture in MS SQL 2005 CREATE/ALTER FULLTEXT INDEX – in T-SQL for MS SQL Server 2005 Accent sensitivity Thesaurus. Only synonyms. The thesaurus files are empty, you must add the word yourself. XML files. April 20, 2017 © Copyright 2004 Sunaptic Solutions. All rights reserved.

Grammatical Algorithms Stemming Match We already saw SQL Server Full Text search. Google example: “cutting and paste”. Why needs dictionary: to determine the stem. MS Search Service provides only inflectional, not derivational, word generation. Synonym Match Most Grammatical Algorithms are Based on Dictionaries Quasi Stemming Match Can be developed without a main dictionary (using quasi–endings tree). Relatively low relevance. MS Search Service provides only inflectional, not derivational, word generation. Therefore, only words in the same family (noun, verb, adjective, adverb, and so forth) are generated. For example, gerunds are not generated from nouns. Stemming "swim" generates swim, swam, swum, since these are all verbs — but not nouns such as swimmer and swimmers. Grammatical Algorithms – need Dictionaries and Grammar Rules. April 20, 2017 © Copyright 2004 Sunaptic Solutions. All rights reserved.

Phonetic Matching Phonetic Matching Algorithms (or Phonetic Encoding, or “Sounds Alike” Algorithms) Language Dependent Domain Dependent April 20, 2017 © Copyright 2004 Sunaptic Solutions. All rights reserved.

Phonetic Matching Algorithms The original SoundEx Algorithm Has been used in US census since late 1890s. Was patented by Margaret O'Dell and Robert C. Russell in 1918. Improvements: Phonix (1988), Editex (phonetic distance measuring, circa 2000), etc. Metaphone and Double Metaphone Algorithms Author: Lawrence Phillips. 1990 and 2000. April 20, 2017 © Copyright 2004 Sunaptic Solutions. All rights reserved.

SoundEx Algorithm 1. Capitalize all letters in the word and drop all punctuation marks. Pad the word with rightmost blanks as needed during each procedure step. 2. Retain the first letter of the word. 3. Change all occurrence of the following letters to '0' (zero):   'A', E', 'I', 'O', 'U', 'H', 'W', 'Y'. 4. Change letters from the following sets into the digit given: 1 = 'B', 'F', 'P', 'V' 2 = 'C', 'G', 'J', 'K', 'Q', 'S', 'X', 'Z' 3 = 'D','T' 4 = 'L' 5 = 'M','N' 6 = 'R' 5. Remove all pairs of digits which occur beside each other from the string that resulted after step 4. 6. Remove all zeros from the string that results from step 5 (placed there in step 3). 7. Pad the string that resulted from step (6) with trailing zeros and return only the first four positions, which will be of the form <uppercase letter> <digit> <digit> <digit>. select soundex(‘smyth') select soundex('smoothie') April 20, 2017 © Copyright 2004 Sunaptic Solutions. All rights reserved.

More About SoundEx Example (SQL) DIFFERENCE Oracle SOUNDEX – Slightly Different from SQL Server SOUNDEX Seems That Major DBMSs (SQL Server, Oracle, DB2) Don’t Have a Better Phonetic Matching Enhancements Replace DG with G etc. Phonix algorithm. select FirstName, LastName from Person.Contact where soundex(LastName) = soundex('pauluk') select FirstName, LastName from Person.Contact where soundex(LastName) = soundex('damis') select FirstName, LastName from Person.Contact where soundex(LastName) = soundex('smoothie') select FirstName, LastName from FuzzySearchTest where soundex(LastName) = soundex('pauluk') select FirstName, LastName from FuzzySearchTest where soundex(LastName) = soundex('smoothie') select FirstName, LastName, difference (LastName, 'smoothie') from FuzzySearchTest where soundex(LastName) = soundex('smoothie') April 20, 2017 © Copyright 2004 Sunaptic Solutions. All rights reserved.

SoundEx Limitations SoundEx is only usable in applications that can tolerate high false positives (when words that don't match the sound of the inquiry are returned) and high false negatives (when words that match the sound of the inquiry are NOT returned). In many instances, unreliable interfaces are used as a foundation, upon which a reliable layer may be built. Interfaces that build a reliable layer, based on context, over a SoundEx foundation may also be possible. SQL: word can’t start with a space. Mistake in first letter results in 100% mismatch. SoundEx acts as a bridge between the fuzzy and inexact process of human vocal interaction, and the concise true/false processes at the foundation of computer communication. As such, SoundEx is an inherently unreliable interface. For this reason, SoundEx is only usable in applications that can tolerate high false positives (when words that don't match the sound of the inquiry are returned) and high false negatives (when words that match the sound of the inquiry are NOT returned). Only 33% of the matches that would be returned by Soundex would be correct. Even more significant was the finding that 25% of correct matches would fail to be discovered by Soundex. (Alan Stanier, September 1990, Computers in Genealogy, Vol. 3, No. 7) Only 36.37% of Soundex returns were correct, while more than 60% of correct names were never returned by Soundex. (A.J. Lait and B. Randell, 1996) This limitation is true even of the best SoundEx improvement techniques available. As long as you accept and honor this limitation, SoundEx and its derivatives can be a very useful tool in helping to improve the quality and usefulness of databases. April 20, 2017 © Copyright 2004 Sunaptic Solutions. All rights reserved.

Metaphone and Double Metaphone An algorithm to code English words phonetically by reducing them to 16 consonant sounds. Double Metaphone An algorithm to code English words (and foreign words often heard in the United States) phonetically by reducing them to 12 consonant sounds. Author: Lawrence Phillips, 1990 and 2000 Metaphone Description and Demo: http://www.wbrogden.com/phonetic/ SQL Example select FirstName, LastName from Person.Contact where dbo.DoubleMetaPhone(LastName) = dbo.DoubleMetaPhone('damis') April 20, 2017 © Copyright 2004 Sunaptic Solutions. All rights reserved.

Double Metaphone Advantages and Limitations Free, Efficient, and Easy to Use Provides Better Results Compare to SOUNDEX Returns Two Possible Matches Works Best with Proper Names May Fail to Match Misspelled Words Much Slower than SOUNDEX Though it works as a general-purpose phonetic search algorithm, Double Metaphone was designed for, and works best with, searching lists of proper names rather than large fields of generic text. Double Metaphone provides minimal ranking ability, apart from the three match levels described elsewhere in the series. This limits the ability to tune search results. Being a phonetic matching (vs. fuzzy matching like q-grams and edit distances) algorithm, Double Metaphone may fail to match misspelled words when the misspelling substantively alters the phonetic structure of a word. Even bearing these limitations in mind, Double Metaphone is free, efficient, easy to use, and adaptable to a number of scenarios. Ultimately, only the designer of a particular system can decide if Double Metaphone is appropriate to his/her particular problem space. April 20, 2017 © Copyright 2004 Sunaptic Solutions. All rights reserved.

Fuzzy Matching What is Fuzzy Matching? Fuzzy query in Index Server are simple prefix matching, like dog* returns dogmatic and doghouse, + stem matching. Originally Meant “Not Exact Matching” Web Search Engines Edit Distance Based Algorithms Simple: Hamming distance algorithms. Most popular: Levenshtein distance algorithm. Q-Gram Based Algorithms Both Types of Algorithms Are Language and Domain Independent April 20, 2017 © Copyright 2004 Sunaptic Solutions. All rights reserved.

Levenshtein Distance Developed in 1965 LD is a Measure of the Similarity Between Two Strings It is the smallest number of insertions, deletions, and substitutions required to change one string into another. Language and Domain Independent Demo http://www.merriampark.com/ld.htm More demos: http://www.cut-the-knot.org/do_you_know/Strings.shtml#induction April 20, 2017 © Copyright 2004 Sunaptic Solutions. All rights reserved.

Q-Grams Q-Grams Are Obtained by Sliding a Window of Size Q over the Characters of a Given String Example 2-grams of “john smith” are $j jo oh hn n_ _s sm mi it th h# IDEA: If Strings Match, They Have Many Common Q-Grams Example: “john smith” and jonh smith” have 9 common q-grams. Language and Domain Independent April 20, 2017 © Copyright 2004 Sunaptic Solutions. All rights reserved.

Designed for data cleanup. “Fuzzy” SSIS Fuzzy Lookup enables to match input records with clean, standardized records in a reference table. Fuzzy Grouping enables to identify groups of records in a table where each record in the group potentially corresponds to the same real-world entity. Designed for data cleanup. Based on Q-Grams and Levenshtein Distance (?). Fuzzy search databases can be amassed that compile common misspellings (or variants) of specific words which can then be substituted during the cleansing process. http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnsql90/html/FzDTSSQL05.asp April 20, 2017 © Copyright 2004 Sunaptic Solutions. All rights reserved.

Design a Simple SSIS Fuzzy Lookup Package Setting Up String Data Types (DT_STR and DT_WSTR) ETI (Error-Tolerant Index), Tokens, Delimiters Tokens are not Q-Grams Similarity Threshold Number of Matches The lower Similarity, the more likely to find a match, the search could take longer. April 20, 2017 © Copyright 2004 Sunaptic Solutions. All rights reserved.

Can Fuzzy Lookup Be Accessed From C# Code? NOT YET Develop your own implementation of the algorithm. April 20, 2017 © Copyright 2004 Sunaptic Solutions. All rights reserved.

Conclusions Language and Domain Knowledge is Important No Implementations? – Develop Yourself! Questions? April 20, 2017 © Copyright 2004 Sunaptic Solutions. All rights reserved.