Approximate string matching Evlogi Hristov Telerik Corporation Student at Telerik Academy.

Slides:



Advertisements
Similar presentations
Creating Vector Graphics in the Web Learning & Development Telerik Software Academy.
Advertisements

Redundant Array of Independent Disks Borislav Varadinov Telerik Software Academy academy.telerik.com System Administrator Marian Marinov CEO of 1H Ltd.
Writing HTML made easier Telerik School Academy HTML, CSS and JavaScript.
Inside Windows Boot Process Borislav Varadinov Telerik Software Academy academy.telerik.com System Administrator Marian Marinov CEO of 1H Ltd.
Guidelines to Preparing and Delivering an Elevator Pitch Presentation Svetlin Nakov Telerik Software Academy academy.telerik.com Manager Technical Training.
Two basic algorithms for path searching in a graph Telerik Algo Academy Graph Algorithms.
About the Course Telerik Software Academy CSS Styling.
Binary, Decimal and Hexadecimal Numbers Svetlin Nakov Telerik Software Academy academy.telerik.com Technical Trainer
Eclipse + Android SDK, VS + Windows Phone SDK Telerik Software Academy Hybrid Mobile Applications.
Welcome to the JSON-stores world Learning & Development Telerik Software Academy.
Course Content, Evaluation, Exams Svetlin Nakov Telerik Software Academy academy.telerik.com Technical Trainer
Eclipse + Android SDK, VS + Windows Phone SDK Svetlin Nakov Telerik Software Academy academy.telerik.com Technical Trainer
Twitter Bootstrap Telerik Software Academy
Things you need to know George Georgiev Technical Trainer GeorgeAtanasov George Atanasov Front-End Developer.
Software Company Structure, Product Management, Marketing, Sales, Business Strategy and More Margarita Antonova Volunteer Telerik Academy academy.telerik.com.
Hamiltonian Cycle Penka Borukova Student at Telerik Academy.
Group Policy Borislav Varadinov Telerik Software Academy academy.telerik.com System Administrator
Reverse polish notation, Operations with big numbers Ivelin Rachev Telerik Corporation “Baba Tonka” High School of Mathematics.
April, Sofia, Telerik Software Academy Svetlin Nakov Telerik Software Academy academy.telerik.com Technical Trainer
Telerik Software Academy ASP.NET MVC.
Sales Processes, Sales Cycle and the Sales Force Margarita Antonova Volunteer Telerik Academy academy.telerik.com Business System Analyst Telerik.
Design Patterns that ease the design by identifying a simple way to realize relationships between entities. Antony Jekov Telerik Software Academy academy.telerik.com.
Using KendoUI for SPA Applications Learning & Development Telerik School Academy.
The way to create cross-platform apps Telerik School Academy Xamarin apps for iOS, Android & WinPhone.
PEST Analysis, Porter’s 5 Forces and Sustainability Strategies Margarita Antonova Volunteer Telerik Academy academy.telerik.com Business System.
Searching for a Job, Writing CV and Cover Letter, Passing a Job Interview Telerik Academy for Software Engineers October 6 th 2012 – Sofia academy.telerik.com.
Drive Partitioning Borislav Varadinov Telerik Software Academy academy.telerik.com System Administrator Marian Marinov CEO of 1H Ltd.
Telerik Software Academy Mobile apps for iPhone & iPad.
Google APIs and Facebook API Ivaylo Kenov Penka Borukova Telerik Corporation Telerik Academy Students.
The magic of virtual machines Borislav Varadinov Telerik Software Academy academy.telerik.com System Administrator Marian Marinov CEO of 1H Ltd.
What it is, and does it work Learning & Development Telerik School Academy.
Device APIs with Xamarin
Telerik Software Academy Mobile apps for iPhone & iPad.
Academy.telerik.com Class of , Fall 2013.
Telerik Software Academy Databases.
Google APIs and Facebook API Ivaylo Kenov Penka Borukova Telerik Corporation Telerik Academy Students.
Telerik Software Academy ASP.NET Web Forms.
What are WinJS and WinRT, Using the APIs in JavaScript George Georgiev Telerik Software Academy academy.telerik.com Technical Trainer itgeorge.net.
JavaScript Modules and Patterns Telerik Software Academy
The True power of dynamic web pages Learning & Development Team Telerik Software Academy.
PEST Analysis, Porter’s 5 Forces and Sustainability Strategies Margarita Antonova Volunteer Telerik Academy academy.telerik.com Business System.
Xamarin with SQLite Telerik School Academy Xamarin apps for iOS, Android & WinPhone.
Know your Tools! Telerik Software Academy JavaScript Fundamentals.
Ivaylo Kenov Telerik Software Academy Technical Assistant.
Developing "evil" JavaScript applications Learning & Development Telerik School Academy.
The PhoneGap History Doncho Minkov Telerik Software Academy Technical Trainer
Building Rock-Solid Software Svetlin Nakov Telerik Software Academy Manager Technical Training
Adobe® Photoshop® CS6 Essentials
Cultivating Professionals for Your Company The Success Story of Telerik Svetlin Nakov, PhD Manager Technical Training Telerik Corp.
Telerik School Academy Xamarin apps for iOS, Android & WinPhone.
AJAX with ASP.NET MVC Telerik Software Academy
Nencho Nenchev Doroteya Agayna Telerik Software Academy Telerik Support Leads.
Installation, Sample Usage, Strings and OOP Telerik Software Academy Software Quality Assurance.
In JavaScript Learning & Development Telerik Software Academy.
Services in AngularJS Telerik Software Academy
Svetlin Nakov Telerik Software Academy Manager Technical Training
What is Roslyn and how can we use it? Telerik Academy Plus C# 6.0 and Roslyn Seminar.
Splitting JavaScript into Dependent Modules Learning & Development Telerik Software Academy.
Changing the default visualization of views in Xamarin.Forms Telerik School Academy Xamarin apps for iOS, Android & WinPhone.
Telerik Software Academy HTML5.
What to expect from the new IDE Telerik Academy Plus Visual Studio 2015 and ASP.NET 5.
Hristian Hristov Telerik Corporation
I have to use math? I am out of here… Telerik School Academy Unity 2D Game Development.
Penka Borukova Student at Telerik Academy. 1. Boyer Moore String Search Algorithm 2. The bad character rule 3. The good suffix rule 4. The algorithm itself.
Monitoring Server Performance Organizing Your Support Model Borislav Varadinov Telerik Software Academy academy.telerik.com System Administrator
Automating Windows Deployment Borislav Varadinov Telerik Software Academy academy.telerik.com System Administrator
Server Roles and Features Borislav Varadinov Telerik Software Academy academy.telerik.com System Administrator
Binary, Decimal and Hexadecimal Numbers Telerik Software Academy C# Fundamentals – Part 2.
Nikolay Kostov Telerik Software Academy academy.telerik.com Team Lead, Senior Developer and Trainer
Adding UI and Audio element Telerik Academy Plus Unity 2D Game Development.
Presentation transcript:

Approximate string matching Evlogi Hristov Telerik Corporation Student at Telerik Academy

1. Levenshtein distance. 2. Bitap overview. 3. Bitap Exact search. 4. Bitap Fuzzy search. 5. Additional information. 2

Edit distance 3

 Edit distance: Primitive operations necessary to convert the string into an exact match.  insertion: cot → coat  deletion: coat → cot  substitution: coat → cost 4 Example: 1. Set n to be the length of s = "GUMBO" Set m to be the length of t = "GAMBOL" If n = 0, return m and exit If m = 0, return n and exit

GUMBO G1 A2 M3 B4 O5 L6 2. Initialize matrix M [m + 1, n + 1] 3. Examine each character of s ( i from 1 to n ) 4. Examine each character of t ( j from 1 to m ) 5. If s[i] equals t[j], the cost is 0 If s[i] is not equal to t[j], the cost is 1 6. Set cell M[j, i] equal to the minimum of: a. The cell immediately above plus 1: M [j-1, i] + 1 b. The cell immediately to the left plus 1: M [j, i-1] + 1 c. The cell diagonally above and to the left plus the cost: M [j-1, i-1] + cost 7. After the iteration steps (3, 4, 5, 6) are complete, the distance is found in the cell M [m - 1, n - 1] 5

private int Levenshtein(string source, string target) { if (string.IsNullOrEmpty(source)) { if (string.IsNullOrEmpty(source)) { if (!string.IsNullOrEmpty(target)) { if (!string.IsNullOrEmpty(target)) { return target.Length; return target.Length; } return 0; return 0; } if (string.IsNullOrEmpty(target)) { if (string.IsNullOrEmpty(target)) { if (!string.IsNullOrEmpty(source)) { if (!string.IsNullOrEmpty(source)) { return source.Length; return source.Length; } return 0; return 0; } int[,] dist = new int[source.Length + 1, target.Length + 1]; int[,] dist = new int[source.Length + 1, target.Length + 1]; int min1, min2, min3, cost; int min1, min2, min3, cost; //..continues on text page 6

for (int i = 0; i < dist.GetLength(0); i += 1) { for (int i = 0; i < dist.GetLength(0); i += 1) { dist[i, 0] = i; dist[i, 0] = i; } for (int i = 0; i < dist.GetLength(1); i += 1) { for (int i = 0; i < dist.GetLength(1); i += 1) { dist[0, i] = i; dist[0, i] = i; } for (int i = 1; i < dist.GetLength(0); i++) { for (int i = 1; i < dist.GetLength(0); i++) { for (int j = 1; j < dist.GetLength(1); j++) { for (int j = 1; j < dist.GetLength(1); j++) { cost = Convert.ToInt32(!(source[i-1] == target[j - 1])); cost = Convert.ToInt32(!(source[i-1] == target[j - 1])); min1 = dist[i - 1, j] + 1; min1 = dist[i - 1, j] + 1; min2 = dist[i, j - 1] + 1; min2 = dist[i, j - 1] + 1; min3 = dist[i - 1, j - 1] + cost; min3 = dist[i - 1, j - 1] + cost; dist[i, j] = Math.Min(Math.Min(min1, min2), min3); dist[i, j] = Math.Min(Math.Min(min1, min2), min3); } } return dist[dist.GetLength(0)-1,dist.GetLength(1)-1]; return dist[dist.GetLength(0)-1,dist.GetLength(1)-1];} 7

shift-or/shift-and 8

 Also known as the shift-or, shift-and or Baeza–Yates–Gonnet algorithm.  Aproximate string matching algorithm.  Approximate equality is defined in terms of Levenshtein distance.  Often used for fuzzy search without indexing.  Does most of the work with bitwise operations.  Runs in O(mn) operations, no matter the structure of the text or the pattern. 9

public static List ExactMatch(string text, string pattern) { long[] alphabet = new long[128]; //ASCII range (0 – 127) long[] alphabet = new long[128]; //ASCII range (0 – 127) for (int i = 0; i < pattern.Length; ++i) for (int i = 0; i < pattern.Length; ++i) { int letter = (int)pattern[i]; int letter = (int)pattern[i]; alphabet[letter] = alphabet[letter] | (1 << i); alphabet[letter] = alphabet[letter] | (1 << i); } long result = 1; // long result = 1; // List indexes = new List (); List indexes = new List (); for (int index = 0; index < text.Length; index++) for (int index = 0; index < text.Length; index++) { result &= alphabet[text[index]]; //if result != pattern => result = 0 result &= alphabet[text[index]]; //if result != pattern => result = 0 result = (result << 1) + 1; result = (result << 1) + 1; if ((result & (1 0) if ((result & (1 0) { indexes.Add(index - pattern.Length + 1); indexes.Add(index - pattern.Length + 1); } } return indexes; return indexes;} 10

cbaba alphabet[a] = ababc cbaba alphabet[b] = cbaba alphabet[c] = = 5 = 10 = 16 Example: text = cbdabababcpattern = ababc cbaba alphabet[d] = = bits: start res: c cb cbd cbda cbdab bdaba dabab ababa babab ababc res: text[i] = 1

12... long[] result = new long[k + 1]; for (int i = 0; i <= k; i++) for (int i = 0; i <= k; i++) { result[i] = 1; result[i] = 1; }... for (int j = 1; j <= k; ++j) for (int j = 1; j <= k; ++j) { // Three operations of the Levenshtein distance // Three operations of the Levenshtein distance long insertion = current | ((result[j] & patternMask[text[i]]) << 1); long insertion = current | ((result[j] & patternMask[text[i]]) << 1); long deletion = (previous | (result[j] & patternMask[text[i]])) << 1; long deletion = (previous | (result[j] & patternMask[text[i]])) << 1; long substitution = (previous | (result[j] & patternMask[text[i]])) << 1; long substitution = (previous | (result[j] & patternMask[text[i]])) << 1; current = result[j]; current = result[j]; result[j] = substitution | insertion | deletion | 1; result[j] = substitution | insertion | deletion | 1; previous = result[j]; previous = result[j]; }...  Instead of having a single array result that changes over the length of the text, we now have k distinct arrays result 1..k

 Shift-and :  Uses bitwise & and 1’s for matches  More intuitive and easyer to understand  Needs to add result |= 1  Shift-or :  Uses bitwise | and zeroes’s for matches  A bit faster 13

форум програмиране, форум уеб дизайн курсове и уроци по програмиране, уеб дизайн – безплатно програмиране за деца – безплатни курсове и уроци безплатен SEO курс - оптимизация за търсачки уроци по уеб дизайн, HTML, CSS, JavaScript, Photoshop уроци по програмиране и уеб дизайн за ученици ASP.NET MVC курс – HTML, SQL, C#,.NET, ASP.NET MVC безплатен курс "Разработка на софтуер в cloud среда" BG Coder - онлайн състезателна система - online judge курсове и уроци по програмиране, книги – безплатно от Наков безплатен курс "Качествен програмен код" алго академия – състезателно програмиране, състезания ASP.NET курс - уеб програмиране, бази данни, C#,.NET, ASP.NET курсове и уроци по програмиране – Телерик академия курс мобилни приложения с iPhone, Android, WP7, PhoneGap free C# book, безплатна книга C#, книга Java, книга C# Николай Костов - блог за програмиране

 Original paper of Baeza-Yates and Gonnet:  _f08/Artikler/09/Baeza92.pdf _f08/Artikler/09/Baeza92.pdf _f08/Artikler/09/Baeza92.pdf  Google implementation using bitap:   Levenshtein algorithm:  memory-efficient-Levenshtein-algorithm memory-efficient-Levenshtein-algorithm memory-efficient-Levenshtein-algorithm  /Strings/Levenshtein_distance /Strings/Levenshtein_distance /Strings/Levenshtein_distance

 “C# Telerik Academy  csharpfundamentals.telerik.com csharpfundamentals.telerik.com  Telerik Software Academy  academy.telerik.com academy.telerik.com  Telerik Facebook  facebook.com/TelerikAcademy facebook.com/TelerikAcademy  Telerik Software Academy Forums  forums.academy.telerik.com forums.academy.telerik.com