Download presentation
Presentation is loading. Please wait.
1
Approximate string matching Evlogi Hristov Telerik Corporation Student at Telerik Academy
2
1. Levenshtein distance. 2. Bitap overview. 3. Bitap Exact search. 4. Bitap Fuzzy search. 5. Additional information. 2
3
Edit distance 3
4
Edit distance: Primitive operations necessary to convert the string into an exact match. insertion: cot → coat deletion: coat → cot substitution: coat → cost 4 Example: 1. Set n to be the length of s = "GUMBO" Set m to be the length of t = "GAMBOL" If n = 0, return m and exit If m = 0, return n and exit
5
0 1 2 3 4 5 1 1 2 3 4 5 2 2 1 2 3 4 3 3 2 1 2 3 4 4 3 2 1 2 GUMBO 012345 G1 A2 M3 B4 O5 L6 2. Initialize matrix M [m + 1, n + 1] 3. Examine each character of s ( i from 1 to n ) 4. Examine each character of t ( j from 1 to m ) 5. If s[i] equals t[j], the cost is 0 If s[i] is not equal to t[j], the cost is 1 6. Set cell M[j, i] equal to the minimum of: a. The cell immediately above plus 1: M [j-1, i] + 1 b. The cell immediately to the left plus 1: M [j, i-1] + 1 c. The cell diagonally above and to the left plus the cost: M [j-1, i-1] + cost 7. After the iteration steps (3, 4, 5, 6) are complete, the distance is found in the cell M [m - 1, n - 1] 5
6
private int Levenshtein(string source, string target) { if (string.IsNullOrEmpty(source)) { if (string.IsNullOrEmpty(source)) { if (!string.IsNullOrEmpty(target)) { if (!string.IsNullOrEmpty(target)) { return target.Length; return target.Length; } return 0; return 0; } if (string.IsNullOrEmpty(target)) { if (string.IsNullOrEmpty(target)) { if (!string.IsNullOrEmpty(source)) { if (!string.IsNullOrEmpty(source)) { return source.Length; return source.Length; } return 0; return 0; } int[,] dist = new int[source.Length + 1, target.Length + 1]; int[,] dist = new int[source.Length + 1, target.Length + 1]; int min1, min2, min3, cost; int min1, min2, min3, cost; //..continues on text page 6
7
for (int i = 0; i < dist.GetLength(0); i += 1) { for (int i = 0; i < dist.GetLength(0); i += 1) { dist[i, 0] = i; dist[i, 0] = i; } for (int i = 0; i < dist.GetLength(1); i += 1) { for (int i = 0; i < dist.GetLength(1); i += 1) { dist[0, i] = i; dist[0, i] = i; } for (int i = 1; i < dist.GetLength(0); i++) { for (int i = 1; i < dist.GetLength(0); i++) { for (int j = 1; j < dist.GetLength(1); j++) { for (int j = 1; j < dist.GetLength(1); j++) { cost = Convert.ToInt32(!(source[i-1] == target[j - 1])); cost = Convert.ToInt32(!(source[i-1] == target[j - 1])); min1 = dist[i - 1, j] + 1; min1 = dist[i - 1, j] + 1; min2 = dist[i, j - 1] + 1; min2 = dist[i, j - 1] + 1; min3 = dist[i - 1, j - 1] + cost; min3 = dist[i - 1, j - 1] + cost; dist[i, j] = Math.Min(Math.Min(min1, min2), min3); dist[i, j] = Math.Min(Math.Min(min1, min2), min3); } } return dist[dist.GetLength(0)-1,dist.GetLength(1)-1]; return dist[dist.GetLength(0)-1,dist.GetLength(1)-1];} 7
8
shift-or/shift-and 8
9
Also known as the shift-or, shift-and or Baeza–Yates–Gonnet algorithm. Aproximate string matching algorithm. Approximate equality is defined in terms of Levenshtein distance. Often used for fuzzy search without indexing. Does most of the work with bitwise operations. Runs in O(mn) operations, no matter the structure of the text or the pattern. 9
10
public static List ExactMatch(string text, string pattern) { long[] alphabet = new long[128]; //ASCII range (0 – 127) long[] alphabet = new long[128]; //ASCII range (0 – 127) for (int i = 0; i < pattern.Length; ++i) for (int i = 0; i < pattern.Length; ++i) { int letter = (int)pattern[i]; int letter = (int)pattern[i]; alphabet[letter] = alphabet[letter] | (1 << i); alphabet[letter] = alphabet[letter] | (1 << i); } long result = 1; //0000 0001 long result = 1; //0000 0001 List indexes = new List (); List indexes = new List (); for (int index = 0; index < text.Length; index++) for (int index = 0; index < text.Length; index++) { result &= alphabet[text[index]]; //if result != pattern => result = 0 result &= alphabet[text[index]]; //if result != pattern => result = 0 result = (result << 1) + 1; result = (result << 1) + 1; if ((result & (1 0) if ((result & (1 0) { indexes.Add(index - pattern.Length + 1); indexes.Add(index - pattern.Length + 1); } } return indexes; return indexes;} 10
11
cbaba 00101 11 alphabet[a] = 01234 ababc cbaba 01010 alphabet[b] = cbaba 10000 alphabet[c] = = 5 = 10 = 16 Example: text = cbdabababcpattern = ababc cbaba 00000 alphabet[d] = = 0 43210 bits: 00001 start res: c 00000 cb 00000 cbd 00000 cbda 00001 cbdab 00010 bdaba 00101 dabab 01010 ababa 00101 babab 01010 ababc 10000 res: text[i] = 1
12
12... long[] result = new long[k + 1]; for (int i = 0; i <= k; i++) for (int i = 0; i <= k; i++) { result[i] = 1; result[i] = 1; }... for (int j = 1; j <= k; ++j) for (int j = 1; j <= k; ++j) { // Three operations of the Levenshtein distance // Three operations of the Levenshtein distance long insertion = current | ((result[j] & patternMask[text[i]]) << 1); long insertion = current | ((result[j] & patternMask[text[i]]) << 1); long deletion = (previous | (result[j] & patternMask[text[i]])) << 1; long deletion = (previous | (result[j] & patternMask[text[i]])) << 1; long substitution = (previous | (result[j] & patternMask[text[i]])) << 1; long substitution = (previous | (result[j] & patternMask[text[i]])) << 1; current = result[j]; current = result[j]; result[j] = substitution | insertion | deletion | 1; result[j] = substitution | insertion | deletion | 1; previous = result[j]; previous = result[j]; }... Instead of having a single array result that changes over the length of the text, we now have k distinct arrays result 1..k
13
Shift-and : Uses bitwise & and 1’s for matches More intuitive and easyer to understand Needs to add result |= 1 Shift-or : Uses bitwise | and zeroes’s for matches A bit faster 13
14
форум програмиране, форум уеб дизайн курсове и уроци по програмиране, уеб дизайн – безплатно програмиране за деца – безплатни курсове и уроци безплатен SEO курс - оптимизация за търсачки уроци по уеб дизайн, HTML, CSS, JavaScript, Photoshop уроци по програмиране и уеб дизайн за ученици ASP.NET MVC курс – HTML, SQL, C#,.NET, ASP.NET MVC безплатен курс "Разработка на софтуер в cloud среда" BG Coder - онлайн състезателна система - online judge курсове и уроци по програмиране, книги – безплатно от Наков безплатен курс "Качествен програмен код" алго академия – състезателно програмиране, състезания ASP.NET курс - уеб програмиране, бази данни, C#,.NET, ASP.NET курсове и уроци по програмиране – Телерик академия курс мобилни приложения с iPhone, Android, WP7, PhoneGap free C# book, безплатна книга C#, книга Java, книга C# Николай Костов - блог за програмиране http://algoacademy.telerik.com
15
Original paper of Baeza-Yates and Gonnet: http://www.akira.ruc.dk/~keld/teaching/algoritmedesign _f08/Artikler/09/Baeza92.pdf http://www.akira.ruc.dk/~keld/teaching/algoritmedesign _f08/Artikler/09/Baeza92.pdf http://www.akira.ruc.dk/~keld/teaching/algoritmedesign _f08/Artikler/09/Baeza92.pdf Google implementation using bitap: https://code.google.com/p/google-diff-match-patch https://code.google.com/p/google-diff-match-patch Levenshtein algorithm: http://www.codeproject.com/Articles/13525/Fast- memory-efficient-Levenshtein-algorithm http://www.codeproject.com/Articles/13525/Fast- memory-efficient-Levenshtein-algorithm http://www.codeproject.com/Articles/13525/Fast- memory-efficient-Levenshtein-algorithm http://en.wikibooks.org/wiki/Algorithm_Implementation /Strings/Levenshtein_distance http://en.wikibooks.org/wiki/Algorithm_Implementation /Strings/Levenshtein_distance http://en.wikibooks.org/wiki/Algorithm_Implementation /Strings/Levenshtein_distance
16
“C# Programming @ Telerik Academy csharpfundamentals.telerik.com csharpfundamentals.telerik.com Telerik Software Academy academy.telerik.com academy.telerik.com Telerik Academy @ Facebook facebook.com/TelerikAcademy facebook.com/TelerikAcademy Telerik Software Academy Forums forums.academy.telerik.com forums.academy.telerik.com
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.