Approximate string matching Evlogi Hristov Telerik Corporation Student at Telerik Academy
1. Levenshtein distance. 2. Bitap overview. 3. Bitap Exact search. 4. Bitap Fuzzy search. 5. Additional information. 2
Edit distance 3
Edit distance: Primitive operations necessary to convert the string into an exact match. insertion: cot → coat deletion: coat → cot substitution: coat → cost 4 Example: 1. Set n to be the length of s = "GUMBO" Set m to be the length of t = "GAMBOL" If n = 0, return m and exit If m = 0, return n and exit
GUMBO G1 A2 M3 B4 O5 L6 2. Initialize matrix M [m + 1, n + 1] 3. Examine each character of s ( i from 1 to n ) 4. Examine each character of t ( j from 1 to m ) 5. If s[i] equals t[j], the cost is 0 If s[i] is not equal to t[j], the cost is 1 6. Set cell M[j, i] equal to the minimum of: a. The cell immediately above plus 1: M [j-1, i] + 1 b. The cell immediately to the left plus 1: M [j, i-1] + 1 c. The cell diagonally above and to the left plus the cost: M [j-1, i-1] + cost 7. After the iteration steps (3, 4, 5, 6) are complete, the distance is found in the cell M [m - 1, n - 1] 5
private int Levenshtein(string source, string target) { if (string.IsNullOrEmpty(source)) { if (string.IsNullOrEmpty(source)) { if (!string.IsNullOrEmpty(target)) { if (!string.IsNullOrEmpty(target)) { return target.Length; return target.Length; } return 0; return 0; } if (string.IsNullOrEmpty(target)) { if (string.IsNullOrEmpty(target)) { if (!string.IsNullOrEmpty(source)) { if (!string.IsNullOrEmpty(source)) { return source.Length; return source.Length; } return 0; return 0; } int[,] dist = new int[source.Length + 1, target.Length + 1]; int[,] dist = new int[source.Length + 1, target.Length + 1]; int min1, min2, min3, cost; int min1, min2, min3, cost; //..continues on text page 6
for (int i = 0; i < dist.GetLength(0); i += 1) { for (int i = 0; i < dist.GetLength(0); i += 1) { dist[i, 0] = i; dist[i, 0] = i; } for (int i = 0; i < dist.GetLength(1); i += 1) { for (int i = 0; i < dist.GetLength(1); i += 1) { dist[0, i] = i; dist[0, i] = i; } for (int i = 1; i < dist.GetLength(0); i++) { for (int i = 1; i < dist.GetLength(0); i++) { for (int j = 1; j < dist.GetLength(1); j++) { for (int j = 1; j < dist.GetLength(1); j++) { cost = Convert.ToInt32(!(source[i-1] == target[j - 1])); cost = Convert.ToInt32(!(source[i-1] == target[j - 1])); min1 = dist[i - 1, j] + 1; min1 = dist[i - 1, j] + 1; min2 = dist[i, j - 1] + 1; min2 = dist[i, j - 1] + 1; min3 = dist[i - 1, j - 1] + cost; min3 = dist[i - 1, j - 1] + cost; dist[i, j] = Math.Min(Math.Min(min1, min2), min3); dist[i, j] = Math.Min(Math.Min(min1, min2), min3); } } return dist[dist.GetLength(0)-1,dist.GetLength(1)-1]; return dist[dist.GetLength(0)-1,dist.GetLength(1)-1];} 7
shift-or/shift-and 8
Also known as the shift-or, shift-and or Baeza–Yates–Gonnet algorithm. Aproximate string matching algorithm. Approximate equality is defined in terms of Levenshtein distance. Often used for fuzzy search without indexing. Does most of the work with bitwise operations. Runs in O(mn) operations, no matter the structure of the text or the pattern. 9
public static List ExactMatch(string text, string pattern) { long[] alphabet = new long[128]; //ASCII range (0 – 127) long[] alphabet = new long[128]; //ASCII range (0 – 127) for (int i = 0; i < pattern.Length; ++i) for (int i = 0; i < pattern.Length; ++i) { int letter = (int)pattern[i]; int letter = (int)pattern[i]; alphabet[letter] = alphabet[letter] | (1 << i); alphabet[letter] = alphabet[letter] | (1 << i); } long result = 1; // long result = 1; // List indexes = new List (); List indexes = new List (); for (int index = 0; index < text.Length; index++) for (int index = 0; index < text.Length; index++) { result &= alphabet[text[index]]; //if result != pattern => result = 0 result &= alphabet[text[index]]; //if result != pattern => result = 0 result = (result << 1) + 1; result = (result << 1) + 1; if ((result & (1 0) if ((result & (1 0) { indexes.Add(index - pattern.Length + 1); indexes.Add(index - pattern.Length + 1); } } return indexes; return indexes;} 10
cbaba alphabet[a] = ababc cbaba alphabet[b] = cbaba alphabet[c] = = 5 = 10 = 16 Example: text = cbdabababcpattern = ababc cbaba alphabet[d] = = bits: start res: c cb cbd cbda cbdab bdaba dabab ababa babab ababc res: text[i] = 1
12... long[] result = new long[k + 1]; for (int i = 0; i <= k; i++) for (int i = 0; i <= k; i++) { result[i] = 1; result[i] = 1; }... for (int j = 1; j <= k; ++j) for (int j = 1; j <= k; ++j) { // Three operations of the Levenshtein distance // Three operations of the Levenshtein distance long insertion = current | ((result[j] & patternMask[text[i]]) << 1); long insertion = current | ((result[j] & patternMask[text[i]]) << 1); long deletion = (previous | (result[j] & patternMask[text[i]])) << 1; long deletion = (previous | (result[j] & patternMask[text[i]])) << 1; long substitution = (previous | (result[j] & patternMask[text[i]])) << 1; long substitution = (previous | (result[j] & patternMask[text[i]])) << 1; current = result[j]; current = result[j]; result[j] = substitution | insertion | deletion | 1; result[j] = substitution | insertion | deletion | 1; previous = result[j]; previous = result[j]; }... Instead of having a single array result that changes over the length of the text, we now have k distinct arrays result 1..k
Shift-and : Uses bitwise & and 1’s for matches More intuitive and easyer to understand Needs to add result |= 1 Shift-or : Uses bitwise | and zeroes’s for matches A bit faster 13
форум програмиране, форум уеб дизайн курсове и уроци по програмиране, уеб дизайн – безплатно програмиране за деца – безплатни курсове и уроци безплатен SEO курс - оптимизация за търсачки уроци по уеб дизайн, HTML, CSS, JavaScript, Photoshop уроци по програмиране и уеб дизайн за ученици ASP.NET MVC курс – HTML, SQL, C#,.NET, ASP.NET MVC безплатен курс "Разработка на софтуер в cloud среда" BG Coder - онлайн състезателна система - online judge курсове и уроци по програмиране, книги – безплатно от Наков безплатен курс "Качествен програмен код" алго академия – състезателно програмиране, състезания ASP.NET курс - уеб програмиране, бази данни, C#,.NET, ASP.NET курсове и уроци по програмиране – Телерик академия курс мобилни приложения с iPhone, Android, WP7, PhoneGap free C# book, безплатна книга C#, книга Java, книга C# Николай Костов - блог за програмиране
Original paper of Baeza-Yates and Gonnet: _f08/Artikler/09/Baeza92.pdf _f08/Artikler/09/Baeza92.pdf _f08/Artikler/09/Baeza92.pdf Google implementation using bitap: Levenshtein algorithm: memory-efficient-Levenshtein-algorithm memory-efficient-Levenshtein-algorithm memory-efficient-Levenshtein-algorithm /Strings/Levenshtein_distance /Strings/Levenshtein_distance /Strings/Levenshtein_distance
“C# Telerik Academy csharpfundamentals.telerik.com csharpfundamentals.telerik.com Telerik Software Academy academy.telerik.com academy.telerik.com Telerik Facebook facebook.com/TelerikAcademy facebook.com/TelerikAcademy Telerik Software Academy Forums forums.academy.telerik.com forums.academy.telerik.com