Presentation is loading. Please wait.

Presentation is loading. Please wait.

Some string optimization tips Haiyang Yu. 2016-1-31 2/14 Outline  Background  Tips for dealing with strings.

Similar presentations


Presentation on theme: "Some string optimization tips Haiyang Yu. 2016-1-31 2/14 Outline  Background  Tips for dealing with strings."— Presentation transcript:

1 Some string optimization tips Haiyang Yu

2 2016-1-31 http://datamining.xmu.edu.cn 2/14 Outline  Background  Tips for dealing with strings

3 2016-1-31 http://datamining.xmu.edu.cn 3/14 Background  Find all pairs from two set which are similar. Data Cleaning. Query Relaxation Spellchecking “PO BOX 23, Main St.” “P.O. Box 23, Main St” “information”“imformation”

4 2016-1-31 http://datamining.xmu.edu.cn 4/14 Background  Find similar pairs We have two string sets,one is {vldb,sigmod,….},the other is {pvldb,icde,…}. Find some candidate pairs, and then verify these pairs. {,,,,,….} Yes No

5 2016-1-31 http://datamining.xmu.edu.cn 5/14 Optimization tips  Do whatever you can do to improve your algorithm’s time performance.  Some tips seem simple but they are important.

6 2016-1-31 http://datamining.xmu.edu.cn 6/14 Optimization tips  Inverted index Suppose we have some strings Inverted index

7 2016-1-31 http://datamining.xmu.edu.cn 7/14 Optimization tips  How to get “kau”? Sub3_1 = S3. subString(0,3),then map it to S3,so we now have a map Sub3_1 -> 3 Record the position information and calculate the hash code, then calculate the hash code Hash(“kau”) = ((((‘k’*131+’a’)*131+’u’)*…) It’s too expensive.

8 2016-1-31 http://datamining.xmu.edu.cn 8/14 Optimization tips  Length information When we are dealing with s3 = “kaushic chaduri”, we split it to several segments which’s length are |s3|/(tau+1) or |s3|/(tau+1) +1. Then we get the substring {“kau”,”shic”,”_cha”,”duri”}

9 2016-1-31 http://datamining.xmu.edu.cn 9/14 Optimization tips  Length information So will we calculate |Si|/(tau+1) every time we use it ? No, even though it seems not that expensive, but we must do our best to improve time performance if RAM allowed. We store the position information. Let L[length][partI] store the information. L[15][0].start = 0, L[15][0].length =3 …

10 2016-1-31 http://datamining.xmu.edu.cn 10/14 Optimization tips  Repetitive sequence Some algorithm split string into repetitive sequence. For example, Q-Grams split S = “kaushic ” into {“kaush”,”aushi”,”ushic”}. So if you use function substring, you have to load RAM three times to get the substring. But if you use the position information and hash code, you can just load it once.

11 2016-1-31 http://datamining.xmu.edu.cn 11/14 Optimization tips  Repetitive sequence So we calculate Hash(“kaush”) = ((((‘k’*131+’a’)*131+’u’)*…) When we calculate next hash code Hash(”aushi”),we needn’t recalculate Hash(“aush”) cause we have calculated it before, so Hash(”aushi”) = (Hash(“kaush”) - 131^4)*131 + ‘i’

12 2016-1-31 http://datamining.xmu.edu.cn 12/14 Optimization tips  Sometimes you have done whatever you can to improve your code, but you still cannot beat the origin code which was written by author. Why? Maybe you need watch the experiment part, for example

13 2016-1-31 http://datamining.xmu.edu.cn 13/14 Optimization tips  What does the “-O3 flag” mean” It’s the optimizing strategy for compiler. They have O0 -->> O1 -->> O2 -->> O3 which O3 is the highest optimizing level.

14 2016-1-31 http://datamining.xmu.edu.cn 14/14  Email: yhycai@gmail.com


Download ppt "Some string optimization tips Haiyang Yu. 2016-1-31 2/14 Outline  Background  Tips for dealing with strings."

Similar presentations


Ads by Google