Efficient Approximate Entity Extraction with Edit Distance Constraints Wei Wang 1, Chuan Xiao 1, Xuemin Lin 1 and Chengqi Zhang 2 1 University of New South.

Slides:



Advertisements
Similar presentations
Ting Chen, Jiaheng Lu, Tok Wang Ling
Advertisements

M. Dumbser 1 / 18 Analisi Numerica Università degli Studi di Trento Dipartimento dIngegneria Civile ed Ambientale Dr.-Ing. Michael Dumbser Lecture on Numerical.
Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin.
APWeb 2004 Hangzhou, China 1 Labeling and Querying Dynamic XML Trees Jiaheng Lu and Tok Wang Ling School of Computing National University of Singapore.
1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming.
Jiaheng Lu, University of California, Irvine
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Space-Constrained Gram-Based Indexing for Efficient.
String Similarity Measures and Joins with Synonyms
1 Faster algorithms for string matching with k mismatches Adviser : R. C. T. Lee Speaker: C. C. Yen Journal of Algorithms, Volume 50, Issue 2, February.
Comp 122, Spring 2004 Order Statistics. order - 2 Lin / Devi Comp 122 Order Statistic i th order statistic: i th smallest element of a set of n elements.
Ken C. K. Lee, Baihua Zheng, Huajing Li, Wang-Chien Lee VLDB 07 Approaching the Skyline in Z Order 1.
Publish-Subscribe Approach to Social Annotation of News Top-k Publish-Subscribe for Social Annotation of News Joint work with: Maxim Gurevich (RelateIQ)
Multi-Guarded Safe Zone: An Effective Technique to Monitor Moving Circular Range Queries Presented By: Muhammad Aamir Cheema 1 Joint work with Ljiljana.
IITB - Bioinformatics Workshop Indexing Genome Sequences Srikanta B. J. Database Systems Lab (DSL) Indian Institute of Science.
Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.
Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008.
1 Weiren Yu 1,2, Xuemin Lin 1, Wenjie Zhang 1 1 University of New South Wales 2 NICTA, Australia Towards Efficient SimRank Computation over Large Networks.
Indexing DNA Sequences Using q-Grams
Database Group – CSE - UNSW 1 Efficient Error-tolerant Query Autocompletion Chuan Xiao 1, Jianbin Qin 2, Wei Wang 2, Yoshiharu Ishikawa 1, Koji Tsuda 3,
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
Reference-based Indexing of Sequence Databases Jayendra Venkateswaran, Deepak Lachwani, Tamer Kahveci, Christopher Jermaine University of Florida-Gainesville.
Large-Scale Entity-Based Online Social Network Profile Linkage.
Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang University of New South Wales and NICTA.
Inverted Index Hongning Wang
Dong Deng, Guoliang Li, Jianhua Feng Database Group, Tsinghua University Present by Dong Deng A Pivotal Prefix Based Filtering Algorithm for String Similarity.
Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search Alexander Behm 1, Shengyue Ji 1, Chen Li 1, Jiaheng.
Quantile-Based KNN over Multi- Valued Objects Wenjie Zhang Xuemin Lin, Muhammad Aamir Cheema, Ying Zhang, Wei Wang The University of New South Wales, Australia.
Reza Sherkat ICDE061 Reza Sherkat and Davood Rafiei Department of Computing Science University of Alberta Canada Efficiently Evaluating Order Preserving.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Cost-Based Variable-Length-Gram Selection for String Collections to Support Approximate Queries Efficiently Xiaochun Yang, Bin Wang Chen Li Northeastern.
Refers to literary criticism which, in method, concept, theory, or form, is influenced by the tradition of psychoanalysis begun by Sigmund Freud. Psychoanalytic.
Gravity By: CJ Miske. What is gravity? Gravity is the force that causes two particles to pull towards each other. The force of attraction by which terrestrial.
Mining Frequent Itemsets with Constraints Takeaki Uno Takeaki Uno National Institute of Informatics, JAPAN Nov/2005 FJWCP.
Efficient Exact Similarity Searches using Multiple Token Orderings Jongik Kim 1 and Hongrae Lee 2 1 Chonbuk National University, South Korea 2 Google Inc.
VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams Chen Li Bin Wang and Xiaochun Yang Northeastern University,
DBease: Making Databases User-Friendly and Easily Accessible Guoliang Li, Ju Fan, Hao Wu, Jiannan Wang, Jianhua Feng Database Group, Department of Computer.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Click to edit Present’s Name Xiaoyang Zhang 1, Jianbin Qin 1, Wei Wang 1, Yifang Sun 1, Jiaheng Lu 2 HmSearch: An Efficient Hamming Distance Query Processing.
BY: MEENA 8E Isaac Newton. Who is he? Isaac Newton was an English physicist, mathematician, astronomer, natural philosopher, alchemist and theologian.
Experiments An Efficient Trie-based Method for Approximate Entity Extraction with Edit-Distance Constraints Entity Extraction A Document An Efficient Filter.
Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang Univ. of New South Wales, Austrailia ICDE ’09 9 Feb 2011 Taewhi Lee Based.
Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:
A fast algorithm for the generalized k- keyword proximity problem given keyword offsets Sung-Ryul Kim, Inbok Lee, Kunsoo Park Information Processing Letters,
Jianmin Wang 1, Shaoxu Song 1, Xiaochen Zhu 1, Xuemin Lin 2 1 Tsinghua University, China 2 University of New South Wales, Australia 1/23 VLDB 2013.
Experiments Faerie: Efficient Filtering Algorithms for Approximate Dictionary-based Entity Extraction Entity Extraction A Document An Efficient Filter.
VGRAM:Improving Performance of Approximate Queries on String Collections Using Variable- Length Grams VLDB 2007 Chen Li (UC, Irvine) Bin Wang (Northeastern.
Presented by: Aneeta Kolhe. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text.
A Multiresolution Symbolic Representation of Time Series Vasileios Megalooikonomou Qiang Wang Guo Li Christos Faloutsos Presented by Rui Li.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava.
Graph Indexing From managing and mining graph data.
EFFICIENT ALGORITHMS FOR APPROXIMATE MEMBER EXTRACTION By Swapnil Kharche and Pavan Basheerabad.
Computer Science and Engineering Jianye Yang 1, Ying Zhang 2, Wenjie Zhang 1, Xuemin Lin 1 Influence based Cost Optimization on User Preference 1 The University.
Click to edit Present’s Name AP-Tree: Efficiently Support Continuous Spatial-Keyword Queries Over Stream Xiang Wang 1*, Ying Zhang 2, Wenjie Zhang 1, Xuemin.
Outline Introduction State-of-the-art solutions
Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China)
TT-Join: Efficient Set Containment Join
Query Languages.
Pass-Join: A Partition based Method for Similarity Joins
Chuan Xiao, Wei Wang, Xuemin Lin
Guoliang Li (Tsinghua, China) Dong Deng (Tsinghua, China)
Weighted Exact Set Similarity Join
Searching Similar Segments over Textual Event Sequences
Efficient Subgraph Similarity All-Matching
Minwise Hashing and Efficient Search
An Efficient Partition Based Method for Exact Set Similarity Joins
Presentation transcript:

Efficient Approximate Entity Extraction with Edit Distance Constraints Wei Wang 1, Chuan Xiao 1, Xuemin Lin 1 and Chengqi Zhang 2 1 University of New South Wales and NICTA 2 University of Technology, Sydney

2 Named Entity Recognition  Dictionary-based NER Dictionary of Entities Isaac Newton Sigmund Freud English Austrian physicist mathematician astronomer philosopher alchemist theologian psychiatrist economist historian sociologist... Documents 1 Sir Isaac Newton was an English physicist, mathematician, astronomer, natural philosopher, alchemist, and theologian and one of the most influential men in human history. His Philosophi æ Naturalis Principia Mathematica, published in 1687, is by itself considered to be among the most influential books in the history of science, laying the groundwork for most of classical mechanics. 2 Sigmund Freud was an Austrian psychiatrist who founded the psychoanalytic school of psychology. Freud is best known for his theories of the unconscious mind and the defense mechanism of repression and for creating the clinical practice of psychoanalysis for curing psychopathology through dialogue between a patient and a psychoanalyst.

3 Approximate Entity Extraction  What if data are not cleaned or standardized? due to typos, multiple representations, etc.  Example – multiple representations al qaeda al qaida al-qaeda al-qa ’ ida  Using similarity measures token-based measures: jaccard e.g.  x = {al, qaeda}, y = {al, qaida}  J(x, y) = 1/3 = 0.33 If we set the threshold as 0.33, it works well for entities with several tokens, but, {al, qaeda} will match {al, gore} ! match the same entity!

4 Using Edit Distance Constraints  Using string-based measures edit-distance  Problem Definition Given a document R and a dictionary E of entities, the task of approximate entity extraction with edit distance threshold d is to find all sub-strings in R such that they are within edit distance d from one of the entities in E. { R[i.. j], E | k, ed(R[i.. j], E k )  d } E

5 Previous Approaches  q-gram based method count filtering  at least LB(s,t) common q-grams, where LB(s,t) = max(|s|, |t|) - q + 1 – q*d position filtering  positions of common q-grams should be within d length filtering  | len(s)-len(t) |  d  Steps index the q-grams for the entities probe index for the q-grams of each sub-string (query) of the document  form candidates verify the candidates Rhode_Island Rho hod ode de_ e_I _Is Isl sla lan and a Example: q = 3 at most q*d q-grams are destroyed

6 Drawbacks of q-gram Based Methods  entities are short we have to use small q to ensure the lower bound of matching q-grams is positive  short q-grams result in poor performance short q-grams are frequent  long inverted lists the lower bound is low for short entities  large candidate size  It has to try all the queries with length from L min – d to L max + d at every starting position. Document 1 Sir Isaac Newton was an English physicist, mathematician, astronomer, natural philosopher, alchemist, and theologian and one of the most influential men in human history. His Philosophi æ Naturalis Principia Mathematica, published in 1687, is by itself considered to be among the most influential books in the history of science, laying the groundwork for most of classical mechanics. Dictionary (L min =9, L max =43) 1 physicist 2 mathematician 3 Philosophiæ Naturalis Principia Mathematica

7 FastSS Algorithm [T. Bocek et. al. 2007]  Basic Idea – Neighborhood Generation generate the variants for each entity and query by enumerating edit operations at any possible position  Steps enumerate by at most d deletions for each entity resulting strings are called d-variant family, inserted into inverted index generate d-variant family for each query, probe the index to form candidates, and then verify them  Example, d = 1 e = qaeda q = qaida V e = {qaeda, aeda, qeda, qada, qaea, qaed} V q = {qaida, aida, qida, qada, qaia, qaid}  Problem the size of d-variant family for each entity (query) is O(|s| d ) too many variants when entities are long or d is large!

8 Partitioning Scheme  How to reduce the number of variants? immediate solution: divide an entity (query) into several partitions generate d-variants within each partition only  guarantee not to miss any result  still too many variants? pigeon-hole principle If we consider shifting and scaling, there exists an entity partition and a query partition such that their edit distance is within 1  generate 1-variant family for each partition divide each entity (query) into k = ceil[(d+1)/2] partitions

Partitioning Scheme  divide each entity (query) into k = ceil[(d+1)/2] partitions  shift within the range of [-d, d]  scale within the range of [-2, 2] (it can be proved 2 is enough)  shifting an scaling are only needed on entities  special cases first partition: only need to consider scaling within [-2, 2] last partition: only need to consider same amount of shifting and scaling within [-d, d] dd 22 always start from the first character always end with the last character

10 Partitioning Scheme - Example  Example, d = 3 e = abcdefgh q = axxbcdefgyh  Partitioning k = 2 P e = { ;, ; ; ; ; ; ; ; ; ; } P q = { ; }  Generating 1-variants V {defgh} and V {defgyh} share a common variant ‘defgh’, so this candidate will be identified represented in the form of

11 Prefix Pruning  What if a partition is still quite long? still many 1-variants solution: generate 1-variant family on prefix only!  Prefix Pruning If a partition is longer than a threshold l, we only generate 1- variant family on its l-prefix.  Example, l = 5 P = abcdefg generate 1-variant family on its 5-prefix  P[1.. 5] = abcde  V p[1.. 5] = {abcde, bcde, acde, abce, abcd}  space complexity - # of variants generated FastSS: O(|s| d ) after partitioning and prefix pruning: O(l * d 2 )

12 NGPP Algorithm  Neighborhood Generation + Partitioning + Prefix  Balance between variant size and selectivity different schemes to deal with short and long entities  Index short and long entities short: for entities which are shorter than k*l+d, we index d- variant family on its l-prefix (prefix pruning only) long: for entities which are no shorter than k*l, we first divide them into k partitions, and index 1-variant family on the l- prefix of the partitions (partitioning + prefix pruning)  Scan documents scan for each starting position enumerate the query length from L min – d to l generate its d-variant family, search for short entities generate its 1-variant family, search for long entities

13 NGPP Example  d = 2, l = 4  short = 8  Entity e 1 = ‘ Providence ’ (long) e 2 = ‘ capital ’ (short)  Document Prowidnce is the kaepital of Rhode Island. genenrate 1- variant familiy pr pro prov provi provid vidence idence dence ence nce genenrate d- variant familiy capital Prow rowi owid e 1 Providence … kaep e 2 capital … 1-variant match d-variant match

14 Experiment Settings  Algorithms NGPP FastSS q-gram based method  Measure number of variants, candidate size, running time  Dataset dataset# of recordsavg. string length DBLP DICT (author)108k14.5 DOC (author, title)87k104.7 GENE DICT (gene/protein name)381k22.4 DOC (author, title, abstract)10k870.0 CONLL DICT (person, location)8k12.6 DOC (news article)19k819.0

15 Experiment Results  NGPP vs FastSS DBLP; d = 2 algorithm# of variantscandidate sizerunning time FastSS7500M2.1M2643s NGPP (l = 10) 150M11M40s

Experiment Results  NGPP vs q-gram based method DBLP; d = 1, 2, 3 Candidate SizeRunning Time

Conclusion  Contributions an efficient algorithm for approximate entity extraction with edit distance constraints based on neighborhood generation two techniques to reduce the number of variants generated, as well as running time  partitioning  prefix pruning  Future work approximate multiple pattern matching  other similarity measures, e.g., the function used in DNA/protein sequence alignment

18 Thank you! Questions?

19 Related Work  neighborhood generation approaches E. W. Myers. A sublinear algorithm for approximate keyword searching. Algorithmica, 12(4/5):345 – 374, T. Bocek, E. Hunt, B. Stiller. Fast Similarity Search in Large Dictionaries. Technical Report ifi , Department of Informatics, University of Zurich, April  q-gram based approaches L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, C. Xiao, W. Wang, and X. Lin. Ed-join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB, 1(1):933 – 944,  alternative: use vgrams instead of q-grams C. Li, B. Wang, and X. Yang. VGRAM: Improving performance of approximate queries on string collections using variable-length grams. In VLDB, X. Yang, B. Wang, and C. Li. Cost-based variable length gram selection for string collections to support approximate queries efficiently. In SIGMOD, 2008.