EFFICIENT ALGORITHMS FOR APPROXIMATE MEMBER EXTRACTION By Swapnil Kharche and Pavan Basheerabad.

Slides:

Advertisements

Similar presentations

Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin.

Advertisements

Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.

Indexing DNA Sequences Using q-Grams

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:

CrowdER - Crowdsourcing Entity Resolution

Chapter 5: Introduction to Information Retrieval

Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.

Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.

Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China) Jianhua Feng (Tsinghua, China)

Improved TF-IDF Ranker

Large-Scale Entity-Based Online Social Network Profile Linkage.

TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.

Surajit Chaudhuri Venkatesh Ganti Dong Xin Microsoft Research Exploiting Web Search to Generate Synonyms for Entities.

MMDS Secs Slides adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, October.

A Self Learning Universal Concept Spotter By Tomek Strzalkowski and Jin Wang Original slides by Iman Sen Edited by Ralph Grishman.

Using Fingerprints in n-Gram Indices Digital Libraries: Advanced Methods and Technologies, Digital Collections Stefan Selbach

Better Filtering with Gapped q-grams S. Burkhardt Center for Bioinformatics, SaarbrückenMax-Planck Institut f. Informatik, Saarbrücken J. Kärkkäinen.

Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.

Creating Difficult Instances of the Post Correspondence Problem Presenter: Ling Zhao Department of Computing Science University of Alberta March 20, 2001.

INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,

Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.

Finding Similar Items.

1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.

Chapter 5: Information Retrieval and Web Search

Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.

Efficient Exact Set-Similarity Joins Arvind Arasu Venkatesh Ganti Raghav Kaushik DMX Group, Microsoft Research.

Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica.

Introduction n Keyword-based query answering considers that the documents are flat i.e., a word in the title has the same weight as a word in the body.

Ontology Alignment/Matching Prafulla Palwe. Agenda ► Introduction  Being serious about the semantic web  Living with heterogeneity  Heterogeneity problem.

A Grammar-based Entity Representation Framework for Data Cleaning Authors: Arvind Arasu Raghav Kaushik Presented by Rashmi Havaldar.

Efficient Exact Similarity Searches using Multiple Token Orderings Jongik Kim 1 and Hongrae Lee 2 1 Chonbuk National University, South Korea 2 Google Inc.

Presented by Tienwei Tsai July, 2005

Similarity join problem with Pass- Join-K using Hadoop ---BY Yu Haiyang.

Clustering-based Collaborative filtering for web page recommendation CSCE 561 project Proposal Mohammad Amir Sharif

Filter Algorithms for Approximate String Matching Stefan Burkhardt.

Experiments An Efficient Trie-based Method for Approximate Entity Extraction with Edit-Distance Constraints Entity Extraction A Document An Efficient Filter.

25/03/2003CSCI 6405 Zheyuan Yu1 Finding Unexpected Information Taken from the paper : “Discovering Unexpected Information from your Competitor’s Web Sites”

11 A Hybrid Phish Detection Approach by Identity Discovery and Keywords Retrieval Reporter: 林佳宜 /10/17.

Word Sense Disambiguation in Queries Shaung Liu, Clement Yu, Weiyi Meng.

Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 309.

ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.

Chapter 6: Information Retrieval and Web Search

Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.

Experiments Faerie: Efficient Filtering Algorithms for Approximate Dictionary-based Entity Extraction Entity Extraction A Document An Efficient Filter.

VGRAM:Improving Performance of Approximate Queries on String Collections Using Variable- Length Grams VLDB 2007 Chen Li (UC, Irvine) Bin Wang (Northeastern.

Cristian Andrades M. Andrea Rodr´ıguez Charles C. Chiang Signature Indexing of Design Layouts for Hotspot Detection DATE’14.

Presented by: Aneeta Kolhe. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text.

1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.

Recent Results in Combined Coding for Word-Based PPM Radu Rădescu George Liculescu Polytechnic University of Bucharest Faculty of Electronics, Telecommunications.

Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.

Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.

Large-scale Similarity Join with Edit-distance Constraints ---BY Yu Haiyang 1/30.

AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam.

Advantages of Query Biased Summaries in Information Retrieval by A. Tombros and M. Sanderson Presenters: Omer Erdil Albayrak Bilge Koroglu.

A Multiresolution Symbolic Representation of Time Series Vasileios Megalooikonomou Qiang Wang Guo Li Christos Faloutsos Presented by Rui Li.

Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28,

Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.

Predicting the Location and Time of Mobile Phone Users by Using Sequential Pattern Mining Techniques Mert Özer, Ilkcan Keles, Ismail Hakki Toroslu, Pinar.

SAT-Based Model Checking Without Unrolling Aaron R. Bradley.

Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava.

Robust Requirements Tracing Via Internet Tech:Improving an IV&V Technique SAS 2004July 20, 2004 Alex Dekhtyar Jane Hayes Senthil Sundaram Ganapathy Chidambaram.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

CS791 - Technologies of Google Spring A Webbased Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.

Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.

TT-Join: Efficient Set Containment Join

Guoliang Li (Tsinghua, China) Dong Deng (Tsinghua, China)

Chapter 7 Lexical Analysis and Stoplists

Time Relaxed Spatiotemporal Trajectory Joins

Recuperação de Informação B

An Efficient Partition Based Method for Exact Set Similarity Joins

Presentation transcript:

EFFICIENT ALGORITHMS FOR APPROXIMATE MEMBER EXTRACTION By Swapnil Kharche and Pavan Basheerabad

INTRODUCTION AME How to efficiently extract a substring from a text document that approximately match some strings in the given dictionary. Applications – named entity recognition, data cleaning Two Steps Filtration – filter out strings from dictionary which are very different from substring Verification – each candidate string is verified to decide whether the substring should be extracted 2

INTRODUCTION: AN EXAMPLE A Dictionary of strings we are interested in E.g. Conference names, author names etc. We are going to locate their “approximate appearances” in a series of documents. 3

PROBLEM DEFINITION Given a dictionary R of strings and a similarity threshold δ ∈ [0,1], then a query M is submitted. Here, M represents a relatively long string (e.g. a text file). The task of AME is to extract all M’s substrings m, such that there exists some r ∈ R satisfying Sim(m,r) ≥ δ. r is a piece of evidence for m Sim() is a function measuring the similarity of two strings An example of similarity measure Jaccard Similarity: 4

APPROACH When the input is given, we need to decide whether a substring m should be extracted Simple verification on all dictionary strings may be inefficient Pre-pruning and post-verifying is beneficial But should it be running-speed-oriented or filtering-power-oriented? Less time or less survivors? 5

FILTRATION-VERIFICATION 6 Filtration R Verification Potential Matches True Matches Wrong Matches Input Query M

FILTRATION-VERIFICATION(CONT’D) We need to balance between the two stages 7 More(less) filtration time Strong(weak) Filtration power Fewer(more) candidates Less(more) verification time Overall performance =Tf+Tv ??

TECHNIQUES If Sim(m,r) ≥ δ, what do we have ? wt(Sig(m)∩Sig(r)) ≥ τ (m) Existing techniques wt(Sig(m)∩Sig(r)) ≥ min{ τ (m), τ (r)} Technique used Where, Sig(m) is a prefix signature set of string m τ (m) is wt(Sig(m))-(1- δ )wt(m) So the threshold does not remain constant Use inverted lists to count sig-token overlapping Using IDF weights (Inverse Document Frequency) 8

SIGNATURE-BASED INVERTED LISTS(SIL) Lists indexed by sig-tokens Each sig-token of a string creates a node (containing the string’s id) in the corresponding list. E.g. R = { r1 = “canon eos 5d digital camera”, r2 =“Nikon digital slr camera”, r3 = “canon slr” }. wt(5d, eos, slr, Nikon, canon, camera, digital) = (9, 7, 2, 2, 2, 1, 1) 9

SIL (CONT’D) 10 ridStringSignature Set 1“canon eos 5d digital camera” {“canon”,”eos”, “5d”} 2“Nikon digital slr camera” {“nikon”, “slr”, “camera”} 3“canon slr”{“canon”, “slr”} SignatureString rids 5d(1) “canon”(1), (3) “camera”(2) “eos”(1) “Nikon”(2) “slr”(2), (3) Signature sets of R’s strings SIL

EvSCAN ALGORITHM BY SIL Compute the overlapped sig weight using wt(Sig(m)∩Sig(r)) The best matched string will be the one which satisfy the condition wt(Sig(m)∩Sig(r)) ≥ min{ τ (m), τ (r)} E.g. m=“canon eos digital camera”, δ= ridwt(Sig(m)∩Sig(r))min{τ(m),τ(r)}

EvITER Algorithm – Progressive Computation Recall we are checking all substrings Some of them are quite similar, indicating that they share duplicate computation This means that, if m have potential evidence r, then m t is very likely to match r Formally we proved that Let ES(m) be the set of “potential evidence” for m, list[t]={s| all dictionary strings that contain token t} We have ES(m t) ES(m) ∪ list[t] ES(m) = { r ∈ R | wt(m ∩ sig(r)) ≥ min{ δ * wt(m), τ (r)}} 12

EXAMPLE Document M: m t “…. cannon eos digital camera lens…” We know that only r1, r22, r53 are possible to match “cannon eos digital camera lens” ES(m) {r1} … lens, 3.0 … 2253 List[t] 13

FLOW OF EVIDENCE EvITER for “Evidence ITERATION” 14

THE STATIC THRESHOLD PROBLEM How does this index work so far? -“Get ready for δ =0.8 please.” -“Please wait 30min for index generation…” -“Ready!” -“Document M1, δ =0.8. Go!” -“…Extraction complete.” -“Document M2, and I want δ =0.9…” -“Sorry, please wait another 30min for index regeneration…” 15

THE STATIC THRESHOLD PROBLEM This One Seems Better -“Get ready for δ> =0.8 please.” -“Please wait 30min for index generation…” -“Ready!” -“Document M1, δ =0.8. Go!” -“…Extraction complete.” -“Document M2, and I want δ =0.9…” -“…Extraction complete.” 16

EXPERIMENTAL DATASETS Paper titles from the DBLP website Author names from DBLP website 17

RESULTS 18 Fig. Performance under different k ( δ = 0.85)

PERFORMANCE 19 Fig. Performance under different thresholds (k = 3)

CONCLUSION This method causes no false negatives It achieves a good balance between the two phases of filtration and verification. They proposed EvITER to eliminate duplicate computation It achieves both effective & efficient performance 20

THANK YOU! 21