2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) Hierarchical Cost-sensitive Web Resource Acquisition.

Slides:



Advertisements
Similar presentations
Language Technologies Reality and Promise in AKT Yorick Wilks and Fabio Ciravegna Department of Computer Science, University of Sheffield.
Advertisements

Lindsey Bleimes Charlie Garrod Adam Meyerson
Chapter 5: Introduction to Information Retrieval
Michele Samorani Manuel Laguna. PROBLEM : In meta-heuristic methods that are based on neighborhood search, whenever a local optimum is encountered, the.
Iterative Optimization and Simplification of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
Applications Chapter 9, Cimiano Ontology Learning Textbook Presented by Aaron Stewart.
CSE 574 – Artificial Intelligence II Statistical Relational Learning Instructor: Pedro Domingos.
Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.
Using Search in Problem Solving
Maximizing Classifier Utility when Training Data is Costly Gary M. Weiss Ye Tian Fordham University.
1 On Compressing Web Graphs Michael Mitzenmacher, Harvard Micah Adler, Univ. of Massachusetts.
Statistical Relational Learning for Link Prediction Alexandrin Popescul and Lyle H. Unger Presented by Ron Bjarnason 11 November 2003.
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
The community-search problem and how to plan a successful cocktail party Mauro SozioAris Gionis Max Planck Institute, Germany Yahoo! Research, Barcelona.
Elementary graph algorithms Chapter 22
1 A Topic Modeling Approach and its Integration into the Random Walk Framework for Academic Search 1 Jie Tang, 2 Ruoming Jin, and 1 Jing Zhang 1 Knowledge.
To Trust of Not To Trust? Predicting Online Trusts using Trust Antecedent Framework Viet-An Nguyen 1, Ee-Peng Lim 1, Aixin Sun 2, Jing Jiang 1, Hwee-Hoon.
Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer.
A Randomized Approach to Robot Path Planning Based on Lazy Evaluation Robert Bohlin, Lydia E. Kavraki (2001) Presented by: Robbie Paolini.
Webpage Understanding: an Integrated Approach
Issues with Data Mining
Fast Webpage classification using URL features Authors: Min-Yen Kan Hoang and Oanh Nguyen Thi Conference: ICIKM 2005 Reporter: Yi-Ren Yeh.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
Mining Frequent Itemsets with Constraints Takeaki Uno Takeaki Uno National Institute of Informatics, JAPAN Nov/2005 FJWCP.
Name : Emad Zargoun Id number : EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF Computing and technology “ITEC547- text mining“ Prof.Dr. Nazife Dimiriler.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
Iterative Readability Computation for Domain-Specific Resources By Jin Zhao and Min-Yen Kan 11/06/2010.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Automatically Extracting Data Records from Web Pages Presenter: Dheerendranath Mundluru
2008 International Workshop on Web and Databases (WebDB) Efficient Web-Based Linkage of Short to Long Forms Yee Fan Tan 1, Ergin Elmacioglu 2, Min-Yen.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
1 A Feature Selection and Evaluation Scheme for Computer Virus Detection Olivier Henchiri and Nathalie Japkowicz School of Information Technology and Engineering.
Treatment Learning: Implementation and Application Ying Hu Electrical & Computer Engineering University of British Columbia.
COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.
A Language Independent Method for Question Classification COLING 2004.
Crawling and Aligning Scholarly Presentations and Documents from the Web By SARAVANAN.S 09/09/2011 Under the guidance of A/P Min-Yen Kan 10/23/
Domain-Specific Iterative Readability Computation Jin Zhao 13/05/2011.
An Efficient Algorithm for Enumerating Pseudo Cliques Dec/18/2007 ISAAC, Sendai Takeaki Uno National Institute of Informatics & The Graduate University.
Confidence-Aware Graph Regularization with Heterogeneous Pairwise Features Yuan FangUniversity of Illinois at Urbana-Champaign Bo-June (Paul) HsuMicrosoft.
“Artificial Intelligence” in my research Seung-won Hwang Department of CSE POSTECH.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Q2Semantic: A Lightweight Keyword Interface to Semantic Search Haofen Wang 1, Kang Zhang 1, Qiaoling Liu 1, Thanh Tran 2, and Yong Yu 1 1 Apex Lab, Shanghai.
Evaluating Semantic Metadata without the Presence of a Gold Standard Yuangui Lei, Andriy Nikolov, Victoria Uren, Enrico Motta Knowledge Media Institute,
Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.
Guided Learning for Role Discovery (GLRD) Presented by Rui Liu Gilpin, Sean, Tina Eliassi-Rad, and Ian Davidson. "Guided learning for role discovery (glrd):
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to database visualization and exploration.
Institute of Computing Technology, Chinese Academy of Sciences 1 A Unified Framework of Recommending Diverse and Relevant Queries Speaker: Xiaofei Zhu.
1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,
Search Worms, ACM Workshop on Recurring Malcode (WORM) 2006 N Provos, J McClain, K Wang Dhruv Sharma
CoCQA : Co-Training Over Questions and Answers with an Application to Predicting Question Subjectivity Orientation Baoli Li, Yandong Liu, and Eugene Agichtein.
Speeding Up Enumeration Algorithms with Amortized Analysis Takeaki Uno (National Institute of Informatics, JAPAN)
Consensus Group Stable Feature Selection
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Post-Ranking query suggestion by diversifying search Chao Wang.
More Than Relevance: High Utility Query Recommendation By Mining Users' Search Behaviors Xiaofei Zhu, Jiafeng Guo, Xueqi Cheng, Yanyan Lan Institute of.
Name Disambiguation in Digital Libraries Tan Yee Fan 2005 October 19 WING Group Meeting.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Scalable Methods for Estimating Document Frequencies of Collocations in Databases Tan Yee Fan 2006 December 15 WING Group Meeting.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Matching References to Headers in PDF Papers Tan Yee Fan 2007 December 19 WING Group Meeting.
1 Discovering Web Communities in the Blogspace Ying Zhou, Joseph Davis (HICSS 2007)
Greedy & Heuristic algorithms in Influence Maximization
Proof technique (pigeonhole principle)
Computer Science cpsc322, Lecture 14
Random Sampling over Joins Revisited
Elementary graph algorithms Chapter 22
Web Mining Research: A Survey
Elementary graph algorithms Chapter 22
Presentation transcript:

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) Hierarchical Cost-sensitive Web Resource Acquisition for Record Matching Yee Fan Tan and Min-Yen Kan School of Computing National University of Singapore

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) 2 of 27Yee Fan Tan and Min-Yen Kan - Hierarchical Cost-sensitive Web Resource Acquisition for Record Matching Contents Background and Motivation Record matching using Web resources Issues with existing work Cost-sensitive Record Acquisition Framework Algorithm for Record Matching Problems Evaluation Conclusion

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) 3 of 27Yee Fan Tan and Min-Yen Kan - Hierarchical Cost-sensitive Web Resource Acquisition for Record Matching Record Matching using Web Resources KDD PAKDD DKMD KDID Knowledge Discovery and Data Mining Pacific-Asia Conference on Knowledge Discovery and Data Mining Research Issues on Data Mining and Knowledge Discovery Knowledge Discovery in Inductive Databases ? Record linkage of short forms to long forms Conference or workshop websites, etc. Disambiguation of author names in publication lists Publication pages of authors and institutions, etc. Cooccurrence counts Hit counts Webpage similarity URL information Snippet similarity AB WI-IAT Web Intelligence

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) 4 of 27Yee Fan Tan and Min-Yen Kan - Hierarchical Cost-sensitive Web Resource Acquisition for Record Matching Record Matching using Web Resources Typical scheme for comparing records a and b Query search engine with queries of the form: a, b, and/or a  b (Optionally, append additional terms or tokens) Compute similarity metric(s) and/or construct test instance Examples (IR, NLP, IE, data mining, linkage, etc.) Mihalcea and Moldovan (1999), Cimiano et al. (2005), Tan et al. (2006), Bollegala et al. (2007), Elmacioglu et al., (2007), Oh and Isahara (2008), Kalashnikov et al. (2008)

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) 5 of 27Yee Fan Tan and Min-Yen Kan - Hierarchical Cost-sensitive Web Resource Acquisition for Record Matching Cost of Acquiring Web Resources Acquiring web resources are slow and may incur other access costs Google SOAP Search takes a day or more to query 1000 records individually, even longer for pairwise queries For large datasets, not feasible to query and download everything Must acquire only selected web resources

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) 6 of 27Yee Fan Tan and Min-Yen Kan - Hierarchical Cost-sensitive Web Resource Acquisition for Record Matching Basically ignored in existing work Including work on cost-sensitive attribute value acquisition, e.g., Ling et al. (2006), Saar-Tsechansky et al. (2009) Same resource may be used to extract different types of attribute values Different instances can share attribute values Hierarchical Dependencies search(a) webpage(a) term-freq(a) hitcount(a)

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) 7 of 27Yee Fan Tan and Min-Yen Kan - Hierarchical Cost-sensitive Web Resource Acquisition for Record Matching Contents Background and Motivation Cost-sensitive Record Acquisition Framework Resource dependency graph Resource acquisition problem Observations and challenges for record matching problems Algorithm for Record Matching Problems Evaluation Conclusion

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) 8 of 27Yee Fan Tan and Min-Yen Kan - Hierarchical Cost-sensitive Web Resource Acquisition for Record Matching Resource Dependency Graph Directed acyclic graph G = (V, E) Vertex types T with |T| = 7

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) 9 of 27Yee Fan Tan and Min-Yen Kan - Hierarchical Cost-sensitive Web Resource Acquisition for Record Matching Feasible vertex set V’  V V’ can be acquired as-is without violating acquisition dependencies Resource Acquisition Graph V’V’V’V’

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) 10 of 27Yee Fan Tan and Min-Yen Kan - Hierarchical Cost-sensitive Web Resource Acquisition for Record Matching Cost and Benefit Functions Acquisition cost function search(.) cost 10 each webpage(.) cost 100 each All other cost 1 each Benefit function Positive benefit for vertices that are attribute values in test instances Zero benefit otherwise Zero benefit Positive benefit

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) 11 of 27Yee Fan Tan and Min-Yen Kan - Hierarchical Cost-sensitive Web Resource Acquisition for Record Matching Resource Acquisition Problem Given Resource dependency graph G = (V, E) Acquisition cost function cost: V → R +  {0} Benefit function benefit: F(G) → R Vertex type function type: V → T Budget budget Acquire feasible vertex set V’ with cost(V’) ≤ budget to maximize obj(V’) = benefit(V’) – cost(V’) Can be applied on many problems and variants

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) 12 of 27Yee Fan Tan and Min-Yen Kan - Hierarchical Cost-sensitive Web Resource Acquisition for Record Matching Observations for Record Matching Depth of 2 to 4 is typical Usually possible to flatten graph to two levels Root level: expensive search engine results and web page downloads Leaf level: cheap attribute values

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) 13 of 27Yee Fan Tan and Min-Yen Kan - Hierarchical Cost-sensitive Web Resource Acquisition for Record Matching Challenges for Record Matching Heuristic search required Not feasible to enumerate all possible V’ to acquire Many local maxima V’ = φ is a local maxima Easy for states to be revisited No guidance as to which root and child vertices to acquire Each search(.) has the same objective value Vertex out-degree can be more than max(|A|, |B|)

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) 14 of 27Yee Fan Tan and Min-Yen Kan - Hierarchical Cost-sensitive Web Resource Acquisition for Record Matching Contents Background and Motivation Cost-sensitive Record Acquisition Framework Algorithm for Record Matching Problems Application of Tabu search More intelligent legal moves Propagation of benefit values Evaluation Conclusion

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) 15 of 27Yee Fan Tan and Min-Yen Kan - Hierarchical Cost-sensitive Web Resource Acquisition for Record Matching Tabu Search (Glover, 1990) Simple Hill Climbing Set V’ = φ While termination condition not satisfied Let V* be neighbouring state of V’ with best obj(V*) generated by legal moves Set V’ = V* Simple Tabu Search Set V’ = φ Initialize empty Tabu list While termination condition not satisfied Let V* be neighbouring state of V’ with best obj(V*) generated by legal moves not in Tabu list Add move(s) that undo effect of selected move into Tabu list, and keep them in Tabu list for k iterations Set V’ = V* Tabu list = List of prohibited moves 1.Avoids getting stuck in local maxima 2.Better exploration of state space

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) 16 of 27Yee Fan Tan and Min-Yen Kan - Hierarchical Cost-sensitive Web Resource Acquisition for Record Matching Let D(V’) be the set of children from any vertex in V’ that are not in V’ If V’ is empty, then set D(V’) = V Add(v): Add v from D(V’) and its ancestors into V’ Remove(v): Remove v and its descendants from V’ AddType(t): Add a maximal set of vertices (within acquisition budget) of type t from D(V’) into V’ Legal Moves V’V’ D(V’)

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) 17 of 27Yee Fan Tan and Min-Yen Kan - Hierarchical Cost-sensitive Web Resource Acquisition for Record Matching Propagated Benefit v u v where Benefit propagation from leaf to root: vertices → edges → vertices → edges → …

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) 18 of 27Yee Fan Tan and Min-Yen Kan - Hierarchical Cost-sensitive Web Resource Acquisition for Record Matching Current state V’ and we consider adding V’’ Surrogate benefit function Surrogate objective function surr-obj(V’, V’’) = surr-benefit(V’, V’’) – cost(V’’) Surrogate Benefit and Objective V’V’ V’’

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) 19 of 27Yee Fan Tan and Min-Yen Kan - Hierarchical Cost-sensitive Web Resource Acquisition for Record Matching Benefit Function for SVM We use SVM to classify test instances Apply cost-sensitive attribute value acquisition work in Tan and Kan (2010) to compute benefit function

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) 20 of 27Yee Fan Tan and Min-Yen Kan - Hierarchical Cost-sensitive Web Resource Acquisition for Record Matching Contents Background and Motivation Cost-sensitive Record Acquisition Framework Algorithm for Record Matching Problems Evaluation Experimental setup Results for linking short forms to long forms Results for author name disambiguation Conclusion

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) 21 of 27Yee Fan Tan and Min-Yen Kan - Hierarchical Cost-sensitive Web Resource Acquisition for Record Matching Experimental Setup Algorithms Baselines Random Least cost Best benefit Best cost-benefit ratio Best type Our algorithm Manual (with access to solution set) Evaluation Methodology Start with no vertices acquired For each iteration Run acquisition algorithm with some budget Record: total acquisition cost, total misclassification cost Run for some iterations

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) 22 of 27Yee Fan Tan and Min-Yen Kan - Hierarchical Cost-sensitive Web Resource Acquisition for Record Matching Datasets Linking short forms to long forms SL-GENOMES: Human genomes SL-DBLP: Publication venues Disambiguating author names in publication lists AUTHOR-DBLP: 352 ambiguous author names

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) 23 of 27Yee Fan Tan and Min-Yen Kan - Hierarchical Cost-sensitive Web Resource Acquisition for Record Matching Results SL-GENOMES Average improvement of difference with Manual over second best algorithm: 49%

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) 24 of 27Yee Fan Tan and Min-Yen Kan - Hierarchical Cost-sensitive Web Resource Acquisition for Record Matching Results AUTHOR-DBLP Average improvement of difference with Manual over second best algorithm: 74%

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) 25 of 27Yee Fan Tan and Min-Yen Kan - Hierarchical Cost-sensitive Web Resource Acquisition for Record Matching Search Engine Queries Saved 90% of acquisitions of search(.) vertices for test instances correctly classified by our algorithm Mainly acquired search(a) or search(b) vertices, which services many test instances Typically, no need to acquire search(a  b) for correct classification

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) 26 of 27Yee Fan Tan and Min-Yen Kan - Hierarchical Cost-sensitive Web Resource Acquisition for Record Matching Contents Background and Motivation Cost-sensitive Record Acquisition Framework Algorithm for Record Matching Problems Evaluation Conclusion Contributions

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) 27 of 27Yee Fan Tan and Min-Yen Kan - Hierarchical Cost-sensitive Web Resource Acquisition for Record Matching Contributions Model using resource dependency graph Versatile, applicable to many problems Acquisition algorithm for record matching problems Effective on record matching problems of different domains Thank you for your attention! Framework for cost-sensitive acquisitions of web resources with hierarchical dependencies