Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

Slides:

Advertisements

Similar presentations

Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson.

Advertisements

An Interactive-Voting Based Map Matching Algorithm

Multi-Document Person Name Resolution Michael Ben Fleischman (MIT), Eduard Hovy (USC) From Proceedings of ACL-42 Reference Resolution workshop 2004.

ICDE 2014 LinkSCAN*: Overlapping Community Detection Using the Link-Space Transformation Sungsu Lim †, Seungwoo Ryu ‡, Sejeong Kwon§, Kyomin Jung ¶, and.

CrowdER - Crowdsourcing Entity Resolution

Random Forest Predrag Radenković 3237/10

LEARNING INFLUENCE PROBABILITIES IN SOCIAL NETWORKS Amit Goyal Francesco Bonchi Laks V. S. Lakshmanan University of British Columbia Yahoo! Research University.

Combining Classification and Model Trees for Handling Ordinal Problems D. Anyfantis, M. Karagiannopoulos S. B. Kotsiantis, P. E. Pintelas Educational Software.

Exploiting Sparse Markov and Covariance Structure in Multiresolution Models Presenter: Zhe Chen ECE / CMR Tennessee Technological University October 22,

Second order cone programming approaches for handing missing and uncertain data P. K. Shivaswamy, C. Bhattacharyya and A. J. Smola Discussion led by Qi.

Paper presentation for CSI5388 PENGCHENG XI Mar. 23, 2005

Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.

Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.

1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.

Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.

Ensemble Learning: An Introduction

Heterogeneous Consensus Learning via Decision Propagation and Negotiation Jing Gao † Wei Fan ‡ Yizhou Sun † Jiawei Han † †University of Illinois at Urbana-Champaign.

Heterogeneous Consensus Learning via Decision Propagation and Negotiation Jing Gao† Wei Fan‡ Yizhou Sun†Jiawei Han† †University of Illinois at Urbana-Champaign.

Disambiguation Algorithm for People Search on the Web Dmitri V. Kalashnikov, Sharad Mehrotra, Zhaoqi Chen, Rabia Nuray-Turan, Naveen Ashish For questions.

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems Zhaoqi Chen, Dmitri V. Kalashnikov, Sharad Mehrotra University of California,

1 Ensembles of Nearest Neighbor Forecasts Dragomir Yankov, Eamonn Keogh Dept. of Computer Science & Eng. University of California Riverside Dennis DeCoste.

ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.

+ Doing More with Less : Student Modeling and Performance Prediction with Reduced Content Models Yun Huang, University of Pittsburgh Yanbo Xu, Carnegie.

Machine Learning: Ensemble Methods

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Online Stacked Graphical Learning Zhenzhen Kou +, Vitor R. Carvalho *, and William W. Cohen + Machine Learning Department + / Language Technologies Institute.

EVENT IDENTIFICATION IN SOCIAL MEDIA Hila Becker, Luis Gravano Mor Naaman Columbia University Rutgers University.

1 Efficiently Learning the Accuracy of Labeling Sources for Selective Sampling by Pinar Donmez, Jaime Carbonell, Jeff Schneider School of Computer Science,

Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.

Transfer Learning From Multiple Source Domains via Consensus Regularization Ping Luo, Fuzhen Zhuang, Hui Xiong, Yuhong Xiong, Qing He.

Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.

Today Ensemble Methods. Recap of the course. Classifier Fusion

Ensemble Methods: Bagging and Boosting

Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.

Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.

A Novel Local Patch Framework for Fixing Supervised Learning Models Yilei Wang 1, Bingzheng Wei 2, Jun Yan 2, Yang Hu 2, Zhi-Hong Deng 1, Zheng Chen 2.

Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova ， Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.

SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.

Network Community Behavior to Infer Human Activities.

© Devi Parikh 2008 Devi Parikh and Tsuhan Chen Carnegie Mellon University April 3, ICASSP 2008 Bringing Diverse Classifiers to Common Grounds: dtransform.

DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.

Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:

Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.

Classification Ensemble Methods 1

1 January 24, 2016Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 7 — Classification Ensemble Learning.

Object Recognition as Ranking Holistic Figure-Ground Hypotheses Fuxin Li and Joao Carreira and Cristian Sminchisescu 1.

Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.

Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,

Using decision trees to build an a framework for multivariate time- series classification 1 Present By Xiayi Kuang.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Improving Music Genre Classification Using Collaborative Tagging Data Ling Chen, Phillip Wright *, Wolfgang Nejdl Leibniz University Hannover * Georgia.

Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.

Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: –Ignores gene/gene interaction: single.

 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:

1 Systematic Data Selection to Mine Concept-Drifting Data Streams Wei Fan Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery.

SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.

1 Machine Learning: Ensemble Methods. 2 Learning Ensembles Learn multiple alternative definitions of a concept using different training data or different.

1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.

Machine Learning: Ensemble Methods

Antara Ghosh Jignashu Parikh

Chapter 7. Classification and Prediction

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Bag-of-Visual-Words Based Feature Extraction

Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)

R-CNN region By Ilia Iofedov 11/11/2018 BGU, DNN course 2016.

Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification M N S S K Pavan Kumar Advisor : Dr. C. V. Jawahar.

Disambiguation Algorithm for People Search on the Web

Leverage Consensus Partition for Domain-Specific Entity Coreference

Word representations David Kauchak CS158 – Fall 2016.

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Presentation transcript:

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra

Table of Contents Introduction Related Work Problem Definition Context based Framework Experimental Evaluation Conclusion

Introduction This paper proposes a new Entity Resolution Ensemble framework. The real world raw datasets are often not perfect and contain various types of issues, including errors and uncertainty, which complicate analysis. The task of ER Ensemble is to combine the results of multiple base-level ER systems into a single solution with the goal of increasing the quality of ER. The goal of ER is to identify and group references that co-refer. The output of an ER system is a clustering of references, where each cluster is supposed to represent one distinct entity.

There are diverse ER approaches have been proposed to address the entity resolution challenge. This motivates the problem of creating a single combined ER solution – an ER ensemble. Different ER solutions perform better in different contexts. No solution is the best in terms of Quality. ER ensemble actively uses the notion of the correct clustering A + for D. Each system S i is applied to the dataset being processed to produce its clustering A (i). Clustering A (i) corresponds to what system S i believes A + should be.

The goal of ER ensemble is to build a clustering A * with the highest quality which is as close to A + as possible. Difference between the two ensemble problems: The objective of the cluster ensemble will be to find a clustering that agrees the most with the n clusterings, including the n−1 poor-quality clusters. The goal of the ER ensemble is to simply output A (n) as its A *,and completely ignore the rest of the clusterings. ▫Let K be the true number of clusters if (K is large) then high threshold if (K is small) then low threshold

Two novel ER ensemble approaches. A predictive way of selecting context features that relies on estimating the unknown parameters of the data being processed. An extensive empirical evaluation of the proposed solution.

Related Work Entity Resolution : Many of the approaches have focused on exploiting various similarity metrics, relational similarities, probabilistic models. Clustering Ensembles : The final clustering is usually the one that agrees the most with the given clusterings. Combining Classifiers: Majority Voting-popular rule Bagging & Boosting-combine classifiers of same type. Stacking is to perform cross-validation on the base-level dataset and create a meta-level dataset from the predictions of base-level classifiers.

Problem Definition Entity Resolution Two references r i and r j are said to co-refer, denoted as r i ~ r j, if they refer to the same object, that is, if o ri = o rj. The co-reference set C r for r is the set of all references that co-refer with r ie., C r = {q belongs to R:q~r}. The output of an ER system S applied to dataset D is a set of clusters A = {A 1,A 2,...,A |A| }. To assess the quality of the resulting clusterings, we evaluate how different the clustering A is from the ground truth A +.

ER Ensemble These systems can be applied to dataset D to cluster the set of references R. Each system S i will produce its set of clusters A (i) = {A 1 (i),A (i) 2,...,A (i) |Ai| }. The goal is to be able to combine the resulting clusterings A (1),A (2),...,A (n) into a single clustering A * such that the quality of A * is higher than that of A (i), for i = 1, 2,..., n. The task of ER ensemble is for each e j belongs to E to provide a mapping d j ->a * j. Here, a * j belongs {−1, 1} is the prediction of the combined algorithm for edge e j, where a * j = 1 if the overall algorithm believes a + j = 1, and a * j =−1 otherwise.

Problem Definition Using Blocking to Improve Efficiency : Virtual Connected Subgraph(VCS). Blocking is important to improve the efficiency and also the quality of the algorithms. Naïve Solution: Voting – It will count predictions. Advanced Voting Scheme is Weighted Voting(WV). Problem with voting is that it is a static non-adaptive scheme with respect to the context.

Context Based Framework Context Features : Effectiveness and Generality Number of Clusters: This is derived from the expected number of clusters per VCS as well as the number of clusters generated by each base-level system. Node Fanout: It exploits the nodes adjacent to the edges at the edge level. The overall context feature vector fj is computed as a linear combination of the feature vectors For m features, Adding more context features can increase the robustness and accuracy of the system

Overview of the Approach

Meta-level Classification 2 algorithms for meta-level classification: Context-Extended Classification: Uses Regression to find K * and N v *.The computes f ji. Now it is possible to learn a mapping from features d j ×f j into edge predictions a j * by training a classifier using the ground truth labels a j +.

Context-Weighted Classification: The idea of this approach is to learn for each base-level system S i a model M i that would predict how accurate S i tends to be in a given context.

Creating Final Clusters Correlation Clustering (CC) generates the partitioning of data points that optimizes a score function. CC was specifically designed for the situation where there are conflicts in edge labeling. Input: A graph where nodes correspond to references and edges between two nodes are labeled, in the simplest case, with +1 if there is evidence that u and v co-refer, and with −1 otherwise. CC finds a clustering that has the most agreement with the edge labeling, therefore limiting the conflicts as much as possible.

Experimental Evaluation Set Up: Dataset, Evaluation Metrics, Baseline Methods and Classifiers. Experiments on Web Domain: Base-level Algorithms, Comparison of various combination algorithms. The main difference among these algorithms are (a) the similarity metrics chosen to compare pairs of references and (b) the clustering algorithms that generate the final clusters based on similarities.

Efficiency: The running time of the ER ensemble algorithm consists of several parts: running base-level ER systems and loading the decisions by these systems on the edges, running the two regression classifiers to derive the context features, applying meta-level classification to predict the edge classes, and creating final clusters. Effects of Blocking: Blocking is important to improve the efficiency of our algorithms. It also has great impact on the quality of ER ensemble since the VCS level context features rely on the blocking.

Conclusion This paper develops a novel ER ensemble framework for combining multiple base-level entity resolution systems. The Context-Extended and Context-Weighted Ensemble approaches for meta-level classification that enable the framework to adapt to the dataset being processed based on the local context is also proposed. This study demonstrates the superiority of the Context- Weighted Ensemble approach.

Future Work of the Paper To explore principally different ways to create context features and a different classification scheme that trades quality for efficiency.

Thank You!!!