Saisai Gong, Wei Hu, Yuzhong Qu

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

Context-based object-class recognition and retrieval by generalized correlograms by J. Amores, N. Sebe and P. Radeva Discussion led by Qi An Duke University.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
ECG Signal processing (2)
Random Forest Predrag Radenković 3237/10
Combining Classification and Model Trees for Handling Ordinal Problems D. Anyfantis, M. Karagiannopoulos S. B. Kotsiantis, P. E. Pintelas Educational Software.
Large-Scale Entity-Based Online Social Network Profile Linkage.
Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.
SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen Presentation by: Vladan Radosavljevic.
1 Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge Brief introduction to lectures.
Ensemble Learning: An Introduction
Learning Object Identification Rules for Information Integration Sheila Tejada Craig A. Knobleock Steven University of Southern California.
1 Ensembles of Nearest Neighbor Forecasts Dragomir Yankov, Eamonn Keogh Dept. of Computer Science & Eng. University of California Riverside Dennis DeCoste.
Robust Real-Time Object Detection Paul Viola & Michael Jones.
Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Rotation Forest: A New Classifier Ensemble Method 交通大學 電子所 蕭晴駿 Juan J. Rodríguez and Ludmila I. Kuncheva.
Selective Sampling on Probabilistic Labels Peng Peng, Raymond Chi-Wing Wong CSE, HKUST 1.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Transfer Learning From Multiple Source Domains via Consensus Regularization Ping Luo, Fuzhen Zhuang, Hui Xiong, Yuhong Xiong, Qing He.
Comparing the Parallel Automatic Composition of Inductive Applications with Stacking Methods Hidenao Abe & Takahira Yamaguchi Shizuoka University, JAPAN.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
 An important problem in sponsored search advertising is keyword generation, which bridges the gap between the keywords bidded by advertisers and queried.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 309.
BOF Trees Visualization  Zagreb, June 12, 2004 BOF Trees Visualization  Zagreb, June 12, 2004 “BOF” Trees Diagram as a Visual Way to Improve Interpretability.
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
CLASSIFICATION: Ensemble Methods
Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
BEHAVIORAL TARGETING IN ON-LINE ADVERTISING: AN EMPIRICAL STUDY AUTHORS: JOANNA JAWORSKA MARCIN SYDOW IN DEFENSE: XILING SUN & ARINDAM PAUL.
Data Reduction via Instance Selection Chapter 1. Background KDD  Nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable.
INTERACTIVELY BROWSING LARGE IMAGE DATABASES Ronald Richter, Mathias Eitz and Marc Alexa.
Ivica Dimitrovski 1, Dragi Kocev 2, Suzana Loskovska 1, Sašo Džeroski 2 1 Faculty of Electrical Engineering and Information Technologies, Department of.
Consensus Group Stable Feature Selection
Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
Notes on HW 1 grading I gave full credit as long as you gave a description, confusion matrix, and working code Many people’s descriptions were quite short.
Probabilistic Equational Reasoning Arthur Kantor
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Experience Report: System Log Analysis for Anomaly Detection
Hyunghoon Cho, Bonnie Berger, Jian Peng  Cell Systems 
Boosted Augmented Naive Bayes. Efficient discriminative learning of
Bag-of-Visual-Words Based Feature Extraction
Cross Domain Distribution Adaptation via Kernel Mapping
Websoft Research Group
Introduction Feature Extraction Discussions Conclusions Results
Dieudo Mulamba November 2017
Lecture 9: Entity Resolution
Reducing Training Time in a One-shot Machine Learning-based Compiler
Department of Computer Science University of York
Property consolidation for entity browsing
Rui Yang, Wei Hu and Yuzhong Qu
iSRD Spam Review Detection with Imbalanced Data Distributions
MEgo2Vec: Embedding Matched Ego Networks for User Alignment Across Social Networks Jing Zhang+, Bo Chen+, Xianming Wang+, Fengmei Jin+, Hong Chen+, Cuiping.
An Interactive Approach to Collectively Resolving URI Coreference
Consensus Partition Liang Zheng 5.21.
Block Matching for Ontologies
Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang
Ensemble learning Reminder - Bagging of Trees Random Forest
Nearest Neighbors CSC 576: Data Mining.
Leverage Consensus Partition for Domain-Specific Entity Coreference
Actively Learning Ontology Matching via User Interaction
Learning to Rank with Ties
Extracting Why Text Segment from Web Based on Grammar-gram
Hyunghoon Cho, Bonnie Berger, Jian Peng  Cell Systems 
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Leveraging Distributed Human Computation and Consensus Partition for Entity Coreference Saisai Gong, Wei Hu, Yuzhong Qu Websoft Research Group, Nanjing University, China

Contents Introduction Overview of approach Consensus partition Ensemble learning Evaluation Conclusion

Introduction Human knowledge is valuable for entity coreference Crowdsourcing and other systems used for acquiring human contribution to coreference Quality of user-judged results is a problem Mistakes and outliers Omissions …

Introduction In this paper, we propose coCoref which Acquire human contribution from web browsing activities Improve quality by aggregating individual results into more robust and comprehensive ones Alleviate user involvement by automatically identifying coreferent entities not judged

Overview of approach

Overview of approach

Overview of approach

Overview of approach

Overview of approach

Overview of approach

Overview of approach

Example NewYorkCity, NY, TheBigApple Manhattan, NewYorkCounty NewYorkCity, TheBigApple, NewYorkCounty NY, Manhattan NewYorkCity, TheBigApple NY, Manhattan

Example Consensus partition NewYorkCity, NY, TheBigApple Manhattan, NewYorkCounty NewYorkCity, TheBigApple, NewYorkCounty NY, Manhattan NewYorkCity, TheBigApple NY, Manhattan Consensus partition NewYorkCounty NewYorkCity, TheBigApple NY, Manhattan

Consensus partition Compute a consensus partition that minimize the number of disagreements with input partitions Obtain more robust and comprehensive coreference results Measure disagreement between partitions using symmetric difference distance NP-complete

Consensus partition Approximation Algorithm 3-approximation algorithm time complexity:

Ensemble learning Given training examples, learn a classifier to identify whether two entities are coreferent Training examples from consensus partition Positive: entity pairs in the same equivalence class Negative: entity pairs across different equivalence classes Feature: property-value similarity All property pairs for feature vector

Ensemble learning Choose suitable learning algorithms when Training examples are relatively small. Still a small amount of mistakes or outliers exist. Potentially important property pairs may not be characterized by training data. We use ensemble learning model Base learner: decision tree Random forest Bagging More suitable for relatively small training data randomness of input features Potential important property pairs can be also used

Evaluation Integration in SView SView finds candidates using owl:sameAs and web services Light-weighted user interaction Just accept or reject a candidate Coreferent entities’ data integrated in browsing immediately Visit SView at http://ws.nju.edu.cn/sview/

Evaluation on SView dataset Target: evaluate how coCoref improves result quality using consensus partition Based on SView dataset (Oct. to Dec. 2013) 36 users’ individual partitions 1,489 entities from 76 distributed data sources (URI namespace) Building reference partition Comparison using Precision, Recall, F-measure, Rand Index, NMI (normalized mutual information) Baseline: average measure values of 36 users’ individual partitions Compare consensus partition with the best individual ones

Evaluation on SView dataset Results Consensus partition is more comprehensive Highest values for Recall, F-measure, Rand Index, NMI Consensus partition is more robust Improve low average accuracy of 36 users’ results by 21% (0.807 to 0.665 ) Achieve close precision to the best (0.863)

Evaluation on OAEI NYT Target: evaluate the effectiveness of consensus partition and ensemble learning algorithm on a large scale Based on OAEI New York Times benchmark  Leveraging coreference results of ObjectCoref, Zhishi.links, Knofuss Total 33,914 entities Cluster coreferent entities of each tool’s dataset to construct it partition

Evaluation on OAEI NYT Results (1) Highest accuracy of consensus partition

Evaluation on OAEI NYT Ensemble learning based on consensus partition Use 10% of training data 10-fold cross validation Compare ensemble learning and consensus partition

Evaluation on OAEI NYT Results (2) Identify a similar size of coreferent entities as consensus partition using only 10% of training data

Conclusion Our contributions Consensus partition for improving accuracy of user-judged results Ensemble learning to alleviate user involvement Integration with browsing activities

Thank you for your attention! Supported in part by the National Natural Science Foundation of China under Grant Nos. 61370019 and 61170068, and in part by the Natural Science Foundation of Jiangsu Province under Grant Nos. BK2011189 and BK2012723