Web Scale Taxonomy Cleansing Taesung Lee, Zhongyuan Wang, Haixun Wang, Seung-won Hwang VLDB 2011.

Slides:



Advertisements
Similar presentations
Google News Personalization: Scalable Online Collaborative Filtering
Advertisements

Understanding Tables on the Web Jingjing Wang. Problem to Solve A wealth of information in the World Wide Web Not easy to access or process by machine.
Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.
Date: 2014/05/06 Author: Michael Schuhmacher, Simon Paolo Ponzetto Source: WSDM’14 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Knowledge-based Graph Document.
Multi-Document Person Name Resolution Michael Ben Fleischman (MIT), Eduard Hovy (USC) From Proceedings of ACL-42 Reference Resolution workshop 2004.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
CrowdER - Crowdsourcing Entity Resolution
Probase: Understanding Data on the Web Haixun Wang Microsoft Research Asia.
BY ANISH D. SARMA, XIN DONG, ALON HALEVY, PROCEEDINGS OF SIGMOD'08, VANCOUVER, BRITISH COLUMBIA, CANADA, JUNE 2008 Bootstrapping Pay-As-You-Go Data Integration.
Modeling and Querying Possible Repairs in Duplicate Detection George Beskales Mohamed A. Soliman Ihab F. Ilyas Shai Ben-David.
Large-Scale Entity-Based Online Social Network Profile Linkage.
NetSci07 May 24, 2007 Entity Resolution in Network Data Lise Getoor University of Maryland, College Park.
A Probabilistic Framework for Semi-Supervised Clustering
Wei Shen †, Jianyong Wang †, Ping Luo ‡, Min Wang ‡ † Tsinghua University, Beijing, China ‡ HP Labs China, Beijing, China WWW 2012 Presented by Tom Chao.
WISE: Large Scale Content-Based Web Image Search Michael Isard Joint with: Qifa Ke, Jian Sun, Zhong Wu Microsoft Research Silicon Valley 1.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Large-Scale Cost-sensitive Online Social Network Profile Linkage.
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
Attribute Extraction and Scoring: A Probabilistic Approach Taesung Lee, Zhongyuan Wang, Haixun Wang, Seung-won Hwang Microsoft Research Asia Speaker: Bo.
 Person Name Disambiguation by Bootstrapping SIGIR’10 Yoshida M., Ikeda M., Ono S., Sato I., Hiroshi N. Supervisor: Koh Jia-Ling Presenter: Nonhlanhla.
Progressive Approach to Relational Entity Resolution Yasser Altowim, Dmitri Kalashnikov, Sharad Mehrotra Progressive Approach to Relational Entity Resolution.
A Two Tier Framework for Context-Aware Service Organization & Discovery Wei Zhang 1, Jian Su 2, Bin Chen 2,WentingWang 2, Zhiqiang Toh 2, Yanchuan Sim.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:
Wong Cheuk Fun Presentation on Keyword Search. Head, Modifier, and Constraint Detection in Short Texts Zhongyuan Wang, Haixun Wang, Zhirui Hu.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Ground Truth Free Evaluation of Segment Based Maps Rolf Lakaemper Temple University, Philadelphia,PA,USA.
CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.
BioSnowball: Automated Population of Wikis (KDD ‘10) Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/11/30 1.
Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova , Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
DOCUMENT UPDATE SUMMARIZATION USING INCREMENTAL HIERARCHICAL CLUSTERING CIKM’10 (DINGDING WANG, TAO LI) Advisor: Koh, Jia-Ling Presenter: Nonhlanhla Shongwe.
Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.
Approximate Inference: Decomposition Methods with Applications to Computer Vision Kyomin Jung ( KAIST ) Joint work with Pushmeet Kohli (Microsoft Research)
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
Inference Protocols for Coreference Resolution Kai-Wei Chang, Rajhans Samdani, Alla Rozovskaya, Nick Rizzolo, Mark Sammons, and Dan Roth This research.
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
Relational Duality: Unsupervised Extraction of Semantic Relations between Entities on the Web Danushka Bollegala Yutaka Matsuo Mitsuru Ishizuka International.
DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.
LINDEN : Linking Named Entities with Knowledge Base via Semantic Knowledge Date : 2013/03/25 Resource : WWW 2012 Advisor : Dr. Jia-Ling Koh Speaker : Wei.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1, Ralph Grishman 1 ACL 2004 * Cyberspace Laboratories.
Concept-based Short Text Classification and Ranking
1 Friends and Neighbors on the Web Presentation for Web Information Retrieval Bruno Lepri.
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
CPT-S Advanced Databases 11 Yinghui Wu EME 49.
Warren Shen, Xin Li, AnHai Doan Database & AI Groups University of Illinois, Urbana Constraint-Based Entity Matching.
2016/9/301 Exploiting Wikipedia as External Knowledge for Document Clustering Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou Proceeding.
Vincent Granville, Ph.D. Co-Founder, DSC
Research at Open Systems Lab IIIT Bangalore
Presentation 王睿.
Extracting Semantic Concept Relations
A Graph-Based Approach to Learn Semantic Descriptions of Data Sources
Family History Technology Workshop
Effective Entity Recognition and Typing by Relation Phrase-Based Clustering
ProBase: common Sense Concept KB and Short Text Understanding
Presentation transcript:

Web Scale Taxonomy Cleansing Taesung Lee, Zhongyuan Wang, Haixun Wang, Seung-won Hwang VLDB 2011

Knowledge Taxonomy – Manually built e.g. Freebase – Automatically generated e.g. Yago, Probase – Probase: a project at MSR

Freebase A knowledge base built by community contributions Unique ID for each real world entity Rich entity level information

Probase, the Web Scale Taxonomy Automatically generated from Web data Rich hierarchy of millions of concepts (categories) Probabilistic knowledge base politicians people presidents George W. Bush, Bill Clinton, George H. W. Bush, Hillary Clinton, Bill Clinton, George H. W. Bush, George W. Bush, 0.019

Characteristics of Automatically Harvested Taxonomies Web scale – Probase has 2.7 millions categories – Probase has 16 million entities and Freebase has 13 million entities. Noises and inconsistencies – US Presidents: {…George H. W. Bush, George W. Bush, Dubya, President Bush, G. W. Bush Jr.,…} Little context – Probase didn’t have attribute values for entities – E.g. we have no information such as birthday, political party, religious belief for “George H. W. Bush”

Leverage Freebase to Clean and Enrich Probase FreebaseProbase How is it built?manualautomatic Data modeldeterministicProbabilistic Taxonomy topologyMostly treeDAG # of concepts12.7 thousand2 million # of entities (instances)13 million16 million Information about entityrichSparse adoptionWidely usedNew

Why we do not bootstrap from freebase Probase try to capture concepts in our mental world – Freebase only has 12.7 thousand concepts/categories – For those concepts in Freebase we can also easily acquire – The key challenge is how can we capture tail concepts automatically with high precision Probase try to quantify uncertainty – Freebase treats facts as black or white – A large part of Freebase instances(over two million instances) are distributed in a few very popular concepts like “track” and “book” How can we get better one? – Cleansing and enriching Probase by mapping its instances to Freebase!

How Can We Attack the Problem? Entity resolution on the union of Probase & Freebase Simple String Similarity? George W. Bush George H. W. Bush – Jaccard Similarity = ¾ George W. Bush Dubya – Jaccard Similarity = 0 Many other learnable string similarity measures are also not free from these kind of problems

How Can We Attack the Problem? Attribute-based Entity Resolution Naïve Relational Entity Resolution Collective Entity Resolution  Unfortunately, Probase did not have rich entity level information OK. Let’s use external knowledge

Positive Evidence To handle nickname case – Ex) George W. Bush – Dubya – Synonym pairs from known sources – [Silviu Cucerzan, EMNLP-CoNLL 2007, …] Wikipedia Redirects Wikipedia Internal Links – [[ George W. Bush | Dubya ]] Wikipedia Disambiguation Page Patterns such as ‘whose nickname is’, ‘also known as’

What If Just Positive Evidence? Wikipedia does not have everything. – typo/misspelling Hillery Clinton – Too many possible variations Former President George W. Bush

What If Just Positive Evidence? (George W. Bush = President Bush) + (President Bush = George H. W. Bush) = ? George W. Bush Dubya Bush Bush Sr. President Bush George H. W. Bush Entity Similarity Graph

Negative Evidence What if we know they are different? It is possible to… – stop transitivity at the right place – safely utilize string similarity – remove false positive evidence ‘Clean data has no duplicates’ Principle – The same entity would not be mentioned in a list of instances several times in different forms – We assume these are clean: ‘List of ***’, ‘Table of ***’ such as … Freebase George W. Bush Dubya Bush Bush Sr. President Bush George H. W. Bush Different!

Two Types of Evidence in Action Instance Pair Space Correct Matching Pairs [Evidence] Wikilinks [Evidence] String Similarity [Evidence] Wikipedia Redirect Page [Evidence] Wikipedia Disambiguation Page Negative Evidence

Scalability Issue Large number of entities Simple string similarity (Jaro-Winkler) implemented in C# can process 100K pairs in one second  More than 60 years for all pairs! FreebaseProbaseFreebase * Probase # of concepts12.7 thousand2 million25.4 * 10 9 # of entities (instances)13 million16 million208 * 10 12

Scalability Issue ‘Birds of a Feather’ Principle – Michael Jordan the professor - Michael Jordan the basketball player may have only few friends in common – We only need to compare instances in a pair of related concepts; High # of overlapping instances High # of shared attributes Computer Scientists Basketball players Athletes

With a Pair of Concepts Building a graph of entities – Quantifying positive evidence for edge weight – Noisy-or model Using All kinds of positive evidence including string similarity George W. Bush Dubya Bush President Bush George H. W. Bush President Bill Clinton Hillary Clinton Bill Clinton

Multi-Multiway Cut Problem Multi-Multiway Cut problem on a graph of entities – A variant of s-t cut – No vertices with the same label may belong to one connected component (cluster) {George W. Bush, President Bill Clinton, George H. W. Bush}, {George W. Bush, Bill Clinton, Hillary Clinton}, … – The cost function: the sum of removed edge weights George W. Bush Dubya Bush President Bush George H. W. Bush President Bill Clinton Hillary Clinton Bill Clinton George W. Bush Dubya Bush President Bush George H. W. Bush President Bill Clinton Hillary Clinton Bill Clinton Multiway cutOur case

Our method Monte Carlo heuristics algorithm: – Step 1: Randomly insert one edge at a time with probability proportional to the weight of edge – Step 2: Skip an edge if it violates any piece of negative evidence We repeat the random process several times, and then choose the best one that minimize the cost

George W. Bush Dubya Bush Bush Sr. President Bush George H. W. Bush President Bill Clinton

George W. Bush Dubya Bush Bush Sr. President Bush George H. W. Bush President Bill Clinton

George W. Bush Dubya Bush Bush Sr. President Bush George H. W. Bush President Bill Clinton

George W. Bush Dubya Bush Bush Sr. President Bush George H. W. Bush President Bill Clinton

George W. Bush Dubya Bush Bush Sr. President Bush George H. W. Bush President Bill Clinton

George W. Bush Dubya Bush Bush Sr. President Bush George H. W. Bush President Bill Clinton

George W. Bush Dubya Bush Bush Sr. President Bush George H. W. Bush President Bill Clinton

Experimental Setting Probase: 16M instances / 2M concepts Freebase ( dump): 13M instances / 12.7K concepts Positive Evidence – String similarity: Weighted Jaccard Similarity – Wikipedia Links: 12,662,226 – Wikipedia Redirect: 4,260,412 – Disambiguation Pages: 223,171 Negative Evidence (# name bags) – Wikipedia List: 122,615 – Wikipedia Table: 102,731 – Freebase

Experimental Setting Baseline #1 – Positive evidence (without string similarity) based method. – Maps two instances with the strongest positive evidence. If there is no evidence, not mapped. Baseline #2 – String similarity based method. – Maps two instances if the string similarity, here Jaro-Winkler distance, is above a threshold (0.9 to get comparable precision). Baseline #3 – Markov Clustering with our similarity graphs – The best method among the scalable clustering methods for entity resolution introduced in [Hassanzadeh, VLDB2009]

Some Examples of Mapped Probase Instances Freebase InstanceBaseline #1Baseline #2Baseline #3Our Method George W. BushGeorge W. Bush, President George W. Bush, Dubya George W. BushGeorge W. Bush, President George W. Bush, George H. W. Bush, George Bush Sr., Dubya George W. Bush, President George W. Bush, Dubya, Former President George W. Bush American AirlinesAmerican Airlines, American Airline, AA, Hawaiian Airlines American Airlines, American Airline American Airlines, American Airline, AA, North American Airlines American Airlines, American Airline, AA John KerryJohn Kerry, Sen. John Kerry, Senator Kerry John KerryJohn Kerry, Senator Kerry John Kerry, Sen. John Kerry, Senator Kerry, Massachusetts Sen. John Kerry, Sen. John Kerry of Massachusetts

More Examples * Baseline #3 (Markov Clustering) clustered and mapped all related Barack Obama instances to Ricardo Mangue Obama Nfubea by wrong transitivity of clustering Freebase InstanceBaseline #1Baseline #2Baseline #3Our Method Barack ObamaBarack Obama, Barrack Obama, Senator Barack Obama Barack Obama, Barrack Obama None * Barack Obama, Barrack Obama, Senator Barack Obama, US President Barack Obama, Mr Obama mp3mp3, mp3s, mp3 files, mp3 format mp3, mp3smp3, mp3 files, mp3 format mp3, mp3s, mp3 files, mp3 format, high-quality mp3

Precision / Recall Probase ClassPoliticiansForamtsSystemsAirline Freebase Type /government /politician /computer /file_format /computer /operating_system /aviation /airline PrecisionRecallPrecisionRecallPrecisionRecallPrecisionRecall Baseline # Baseline # Baseline # Our method

More Results

Scalability Our method (in C#) is compared with Baseline #3 (Markov Clustering, in C) Category|V||E| Companies662,029110,313 Locations964,665118,405 Persons1,847,90781,464 Books2,443,573205,513

Summary & Conclusion A novel way to extract and use negative evidence is introduced Highly effective and efficient entity resolution method for large automatically generated taxonomy is introduced The method is applied and tested on two large knowledge bases for entity resolution and merger

References [Hassanzadeh, VLDB2009] O. Hassanzadeh, F. Chiang, H. C. Lee, and R. J. Miller. Framework for evaluating clustering algorithms in duplicate detection. PVLDB, pages 1282–1293, [Silviu Cucerzan, EMNLP-CoNLL 2007, …] Large Scale Named Entity Disambiguation Based on Wikipedia Data, EMNLP-CoNLL Joint Conference, Prague, 2007 R. Bunescu. Using encyclopedic knowledge for named entity disambiguation, In EACL, pages 9-16, 2006 X. Han and J. Zhao. Named entity disambiguation by leveraging wikipedia semantic knowledge. In CIKM, pages , 2009 [Sugato Basu, KDD04] S. Basu, M. Bilenko, and R. J. Mooney. A probabilistic framework for semi-supervised clustering. In SIGKDD, pages 59–68, [Dahlhaus, E., SIAM J. Comput’94] E. Dahlhaus, D. S. Johnson, C. H. Papadimitriou, P. D. Seymour, and M. Yannakakis. The complexity of multiterminal cuts. SIAM J. Comput., 23:864–894, For more references, please refer to the paper. Thanks.

Get more information from: