1 Yang Yang *, Yizhou Sun +, Jie Tang *, Bo Ma #, and Juanzi Li * Entity Matching across Heterogeneous Sources *Tsinghua University + Northeastern University.

Slides:



Advertisements
Similar presentations
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Advertisements

Hierarchical Dirichlet Processes
Active Learning for Streaming Networked Data Zhilin Yang, Jie Tang, Yutao Zhang Computer Science Department, Tsinghua University.
Linking Named Entity in Tweets with Knowledge Base via User Interest Modeling Date : 2014/01/22 Author : Wei Shen, Jianyong Wang, Ping Luo, Min Wang Source.
Hongliang Li, Senior Member, IEEE, Linfeng Xu, Member, IEEE, and Guanghui Liu Face Hallucination via Similarity Constraints.
Intelligent Systems Lab. Recognizing Human actions from Still Images with Latent Poses Authors: Weilong Yang, Yang Wang, and Greg Mori Simon Fraser University,
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
1 Social Influence Analysis in Large-scale Networks Jie Tang 1, Jimeng Sun 2, Chi Wang 1, and Zi Yang 1 1 Dept. of Computer Science and Technology Tsinghua.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Multiple Instance Learning
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
1 Heterogeneous Cross Domain Ranking in Latent Space Bo Wang 1, Jie Tang 2, Wei Fan 3, Songcan Chen 1, Zi Yang 2, Yanzhu Liu 4 1 Nanjing University of.
Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
1 A Topic Modeling Approach and its Integration into the Random Walk Framework for Academic Search 1 Jie Tang, 2 Ruoming Jin, and 1 Jing Zhang 1 Knowledge.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.
1 Yang Yang, Jie Tang, Juanzi Li Tsinghua University Walter Luyten, Marie-Francine Moens Katholieke Universiteit Leuven Lu Liu Northwestern University.
Multilingual Synchronization focusing on Wikipedia
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Light-weight Domain-based Form Assistant: Querying Web Databases On the.
1 A Discriminative Approach to Topic- Based Citation Recommendation Jie Tang and Jing Zhang Presented by Pei Li Knowledge Engineering Group, Dept. of Computer.
Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.
Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS,
Employing Active Learning to Cross-Lingual Sentiment Classification with Data Quality Controlling Shoushan Li †‡ Rong Wang † Huanhuan Liu † Chu-Ren Huang.
1 Linmei HU 1, Juanzi LI 1, Zhihui LI 2, Chao SHAO 1, and Zhixing LI 1 1 Knowledge Engineering Group, Dept. of Computer Science and Technology, Tsinghua.
RCDL Conference, Petrozavodsk, Russia Context-Based Retrieval in Digital Libraries: Approach and Technological Framework Kurt Sandkuhl, Alexander Smirnov,
Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.
Hidden Topic Markov Models Amit Gruber, Michal Rosen-Zvi and Yair Weiss in AISTATS 2007 Discussion led by Chunping Wang ECE, Duke University March 2, 2009.
A MIXED MODEL FOR CROSS LINGUAL OPINION ANALYSIS Lin Gui, Ruifeng Xu, Jun Xu, Li Yuan, Yuanlin Yao, Jiyun Zhou, Shuwei Wang, Qiaoyun Qiu, Ricky Chenug.
1 Yang Yang *, Yizhou Sun +, Jie Tang *, Bo Ma #, and Juanzi Li * Entity Matching across Heterogeneous Sources *Tsinghua University + Northeastern University.
Interoperable Visualization Framework towards enhancing mapping and integration of official statistics Haitham Zeidan Palestinian Central.
Deep Learning Powered In- Session Contextual Ranking using Clickthrough Data Xiujun Li 1, Chenlei Guo 2, Wei Chu 2, Ye-Yi Wang 2, Jude Shavlik 1 1 University.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
DOCUMENT UPDATE SUMMARIZATION USING INCREMENTAL HIERARCHICAL CLUSTERING CIKM’10 (DINGDING WANG, TAO LI) Advisor: Koh, Jia-Ling Presenter: Nonhlanhla Shongwe.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
LOGO Identifying Opinion Leaders in the Blogosphere Xiaodan Song, Yun Chi, Koji Hino, Belle L. Tseng CIKM 2007 Advisor : Dr. Koh Jia-Ling Speaker : Tu.
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
Cold Start Problem in Movie Recommendation JIANG CAIGAO, WANG WEIYAN Group 20.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.
Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:
Panther: Fast Top-k Similarity Search in Large Networks JING ZHANG, JIE TANG, CONG MA, HANGHANG TONG, YU JING, AND JUANZI LI Presented by Moumita Chanda.
LINDEN : Linking Named Entities with Knowledge Base via Semantic Knowledge Date : 2013/03/25 Resource : WWW 2012 Advisor : Dr. Jia-Ling Koh Speaker : Wei.
Link Distribution on Wikipedia [0407]KwangHee Park.
Topic-Factorized Ideal Point Estimation Model for Legislative Voting Network Yupeng Gu †, Yizhou Sun †, Ning Jiang ‡, Bingyu Wang †, Ting Chen † † Northeastern.
Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.
Combining Text and Image Queries at ImageCLEF2005: A Corpus-Based Relevance-Feedback Approach Yih-Cheng Chang Department of Computer Science and Information.
Leveraging Knowledge Bases for Contextual Entity Exploration Categories Date:2015/09/17 Author:Joonseok Lee, Ariel Fuxman, Bo Zhao, Yuanhua Lv Source:KDD'15.
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Light-weight Domain-based Form Assistant: Querying Web Databases On the.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Mining Advisor-Advisee Relationships from Research Publication.
2005/09/13 A Probabilistic Model for Retrospective News Event Detection Zhiwei Li, Bin Wang*, Mingjing Li, Wei-Ying Ma University of Science and Technology.
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
Cold-Start KBP Something from Nothing Sean Monahan, Dean Carpenter Language Computer.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.
Improving compound–protein interaction prediction by building up highly credible negative samples Toward more realistic drug-target interaction predictions.
Dynamic Network Analysis Case study of PageRank-based Rewiring Narjès Bellamine-BenSaoud Galen Wilkerson 2 nd Second Annual French Complex Systems Summer.
1 Dongheng Sun 04/26/2011 Learning with Matrix Factorizations By Nathan Srebro.
Link Distribution in Wikipedia [0324] KwangHee Park.
Multi-Modal Bayesian Embeddings for Learning Social Knowledge Graphs Zhilin Yang 12, Jie Tang 1, William W. Cohen 2 1 Tsinghua University 2 Carnegie Mellon.
MINING DEEP KNOWLEDGE FROM SCIENTIFIC NETWORKS
Computational Reasoning in High School Science and Math
Cross-lingual Knowledge Linking Across Wiki Knowledge Bases
Lecture 24: NER & Entity Linking
Adaptive entity resolution with human computation
MEgo2Vec: Embedding Matched Ego Networks for User Alignment Across Social Networks Jing Zhang+, Bo Chen+, Xianming Wang+, Fengmei Jin+, Hong Chen+, Cuiping.
Dynamic Supervised Community-Topic Model
Actively Learning Ontology Matching via User Interaction
Presentation transcript:

1 Yang Yang *, Yizhou Sun +, Jie Tang *, Bo Ma #, and Juanzi Li * Entity Matching across Heterogeneous Sources *Tsinghua University + Northeastern University Data&Code available at: # Carnegie Mellon University

2 Motivation Given an entity in a source domain, we aim to find its matched entities from target domain. Product-Patent matching

3 Problem C1C1 C2C2 {C 1, C 2 }, where C t ={d 1, d 2, …, d n } is a collection of entities L ij = 1, d i and d j are matched 0, not matched ?, unknown Input 2: Matching relation matrix Input 1: Dual source corpus

4 Two domains have less or no overlapping in content Challenges 1 1 Daily expression vs Professional expression

5 Two domains have less or no overlapping in content Challenges 1 1 How to model the topic- level relevance probability 2 2 ???

6 Cross-Source Topic Model

7 Basic Assumption For entities from different sources, their matching relations and hidden topics are influenced by each other. How to leverage the known matching relations to help link hidden topic spaces of two sources?

8 Cross-Sampling d 1 and d 2 are matched … 1 1

9 Cross-Sampling Sample a new term w 1 for d Toss a coin c, if c=0, sample w 1 ’s topic according to d 1

10 Cross-Sampling Sample a new term w 1 for d Otherwise sample w 1 ’s topic according to d 2

11 Cross-Source Topic Model Step 1: Step 2:

12 Model Learning Variational EM –Model parameters: –Variational parameters: –E-step: –M-step:

13 Task I: Product-patent matching Task II: Cross-lingual matching Experiments

14 Task I: Product-Patent Matching Given a Wiki article describing a product, finding all patents relevant to the product. Data set: –13,085 Wiki articles; –15,00 patents from USPTO; –1,060 matching relations in total.

15 Experimental Results Training : 30% of the matching relations randomly chosen. Content Similarity based on LDA (CS+LDA): cosine similarity between two articles’ topic distribution extracted by LDA. Random Walk based on LDA (RW+LDA): random walk on a graph where edges indicate the topic similarity between articles. Relational Topic Model (RTM): generally used to model links between documents. Random Walk based on CST (RW+CST): uses CST instead of LDA comparing with RW+LDA.

16 Task II: Cross-lingual Matching Given an English Wiki article,we aim to find a Chinese article reporting the same content. Data set: –2,000 English articles from Wikipedia; –2,000 Chinese articles from Baidu Baike; –Each English article corresponds to one Chinese article.

17 Experimental Results Training: 3-fold cross validation Title Only: only considers the (translated) title of articles. SVM-S: famous cross-lingual Wikipedia matching toolkit. LFG [1] : mainly considers the structural information of Wiki articles. LFG+LDA: adds content feature (topic distributions) to LFG by employing LDA. LFG+CST: adds content feature to LFG by employing CST. [1] Zhichun Wang, Juanzi Li, Zhigang Wang, and Jie Tang. Cross-lingual Knowledge Linking Across Wiki Knowledge Bases. WWW'12. pp

18 Parameter Analysis (a) Number of topics K(b) Ratio (c) Precision(d) Convergence analysis

19 Apple vs. Samsung Topics highly relevant to both Apple and Samsung found by CST. (Topic titles are hand-labeled)

20 Conclusion Study the problem of entity matching across heterogeneous sources. Propose the cross-source topic model, which integrates the topic extraction and entity matching into a unified framework. Conduct two experimental tasks to demonstrate the effectiveness of CST.

21 Yang Yang *, Yizhou Sun +, Jie Tang *, Bo Ma #, and Juanzi Li * Entity Matching across Heterogeneous Sources *Tsinghua University + Northeastern University Data&Code available at: # Carnegie Mellon University Thank You!

22 Problem Given an entity in a source domain, we aim to find its matched entities from target domain. –Given a textural description of a product, finding related patents in a patent database. –Given an English Wiki page, finding related Chinese Wiki pages. –Given a specific disease, finding all related drugs.