Sentence Similarity Based on Semantic Nets and Corpus Statistics

Slides:



Advertisements
Similar presentations
Correlation Search in Graph Databases Yiping Ke James Cheng Wilfred Ng Presented By Phani Yarlagadda.
Advertisements

Improved TF-IDF Ranker
Semantic Access to Data from the Web Raquel Trillo *, Laura Po +, Sergio Ilarri *, Sonia Bergamaschi + and E. Mena * 1st International Workshop on Interoperability.
Intelligent Database Systems Lab Presenter: WU, JHEN-WEI Authors: Jorge Gorricha, Victor Lobo CG Improvements on the visualization of clusters in.
Sequence Clustering and Labeling for Unsupervised Query Intent Discovery Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: WSDM’12 Date: 1 November,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A novel document similarity measure based on earth mover’s.
1 Statistical correlation analysis in image retrieval Reporter : Erica Li 2004/9/30.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Guided Conversational Agents and Knowledge Trees for Natural Language Interfaces to Relational Databases Mr. Majdi Owda, Dr. Zuhair Bandar, Dr. Keeley.
Semantic Video Classification Based on Subtitles and Domain Terminologies Polyxeni Katsiouli, Vassileios Tsetsos, Stathes Hadjiefthymiades P ervasive C.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 An Efficient Concept-Based Mining Model for Enhancing.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.
Word Processors, Databases, Spreadsheets, and Data Problems.
Intelligent Database Systems Lab Presenter : BEI-YI JIANG Authors : UNIVERSIT´E CATHOLIQUE DE LOUVAIN, BELGIUM ASSOCIATION FOR COMPUTING MACHINERY.
Multi-Prototype Vector Space Models of Word Meaning __________________________________________________________________________________________________.
Intelligent Database Systems Lab Presenter : YAN-SHOU SIE Authors : JEROEN DE KNIJFF, FLAVIUS FRASINCAR, FREDERIK HOGENBOOM DKE Data & Knowledge.
Intelligent Database Systems Lab Presenter : WU, MIN-CONG Authors : Jorge Villalon and Rafael A. Calvo 2011, EST Concept Maps as Cognitive Visualizations.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A Taxonomy of Similarity Mechanisms for Case-Based Reasoning.
WORD SENSE DISAMBIGUATION STUDY ON WORD NET ONTOLOGY Akilan Velmurugan Computer Networks – CS 790G.
10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.
2014 EMNLP Xinxiong Chen, Zhiyuan Liu, Maosong Sun State Key Laboratory of Intelligent Technology and Systems Tsinghua National Laboratory for Information.
Intelligent Database Systems Lab Presenter : JHOU, YU-LIANG Authors :Shady Shehata, Fakhri Karray, Mohamed S. Kamel, Fellow 2012, IEEE An Efficient Concept-Based.
Intelligent Database Systems Lab Presenter : YAN-SHOU SIE Authors Mohamed Ali Hadj Taieb *, Mohamed Ben Aouicha, Abdelmajid Ben Hamadou KBS Computing.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Extracting meaningful labels for WEBSOM text archives Advisor.
Intelligent Database Systems Lab Presenter : JIAN-REN CHEN Authors : Sheng-Tun Li a,b,*, Fu-Ching Tsai a 2013, KBS A fuzzy conceptualization model for.
1 K-Means+ID3 A Novel Method for Supervised Anomaly Detection by Cascading K-Means Clustering and ID3 Decision Tree Learning Methods Author : Shekhar R.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. An IPC-based vector space model for patent retrieval Presenter: Jun-Yi Wu Authors: Yen-Liang Chen, Yu-Ting.
1 Mining the Web to Determine Similarity Between Words, Objects, and Communities Author : Mehran Sahami Reporter : Tse Ho Lin 2007/9/10 FLAIRS, 2006.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to database visualization and exploration.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Word sense disambiguation of WordNet glosses Presenter: Chun-Ping Wu Author: Dan Moldovan, Adrian Novischi.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 An Adaptation of the Vector-Space Model for Ontology-Based.
Evgeniy Gabrilovich and Shaul Markovitch
Intelligent Database Systems Lab Presenter : Chang,Chun-Chih Authors : CHRISTOS BOURAS, VASSILIS TSOGKAS 2012, KBS A clustering technique for news articles.
Intelligent Database Systems Lab Presenter : Chang,Chun-Chih Authors : David Milne *, Ian H. Witten 2012, AI An open-source toolkit for mining Wikipedia.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Juan D.Velasquez Richard Weber Hiroshi Yasuda 國立雲林科技大學 National.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Extending the Growing Hierarchal SOM for Clustering Documents.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Unsupervised word sense disambiguation for Korean through the acyclic weighted digraph using corpus and.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Mining massive document collections by the WEBSOM method Presenter : Yu-hui Huang Authors :Krista Lagus,
INFORMATION RETRIEVAL PROJECT Creation of clusters of concepts that represent a domain corpus.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Improving the performance of personal name disambiguation.
Intelligent Database Systems Lab Presenter : JIAN-REN CHEN Authors : Wen Zhang, Taketoshi Yoshida, Xijin Tang 2011.ESWA A comparative study of TF*IDF,
Intelligent Database Systems Lab Presenter: CHANG, SHIH-JIE Authors: Longzhuang Li, Yi Shang, Wei Zhang 2002.ACM. Improvement of HITS-based Algorithms.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 1 Mining knowledge from natural language texts using fuzzy associated concept mapping Presenter : Wu,
Intelligent Database Systems Lab Presenter : WU, MIN-CONG Authors : STEPHEN T. O’ROURKE, RAFAEL A. CALVO and Danielle S. McNamara 2011, EST Visualizing.
2/10/2016Semantic Similarity1 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Mining Advisor-Advisee Relationships from Research Publication.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Community self-Organizing Map and its Application to Data Extraction Presenter: Chun-Ping Wu Authors:
Intelligent Database Systems Lab Presenter: CHANG, SHIH-JIE Authors: Tao Liu, Zheng Chen, Benyu Zhang, Wei-ying Ma, Gongyi Wu 2004.ICDM. Improving Text.
1 Visualizing Multi-dimensional Clusters, Trends, and Outliers using Star Coordinates Author : Eser Kandogan Reporter : Tze Ho-Lin 2007/5/9 SIGKDD, 2001.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Visualizing social network concepts Presenter : Chun-Ping Wu Authors :Bin Zhu, Stephanie Watts, Hsinchun.
Intelligent Database Systems Lab Presenter : YU-TING LU Authors : Hsin-Chang Yang, Han-Wei Hsiao, Chung-Hong Lee IPM Multilingual document mining.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Enhancing Text Clustering by Leveraging Wikipedia Semantics.
Evaluating Translation Memory Software Francie Gow MA Translation, University of Ottawa Translator, Translation Bureau, Government of Canada
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics Semantic distance between two words.
Intelligent Database Systems Lab Presenter : BEI-YI JIANG Authors : JAMAL A. NASIR, IRAKLIS VARLAMIS, ASIM KARIM, GEORGE TSATSARONIS KNOWLEDGE-BASED.
An Adaptive Learning with an Application to Chinese Homophone Disambiguation from Yue-shi Lee International Journal of Computer Processing of Oriental.
Intelligent Database Systems Lab Presenter: YU-TING LU Authors: Yong-Bin Kang, Pari Delir Haghighi, Frada Burstein ESA CFinder: An intelligent key.
Bridging Domains Using World Wide Knowledge for Transfer Learning
Arabic Text Categorization Based on Arabic Wikipedia
Using lexical chains for keyword extraction
Associative Query Answering via Query Feature Similarity
Waikato Environment for Knowledge Analysis
Chapter 1: Computer Systems
Presentation transcript:

Sentence Similarity Based on Semantic Nets and Corpus Statistics Author : Yuhua Li, David McLean, Zuhair A. Bandar, James D. O’Shea, and Keeley Crockett Reporter : Tze Ho-Lin 2006/1/3 TKDE, 2006

Outline Motivation Objectives Methodology Evaluation Conclusion Personal Comments

Motivation Existing methods for computing sentence similarity have been adopted from approaches used for long text documents. These methods process sentences in a very high-dimensional space and are consequently Inefficient Require human input Not adaptable to some application domains

Objectives Using lexical database to enable our method to model human common sense knowledge. Incorporation of corpus statistics allows our method to be adaptable to different domains.

Methodology (WordNet) (Brown Corpus) 最左邊是輸入,也就是兩個句子。而最右邊是兩句子的相似度輸出 一個相似度是如何計算得來的呢,看圖可知相似度是由語意的相似和字與字之間順序的相似度

Evaluation Similarity correlation (Pearson) R&G No. Word Pair Human Means Algorithm Measure Sentence Pair 59 Cock rooster 0.86 0.999885 A cock is an adult male chicken. A rooster is an adult male chicken. 29 Bird woodland 0.01 0.334691 A bird is a creature with feathers and wings, female birds lay eggs and most can fly. Woodland is land with a lot of trees. 56 Coast shore 0.59 0.759004 The coast is an area of land that is next to the sea. The shores or shore of a sea, lake or wide river is the land along the edge of it. Similarity correlation (Pearson) 由於目前在句子的評估上並無適當的標準資料集,所以本研究的作者就讓本研究的方法和一些參與人員的相似度進行比較

Conclusion A method for measuring the semantic similarity based on semantic and word order information. The proposed method provides similarity measures that are fairly consistent with human knowledge.

Personal Comments Application Advantage Disadvantage Categorical data similarity measure. Knowledge Retrieval Advantage This function has a simple, easy-to-implement formula. Disadvantage This technique compares words on a word-by-word basis. The proposed method does not currently conduct word sense disambiguation for polysemous words.

Method: Semantic Similarity between Words 兩字的相似度分成了路徑長度和深度兩種方式來計算 由下圖舉例

Method: Semantic Similarity between Sentences One element Is a word in the joint word set. 求cosine夾角 那s1中的數字如何求出來?就是經由下面的Si計算 由I函數,我們可以得到corpus的權重 Is its associated word in the sentence. n is the frequency of the word w in the corpus N is the total number of words in the corpus

Method: Word Order Similarity between Sentences T1: RAM keeps things being worked with. T2: The CPU uses RAM as a short-term memory store. T={RAM keeps things being worked with The CPU uses as a short-term memory store}. 如果沒有字,我們找出句子中最相近的字的index來放

Process for Deriving the Semantic Vector (preset threshold: 0.2) T1: RAM keeps things being worked with. T2: The CPU uses RAM as a short-term memory store. T={RAM keeps things being worked with The CPU uses as a short-term memory store}.

Overall sentence similarity 公式中有一個thida,用來控制語意和字順序的權重