Normalized Web Distance and Word Similarity 施文娴. Introduction Some objects can be given literally ◦The literal four-letter genome of a mouse ◦The literal.

Slides:



Advertisements
Similar presentations
Indexing DNA Sequences Using q-Grams
Advertisements

Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Clustering Categorical Data The Case of Quran Verses
Google Similarity Distance Presented by: Akshay Kumar Pankaj Prateek.
SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.
The Google Similarity Distance  We’ve been talking about Natural Language parsing  Understanding the meaning in a sentence requires knowing relationships.
Cluster Analysis Measuring latent groups. Cluster Analysis - Discussion Definition Vocabulary Simple Procedure SPSS example ICPSR and hands on.
MMDS Secs Slides adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, October.
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Representing Relations Using Matrices
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Compression-based Unsupervised Clustering of Spectral Signatures D. Cerra, J. Bieniarz, J. Avbelj, P. Reinartz, and R. Mueller WHISPERS, Lisbon,
Nearest Neighbor. Predicting Bankruptcy Nearest Neighbor Remember all your data When someone asks a question –Find the nearest old data point –Return.
1Notes  Assignment 0 marks should be ready by tonight (hand back in class on Monday)
1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.
Faculty Of Applied Science Simon Fraser University Cmpt 825 presentation Corpus Based PP Attachment Ambiguity Resolution with a Semantic Dictionary Jiri.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Section 7.2 Tables and summarize hypothesis tests about one mean. (These tests assume that the sample is taken from a normal distribution,
Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,
Document and Query Forms Chapter 2. 2 Document & Query Forms Q 1. What is a document? A document is a stored data record in any form A document is a stored.
Biology How Does Information/Entropy/ Complexity fit in?
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Clustering by Compression Rudi Cilibrasi (CWI), Paul Vitanyi (CWI/UvA)
Chapter 5: Information Retrieval and Web Search
Working with English Learners
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Ontology Alignment/Matching Prafulla Palwe. Agenda ► Introduction  Being serious about the semantic web  Living with heterogeneity  Heterogeneity problem.
Computational Linguistics Yoad Winter *General overview *Examples: Transducers; Stanford Parser; Google Translate; Word-Sense Disambiguation * Finite State.
INF 384 C, Spring 2009 Ontologies Knowledge representation to support computer reasoning.
Foundations of Computer Science Computing …it is all about Data Representation, Storage, Processing, and Communication of Data 10/4/20151CS 112 – Foundations.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
Information theory concepts in software engineering Richard Torkar
Algorithmic Information Theory, Similarity Metrics and Google Varun Rao.
A Language Independent Method for Question Classification COLING 2004.
©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek (610)
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Expressing Implicit Semantic Relations without Supervision ACL 2006.
Chapter 6: Information Retrieval and Web Search
An Investigation of Statistical Machine Translation (Spanish to English) Raghav Bashyal.
SINGULAR VALUE DECOMPOSITION (SVD)
Kolmogorov Complexity and Universal Distribution Presented by Min Zhou Nov. 18, 2002.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
Knowledge based Question Answering System Anurag Gautam Harshit Maheshwari.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Simulation. Simulation is a way to model random events, such that simulated outcomes closely match real-world outcomes. By observing simulated outcomes,
Ch1 Larson/Farber 1 Elementary Statistics Math III Introduction to Statistics.
Clustering Algorithms Sunida Ratanothayanon. What is Clustering?
Measuring the Structural Similarity of Semistructured Documents Using Entropy Sven Helmer University of London, Birkbeck VLDB’07, September 23-28, 2007,
CS Machine Learning Instance Based Learning (Adapted from various sources)
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Analyzing Text with SQL Server 2014, R, AND Azure ML Dejan Sarka.
AP STATISTICS LESSON AP STATISTICS LESSON PROBABILITY MODELS.
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
Relational Algebra Chapter 4, Part A
Chapter 7 - JavaScript: Introduction to Scripting
Hierarchical clustering approaches for high-throughput data
JavaScript: Introduction to Scripting
Advanced Algorithms Analysis and Design
From frequency to meaning: vector space models of semantics
Chapter 2 Data Representation.
BIOINFORMATICS Fast Alignment
Text Categorization Berlin Chen 2003 Reference:
Chapter 7 - JavaScript: Introduction to Scripting
Chapter 7 - JavaScript: Introduction to Scripting
Clustering The process of grouping samples so that the samples are similar within each group.
Chapter 7 - JavaScript: Introduction to Scripting
Presentation transcript:

Normalized Web Distance and Word Similarity 施文娴

Introduction Some objects can be given literally ◦The literal four-letter genome of a mouse ◦The literal text of War and Peace by Tolstoy Objects can also be given by name ◦The four-letter genome of a mouse ◦The text of War and Peace by Tolstoy Some objects can only given by name ◦Home ◦Red Semantic relation!

Some Methods for Word Similarity

1.Association Measures : PMI-IR 2.Attributes For example: horse “the rider rides the horse” Triple: (rider, rides, horse) Pair: (rider, rides) is an attribute of the noun “horse” coupled with an appropriate word similarity measure

Some Methods for Word Similarity 1.Association Measures : PMI-IR 2.Attributes 3.Relational Word Similarity Relational similarity is correspondence between relations. When two pairs of words have a high degree of relational similarity, we say that their relations are analogous. For example: the word pair mason:stone is analogous to the pair carpenter:wood.

Some Methods for Word Similarity

1.Association Measures : PMI-IR 2.Attributes 3.Relational Word Similarity 4.LSA (Latent Semantic Analysis) 5.NWD Extracts semantic relations Feature-free and parameter-free Set of web pages returned by the query concerned

Brief Introduction to Kolmogorov Complexity The Kolmogorov complexity of an object, such as a piece of text, is a measure of the computational resources needed to specify the object. For example: ◦String1: abababababababababababababababab ◦String2: 4c1j5b2p0cv4w1x8rx2y39umgw5q85s7 The string1 has a short English-language description, namely "ab 16 times", which consists of 11 characters. The second one has no obvious simple description other than writing down the string itself, which has 32 characters.

Brief Introduction to Kolmogorov Complexity

Information Distance (Occam’s razor)

Normalized Information Distance

Normalized Compression Distance

Normalized Web Distance

a)the code-word length involved is one that on average is shortest for the probability involved b)the statistics involved are related to hits on Internet pages and not to genuine probabilities. For example, if every page containing term x also contains term y and vice versa, then the NWD between x and y is 0, even though x and y may be different (like “yes” and “no”). The consequence is that the NWD distance takes values primarily (but not exclusively) in [0, 1] and is not a metric. Thus, while ‘name1’ is semantically close to ‘name2,’ and ‘name2’ is semantically close to ‘name3,’ ‘name1’ can be semantically very different from ‘name3.’

Applications 1. Hierarchical Clustering 2. Classification 3. Matching the Meaning

Hierarchical Clustering 1.Calculate a distance matrix using the NWDs among all pairs of terms in the input list. 2.Calculate a best-matching unrooted ternary tree

Hierarchical Clustering

Classification

Matching the Meaning 1.Assume that there are five words that appear in two different matched sentences, but the permutation associating the English and Spanish words is, as yet, undetermined. 2.For example, plant, car, dance, speak, friend versus bailar, hablar, amigo, coche, planta. 3.A preexisting vocabulary of eight English words with their matched Spanish translations: tooth, diente; joy, alegria; tree, arbol; electricity, electricidad; table, tabla; money, dinero; sound, sonido; music, musica.

Matching the Meaning

Summary Information distance between two objects can be measured by the size of the shortest description that the transforms each object into the other one. This idea is most naturally expressed mathematically using Kolmogorov complexity. Normalize the information metric to create a relative similarity in between 0 and 1. (UNCOMPUTABLE!) We approximate the Kolmogorov complexity parts by off the shelve compression programs (in the case of the normalized compression distance) or readily available statistics from the Internet (in case of the normalized web distance).

Future work The relation with your homework? Compressed length and probability