Google Similarity Distance Presented by: Akshay Kumar Pankaj Prateek.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
The Google Similarity Distance  We’ve been talking about Natural Language parsing  Understanding the meaning in a sentence requires knowing relationships.
Creating a Similarity Graph from WordNet
Bar Ilan University And Georgia Tech Artistic Consultant: Aviya Amir.
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Models and Security Requirements for IDS. Overview The system and attack model Security requirements for IDS –Sensitivity –Detection Analysis methodology.
Computer Vision - A Modern Approach Set: Linear Filters Slides by D.A. Forsyth Differentiation and convolution Recall Now this is linear and shift invariant,
Measuring Semantic Similarity between Words Using Web Search Engines Danushka Bollegala, Yutaka Matsuo, Mitsuru Ishizuka Topic  Semantic similarity measures.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Basic Concepts and Definitions Vector and Function Space. A finite or an infinite dimensional linear vector/function space described with set of non-unique.
Clustering by Compression Rudi Cilibrasi (CWI), Paul Vitanyi (CWI/UvA)
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Information Retrieval
Query Log Analysis Naama Kraus Slides are based on the papers: Andrei Broder, A taxonomy of web search Ricardo Baeza-Yates, Graphs from Search Engine Queries.
Time Series Data Analysis - II
Silvio Cesare Ph.D. Candidate, Deakin University.
Automated Essay Grading Resources: Introduction to Information Retrieval, Manning, Raghavan, Schutze (Chapter 06 and 18) Automated Essay Scoring with e-rater.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.
Machine Vision for Robots
Kumar Srijan ( ) Syed Ahsan( ). Problem Statement To create a Neural Networks based multiclass object classifier which can do rotation,
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Efficient Minimal Perfect Hash Language Models David Guthrie, Mark Hepple, Wei Liu University of Sheffield.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:
Information theory concepts in software engineering Richard Torkar
Algorithmic Information Theory, Similarity Metrics and Google Varun Rao.
Video Google: A Text Retrieval Approach to Object Matching in Videos Josef Sivic and Andrew Zisserman.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Analysis of Algorithms CSCI Previous Evaluations of Programs Correctness – does the algorithm do what it is supposed to do? Generality – does it.
MINING MULTI-LABEL DATA BY GRIGORIOS TSOUMAKAS, IOANNIS KATAKIS, AND IOANNIS VLAHAVAS Published on July, 7, 2010 Team Members: Kristopher Tadlock, Jimmy.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
EDGE DETECTION IN COMPUTER VISION SYSTEMS PRESENTATION BY : ATUL CHOPRA JUNE EE-6358 COMPUTER VISION UNIVERSITY OF TEXAS AT ARLINGTON.
80 million tiny images: a large dataset for non-parametric object and scene recognition CS 4763 Multimedia Systems Spring 2008.
G52IVG, School of Computer Science, University of Nottingham 1 Edge Detection and Image Segmentation.
Understanding the Content Index. Review: The Search Engine.
Beyond Sliding Windows: Object Localization by Efficient Subwindow Search The best paper prize at CVPR 2008.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Unsupervised Learning of Visual Sense Models for Polysemous Words Kate Saenko Trevor Darrell Deepak.
Measuring Semantic Similarity between Words Using Web Search Engines WWW 07.
Gang WangDerek HoiemDavid Forsyth. INTRODUCTION APROACH (implement detail) EXPERIMENTS CONCLUSION.
1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,
November 30, PATTERN RECOGNITION. November 30, TEXTURE CLASSIFICATION PROJECT Characterize each texture so as to differentiate it from one.
Vector Space Models.
Relational Duality: Unsupervised Extraction of Semantic Relations between Entities on the Web Danushka Bollegala Yutaka Matsuo Mitsuru Ishizuka International.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture IX: 2014/05/05.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
Normalized Web Distance and Word Similarity 施文娴. Introduction Some objects can be given literally ◦The literal four-letter genome of a mouse ◦The literal.
Database Searching and Information Retrieval Presented by: Tushar Kumar.J Ritesh Bagga.
On Using SIFT Descriptors for Image Parameter Evaluation Authors: Patrick M. McInerney 1, Juan M. Banda 1, and Rafal A. Angryk 2 1 Montana State University,
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
SVMs, Part 2 Summary of SVM algorithm Examples of “custom” kernels Standardizing data for SVMs Soft-margin SVMs.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics Semantic distance between two words.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
1 Dongheng Sun 04/26/2011 Learning with Matrix Factorizations By Nathan Srebro.
Naifan Zhuang, Jun Ye, Kien A. Hua
Nearest Neighbor Classifiers
Histograms CSE 6363 – Machine Learning Vassilis Athitsos
אחזור מידע, מנועי חיפוש וספריות
Distributed Representation of Words, Sentences and Paragraphs
Information Organization: Clustering
Local Binary Patterns (LBP)
Probabilistic Databases
Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.
Term Frequency–Inverse Document Frequency
Presentation transcript:

Google Similarity Distance Presented by: Akshay Kumar Pankaj Prateek

Are these similar? Number ‘1’ vs. color ‘red’ Number ‘1’ vs. ‘small’ Horse vs. Rider True vs. false ‘Monalisa’ vs. ‘Virgin of the rocks’ We need some universal similarity measure!!!

Information Distance E(x, y) : Given two strings x and y, the length of the shortest binary program, in the reference universal computing system, such that the program computes output y from input x, and also output x from input y is known as the information distance between x and y Up to a negligible logarithmic additive term,

Information Distance Determines the distance between two strings minorizing the dominant feature in which they are similar. Not a good measure – If two small strings differ by an ID which is large compared to their lengths, then the strings are not similar. However, if two very large strings differ by the same distance, they are very similar

Normalized Information Distance To ensure that the ID expresses similarity between two strings, normalize it over the length of the strings Kolmogorov complexity is uncomputable!!!

Normalized Compression Distance To counter the problem of uncomputability of K(x), it’s replaced by C(x) where C is some compression technique

Normalized Google Distance NGD : a way of expressing NCD Premise : – Number of web pages indexed by Google is so vast that is actually approximates the actual relative frequency of various search terms – Invariant to growing size of Google database Uses probability of search terms to define similarity distance

Normalized Google Distance

Example

Errors/Shortcomings Not applicable to small datasets. Definition of N is still erroneous, the NGD can still be greater than 1 (Kraft’s inequality is violated) Ignored the number of occurrence of a term on a page Page count analysis ignore the position of a word in a page – Even though 2 words appear on the same page, they may not be related. Page count of a polysemous word might contain a combination of all its senses, e.g. - Apple. The probability of occurrence of every page is taken to be the same (it is NOT the same, depends on the page rank) Given the scale and noise on the WWW, some words might just occur on a page arbitrarily (by random chance)

Suggestions Can use snippets which are returned along with the web searches to determine the similarity of the words – Only the snippets of the top ranking pages can be processed efficiently and no guarantee exists that all the information we need to measure the semantic similarity is contained in those snippets Combining information from the web-searches with the information obtained from other databases (Wikipedia, WordNet etc.) to find the similarity measure.

SVM – NGD Learning A set of n-dimensional data points along with their classes i.e. (x i, y i ). Only 2 classes taken here for brevity. Divide this set into two parts : training set and testing set Determine a dividing curve to differentiate the points of two classes using learning set. Validate the results by testing on testing set. Image Source:

SVM – NGD Learning Set of training words Set of anchor words of cardinality n(much smaller than training words set) Convert each of training word into a n-dimensional vector whose i th dimension is NGD between that word and i th anchor word Train it using SVM

NGD Translation

References Cilibrasi, Rudi L., and Paul MB Vitanyi. "The Google similarity distance." Knowledge and Data Engineering, IEEE Transactions on 19.3 (2007): Bollegala, Danushka, Yutaka Matsuo, and Mitsuru Ishizuka. "Measuring semantic similarity between words using web search engines." www 7 (2007):