The Google Similarity Distance  We’ve been talking about Natural Language parsing  Understanding the meaning in a sentence requires knowing relationships.

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

CH-4 Ontologies, Querying and Data Integration. Introduction to RDF(S) RDF stands for Resource Description Framework. RDF is a standard for describing.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Google Similarity Distance Presented by: Akshay Kumar Pankaj Prateek.
Search Engines and Information Retrieval
Authoritative Sources in a Hyperlinked Environment Hui Han CSE dept, PSU 10/15/01.
Made by: Dan Ye. Introduction Basic Last Page ☆ HTML stands for Hyper Text Markup Language; ☆ HTML is not a programming language, it is a markup language;
Query Languages: Patterns & Structures. Pattern Matching Pattern –a set of syntactic features that must occur in a text segment Types of patterns –Words:
1 Discovering Unexpected Information from Your Competitor’s Web Sites Bing Liu, Yiming Ma, Philip S. Yu Héctor A. Villa Martínez.
1 MARG-DARSHAK: A Scrapbook on Web Search engines allow the users to enter keywords relating to a topic and retrieve information about internet sites (URLs)
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.
HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Baseline Document Retrieval Component N. Bassiou, C. Kotropoulos, I. Pitas 20/07/2000,
Improving Software Package Search Quality Dan Fingal and Jamie Nicolson.
The Claremont Colleges Integrating Library Resources Into Sakai Jezmynne Westcott The Claremont Colleges Jez91711 on AIM, Yahoo, and Gmail.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.
Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.
Search Engines and Information Retrieval Chapter 1.
The Technology Behind. The World Wide Web In July 2008, Google announced that they found 1 trillion unique webpages! Billions of new web pages appear.
CHAPTER 3 USING HYPERLINKS TO CONNECT CONTENT. LEARNING OBJECTIVES How to use the and anchor tag pair to create a text-based hyperlink. How to use the.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
Social Network Analysis (1) LING 575 Fei Xia 01/04/2011.
Algorithmic Information Theory, Similarity Metrics and Google Varun Rao.
©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek (610)
Link Analysis on the Web An Example: Broad-topic Queries Xin.
10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.
Search. Search issues How do we say what we want? –I want a story about pigs –I want a picture of a rooster –How many televisions were sold in Vietnam.
Course grading Project: 75% Broken into several incremental deliverables Paper appraisal/evaluation/project tool evaluation in earlier May: 25%
Curtis Spencer Ezra Burgoyne An Internet Forum Index.
Publication Spider Wang Xuan 07/14/2006. What is publication spider Gathering publication pages Using focused crawling With the help of Search Engine.
Mining real world data Web data. World Wide Web Hypertext documents –Text –Links Web –billions of documents –authored by millions of diverse people –edited.
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,
Finding Experts Using Social Network Analysis 2007 IEEE/WIC/ACM International Conference on Web Intelligence Yupeng Fu, Rongjing Xiang, Yong Wang, Min.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 3. Word Association.
How Google and Microsoft taught search to “understand” the Web Austin Granger Chris Hesemann.
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
CS307P-SYSTEM PRACTICUM CPYNOT. B13107 – Amit Kumar B13141 – Vinod Kumar B13218 – Paawan Mukker.
10-1 An Introduction to Systems A _______ is a set of sentences joined by the word ____ or by a ________________. Together these sentences describe a ______.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
NLP.
2/10/2016Semantic Similarity1 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis.
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
CPSC 203 Introduction to Computers Lab 23 By Jie Gao.
Question Answering Passage Retrieval Using Dependency Relations (SIGIR 2005) (National University of Singapore) Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan,
CS 440 Database Management Systems Web Data Management 1.
PageRank Google : its search listings always seemed deliver the “good stuff” up front. 1 2 Part of the magic behind it is its PageRank Algorithm PageRank™
CPS 49S Google: The Computer Science Within and its Impact on Society Shivnath Babu Spring 2007.
How Facebook Talk Informs Us About Current Word Use
Prepared by Rao Umar Anwar For Detail information Visit my blog:
Content-Based Image Retrieval
Content-Based Image Retrieval
Information Retrieval
Zachary Cleaver Semantic Web.
Why Social Graphs Are Different Communities Finding Triangles
NON-FICTION UNIT 5th Grade
Data Mining Chapter 6 Search Engines
From frequency to meaning: vector space models of semantics
Word Embedding Word2Vec.
Multimedia Information Retrieval
UCAS Progress is the application you need to complete to apply for any 6th form, college or apprenticeship. Computer Room They are to sit in register.
Apply programming techniques to design and create a web page
Databases 1.
Word embeddings (continued)
Giannis Varelas Epimenidis Voutsakis Paraskevi Raftopoulou
Describing Distribution
Conceptual grounding Nisheeth 26th March 2019.
Information Retrieval and Web Design
Information Retrieval
Detecting and analysing motion
Presentation transcript:

The Google Similarity Distance  We’ve been talking about Natural Language parsing  Understanding the meaning in a sentence requires knowing relationships between words e.g. house -> square house -> home house -> rooms  There are many of these in our language!

 There are ongoing attempts to build databases of these relationships. They are time and labour intensive.  The Web is the largest text database on Earth. It contains low-grade information in abundance.  There are two kinds of objects on which knowledge can be attained: actual object (a graph) and names of objects (“a graph”).  Actual objects can be compared for similarity through features.  Names of objects can be compared for similarity through ‘Google Semantics’ i.e. how they occur together in the web.

The Idea:  Define a new kind of semantics understandable by a computer.  Google semantics: content of the pages returned for a query on a word.  For a pair of words: the pages after querying the words singly, and then together.  Semantics is the context in which the words appear. Links from the pages to additional context are ignored  Only identifies associations, not similarity of meaning. For example, “rich” and “poor” will often occur together.

The method: Count how many pages are returned by Google for “monkey”, “president” and “monkey president”. Monkey: 74,200,000 President: 363,000,000 Monkey president: 2,230,000

The Google Distribution: Number of pages returned for a word x is event x. Number of pages returned for words x and y together is event x∩y. Probability L of monkey is 74,200,000 / total number of pages(8x10 9 ) 74,200,000 / total number of pages(8x10 9 )= Probability L of president is 363,000,000 / total number of pages 363,000,000 / total number of pages= Probability L of monkey∩president is 2,230,000 / total number of pages 2,230,000 / total number of pages =

Normalisation:  The values are normalised to produce a normalized Google distance (NGD).  N = the sum of the three sets: 74,200, ,000, ,230,000 =