Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Google Similarity Distance  We’ve been talking about Natural Language parsing  Understanding the meaning in a sentence requires knowing relationships.

Similar presentations


Presentation on theme: "The Google Similarity Distance  We’ve been talking about Natural Language parsing  Understanding the meaning in a sentence requires knowing relationships."— Presentation transcript:

1 The Google Similarity Distance  We’ve been talking about Natural Language parsing  Understanding the meaning in a sentence requires knowing relationships between words e.g. house -> square house -> home house -> rooms  There are many of these in our language!

2  There are ongoing attempts to build databases of these relationships. They are time and labour intensive.  The Web is the largest text database on Earth. It contains low-grade information in abundance.  There are two kinds of objects on which knowledge can be attained: actual object (a graph) and names of objects (“a graph”).  Actual objects can be compared for similarity through features.  Names of objects can be compared for similarity through ‘Google Semantics’ i.e. how they occur together in the web.

3 The Idea:  Define a new kind of semantics understandable by a computer.  Google semantics: content of the pages returned for a query on a word.  For a pair of words: the pages after querying the words singly, and then together.  Semantics is the context in which the words appear. Links from the pages to additional context are ignored  Only identifies associations, not similarity of meaning. For example, “rich” and “poor” will often occur together.

4 The method: Count how many pages are returned by Google for “monkey”, “president” and “monkey president”. Monkey: 74,200,000 President: 363,000,000 Monkey president: 2,230,000

5 The Google Distribution: Number of pages returned for a word x is event x. Number of pages returned for words x and y together is event x∩y. Probability L of monkey is 74,200,000 / total number of pages(8x10 9 ) 74,200,000 / total number of pages(8x10 9 )=0.009275 Probability L of president is 363,000,000 / total number of pages 363,000,000 / total number of pages=0.045375 Probability L of monkey∩president is 2,230,000 / total number of pages 2,230,000 / total number of pages = 0.00027875

6 Normalisation:  The values are normalised to produce a normalized Google distance (NGD).  N = the sum of the three sets: 74,200,000 + 363,000,000 + 2,230,000 = 439430000


Download ppt "The Google Similarity Distance  We’ve been talking about Natural Language parsing  Understanding the meaning in a sentence requires knowing relationships."

Similar presentations


Ads by Google