Download presentation
Presentation is loading. Please wait.
Published byAnnabelle Butler Modified over 9 years ago
1
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Programming Collective Intelligence by Toby Segaran, O’Reilly Media, 2007, ISBN 978-0-596-52932-1
2
A cluster is a group of related things Automatic detection of clusters is a powerful data discovery tool Detect similar user interests, buying patterns, clickthrough patterns, etc. Also applicable to the sciences ▪ In computational biology, find groups (or clusters) of genes that exhibit similar behavior
3
Data clustering is an example of an unsupervised learning algorithm... ...which is an AI technique for discovering structure within one or more datasets The key goal is to find the distinct group(s) that exist within a given dataset We don’t know what we’ll find
4
We need to first identify a common set of numerical attributes that we can compare to see how similar they are. Can we do anything with word frequencies?
5
If we cluster blogs based on their word frequencies, maybe we can identify groups of blogs that are... ...similar in terms of blog content ...similar in terms of writing style ...of interest for searching, cataloging, etc.
6
A feed is a simple XML document containing information about a blog and its entries Reader apps enable users to read multiple blogs in a single window ▪ Click below to check out the Google Reader blog:
7
Check out these feeds: http://blogs.abcnews.com/theblotter/index.rdf http://blogs.abcnews.com/theblotter/index.rdf http://www.wired.com/rss/index.xml http://www.wired.com/rss/index.xml http://www.tmz.com/rss.xml http://www.tmz.com/rss.xml http://scienceblogs.com/sample/combined.xml http://scienceblogs.com/sample/combined.xml http://www.neilgaiman.com/journal/feed/rss.xml http://www.neilgaiman.com/journal/feed/rss.xml
8
Techniques for avoiding stop words: Ignore words on a predefined stop list Select words from within a predefined range of occurrence percentages ▪ Lower bound of 10% ▪ Upper bound of 50% ▪ Tune as necessary
9
Study the resulting blog data Identify any patterns in the data Which blogs are very similar? Which blogs are very different? How can these techniques be applied to other types of search? Web search? Enterprise search?
10
Hierarchical clustering is an algorithm that groups similar items together At each iteration, the two most similar items (or groups) are merged For example, given five items A-E: A A B B C C D D E E
11
Calculate the distances between all items Group the two items that are closest: Repeat! AB A A B B C C D D E E
12
How do we compare group AB to other items? ▪ Use the midpoint of items A and B AB A A B B C C D D E E ABC DE x
13
When do we stop? ▪ When we have a top-level group that includes all items AB A A B B C C D D E E ABC DE ABCDE x
14
The hierarchical part is based on the discovery order of clusters This diagram is called a dendrogram... A A B B C C D D E E AB DE ABC ABCDE
15
A dendrogram is a graph (or tree) Distances between nodes of the dendrogram show how similar items (or groups) are AB is closer (to A and B) than DE is (to D and E), so A and B are more similar than D and E How can we define closeness? A A B B C C D D E E AB DE ABC ABCDE
16
A similarity score compares two distinct elements from a given set To measure closeness, we need to calculate a similarity score for each pair of items in the set Options include: ▪ The Euclidean distance score, which is based on the distance formula in two-dimensional geometry ▪ The Pearson correlation score, which is based on fitting data points to a line
17
To find the Euclidean distance between two data points, use the distance formula: distance = √ (y 2 – y 1 ) 2 + (x 2 – x 1 ) 2 The larger the distance between two items, the less similar they are So use the reciprocal of distance as a measure of similarity (but be careful of division by zero)
18
The Pearson correlation score is derived by determining the best-fit line for a given set x x x x x x x The best-fit line, on average, comes as close as possible to each item The Pearson correlation score is a coefficient measuring the degree to which items are on the best-fit line v1 v2 x
19
The Pearson correlation score tells us how closely items are correlated to one another 1.0 is a perfect match; ~0.0 is no relationship correlation score: 0.4 correlation score: 0.8 x x x x x x x x x x x x x x x x v2 v1
20
The algorithm is: Calculate sum(v1) and sum(v2) Calculate the sum of the squares of v1 and v2 ▪ Call them sum1Sq and sum2Sq Calculate the sum of the products of v1 and v2 ▪ (v1[0] * v2[0]) + (v1[1] * v2[1]) +... + (v1[n-1] * v2[n-1]) ▪ Call this pSum x x x x x x x x v2 v1
21
Calculate the Pearson score: Much more complex, but often better than the Euclidean distance score sum(v1) * sum(v2) pSum – ( ) n r = sum1Sq – sum(v1) 2 sum2Sq – sum(v2) 2 * n n √
22
Review the blog-data dendrograms Identify any patterns in the data Which blogs are very similar? Which blogs are very different? How can these techniques be applied to other types of search? Web search? Enterprise search?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.