The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Programming Collective Intelligence by Toby.

Slides:



Advertisements
Similar presentations
CLUSTERING.
Advertisements

Clustering II.
SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Cluster Analysis: Basic Concepts and Algorithms
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Clustering Basic Concepts and Algorithms
PARTITIONAL CLUSTERING
Data Mining Techniques: Clustering
Introduction to Bioinformatics
6-1 ©2006 Raj Jain Clustering Techniques  Goal: Partition into groups so the members of a group are as similar as possible and different.
Clustering II.
1 Machine Learning: Symbol-based 10d More clustering examples10.5Knowledge and Learning 10.6Unsupervised Learning 10.7Reinforcement Learning 10.8Epilogue.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Cluster Analysis (1).
Introduction to Hierarchical Clustering Analysis Pengyu Hong 09/16/2005.
Introduction to Bioinformatics - Tutorial no. 12
What is Cluster Analysis?
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Data Quality Class 8. Agenda Tests will be graded by next week Project Discovery –Domain discovery –Domain identification –Nulls –Clustering for rule.
Enhancing User Experience By Employing Collective Intelligence April 16, 2008 Jason Zietz.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Clustering. What is clustering? Grouping similar objects together and keeping dissimilar objects apart. In Information Retrieval, the cluster hypothesis.
1 A Rank-by-Feature Framework for Interactive Exploration of Multidimensional Data Jinwook Seo, Ben Shneiderman University of Maryland Hyun Young Song.
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
Investigating the Relationship between Scores
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
Describing Relationships Using Correlations. 2 More Statistical Notation Correlational analysis requires scores from two variables. X stands for the scores.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
Clustering Patrice Koehl Department of Biological Sciences National University of Singapore
Social Searching and Information Recommendation Systems Hassan Zamir.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
Clustering Algorithms Sunida Ratanothayanon. What is Clustering?
Artificial Intelligence Techniques Internet Applications 4.
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:
Artificial Intelligence CIS 342 The College of Saint Rose David Goldschmidt, Ph.D.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Data Preprocessing: Data Reduction Techniques Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
©2013, The McGraw-Hill Companies, Inc. All Rights Reserved Chapter 3 Investigating the Relationship of Scores.
Computational Biology
Unsupervised Learning
Cluster Analysis of Gene Expression Profiles
CSC 4510/9010: Applied Machine Learning
Clustering CSC 600: Data Mining Class 21.
CZ5211 Topics in Computational Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Data Mining K-means Algorithm
Hierarchical Clustering
Data Clustering Michael J. Watts
K-means and Hierarchical Clustering
CSE 4705 Artificial Intelligence
Section 7.12: Similarity By: Ralucca Gera, NPS.
Multivariate Statistical Methods
Cluster Analysis in Bioinformatics
Consensus Partition Liang Zheng 5.21.
Statistics for the Social Sciences
Statistics for the Social Sciences
SEEM4630 Tutorial 3 – Clustering.
Hierarchical Clustering
Unsupervised Learning
Presentation transcript:

The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Programming Collective Intelligence by Toby Segaran, O’Reilly Media, 2007, ISBN

 A cluster is a group of related things  Automatic detection of clusters is a powerful data discovery tool  Detect similar user interests, buying patterns, clickthrough patterns, etc.  Also applicable to the sciences ▪ In computational biology, find groups (or clusters) of genes that exhibit similar behavior

 Data clustering is an example of an unsupervised learning algorithm... ...which is an AI technique for discovering structure within one or more datasets  The key goal is to find the distinct group(s) that exist within a given dataset  We don’t know what we’ll find

We need to first identify a common set of numerical attributes that we can compare to see how similar they are. Can we do anything with word frequencies?

 If we cluster blogs based on their word frequencies, maybe we can identify groups of blogs that are... ...similar in terms of blog content ...similar in terms of writing style ...of interest for searching, cataloging, etc.

 A feed is a simple XML document containing information about a blog and its entries  Reader apps enable users to read multiple blogs in a single window ▪ Click below to check out the Google Reader blog:

 Check out these feeds:     

 Techniques for avoiding stop words:  Ignore words on a predefined stop list  Select words from within a predefined range of occurrence percentages ▪ Lower bound of 10% ▪ Upper bound of 50% ▪ Tune as necessary

 Study the resulting blog data  Identify any patterns in the data  Which blogs are very similar?  Which blogs are very different?  How can these techniques be applied to other types of search?  Web search?  Enterprise search?

 Hierarchical clustering is an algorithm that groups similar items together  At each iteration, the two most similar items (or groups) are merged  For example, given five items A-E: A A B B C C D D E E

 Calculate the distances between all items  Group the two items that are closest:  Repeat! AB A A B B C C D D E E

 How do we compare group AB to other items? ▪ Use the midpoint of items A and B AB A A B B C C D D E E ABC DE x

 When do we stop? ▪ When we have a top-level group that includes all items AB A A B B C C D D E E ABC DE ABCDE x

 The hierarchical part is based on the discovery order of clusters  This diagram is called a dendrogram... A A B B C C D D E E AB DE ABC ABCDE

 A dendrogram is a graph (or tree)  Distances between nodes of the dendrogram show how similar items (or groups) are  AB is closer (to A and B) than DE is (to D and E), so A and B are more similar than D and E  How can we define closeness? A A B B C C D D E E AB DE ABC ABCDE

 A similarity score compares two distinct elements from a given set  To measure closeness, we need to calculate a similarity score for each pair of items in the set  Options include: ▪ The Euclidean distance score, which is based on the distance formula in two-dimensional geometry ▪ The Pearson correlation score, which is based on fitting data points to a line

 To find the Euclidean distance between two data points, use the distance formula: distance = √ (y 2 – y 1 ) 2 + (x 2 – x 1 ) 2  The larger the distance between two items, the less similar they are  So use the reciprocal of distance as a measure of similarity (but be careful of division by zero)

 The Pearson correlation score is derived by determining the best-fit line for a given set x x x x x x x  The best-fit line, on average, comes as close as possible to each item  The Pearson correlation score is a coefficient measuring the degree to which items are on the best-fit line v1 v2 x

 The Pearson correlation score tells us how closely items are correlated to one another  1.0 is a perfect match; ~0.0 is no relationship correlation score: 0.4 correlation score: 0.8 x x x x x x x x x x x x x x x x v2 v1

 The algorithm is:  Calculate sum(v1) and sum(v2)  Calculate the sum of the squares of v1 and v2 ▪ Call them sum1Sq and sum2Sq  Calculate the sum of the products of v1 and v2 ▪ (v1[0] * v2[0]) + (v1[1] * v2[1]) (v1[n-1] * v2[n-1]) ▪ Call this pSum x x x x x x x x v2 v1

 Calculate the Pearson score:  Much more complex, but often better than the Euclidean distance score sum(v1) * sum(v2) pSum – ( ) n r = sum1Sq – sum(v1) 2 sum2Sq – sum(v2) 2 * n n √

 Review the blog-data dendrograms  Identify any patterns in the data  Which blogs are very similar?  Which blogs are very different?  How can these techniques be applied to other types of search?  Web search?  Enterprise search?