Download presentation
Presentation is loading. Please wait.
1
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as a binary term vector and Ranking documents with the odds ratio for relevance:
2
Ranking by Odds Ratio Assume that term occurrences are independent given the query and the value of R Let reflects whether the term t appears in document d or not. Then
3
Bayesian Inferencing Bayesian inference network for relevance ranking. A document is relevant to the extent that setting its corresponding belief node to true lets us assign a high degree of belief in the node corresponding to the query. Manual specification of mappings between terms to approximate concepts.
4
Similarity and Clustering
5
Motivation Problem: Query word could be ambiguous: Eg: Query“Star” retrieves documents about astronomy, plants, animals etc. Solution: Visualization Clustering document responses to queries along lines of different topics. Problem 2: Manual construction of topic hierarchies and taxonomies Solution: Preliminary clustering of large samples of web documents.
6
Motivation Problem 3: Speeding up similarity search Solution: Restrict the search for documents similar to a query to most representative cluster(s).
7
Example Scatter/Gather, a text clustering system, can separate salient topics in the response to keyword queries. (Image courtesy of Hearst)
8
Clustering Task : Evolve measures of similarity to cluster a collection of documents/terms into groups within which similarity within a cluster is larger than across clusters. Cluster Hypothesis: Given a `suitable‘ clustering of a collection, if the user is interested in document/term d/t, he is likely to be interested in other members of the cluster to which d/t belongs. Collaborative filtering: Clustering of two/more objects which have bipartite relationship
9
Clustering (contd) Two important paradigms: Bottom-up agglomerative clustering Top-down partitioning Visualization techniques: Embedding of corpus in a low-dimensional space (SVD) Characterizing the entities: Internally : Vector space model, probabilistic models Externally: Measure of similarity/dissimilarity between pairs Learning: Supplement stock algorithms with experience with data
10
Clustering Problem Given a database D={t 1,t 2,…,t n } of tuples and an integer value k, the Clustering Problem is to define a mapping f:D->{1,..,k} where each t i is assigned to one cluster K j, 1<=j<=k. A Cluster, K j, contains precisely those tuples mapped to it. Unlike classification problem, clusters are not known a priori.
11
Clustering: Parameters Similarity measure: (eg: cosine similarity) Distance measure: (eg: Eucledian distance) Number “ k ” of clusters Issues Large number of noisy dimensions Notion of noise is application dependent
12
Cluster Parameters
13
Clustering: Formal Specification Partitioning Approaches Bottom-up clustering Top-down clustering Geometric Embedding Approaches Self-organization map Multidimensional scaling Latent semantic indexing Generative models and probabilistic approaches Single topic per document Documents correspond to mixtures of multiple topics
14
Clustering Houses Size Based Geographic Distance Based
15
Clustering vs. Classification No prior knowledge Number of clusters Meaning of clusters Unsupervised learning
16
Clustering Issues Outlier handling Dynamic data Interpreting results Evaluating results Number of clusters Data to be used Scalability
17
Impact of Outliers on Clustering
18
Distance Between Clusters Single Link: smallest distance between points Complete Link: largest distance between points Average Link: average distance between points Centroid: distance between centroids
19
Hierarchical Clustering Clusters are created in levels actually creating sets of clusters at each level. Agglomerative Initially each item in its own cluster Iteratively clusters are merged together Bottom Up Divisive Initially all items in one cluster Large clusters are successively divided Top Down
20
Hierarchical Algorithms Single Link MST Single Link Complete Link Average Link
21
Dendrogram Dendrogram: a tree data structure which illustrates hierarchical clustering techniques. Each level shows clusters for that level. Leaf – individual clusters Root – one cluster A cluster at level i is the union of its children clusters at level i+1.
22
Levels of Clustering
23
Agglomerative Example ABCDE A01223 B10243 C22015 D24103 E33530 B A EC D 4 Threshold of 2351 ABCDE
24
MST Example ABCDE A01223 B10243 C22015 D24103 E33530 BA EC D
25
Agglomerative Algorithm
26
Single Link View all items with links (distances) between them. Finds maximal connected components in this graph. Two clusters are merged if there is at least one edge which connects them. Uses threshold distances at each level. Could be agglomerative or divisive.
27
MST Single Link Algorithm
28
Clustering Results
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.