Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as a binary term vector and Ranking documents with the odds ratio for relevance:

Ranking by Odds Ratio Assume that term occurrences are independent given the query and the value of R Let reflects whether the term t appears in document d or not. Then

Bayesian Inferencing Bayesian inference network for relevance ranking. A document is relevant to the extent that setting its corresponding belief node to true lets us assign a high degree of belief in the node corresponding to the query. Manual specification of mappings between terms to approximate concepts.

Similarity and Clustering

Motivation Problem: Query word could be ambiguous: Eg: Query“Star” retrieves documents about astronomy, plants, animals etc. Solution: Visualization Clustering document responses to queries along lines of different topics. Problem 2: Manual construction of topic hierarchies and taxonomies Solution: Preliminary clustering of large samples of web documents.

Motivation Problem 3: Speeding up similarity search Solution: Restrict the search for documents similar to a query to most representative cluster(s).

Example Scatter/Gather, a text clustering system, can separate salient topics in the response to keyword queries. (Image courtesy of Hearst)

Clustering Task : Evolve measures of similarity to cluster a collection of documents/terms into groups within which similarity within a cluster is larger than across clusters. Cluster Hypothesis: Given a `suitable‘ clustering of a collection, if the user is interested in document/term d/t, he is likely to be interested in other members of the cluster to which d/t belongs. Collaborative filtering: Clustering of two/more objects which have bipartite relationship

Clustering (contd) Two important paradigms: Bottom-up agglomerative clustering Top-down partitioning Visualization techniques: Embedding of corpus in a low-dimensional space (SVD) Characterizing the entities: Internally : Vector space model, probabilistic models Externally: Measure of similarity/dissimilarity between pairs Learning: Supplement stock algorithms with experience with data

Clustering Problem Given a database D={t 1,t 2,…,t n } of tuples and an integer value k, the Clustering Problem is to define a mapping f:D->{1,..,k} where each t i is assigned to one cluster K j, 1<=j<=k. A Cluster, K j, contains precisely those tuples mapped to it. Unlike classification problem, clusters are not known a priori.

Clustering: Parameters Similarity measure: (eg: cosine similarity) Distance measure: (eg: Eucledian distance) Number “ k ” of clusters Issues Large number of noisy dimensions Notion of noise is application dependent

Cluster Parameters

Clustering: Formal Specification Partitioning Approaches Bottom-up clustering Top-down clustering Geometric Embedding Approaches Self-organization map Multidimensional scaling Latent semantic indexing Generative models and probabilistic approaches Single topic per document Documents correspond to mixtures of multiple topics

Clustering Houses Size Based Geographic Distance Based

Clustering vs. Classification No prior knowledge Number of clusters Meaning of clusters Unsupervised learning

Clustering Issues Outlier handling Dynamic data Interpreting results Evaluating results Number of clusters Data to be used Scalability

Impact of Outliers on Clustering

Distance Between Clusters Single Link: smallest distance between points Complete Link: largest distance between points Average Link: average distance between points Centroid: distance between centroids

Hierarchical Clustering Clusters are created in levels actually creating sets of clusters at each level. Agglomerative Initially each item in its own cluster Iteratively clusters are merged together Bottom Up Divisive Initially all items in one cluster Large clusters are successively divided Top Down

Hierarchical Algorithms Single Link MST Single Link Complete Link Average Link

Dendrogram Dendrogram: a tree data structure which illustrates hierarchical clustering techniques. Each level shows clusters for that level. Leaf – individual clusters Root – one cluster A cluster at level i is the union of its children clusters at level i+1.

Levels of Clustering

Agglomerative Example ABCDE A01223 B10243 C22015 D24103 E33530 B A EC D 4 Threshold of 2351 ABCDE

MST Example ABCDE A01223 B10243 C22015 D24103 E33530 BA EC D

Agglomerative Algorithm

Single Link View all items with links (distances) between them. Finds maximal connected components in this graph. Two clusters are merged if there is at least one edge which connects them. Uses threshold distances at each level. Could be agglomerative or divisive.

MST Single Link Algorithm

Clustering Results

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

Similar presentations

Presentation on theme: "Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

Similar presentations

Presentation on theme: "Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as."— Presentation transcript:

Similar presentations

About project

Feedback