Clustering Specific Issues related to Project 2. Reducing dimensionality –Lowering the number of dimensions makes the problem more manageable Less memory.

Clustering Specific Issues related to Project 2

Reducing dimensionality –Lowering the number of dimensions makes the problem more manageable Less memory Less time Less noise –Doesn’t have to be particularly sophisticated Get rid of noise –superfluous terms –stop-list Identify important terms –White list –Term weighting?

Project 2 Clustering –Goal: Cluster –Include linguistic documents –Exclude non-linguistic documents –Given: We can represent documents as vectors in multi- dimensioned space –Vectors composed of words, etc. drawn from documents We have a mechanism for measuring the distance between vectors

Project 2 Clustering –Multiple methods –Two main methods: Hierarchical Non-Hierarchical (partitional) –Hierarchical methods can be: Agglomerative (bottom-up) Divisive (top-down)

Clustering Methods Agglomerative (bottom-up) 1.Assume all vectors are in separate clusters 2.Calculate distances between all pairs of vectors, and put in ordered list 3.Iteratively and progressively cluster based on these distances 4.Closest get clustered first, etc. 5.Start again from 2 until all are clustered (or some threshold reached) Divisive (top-down) –Assume all vectors are in one cluster –Calculate distances –Separate based on least coherence, splitting most distant vectors

Clustering Methods Suppose we have vectors {a}, {b}, {c}, {d} We use agglomerative method, and get the following: {a}, {b}, {c, d} In the next iteration, how do we calculate the distance between {a} & {c, d}? We can measure distance between two vectors (e.g. cosine), but clusters?

Cluster Distance Methods for measuring distance between clusters –Single Link –Complete Link –Average Link Average distance between all vectors in two clusters Can be computationally expensive –In worst case, requires calculating distance between each vector in one cluster and each in the other (O(n 2 )) –Centroid distance Measure the similarity between the centroids of the two clusters

Cluster Distance Methods for measuring distance btwn clusters –Single Link: Similarity between two clusters is the similarity of the two closest objects in the cluster “Long and straggly” clusters –Complete Link: Similarity is measured by the similarity of their two most dissimilar members “Tighter” clusters

Cluster Distance Methods for measuring cluster distance –Average link clustering Average distance across clusters Measure average distance –Problem: Average sounds good on the surface, but… Can be computationally expensive (O(n 2 ) – O(n 3 ))

Centroid Centroid of a cluster: Effectively: each component of  is the average of the values for that component for the M points in c.

Remaining Problems Polysemy/homography –Alternate meanings of a term can have a negative effect on clustering –May cause clustering when we don’t want it Synonymy –Terms that are essentially mean the same thing (esp. in a given context & across documents) won’t help clustering

Cluster Reading Parts of Ch 17 in J&M Ch 14 of M&S, esp. first couple of sections Jain & Murty 1999 –Data Clustering: A review –http://citeseer.ist.psu.edu/jain99data.htmlhttp://citeseer.ist.psu.edu/jain99data.html Sparck Jones & Willett (eds) 1997 –Readings in Information Retrieval –In library

Clustering Specific Issues related to Project 2. Reducing dimensionality –Lowering the number of dimensions makes the problem more manageable Less memory.

Similar presentations

Presentation on theme: "Clustering Specific Issues related to Project 2. Reducing dimensionality –Lowering the number of dimensions makes the problem more manageable Less memory."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Clustering Specific Issues related to Project 2. Reducing dimensionality –Lowering the number of dimensions makes the problem more manageable Less memory.

Similar presentations

Presentation on theme: "Clustering Specific Issues related to Project 2. Reducing dimensionality –Lowering the number of dimensions makes the problem more manageable Less memory."— Presentation transcript:

Similar presentations

About project

Feedback