Independent Components in Text Paper by Thomas Kolenda, Lars Kai Hansne and Sigurdur Sigurdsson Yuan Zhijian
Vector Space Representations Indexing: Forming a team set of all words occurring in the database. -- Form term set -- Document -- Term-document matrix
Vector Space Representations Weighting: Determine the values of the weights Similarity measure: based on inner product of weight vectors or other metrics
LSI-PCA Model The main objective is to uncover hidden linear relations between histograms, by rotating the vector space basis. Simplify by taking the k largest singular values
ICA—Noisy Separation Model: X=AS+U Assumptions: -- I.I.d. Sources -- I.I.d. and Gaussian noise with variance and -- Source distribution:
ICA—Noisy Separation(cont.) Known mixing parameters, e.g. A, -- Bayes formula: P(S|X)œ P(X|S)P(S) -- Maximizing it w.r.t.S -- Solution: -- For low noise level
ICA (cont.) Text representations on the LSI space Document classification Key words -- Back projection of documents to the original vector histogram space
ICA (cont.) Generalisation error -- Principle tool for model selection Bias-variance dilemma: -- Too few components, leading high error -- Too many components, leading ”overfit”
Examples MED data set -- 124 abstracts, 5 groups, 1159 terms Results: -- ICA is successful in recognizing and ”explaining” the group structure.
Examples CRAN data set -- 5 classes, 138 documents, 1115 terms Results: -- ICA identified some group structure but not as convincingly as in the MED data
Conclusion ICA is quite fine Independence of the sources may or may not be well aligned with a manual labeling