Independent Components in Text

Independent Components in Text
Paper by Thomas Kolenda, Lars Kai Hansne and Sigurdur Sigurdsson Yuan Zhijian

Vector Space Representations
Indexing: Forming a team set of all words occurring in the database Form term set Document -- Term-document matrix

Vector Space Representations
Weighting: Determine the values of the weights Similarity measure: based on inner product of weight vectors or other metrics

LSI-PCA Model The main objective is to uncover hidden linear relations between histograms, by rotating the vector space basis. Simplify by taking the k largest singular values

ICA—Noisy Separation Model: X=AS+U Assumptions: -- I.I.d. Sources
-- I.I.d. and Gaussian noise with variance and -- Source distribution:

ICA—Noisy Separation(cont.)
Known mixing parameters, e.g. A, -- Bayes formula: P(S|X)œ P(X|S)P(S) -- Maximizing it w.r.t.S -- Solution: -- For low noise level

ICA (cont.) Text representations on the LSI space
Document classification Key words -- Back projection of documents to the original vector histogram space

ICA (cont.) Generalisation error -- Principle tool for model selection
Bias-variance dilemma: -- Too few components, leading high error -- Too many components, leading ”overfit”

Examples MED data set -- 124 abstracts, 5 groups, 1159 terms
Results: ICA is successful in recognizing and ”explaining” the group structure.

Examples CRAN data set -- 5 classes, 138 documents, 1115 terms
Results: ICA identified some group structure but not as convincingly as in the MED data

Conclusion ICA is quite fine
Independence of the sources may or may not be well aligned with a manual labeling

Independent Components in Text

Similar presentations

Presentation on theme: "Independent Components in Text"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Independent Components in Text

Similar presentations

Presentation on theme: "Independent Components in Text"— Presentation transcript:

Similar presentations

About project

Feedback