Presentation is loading. Please wait.

Presentation is loading. Please wait.

Independent Components in Text

Similar presentations


Presentation on theme: "Independent Components in Text"— Presentation transcript:

1 Independent Components in Text
Paper by Thomas Kolenda, Lars Kai Hansne and Sigurdur Sigurdsson Yuan Zhijian

2 Vector Space Representations
Indexing: Forming a team set of all words occurring in the database Form term set Document -- Term-document matrix

3 Vector Space Representations
Weighting: Determine the values of the weights Similarity measure: based on inner product of weight vectors or other metrics

4 LSI-PCA Model The main objective is to uncover hidden linear relations between histograms, by rotating the vector space basis. Simplify by taking the k largest singular values

5 ICA—Noisy Separation Model: X=AS+U Assumptions: -- I.I.d. Sources
-- I.I.d. and Gaussian noise with variance and -- Source distribution:

6 ICA—Noisy Separation(cont.)
Known mixing parameters, e.g. A, -- Bayes formula: P(S|X)œ P(X|S)P(S) -- Maximizing it w.r.t.S -- Solution: -- For low noise level

7 ICA (cont.) Text representations on the LSI space
Document classification Key words -- Back projection of documents to the original vector histogram space

8 ICA (cont.) Generalisation error -- Principle tool for model selection
Bias-variance dilemma: -- Too few components, leading high error -- Too many components, leading ”overfit”

9 Examples MED data set -- 124 abstracts, 5 groups, 1159 terms
Results: ICA is successful in recognizing and ”explaining” the group structure.

10 Examples CRAN data set -- 5 classes, 138 documents, 1115 terms
Results: ICA identified some group structure but not as convincingly as in the MED data

11 Conclusion ICA is quite fine
Independence of the sources may or may not be well aligned with a manual labeling


Download ppt "Independent Components in Text"

Similar presentations


Ads by Google