Presentation is loading. Please wait.

Presentation is loading. Please wait.

Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS,

Similar presentations


Presentation on theme: "Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS,"— Presentation transcript:

1 Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS, Tohoku UniversityGSIS, JAISTGSIS, Tohoku University WWW 2008 NLG Seminar 2008/12/31 Reporter:Kai-Jie Ko 1

2 Motivation Many classification tasks working with short segments of text & Web, such as search snippets, forum & chat messages, blog & news feeds, product reviews, and book & movie summaries, fail to achieve high accuracy due to the data sparseness 2

3 Previous works to overcome data sparseness Employ search engines to expandand enrich the context of data 3

4 Previous works to overcome data sparseness Employ search engines to expandand enrich the context of data Time consuming! 4

5 Previous works to overcome data sparseness To utilize online data repositories, such as Wikipedia or Open Directory Project, as external knowledge sources 5

6 Previous works to overcome data sparseness To utilize online data repositories, such as Wikipedia or Open Directory Project, as external knowledge sources Only used the user defined categories and concepts in those repositories, not general enough 6

7 General framework 7

8 (a)Choose an universal data 8 Must large and rich enough to cover words, concepts that are related to the classification problem. Wikipedia & MEDLINE are chosen in this paper.

9 (a)Choose an universal data 9 Use topic oriented keywords to crawl Wikipedia with maximum depth of hyperlink 4 ◦240MB ◦71,968 documents ◦882,376 paragraphs ◦60,649 vocabulary ◦30,492,305 words

10 (a)Choose an universal data 10 Ohsumed : a test collection of medical journal abstracts to assist IR research ◦156MB ◦233,442 abstracts

11 (b)Doing topic analysis for the universal dataset 11

12 (b)Doing topic analysis for the universal dataset 12 Using GibbsLDA++, a C/C++ implementation of LDA using Gibbs SamplingGibbsLDA++ The number of topics ranges from 10, 20... to 100, 150, and 200 The hyperparameters alpha and beta were set to 0.5 and 0.1, respectively

13 Hidden topics analysis for Wikipedia data 13

14 Hidden topics analysis for the Ohsumed- MEDLINE data 14

15 (c)Building a moderate size labeled training dataset 15 Words/terms in this dataset should be relevant to as many hidden topics as possible.

16 (d)Doing topic inference for training and future data 16 To transform the original data into a set of topics

17 Sample Google search snippets 17

18 Snippets word co-occurence This show the sparseness of web snippets in that only small fraction of words are shared by the 2 or 3 different snippets 18

19 Shared topics among snippets after inference After doing inference and integration, snippets are more related in semantic way 19

20 (e) Building the classifier 20 Choose from different learning methods Integrate hidden topics into the training, test, or future data according to the data representation of the chosen learning technique Train the classifier on the integrated training data

21 Evaluation Domain disambiguation for Web search results ◦To classify Google search snippets into different domains, such as Business, Computers, Health, etc. Disease classification for medical abstracts ◦Classifies each MEDLINE medical abstract into one of five disease categories that are related to neoplasms, digestive system, etc. 21

22 Domain disambiguation for Web search results Obtain Google snippet as training and testing data, the search phrase of the two data are totally exclusive 22

23 Domain disambiguation for Web search results The result of doing 5-fold cross validation on the training data Reduce 19% of error on average 23

24 Domain disambiguation for Web search results 24

25 Domain disambiguation for Web search results 25

26 Disease Classification for Medical Abstracts with MEDLINE Topics 26 The proposed method requires only 4500 training data to reach the accuracy of the baseline which uses 22500 training data!

27 Conclusion Advantages of proposed framework: ◦A good method to classify sparse and previous unseen data  Utilizing the large universal dataset ◦Expanding the coverage of the classifier  Topics coming from external data cover a lot of terms/words that do not exist in training dataset ◦Easy to implement  Only have to prepare a small set of labeled training example to attain high accuracy 27


Download ppt "Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS,"

Similar presentations


Ads by Google