Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS,

Slides:



Advertisements
Similar presentations
Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.
Advertisements

Jean-Eudes Ranvier 17/05/2015Planet Data - Madrid Trustworthiness assessment (on web pages) Task 3.3.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
1 Question Answering in Biomedicine Student: Andreea Tutos Id: Supervisor: Diego Molla.
Presented by Zeehasham Rasheed
J. Chen, O. R. Zaiane and R. Goebel An Unsupervised Approach to Cluster Web Search Results based on Word Sense Communities.
Distributed Representations of Sentences and Documents
Scalable Text Mining with Sparse Generative Models
Online Stacked Graphical Learning Zhenzhen Kou +, Vitor R. Carvalho *, and William W. Cohen + Machine Learning Department + / Language Technologies Institute.
Chapter 5: Information Retrieval and Web Search
© 2013 IBM Corporation Efficient Multi-stage Image Classification for Mobile Sensing in Urban Environments Presented by Shashank Mujumdar IBM Research,
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Some studies on Vietnamese multi-document summarization and semantic relation extraction Laboratory of Data Mining & Knowledge Science 9/4/20151 Laboratory.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
Search Engines and Information Retrieval Chapter 1.
Exploiting Wikipedia as External Knowledge for Document Clustering Sakyasingha Dasgupta, Pradeep Ghosh Data Mining and Exploration-Presentation School.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
 An important problem in sponsored search advertising is keyword generation, which bridges the gap between the keywords bidded by advertisers and queried.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
Web Data Management Dr. Daniel Deutch. Web Data The web has revolutionized our world Data is everywhere Constitutes a great potential But also a lot of.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
TOPIC CENTRIC QUERY ROUTING Research Methods (CS689) 11/21/00 By Anupam Khanal.
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Understanding User’s Query Intent with Wikipedia G 여 승 후.
Automatic Identification of Pro and Con Reasons in Online Reviews Soo-Min Kim and Eduard Hovy USC Information Sciences Institute Proceedings of the COLING/ACL.
1 A Probabilistic Model for Bursty Topic Discovery in Microblogs Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Jun Xu, Xueqi Cheng CAS Key Laboratory of Web Data.
Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information.
Adish Singla, Microsoft Bing Ryen W. White, Microsoft Research Jeff Huang, University of Washington.
Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Engineering Informatics Lab at Stanford.
Harvesting Social Knowledge from Folksonomies Harris Wu, Mohammad Zubair, Kurt Maly, Harvesting social knowledge from folksonomies, Proceedings of the.
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
CoCQA : Co-Training Over Questions and Answers with an Application to Predicting Question Subjectivity Orientation Baoli Li, Yandong Liu, and Eugene Agichtein.
A Practical Web-based Approach to Generating Topic Hierarchy for Text Segments CIKM2004 Speaker : Yao-Min Huang Date : 2005/03/10.
1 A Biterm Topic Model for Short Texts Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng Institute of Computing Technology, Chinese Academy of Sciences.
Search Strategies & Catalog Instruction Frederic Murray Assistant Professor MLIS, University of British Columbia BA, Political Science, University of Iowa.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:
Ontological user profiling seminar Ontological User Profiling in Recommender Systems Stuart E. Middleton IT Innovation Dept of Electronics and.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
A New Algorithm for Inferring User Search Goals with Feedback Sessions.
Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
1 Context-Aware Ranking in Web Search (SIGIR 10’) Biao Xiang, Daxin Jiang, Jian Pei, Xiaohui Sun, Enhong Chen, Hang Li 2010/10/26.
Web Page Clustering using Heuristic Search in the Web Graph IJCAI 07.
Year 12: Workshop 2: Finding and evaluating information LSE Library / CLT / Widening Participation This work is licensed under a Creative Commons Attribution-NonCommercial.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
Does one size really fit all? Evaluating classifiers in a Bag-of-Visual-Words classification Christian Hentschel, Harald Sack Hasso Plattner Institute.
A Document-Level Sentiment Analysis Approach Using Artificial Neural Network and Sentiment Lexicons Yan Zhu.
University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G
Searching the Web for academic information Ruth Stubbings.
Topic Modeling for Short Texts with Auxiliary Word Embeddings
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Using computers to search electronic databases
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Presented by: Prof. Ali Jaoua
Searching and browsing through fragments of TED Talks
Web Mining Department of Computer Science and Engg.
Introduction to Information Retrieval
Presentation and project
Three steps are separately conducted
Information Retrieval and Web Design
Presentation transcript:

Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS, Tohoku UniversityGSIS, JAISTGSIS, Tohoku University WWW 2008 NLG Seminar 2008/12/31 Reporter:Kai-Jie Ko 1

Motivation Many classification tasks working with short segments of text & Web, such as search snippets, forum & chat messages, blog & news feeds, product reviews, and book & movie summaries, fail to achieve high accuracy due to the data sparseness 2

Previous works to overcome data sparseness Employ search engines to expandand enrich the context of data 3

Previous works to overcome data sparseness Employ search engines to expandand enrich the context of data Time consuming! 4

Previous works to overcome data sparseness To utilize online data repositories, such as Wikipedia or Open Directory Project, as external knowledge sources 5

Previous works to overcome data sparseness To utilize online data repositories, such as Wikipedia or Open Directory Project, as external knowledge sources Only used the user defined categories and concepts in those repositories, not general enough 6

General framework 7

(a)Choose an universal data 8 Must large and rich enough to cover words, concepts that are related to the classification problem. Wikipedia & MEDLINE are chosen in this paper.

(a)Choose an universal data 9 Use topic oriented keywords to crawl Wikipedia with maximum depth of hyperlink 4 ◦240MB ◦71,968 documents ◦882,376 paragraphs ◦60,649 vocabulary ◦30,492,305 words

(a)Choose an universal data 10 Ohsumed : a test collection of medical journal abstracts to assist IR research ◦156MB ◦233,442 abstracts

(b)Doing topic analysis for the universal dataset 11

(b)Doing topic analysis for the universal dataset 12 Using GibbsLDA++, a C/C++ implementation of LDA using Gibbs SamplingGibbsLDA++ The number of topics ranges from 10, to 100, 150, and 200 The hyperparameters alpha and beta were set to 0.5 and 0.1, respectively

Hidden topics analysis for Wikipedia data 13

Hidden topics analysis for the Ohsumed- MEDLINE data 14

(c)Building a moderate size labeled training dataset 15 Words/terms in this dataset should be relevant to as many hidden topics as possible.

(d)Doing topic inference for training and future data 16 To transform the original data into a set of topics

Sample Google search snippets 17

Snippets word co-occurence This show the sparseness of web snippets in that only small fraction of words are shared by the 2 or 3 different snippets 18

Shared topics among snippets after inference After doing inference and integration, snippets are more related in semantic way 19

(e) Building the classifier 20 Choose from different learning methods Integrate hidden topics into the training, test, or future data according to the data representation of the chosen learning technique Train the classifier on the integrated training data

Evaluation Domain disambiguation for Web search results ◦To classify Google search snippets into different domains, such as Business, Computers, Health, etc. Disease classification for medical abstracts ◦Classifies each MEDLINE medical abstract into one of five disease categories that are related to neoplasms, digestive system, etc. 21

Domain disambiguation for Web search results Obtain Google snippet as training and testing data, the search phrase of the two data are totally exclusive 22

Domain disambiguation for Web search results The result of doing 5-fold cross validation on the training data Reduce 19% of error on average 23

Domain disambiguation for Web search results 24

Domain disambiguation for Web search results 25

Disease Classification for Medical Abstracts with MEDLINE Topics 26 The proposed method requires only 4500 training data to reach the accuracy of the baseline which uses training data!

Conclusion Advantages of proposed framework: ◦A good method to classify sparse and previous unseen data  Utilizing the large universal dataset ◦Expanding the coverage of the classifier  Topics coming from external data cover a lot of terms/words that do not exist in training dataset ◦Easy to implement  Only have to prepare a small set of labeled training example to attain high accuracy 27