Concept-based Short Text Classification and Ranking

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Linking Entities in #Microposts ROMIL BANSAL, SANDEEP PANEM, PRIYA RADHAKRISHNAN, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY,
Engeniy Gabrilovich and Shaul Markovitch American Association for Artificial Intelligence 2006 Prepared by Qi Li.
Sequence Clustering and Labeling for Unsupervised Query Intent Discovery Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: WSDM’12 Date: 1 November,
Named Entity Recognition in Query Jiafeng Guo, Gu Xu, Xueqi Cheng, Hang Li (ACM SIGIR 2009) Speaker: Yi-Lin,Hsu Advisor: Dr. Koh, Jia-ling Date: 2009/11/16.
A Music Search Engine Built upon Audio-based and Web-based Similarity Measures P. Knees, T., Pohle, M. Schedl, G. Widmer SIGIR 2007.
1 Building a Dictionary of Image Fragments Zicheng Liao Ali Farhadi Yang Wang Ian Endres David Forsyth Department of Computer Science, University of Illinois.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots Chao-Yeh Chen and Kristen Grauman University of Texas at Austin.
Video retrieval using inference network A.Graves, M. Lalmas In Sig IR 02.
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
Chapter 5: Information Retrieval and Web Search
Aparna Kulkarni Nachal Ramasamy Rashmi Havaldar N-grams to Process Hindi Queries.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department.
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 An Efficient Concept-Based Mining Model for Enhancing.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
1 Context-Aware Search Personalization with Concept Preference CIKM’11 Advisor : Jia Ling, Koh Speaker : SHENG HONG, CHUNG.
Leveraging Conceptual Lexicon : Query Disambiguation using Proximity Information for Patent Retrieval Date : 2013/10/30 Author : Parvaz Mahdabi, Shima.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
Attribute Extraction and Scoring: A Probabilistic Approach Taesung Lee, Zhongyuan Wang, Haixun Wang, Seung-won Hwang Microsoft Research Asia Speaker: Bo.
Short Text Understanding Through Lexical-Semantic Analysis
Marcin Marszałek, Ivan Laptev, Cordelia Schmid Computer Vision and Pattern Recognition, CVPR Actions in Context.
Beyond Co-occurrence: Discovering and Visualizing Tag Relationships from Geo-spatial and Temporal Similarities Date : 2012/8/6 Resource : WSDM’12 Advisor.
Wong Cheuk Fun Presentation on Keyword Search. Head, Modifier, and Constraint Detection in Short Texts Zhongyuan Wang, Haixun Wang, Zhirui Hu.
Similar Document Search and Recommendation Vidhya Govindaraju, Krishnan Ramanathan HP Labs, Bangalore, India JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE.
Chapter 6: Information Retrieval and Web Search
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
Adding Semantics to Clustering Hua Li, Dou Shen, Benyu Zhang, Zheng Chen, Qiang Yang Microsoft Research Asia, Beijing, P.R.China Department of Computer.
Wei Feng , Jiawei Han, Jianyong Wang , Charu Aggarwal , Jianbin Huang
EasyQuerier: A Keyword Interface in Web Database Integration System Xian Li 1, Weiyi Meng 2, Xiaofeng Meng 1 1 WAMDM Lab, RUC & 2 SUNY Binghamton.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
C. Lawrence Zitnick Microsoft Research, Redmond Devi Parikh Virginia Tech Bringing Semantics Into Focus Using Visual.
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
Named Entity Disambiguation on an Ontology Enriched by Wikipedia Hien Thanh Nguyen 1, Tru Hoang Cao 2 1 Ton Duc Thang University, Vietnam 2 Ho Chi Minh.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Psychiatric document retrieval using a discourse-aware model Presenter : Wu, Jia-Hao Authors : Liang-Chih.
Date: 2012/08/21 Source: Zhong Zeng, Zhifeng Bao, Tok Wang Ling, Mong Li Lee (KEYS’12) Speaker: Er-Gang Liu Advisor: Dr. Jia-ling Koh 1.
Date: 2013/6/10 Author: Shiwen Cheng, Arash Termehchy, Vagelis Hristidis Source: CIKM’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Predicting the Effectiveness.
LINDEN : Linking Named Entities with Knowledge Base via Semantic Knowledge Date : 2013/03/25 Resource : WWW 2012 Advisor : Dr. Jia-Ling Koh Speaker : Wei.
Link Distribution on Wikipedia [0407]KwangHee Park.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
2/10/2016Semantic Similarity1 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis.
LEARNING FROM THE PAST: ANSWERING NEW QUESTIONS WITH PAST ANSWERS Date: 2012/11/22 Author: Anna Shtok, Gideon Dror, Yoelle Maarek, Idan Szpektor Source:
1.Learn appearance based models for concepts 2.Compute posterior probabilities or Semantic Multinomial (SMN) under appearance models. -But, suffers from.
 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Enhancing Text Clustering by Leveraging Wikipedia Semantics.
A Document-Level Sentiment Analysis Approach Using Artificial Neural Network and Sentiment Lexicons Yan Zhu.
2016/9/301 Exploiting Wikipedia as External Knowledge for Document Clustering Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou Proceeding.
Queensland University of Technology
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Web News Sentence Searching Using Linguistic Graph Similarity
Mining the Data Charu C. Aggarwal, ChengXiang Zhai
Presented by: Prof. Ali Jaoua
Intent-Aware Semantic Query Annotation
Enriching Taxonomies With Functional Domain Knowledge
Date: 2012/11/15 Author: Jin Young Kim, Kevyn Collins-Thompson,
ProBase: common Sense Concept KB and Short Text Understanding
Speaker: Ming-Chieh, Chiang
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
Presentation transcript:

Concept-based Short Text Classification and Ranking Date:2015/05/21 Author:Fang Wang, Zhongyuan Wang, Zhoujun Li, Ji-Rong Wen Source:CIKM '14 Advisor:Jia-ling Koh Spearker:LIN,CI-JIE

Outline Introduction Method Experiment Conclusion

Outline Introduction Method Experiment Conclusion

Introduction Most existing approaches for text classification represent texts as vectors of words, namely “Bag-of-Words” This text representation results in a very high dimensionality of feature space and frequently suffers from surface mismatching

Introduction Goal: Car Jeep、Honda Bag of words Bag of concepts using “Bag-of-Concepts” in short text representation, aiming to avoid the surface mismatching and handle the synonym and polysemy problem Jeep、Honda Car Bag of words Bag of concepts

Introduction Goal: Short text classification is based on “Bag-of-Concepts” Classify Beyonce named People’s most beautiful woman Music Lady Gaga Responds to Concert Band

Outline Introduction Method Experiment Conclusion

Framework

Framework

Set={beyonce}, Idf(Beyonce)=2 Entity Recognition Documents are first split to sentences Use all instances in Probase as the matching dictionary for detecting the entities from each sentence Stemming is performed to assist in the matching process Extracted entities are merged together and weighted by idf based on different classes Beyonce named People’s most beautiful woman Beyonce named People’s most beautiful woman Set={beyonce}, Idf(Beyonce)=2

Candidates Generation Given entity 𝑒 𝑗 , we select its top 𝑁 𝑡 concepts ranked by the its typical concept P(c|e) Merge all the typical concepts as the primary candidate set Computing the idf value for each concept in the class level Removing stop concepts , which tend to be too general to represent a class c1,c2,...c20 c1,c2,... cn Idf(c1,c3,... cn) c1,c2,... cn Merge Computing idf Removing stop concepts 𝑒 𝑗 U 𝑒 𝑗

Concept Weighting The top 𝑁 𝑡 concepts still contain noise Weight the candidates to measure their representative strengths for each class Given entity “python” in class Technique, mapping method will result in its top 𝑁 𝑡 concepts list including animal

Typicality Use a probabilistic way to measure the Is-A relations given an instance e, which has Is-A relationship with concept c penguin is-a bird Take Probase as a Knowledge database in this paper terms in Probase are connected by a variety of relationships <concept>\t<entity>\t<frequency>\t<popularity>\t<ConceptFrequency>\t<ConceptSize>\t<ConceptVagueness>\t<Zipf_Slope>\t<Zipf_Pearson_Coefficient>\t<EntityFrequency>\t<EntitySize>

Typicality penguin is-a bird n(e, c) denotes the co-occur frequency of e and c n(e) is the frequency of e penguin is-a bird <concept>\t<entity>\t<frequency>\t<EntityFrequency> <bird>\t<penguin>\t<50>\t<100> 𝑃 𝑏𝑖𝑟𝑑 𝑝𝑒𝑛𝑔𝑢𝑖𝑛 = 𝑛(𝑝𝑒𝑛𝑔𝑢𝑖𝑛,𝑏𝑖𝑟𝑑) 𝑛(𝑝𝑒𝑛𝑔𝑢𝑖𝑛)

Framework

Short Text Conceptualization Short Text Conceptualization aims to abstract a set of most representative concepts that can best describe the short text apple ipad ?

Short Text Conceptualization detect all possible entities and then remove those contained by others given the short text “windows phone app,” the recognized entity set will be {“windows phone,” “phone app”}, while “windows,” “phone,” and “app” are removed the entity list 𝐸 𝑠𝑡 𝑖 = { 𝑒 𝑗 , j = 1, 2, ..., M} for a short text 𝑠𝑡 𝑖 Sense Detection detect different senses for each entity in 𝐸 𝑠𝑡 𝑖 , so as to determine whether the entity is ambiguous Disambiguation disambiguate vague entity by leveraging its unambiguous context entities

Sense Detection Denote 𝐶 𝑒 𝑗 = { 𝑐 𝑘 , k = 1, 2, ..., 𝑁 𝑡 } is 𝑒 𝑗 ′s typical concept list Denote 𝐶𝐶𝑙 𝑒 𝑗 = { 𝑐𝑐𝑙 𝑚 , m = 1, 2, ...} is 𝑒 𝑗 ′s concept cluster set 𝑐 𝑘 𝑐𝑐𝑙 𝑚 歌手 演藝 𝑒 𝑗 作詞人 Beyonce 模特兒 設計 時裝設計師

Sense Detection 𝑐 𝑘 Entropy越高 ,𝑒 𝑗 的意義越模糊 𝑐𝑐𝑙 𝑚 Entropy越低 ,𝑒 𝑗 的意義越明確 歌手 0.3 演藝 𝑒 𝑗 0.3 作詞人 Beyonce 0.3 𝑃 演藝 𝐵𝑒𝑦𝑜𝑛𝑐𝑒 =0.3+0.3+0.3 模特兒 0.1 設計 時裝設計師

Disambiguation Denote the vague entity as 𝑒 𝑖 𝑣 , and unambiguous entity 𝑒 𝑗 𝑢

Disambiguation Denote the vague entity as 𝑒 𝑖 𝑣 , and unambiguous entity 𝑒 𝑗 𝑢 𝑃 ′ 演藝 𝐵𝑒𝑦𝑜𝑛𝑐𝑒 =𝑃 演藝 𝐵𝑒𝑦𝑜𝑛𝑐𝑒 ∗𝑃 音樂學 𝑚𝑢𝑠𝑖𝑐 ∗𝐶𝑆(演藝,音樂學) +𝑃 演藝 𝐵𝑒𝑦𝑜𝑛𝑐𝑒 ∗𝑃 音樂學 songs ∗𝐶𝑆 演藝,音樂學 = 0.2*0.9*0.9 + 0.2*0.9*0.9 = 0.324 Beyonce music and songs 𝑃 ′ 設計 𝐵𝑒𝑦𝑜𝑛𝑐𝑒 =𝑃 設計 𝐵𝑒𝑦𝑜𝑛𝑐𝑒 ∗𝑃 音樂學 𝑚𝑢𝑠𝑖𝑐 ∗𝐶𝑆(設計,音樂學) +𝑃 設計 𝐵𝑒𝑦𝑜𝑛𝑐𝑒 ∗𝑃 音樂學 songs ∗𝐶𝑆 設計,音樂學 = 0.2*0.9*0.1 + 0.2*0.9*0.1 = 0.036 設計 演藝 音樂學 P(演藝|Beyonce)=0.5 P(音樂學|music)=1 P(設計|Beyonce)=0.5 P(音樂學|songs)=1 𝑐𝑐𝑙 𝑚 ={設計,演藝} 𝑐𝑐𝑙 𝑛 ={音樂學}

Disambiguation Denote the vague entity as 𝑒 𝑖 𝑣 , and unambiguous entity 𝑒 𝑗 𝑢 P(演藝|Beyonce)=0.5 𝑃 ′ 演藝 𝐵𝑒𝑦𝑜𝑛𝑐𝑒 =0.324 P(設計|Beyonce)=0.5 𝑃 ′ 設計 𝐵𝑒𝑦𝑜𝑛𝑐𝑒 = 0.036 𝑃 ′ 演藝 𝐵𝑒𝑦𝑜𝑛𝑐𝑒 =𝑃 演藝 𝐵𝑒𝑦𝑜𝑛𝑐𝑒 ∗𝑃 音樂學 𝑚𝑢𝑠𝑖𝑐 ∗𝐶𝑆(演藝,音樂學) +𝑃 演藝 𝐵𝑒𝑦𝑜𝑛𝑐𝑒 ∗𝑃 音樂學 songs ∗𝐶𝑆 演藝,音樂學 = 0.2*0.9*0.9 + 0.2*0.9*0.9 = 0.324 Beyonce music and songs 𝑃 ′ 設計 𝐵𝑒𝑦𝑜𝑛𝑐𝑒 =𝑃 設計 𝐵𝑒𝑦𝑜𝑛𝑐𝑒 ∗𝑃 音樂學 𝑚𝑢𝑠𝑖𝑐 ∗𝐶𝑆(設計,音樂學) +𝑃 設計 𝐵𝑒𝑦𝑜𝑛𝑐𝑒 ∗𝑃 音樂學 songs ∗𝐶𝑆 設計,音樂學 = 0.2*0.9*0.1 + 0.2*0.9*0.1 = 0.036 設計 演藝 音樂學 P(演藝|Beyonce)=0.5 P(音樂學|music)=1 P(設計|Beyonce)=0.5 P(音樂學|songs)=1 𝑐𝑐𝑙 𝑚 ={設計,演藝} 𝑐𝑐𝑙 𝑛 ={音樂學}

Disambiguation CS( 𝑐𝑐𝑙 𝑚 , 𝑐𝑐𝑙 𝑛 ) denotes the concept cluster similarity 民族歌手 民族音樂學 民族音樂學 系統音樂學 歷史音樂學 民族歌手 鄉村歌手 𝑒 𝑖 𝑒 𝑖+1 ... 𝑒 𝑘 𝑒 𝑗 𝑒 +1 ... 𝑒 𝑙 音樂學 演藝

Framework

Classification classify the short 𝑡𝑒𝑥𝑡 𝑠𝑡 𝑖 to the class 𝐶𝐿 𝑙 that is most similar with 𝑠𝑡 𝑖 𝑠𝑡 𝑖 ’s concept expression 𝐶 𝑠𝑡 𝑖 = { C j , j = 1, 2,...,M} C 𝑘 Beyonce music and songs C1 C2 C3 C2 C3 C4 演藝 音樂學 演藝 𝐶𝑀 𝑙 𝐶 𝑠𝑡 𝑖 = {演藝、音樂學}

Ranking Ranking by Similarity Ranking with Diversity each short text 𝑠𝑡 𝑖 assigned to 𝐶𝐿 𝑙 has a similarity score, we can rank them directly by their scores Ranking with Diversity diversify the short texts by subtopic Proportionality(PM-2) [12]

Outline Introduction Method Experiment Conclusion

Query recommendation for Channel Living Experiment evaluate the performance of BocSTC(Bag-of-Concepts - Short Text Classification) on the real application - Channel-based query recommendation Query recommendation for Channel Living

Experiment Four commonly used channels are selected as targeted channels Money, Movie, Music and TV Training dataset randomly select 6,000 documents for each channel The titles are used as training data for BocSTC

Experiment Test dataset 841 labeled queries, from which, 200 are selected randomly for verification and 600 for testing

Experiment Performance on query classification

Precision performance on each channel Experiment Precision performance on each channel

Experiment manually annotate top 20 queries with the guidelines Unrelated、Related but Uninteresting、Related and Interesting Diversity performance on each channel

Outline Introduction Method Experiment Conclusion

Conclusion propose a novel framework for short text classification and ranking applications It measures the semantic similarities between short texts from the angle of concepts, so as to avoid surface mismatch

Thanks for listening.