Short Text Understanding Through Lexical-Semantic Analysis

Slides:



Advertisements
Similar presentations
LEARNING SEMANTICS OF WORDS AND PICTURES TEJASWI DEVARAPALLI.
Advertisements

CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio.
Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,
DQR : A Probabilistic Approach to Diversified Query recommendation Date: 2013/05/20 Author: Ruirui Li, Ben Kao, Bin Bi, Reynold Cheng, Eric Lo Source:
Multi-Document Person Name Resolution Michael Ben Fleischman (MIT), Eduard Hovy (USC) From Proceedings of ACL-42 Reference Resolution workshop 2004.
Towards Twitter Context Summarization with User Influence Models Yi Chang et al. WSDM 2013 Hyewon Lim 21 June 2013.
Linking Named Entity in Tweets with Knowledge Base via User Interest Modeling Date : 2014/01/22 Author : Wei Shen, Jianyong Wang, Ping Luo, Min Wang Source.
A Linguistic Approach for Semantic Web Service Discovery International Symposium on Management Intelligent Systems 2012 (IS-MiS 2012) July 13, 2012 Jordy.
Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.
Sequence Clustering and Labeling for Unsupervised Query Intent Discovery Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: WSDM’12 Date: 1 November,
LEDIR : An Unsupervised Algorithm for Learning Directionality of Inference Rules Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: From EMNLP.
Constructing Popular Routes from Uncertain Trajectories Ling-Yin Wei 1, Yu Zheng 2, Wen-Chih Peng 1 1 National Chiao Tung University, Taiwan 2 Microsoft.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
Wei Shen †, Jianyong Wang †, Ping Luo ‡, Min Wang ‡ † Tsinghua University, Beijing, China ‡ HP Labs China, Beijing, China WWW 2012 Presented by Tom Chao.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
ODE: Ontology-assisted Data Extraction WEIFENG SU et al. Presented by: Meher Talat Shaikh.
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
7-1 Introduction The field of statistical inference consists of those methods used to make decisions or to draw conclusions about a population. These.
Overview of Search Engines
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.
1 Context-Aware Search Personalization with Concept Preference CIKM’11 Advisor : Jia Ling, Koh Speaker : SHENG HONG, CHUNG.
Attribute Extraction and Scoring: A Probabilistic Approach Taesung Lee, Zhongyuan Wang, Haixun Wang, Seung-won Hwang Microsoft Research Asia Speaker: Bo.
Exploring Online Social Activities for Adaptive Search Personalization CIKM’10 Advisor : Jia Ling, Koh Speaker : SHENG HONG, CHUNG.
A search-based Chinese Word Segmentation Method ——WWW 2007 Xin-Jing Wang: IBM China Wen Liu: Huazhong Univ. China Yong Qin: IBM China.
Wong Cheuk Fun Presentation on Keyword Search. Head, Modifier, and Constraint Detection in Short Texts Zhongyuan Wang, Haixun Wang, Zhirui Hu.
Paper Review by Utsav Sinha August, 2015 Part of assignment in CS 671: Natural Language Processing, IIT Kanpur.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A Taxonomy of Similarity Mechanisms for Case-Based Reasoning.
7-1 Introduction The field of statistical inference consists of those methods used to make decisions or to draw conclusions about a population. These.
Discriminative Models for Spoken Language Understanding Ye-Yi Wang, Alex Acero Microsoft Research, Redmond, Washington USA ICSLP 2006.
Generic text summarization using relevance measure and latent semantic analysis Gong Yihong and Xin Liu SIGIR, April 2015 Yubin Lim.
Efficient Progressive Processing of Skyline Queries in Peer-to-Peer Systems INFOSCALE’06.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Contextual Ranking of Keywords Using Click Data ICDE`09 Utku Irmak Vadim von Brzeski Vadim von Brzeski Reiner Kraft.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
CSKGOI'08 Commonsense Knowledge and Goal Oriented Interfaces.
Query Segmentation Using Conditional Random Fields Xiaohui and Huxia Shi York University KEYS’09 (SIGMOD Workshop) Presented by Jaehui Park,
1/21 Automatic Discovery of Intentions in Text and its Application to Question Answering (ACL 2005 Student Research Workshop )
1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Unsupervised word sense disambiguation for Korean through the acyclic weighted digraph using corpus and.
Semi-automatic Product Attribute Extraction from Store Website
Trajectory Data Mining Dr. Yu Zheng Lead Researcher, Microsoft Research Chair Professor at Shanghai Jiao Tong University Editor-in-Chief of ACM Trans.
Unsupervised Auxiliary Visual Words Discovery for Large-Scale Image Object Retrieval Yin-Hsi Kuo1,2, Hsuan-Tien Lin 1, Wen-Huang Cheng 2, Yi-Hsuan Yang.
LINDEN : Linking Named Entities with Knowledge Base via Semantic Knowledge Date : 2013/03/25 Resource : WWW 2012 Advisor : Dr. Jia-Ling Koh Speaker : Wei.
Context-Aware Query Classification Huanhuan Cao, Derek Hao Hu, Dou Shen, Daxin Jiang, Jian-Tao Sun, Enhong Chen, Qiang Yang Microsoft Research Asia SIGIR.
Data I.
Concept-based Short Text Classification and Ranking
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
Date: 2012/5/28 Source: Alexander Kotov. al(CIKM’11) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou Interactive Sense Feedback for Difficult Queries.
Semantic Grounding of Tag Relatedness in Social Bookmarking Systems Ciro Cattuto, Dominik Benz, Andreas Hotho, Gerd Stumme ISWC 2008 Hyewon Lim January.
TWinner : Understanding News Queries with Geo-content using Twitter Satyen Abrol,Latifur Khan University of Texas at Dallas,Department of Computer Science.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Enhancing Text Clustering by Leveraging Wikipedia Semantics.
Sentiment Analysis Using Common- Sense and Context Information Basant Agarwal 1,2, Namita Mittal 2, Pooja Bansal 2, and Sonal Garg 2 1 Department of Computer.
Presented by: Siddhant Kulkarni Spring Authors: Publication:  ICDE 2015 Type:  Research Paper 2.
Ning Jin, Wei Wang ICDE 2011 LTS: Discriminative Subgraph Mining by Learning from Search History.
Automatic Writing Evaluation
iSRD Spam Review Detection with Imbalanced Data Distributions
Text Mining & Natural Language Processing
Text Mining & Natural Language Processing
ProBase: common Sense Concept KB and Short Text Understanding
Topic: Semantic Text Mining
Presentation transcript:

Short Text Understanding Through Lexical-Semantic Analysis Wen Hua, Zhongyuan Wang, Haixun Wang, Kai Zheng, and Xiaofang Zhou ICDE 2015 21 April 2015 Hyewon Lim

Outline Introduction Problem Statement Methodology Experiment Conclusion

Introduction Characteristics of short texts Do not always observe the syntax of a written language Cannot always apply to the traditional NLP techniques Have limited context The most search queries contain <5 words Tweets have <140 characters Do not possess sufficient signals to support statistical text processing techniques

Introduction Challenges of short text understanding Segmentation ambiguity Incorrect segmentation of short texts leads to incorrect semantic similarity vs. April in paris lyrics Vacation april in paris {april paris lyrics} {april in paris lyrics} {vacation april paris} {vacation april in paris} Book hotel california vs. Hotel California eagles

Introduction Type ambiguity Traditional approaches to POS tagging consider only lexical features Surface features are insufficient to determine types of terms in short texts vs. pink songs pink shoes instance adjective vs. watch free movie watch omega verb concept

Introduction Entity ambiguity vs. vs. watch harry potter read harry potter vs. Hotel California eagles Jaguar cars

Outline Introduction Problem Statement Methodology Experiment Conclusion

Problem Statement Problem definition Does a query “book Disneyland hotel california” mean that “user is searching for hotels close to Disneyland Theme Park in California”? Book Disneyland hotel california 1) Detect all candidate terms {“book”, “disneyland”, “hotel california”, “hotel”, “california”} 2) Two possible segmentations: {book disneyland hotel california} Book Disneyland hotel california Book[v] Disneyland[e] hotel[c] california[e] “Disneyland” has multiple senses: Theme park and Company Book[v] Disneyland[e](park) hotel[c] california[e](state)

Problem Statement Short text understanding = Semantic labeling Text segmentation Divide text into a sequence of terms in vocabulary Type detection Determine the best type of each term Concept labeling Infer the best concept of each entity within context

Problem Statement Framework

Outline Introduction Problem Statement Methodology Experiment Conclusion

Methodology Online inference Text segmentation How to obtain a coherent segmentation from the set of terms? Mutual exclusion Mutual reinforce

Methodology Online inference (cont.) Type detection Chain Model Consider relatedness between consecutive terms Maximize total score of consecutive terms Pairwise Model Most related terms might not always be adjacent Find the best type for each term so that the Maximum Spanning Tree of the resulting sub-graph between typed-terms has the largest weight

Methodology Online inference (cont.) Instance disambiguation Infer the best concept of each entity within context Filtering/re-rank of the original concept cluster vector Weighted-Vote The final score of each concept cluster is a combination of its original score and the support from other terms hotel california eagles eagles hotel california After normalization: WV <animal, 0.2379> <band, 0.1277> <bird, 0.1101> <celebrity, 0.0463> … <singer, 0.0237> <band, 0.0181> <celebrity, 0.0137> <album, 0.0132> … <band, 0.4562> <celebrity, 0.1583> <animal, 0.1317> <singer, 0.0911> …

Methodology Offline knowledge acquisition Harvesting IS-A network from Probase http://research.microsoft.com/en-us/projects/probase/browser.aspx

Methodology Offline knowledge acquisition (cont.) Constructing co-occurrence network Between typed-terms; common terms are penalized Compress network Reduce cardinality Improve inference accuracy

Methodology Offline knowledge acquisition (cont.) Concept clustering by k-Mediods Cluster similar concepts contained in Probase Represent the semantics of an instance in a more compact manner Reduce the size of the original co-occurrence network Disneyland <theme park, 0.0351>, <amusement park, 0.0336>, <company, 0.0179>, <park, 0.0178>, <big company, 0.0178> <{theme park, amusement park, park}, 0.0865>, <{company, big company}, 0.0357>

Methodology Offline knowledge acquisition (cont.) Scoring semantic coherence Affinity Score Measure semantic coherence between typed-terms Two types of coherence: similarity, relatedness (co-occurrence)

Outline Introduction Problem Statement Methodology Experiment Conclusion

Experiment Benchmark Manually picked 11 terms April in paris, hotel california, watch, book, pink, blue, orange, population, birthday, apple fox Randomly selected 1,100 queries containing one of above terms from one day’s query log Randomly sampled another 400 queries without any restriction Invited 15 colleagues 

Experiment Effectiveness of text segmentation Effectiveness of type detection Effectiveness of short text understanding Verb, adjective, … Attribute, concept and instance

Experiment Accuracy of concept labeling AC: adjacent context; WV: weighted-vote Efficiency of short text understanding

Outline Introduction Problem Statement Methodology Experiment Conclusion

Conclusion Short text understanding A framework with feedback Text segmentation: a randomized approximation algorithm Type detection: a Chain Model and a Pairwise Model Concept labeling: a Weighted-Vote algorithm A framework with feedback The three steps of short text understanding are related with each other Quality of text segmentation > Quality of other steps Disambiguation > accuracy of measuring semantic coherence > performance of text segmentation and type detection