Query Segmentation Using Conditional Random Fields Xiaohui and Huxia Shi York University KEYS’09 (SIGMOD Workshop) 2009. 07. 16. Presented by Jaehui Park,

Slides:



Advertisements
Similar presentations
Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson.
Advertisements

Toward Scalable Keyword Search over Relational Data Akanksha Baid, Ian Rae, Jiexing Li, AnHai Doan, and Jeffrey Naughton University of Wisconsin VLDB 2010.
Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.
Random Forest Predrag Radenković 3237/10
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Online Filtering, Smoothing & Probabilistic Modeling of Streaming Data In short, Applying probabilistic models to Streams Bhargav Kanagal & Amol Deshpande.
Active Learning for Streaming Networked Data Zhilin Yang, Jie Tang, Yutao Zhang Computer Science Department, Tsinghua University.
Fast Algorithms For Hierarchical Range Histogram Constructions
Sequence Clustering and Labeling for Unsupervised Query Intent Discovery Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: WSDM’12 Date: 1 November,
SPARK: Top-k Keyword Query in Relational Databases Yi Luo, Xuemin Lin, Wei Wang, Xiaofang Zhou Univ. of New South Wales, Univ. of Queensland SIGMOD 2007.
Efficient Query Evaluation on Probabilistic Databases
John Lafferty, Andrew McCallum, Fernando Pereira
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.
Qualifying Exam: Contour Grouping Vida Movahedi Supervisor: James Elder Supervisory Committee: Minas Spetsakis, Jeff Edmonds York University Summer 2009.
Ming Hua, Jian Pei Simon Fraser UniversityPresented By: Mahashweta Das Wenjie Zhang, Xuemin LinUniversity of Texas at Arlington The University of New South.
Autocorrelation and Linkage Cause Bias in Evaluation of Relational Learners David Jensen and Jennifer Neville.
Mining Query Subtopics from Search Log Data Date : 2012/12/06 Resource : SIGIR’12 Advisor : Dr. Jia-Ling Koh Speaker : I-Chih Chiu.
Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.
Service Discrimination and Audit File Reduction for Effective Intrusion Detection by Fernando Godínez (ITESM) In collaboration with Dieter Hutter (DFKI)
Models and Security Requirements for IDS. Overview The system and attack model Security requirements for IDS –Sensitivity –Detection Analysis methodology.
Hidden Markov Models Pairwise Alignments. Hidden Markov Models Finite state automata with multiple states as a convenient description of complex dynamic.
Conditional Random Fields
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Scalable Text Mining with Sparse Generative Models
Ensemble Learning (2), Tree and Forest
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Lucent Technologies – Proprietary Use pursuant to company instruction Learning Sequential Models for Detecting Anomalous Protocol Usage (work in progress)
Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.
Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.
Short Text Understanding Through Lexical-Semantic Analysis
Graphical models for part of speech tagging
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.
EASE: An Effective 3-in-1 Keyword Search Method for Unstructured, Semi-structured and Structured Data Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong.
Laxman Yetukuri T : Modeling of Proteomics Data
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Diversifying Search Result WSDM 2009 Intelligent Database Systems Lab. School of Computer Science & Engineering Seoul National University Center for E-Business.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Scene Completion Using Millions of Photographs James Hays, Alexei A. Efros Carnegie Mellon University ACM SIGGRAPH 2007.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Enhancing Web Search by Promoting Multiple Search Engine Use Ryen W. W., Matthew R. Mikhail B. (Microsoft Research) Allison P. H (Rice University) SIGIR.
Data and Knowledge Engineering Laboratory Clustered Segment Indexing for Pattern Searching on the Secondary Structure of Protein Sequences Minkoo Seo Sanghyun.
Date: 2013/6/10 Author: Shiwen Cheng, Arash Termehchy, Vagelis Hristidis Source: CIKM’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Predicting the Effectiveness.
Ensemble Methods in Machine Learning
John Lafferty Andrew McCallum Fernando Pereira
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Context-Aware Query Classification Huanhuan Cao, Derek Hao Hu, Dou Shen, Daxin Jiang, Jian-Tao Sun, Enhong Chen, Qiang Yang Microsoft Research Asia SIGIR.
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
1 What Makes a Query Difficult? David Carmel, Elad YomTov, Adam Darlow, Dan Pelleg IBM Haifa Research Labs SIGIR 2006.
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
Survey Jaehui Park Copyright  2008 by CEBT Introduction  Members Jung-Yeon Yang, Jaehui Park, Sungchan Park, Jongheum Yeon  We are interested.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Information Extraction Entity Extraction: Statistical Methods Sunita Sarawagi.
Learning in a Pairwise Term-Term Proximity Framework for Information Retrieval Ronan Cummins, Colm O’Riordan Digital Enterprise Research Institute SIGIR.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
An Interval Classifier for Database Mining Applications Rakes Agrawal, Sakti Ghosh, Tomasz Imielinski, Bala Iyer, Arun Swami Proceedings of the 18 th VLDB.
Online Evolutionary Collaborative Filtering RECSYS 2010 Intelligent Database Systems Lab. School of Computer Science & Engineering Seoul National University.
A Unified Approach to Ranking in Probabilistic Databases Jian Li, Barna Saha, Amol Deshpande University of Maryland, College Park, USA VLDB
An unsupervised conditional random fields approach for clustering gene expression time series Chang-Tsun Li, Yinyin Yuan and Roland Wilson Bioinformatics,
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
Language Identification and Part-of-Speech Tagging
Boosted Augmented Naive Bayes. Efficient discriminative learning of
Introduction to Data Mining, 2nd Edition by
Topological Signatures For Fast Mobility Analysis
Presentation transcript:

Query Segmentation Using Conditional Random Fields Xiaohui and Huxia Shi York University KEYS’09 (SIGMOD Workshop) Presented by Jaehui Park, IDS Lab., Seoul National University

Copyright  2008 by CEBT INTRODUCTION  effective search of text information in relational databases keyword search – one of the challenges assembling keyword-matching tuples from different tables into one view  exponential search space (w.r.t. the number of keywords) 2 idnamecolormanufac.size id1T41blackcid112 id2T60Silvercid17 id3MiniBluecid34 id4vaiograycid27 id5vaiopredcid22 id6xnoteblackcid47 idnameheadquarter cid1IBMChina cid2 SONYJapan cid3 DELLUSA cid4 samsungKorea ProductCompany id1T41Blackcid112 cid1IBMChina T41 IBM query …

Copyright  2008 by CEBT INTRODUCTION  ex) “Green Mile Tom Hanks” segment: > – segment-matching (not keyword-matching) reducing the search space  Conditional Random Fields (CRF) probabilistic model to segment and label sequence data – normalized model for multiple feature function combination outperform Hidden Markov Model and Maximum-Entropy Markov Model in real world labeling tasks alleviating independence assumption avoid label bias assumption model a conditional probability distribution over a label sequence given keyword sequence – for given “Green Mile Tom Hanks” 3

Copyright  2008 by CEBT PROBLEM DEFINITION  query an ordered keyword sequence ( x = )  segment a subsequence of keywords in the query ( S = ) is valid – if this subsequence appears at least once in the database D  segmentation a sequence of non-overlapping segments that completely cover all keywords in the query ( Š = ) is valid – iff all S ∈ Š are valid – ex) query: Star Wars Clone valid segmentations  ( )  To find the optimal segmentation 4

Copyright  2008 by CEBT NON-STATISTICAL ALGORITHMS  greedy search starting with the first keyword in a given query keep including the next keyword into the current segment until adding the new keyword would make the segment no longer valid : not valid … 5

Copyright  2008 by CEBT NON-STATISTICAL ALGORITHMS  Keyword Query Cleaning [VLDB 2008] dynamic programming expanding each keyword to a set of similar tokens scoring function – TF-IDF (IR sense) – favors longer segments – penalizes spelling corrections 6

Copyright  2008 by CEBT QUERY SEGMENTATION USING CRF  Computing the label for each keyword in a given query grouping adjacent keywords with the same label into the same segment – conditional probability for y: label sequence x: keyword sequence – best label sequence y’ = training set – Database D = {x k, y k } k=1~N obtained from query logs 7

Copyright  2008 by CEBT QUERY SEGMENTATION USING CRF  the query is segmented based on those label the invalid segments is further broken down into valid segments – “Green Mile Tom Hanks” > – “Johny Depp Orlando Bloom” MaxScore, MaxTerm algorithm computing optimal segmentation S’ = – the optimal segmentation for each invalid segment is computed through the tree search procedure MaxTerm: valid segment of maximum length  E> from finest segment to MaxTerm 8 …JDOB… … JO B D

Copyright  2008 by CEBT QUERY SEGMENTATION USING CRF  enhanced CRF model column-position pairing – exact position of a keyword in a segment – ex) “Green Mile Tom Hanks”  start position  segmentation boundary  other position  adapting user preferences 9

Copyright  2008 by CEBT EXPERIMENTS  Dataset IMDB ( tuples), FoodMark ( tuples) the training set and test queries are generated by random sampling – 10-fold cross validation segmentation accuracy – A x = 1 – (|Sx – S’x| + |S’x – Sx|) / |Sx| Sx: true segment set S’x: predicted segment set  accuracy the CRF model is not sensitive to query length 10

Copyright  2008 by CEBT EXPERIMENTS  ambiguous connection deciding which segment a keyword should belong when that keyword can form valid segments with both the preceding and the following keywords –, k,, …> – ambiguity level the number of ambiguous connections 11

Copyright  2008 by CEBT EXPERIMENTS  efficient query segmentation two or three orders of magnitude improvement over keyword query cleaning less than 0.02 seconds to segment one medium query (except keyword query cleaning) 12

Copyright  2008 by CEBT CONCLUSION  CRF-based models for query segmentation  Experiments have demonstrated the effectiveness of the proposed approach  Future work accommodating spelling errors online segmentation of query in a streaming fashion 13

Copyright  2008 by CEBT discussion  ( ) subsequence of valid segment is also valid  strong assumptions prefix as a subsequence of keywords considering only adjacent keywords in the same segment  not clear sum of all segments is always constant (MaxScore algorithm) – it cannot solve the problem of two terms in a segment hard to follow the tree merging algorithm (MaxTermSearch algorithm) 14