Evaluation of N-grams Conflation Approach in Text-based Information Retrieval Serge Kosinov University of Alberta, Computing Science Department, Edmonton,

Slides:



Advertisements
Similar presentations
A Vector Space Model for Automatic Indexing
Advertisements

Chapter 5: Introduction to Information Retrieval
Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
Improved TF-IDF Ranker
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
LYRIC-BASED ARTIST NETWORK METHODOLOGY Derek Gossi CS 765 Fall 2014.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Efficient Web Browsing on Handheld Devices Using Page and Form Summarization Orkut Buyukkokten, Oliver Kaljuvee, Hector Garcia-Molina, Andreas Paepcke.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Ch 4: Information Retrieval and Text Mining
Retrieval Evaluation: Precision and Recall. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity.
Information Retrieval Ch Information retrieval Goal: Finding documents Search engines on the world wide web IR system characters Document collection.
Evaluating the Performance of IR Sytems
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.
HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Baseline Document Retrieval Component N. Bassiou, C. Kotropoulos, I. Pitas 20/07/2000,
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
Chapter 5: Information Retrieval and Web Search
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Aparna Kulkarni Nachal Ramasamy Rashmi Havaldar N-grams to Process Hindi Queries.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
Text mining.
Exploiting Wikipedia as External Knowledge for Document Clustering Sakyasingha Dasgupta, Pradeep Ghosh Data Mining and Exploration-Presentation School.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Chapter 6: Information Retrieval and Web Search
1 Computing Relevance, Similarity: The Vector Space Model.
CSE3201/CSE4500 Term Weighting.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Chapter 23: Probabilistic Language Models April 13, 2004.
DOCUMENT UPDATE SUMMARIZATION USING INCREMENTAL HIERARCHICAL CLUSTERING CIKM’10 (DINGDING WANG, TAO LI) Advisor: Koh, Jia-Ling Presenter: Nonhlanhla Shongwe.
Web- and Multimedia-based Information Systems Lecture 2.
Vector Space Models.
Introduction to String Kernels Blaz Fortuna JSI, Slovenija.
CS307P-SYSTEM PRACTICUM CPYNOT. B13107 – Amit Kumar B13141 – Vinod Kumar B13218 – Paawan Mukker.
E VENT D ETECTION USING A C LUSTERING A LGORITHM Kleisarchaki Sofia, University of Crete, 1.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
Ranked Retrieval INST 734 Module 3 Doug Oard. Agenda Ranked retrieval  Similarity-based ranking Probability-based ranking.
1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting.
1 CS 430: Information Discovery Lecture 5 Ranking.
Natural Language Processing Topics in Information Retrieval August, 2002.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Mining Tag Semantics for Social Tag Recommendation Hsin-Chang Yang Department of Information Management National University of Kaohsiung.
1 Efficient Phrase-Based Document Similarity for Clustering IEEE Transactions On Knowledge And Data Engineering, Vol. 20, No. 9, Page(s): ,2008.
Language Modeling Again So are we smooth now? Courtesy of Chris Jordan.
LEARNING IN A PAIRWISE TERM-TERM PROXIMITY FRAMEWORK FOR INFORMATION RETRIEVAL Ronan Cummins, Colm O’Riordan (SIGIR’09) Speaker : Yi-Ling Tai Date : 2010/03/15.
IR 6 Scoring, term weighting and the vector space model.
Why indexing? For efficient searching of a document
Plan for Today’s Lecture(s)
Clustering of Web pages
Text Based Information Retrieval
Information Retrieval and Web Search
Basic Information Retrieval
Representation of documents and queries
CS 430: Information Discovery
CS 430: Information Discovery
Presentation transcript:

Evaluation of N-grams Conflation Approach in Text-based Information Retrieval Serge Kosinov University of Alberta, Computing Science Department, Edmonton, AB, Canada

N-gram Conflation Method Goal study a conflation method based on n-gram approach with some enhancements and evaluate its performance in textual information retrieval What is conflation good for? matching non-identical words that refer to the same principle concept Why is it important? Avoid problems of strong dependence of retrieval results on the exact wording of a user's query  account for richness and redundancy of the natural language

N-grams Method: Basic Idea * Subdivide words into N-grams - set of overlapping substrings of length N Example:N=2: (radio)  ( ra - ad - di - io ) N=3: (radio)  ( rad - adi - dio ) * Treat as similar and group together words with identical N-gram structure

N-grams Method: Basic Idea (Continued)

Experiment Implementation * Pre-process text collections (remove stop words, punctuation, special characters, etc.) * Find the set of unique terms and compute their similarity matrix * Cluster this data and compute IDF-like correction multipliers for each N-gram * Process queries by replacing the terms that fit into obtained clusters with cluster ID both in document collections and in queries, then pick the best match via standard vector model representation

Computing Similarity Matrix

* Clustering technique complete link agglomerative clustering (aka HCA) Example: C325 = {computer, computing, computer-based} * IDF-like adjustments of weights w ij - weight of bigram B j in term cluster C i bf ij - frequency of bigram B j in term cluster C i N - number of term clusters n - number of term clusters where bigram B j occurs at least once Clustering Data and Adjusting IDF

Processing Queries: Example Best match : cosine similarity coefficient between document vector (..., C325, C487, Torvalds,... ) and query vector (..., C325, Torvalds,... )

Experimental Results Text collections used: Results: 3 point precision average (at 20, 50, 80 % recall)

Inverse Frequency Weights Effect Association of unseen query terms with clusters: * With IDF-like correction: consolidation  {console} editing  {editor, edition} * Without IDF-like correction : consolidation  {condensation} editing  {accrediting}

Individual Query Analysis Example * Other examples in which N-gram conflation outperforms other methods: criteria-criterion, exchange-interchange, system-subsystem, etc.

Conlusions and Directions for Further Study Advantages of N-gram conflation: * is a language-independent approach * tackles well misprints and orthographical errors * best gain for special form and compound words * enhanced with IDF-like correction performs better than traditional stemming (Porter, etc.) Disadvatages: * clusters have homophone noise * straightforward HCA impractical on large-scale datasets Prospects: * apply the method for more inflected languages * combine N-grams and Porter * enhance clustering routines