Multi-AbstractionRetrievalMulti-AbstractionRetrieval MotivationMotivation ExperimentsExperiments Overall Framework Multi-Abstraction Concern Localization.

Slides:



Advertisements
Similar presentations
Sinead Williamson, Chong Wang, Katherine A. Heller, David M. Blei
Advertisements

Which Feature Location Technique is Better? Emily Hill, Alberto Bacchelli, Dave Binkley, Bogdan Dit, Dawn Lawrie, Rocco Oliveto.
Metadata in Carrot II Current metadata –TF.IDF for both documents and collections –Full-text index –Metadata are transferred between different nodes Potential.
Chapter 5: Introduction to Information Retrieval
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
The user entered the query “What is the historical relation between Greek and Roma”. Here are the query’s results. The user clicked the topic “Roman copies.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Ch 4: Information Retrieval and Text Mining
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
MANISHA VERMA, VASUDEVA VARMA PATENT SEARCH USING IPC CLASSIFICATION VECTORS.
Vector Space Model Any text object can be represented by a term vector Examples: Documents, queries, sentences, …. A query is viewed as a short document.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
The Vector Space Model …and applications in Information Retrieval.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing.
Chapter 5: Information Retrieval and Web Search
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.
Utilising software to enhance your research Eamonn Hynes 5 th November, 2012.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Text mining.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems.
Giorgos Giannopoulos (IMIS/”Athena” R.C and NTU Athens, Greece) Theodore Dalamagas (IMIS/”Athena” R.C., Greece) Timos Sellis (IMIS/”Athena” R.C and NTU.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
IR Homework #2 By J. H. Wang Mar. 31, Programming Exercise #2: Query Processing and Searching Goal: to search relevant documents for a given query.
SINGULAR VALUE DECOMPOSITION (SVD)
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
IR Homework #1 By J. H. Wang Mar. 21, Programming Exercise #1: Vector Space Retrieval Goal: to build an inverted index for a text collection, and.
1 Web-Page Summarization Using Clickthrough Data* JianTao Sun, Yuchang Lu Dept. of Computer Science TsingHua University Beijing , China Dou Shen,
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
Web- and Multimedia-based Information Systems Lecture 2.
IR Homework #1 By J. H. Wang Mar. 16, Programming Exercise #1: Vector Space Retrieval - Indexing Goal: to build an inverted index for a text collection.
Clustering More than Two Million Biomedical Publications Comparing the Accuracies of Nine Text-Based Similarity Approaches Boyack et al. (2011). PLoS ONE.
Vector Space Models.
Multi-Abstraction Concern Localization Tien-Duy B. Le, Shaowei Wang, and David Lo School of Information Systems Singapore Management University 1.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Improving the performance of personal name disambiguation.
A Multiresolution Symbolic Representation of Time Series Vasileios Megalooikonomou Qiang Wang Guo Li Christos Faloutsos Presented by Rui Li.
Sudhanshu Khemka.  Treats each document as a vector with one component corresponding to each term in the dictionary  Weight of a component is calculated.
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
From Frequency to Meaning: Vector Space Models of Semantics
Information Retrieval: Models and Methods
Clustering of Web pages
Information Retrieval: Models and Methods
An Empirical Study of Learning to Rank for Entity Search
Information Retrieval and Web Search
Application of Classification and Clustering Methods on mVoC (Medical Voice of Customer) data for Scientific Engagement Yingzi Xu, Department of Statistics,
Trevor Savage, Bogdan Dit, Malcom Gethers and Denys Poshyvanyk
Mining and Analyzing Data from Open Source Software Repository
Chapter 5: Information Retrieval and Web Search
Content Analysis of Text
Authors: Barry Smyth, Mark T. Keane, Padraig Cunningham
Boolean and Vector Space Retrieval Models
Restructuring Sparse High Dimensional Data for Effective Retrieval
Unsupervised learning of visual sense models for Polysemous words
From Unstructured Text to StructureD Data
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
VECTOR SPACE MODEL Its Applications and implementations
Presentation transcript:

Multi-AbstractionRetrievalMulti-AbstractionRetrieval MotivationMotivation ExperimentsExperiments Overall Framework Multi-Abstraction Concern Localization Tien-Duy B. Le, Shaowei Wang, and David Lo {btdle.2012, Abstraction Hierarchy Method Corpus Concerns Preprocessing Hierarchy Creation Level 1 Level 2 Level N …. Standard Retrieval Technique + Multi- Abstraction Retrieval Ranked Methods Per Concern We remove Java keywords, punctuation marks, special symbols, and break identifiers into tokens based on Camel casing convention Finally, we apply Porter Stemming algorithm to reduce English words into their root forms. We apply Latent Dirichlet Allocation (LDA), with different number of topics, a number of times, to construct an abstraction hierarchy Each application of LDA creates a topic model, which corresponds to an abstraction level. We refer to the number of topic models contained in a hierarchy as the height of the hierarchy Concern Localization is the process of locating code units that match a particular textual description (bug reports or feature requests) Recent concern localization techniques compare documents at one level of abstraction (i.e. words/topics) A word can be abstracted at multiple levels of abstraction. For example, Eindhoven can be abstracted to North Brabant, Netherlands, Western Europe, European Continent, Earth etc. In multi-abstraction concern localization, we represent documents at multiple abstraction levels by leveraging multiple topic models. Text Preprocessing Hierarchy Creation Step We propose multi-abstraction Vector Space Model (VSM MA ) by combining VSM with our abstraction hierarchy. In multi-abstraction VSM, document vectors are extended by adding elements corresponding to topics in the hierarchy. Given a query q and a document d in corpus D, the similarity between q and d is calculated in VSM MA as follows: V is the size of the original document vector w i is the i th word in d L is the height of abstraction hierarchy H H i is the i th abstraction level in the hierarchy is the probability of topic t i to appear in d as assigned by the k th topic model in abstraction hierarchy H tf-idf (w,d,D) is the term frequency-inverse document frequency of word w in document d given corpus D Effectiveness of Multi-Abstraction VSM Number of TopicsMAPImprovement Baseline (VSM)0.0669N/A H % H250, % H350, 100, % H450, 100, 150, % The MAP improvement of H4 (over baseline) is 19.36% The MAP is improved when the height of the abstraction hierarchy is increased Future Work Extend the experiments with combinations of  Different numbers of topics in each level of the hierarchy  Different hierarchy heights  Different topic models (Pachinko Allocation Model, Syntactic Topic Model, Hierarchical LDA)  Experiment with Panichella et al. ‘s method [1] to infer good LDA configurations for our approach [1]A. Panichella, B. Dit, R.Oliveto, M.D. Penta, D. Poshyvanyk, and A.D Lucia. How to effectively use topic models for software engineering tasks? an approach based on genetic algorithms. (ICSE 2013) Where