A code-centric cluster-based approach for searching online support forums for programmers Christopher Scaffidi, Christopher Chambers, Sheela Surisetty.

Slides:

Advertisements

Similar presentations

Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.

Advertisements

Significance Testing.  A statistical method that uses sample data to evaluate a hypothesis about a population  1. State a hypothesis  2. Use the hypothesis.

Introduction to Information Retrieval

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

R OBERTO B ATTITI, M AURO B RUNATO The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.

Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.

A review on “Answering Relationship Queries on the Web” Bhushan Pendharkar ASU ID

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.

A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.

Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)

Information Retrieval Visualization CPSC 533c Class Presentation Qixing Zheng March 22, 2004.

Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.

Evaluating Search Engine

Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.

Carnegie Mellon 1 Maximum Likelihood Estimation for Information Thresholding Yi Zhang & Jamie Callan Carnegie Mellon University

Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.

CSCI 5708: Query Processing I Pusheng Zhang University of Minnesota Feb 3, 2004.

© Anselm Spoerri Lecture 13 Housekeeping –Term Projects Evaluations –Morse, E., Lewis, M., and Olsen, K. (2002) Testing Visual Information Retrieval Methodologies.

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

CSCI 5708: Query Processing I Pusheng Zhang University of Minnesota Feb 3, 2004.

Information Retrieval

Chapter 5: Information Retrieval and Web Search

1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.

Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

LEARNING PROGRAMME Hypothesis testing Intermediate Training in Quantitative Analysis Bangkok November 2007.

© 2011 Pearson Prentice Hall, Salkind. Introducing Inferential Statistics.

Code You Can Use: Searching for web automation scripts based on reusability James Admire, Abbas Al Zawwad, Abdulwahab Almorebah, Sanchit Karve, Christopher.

Question Answering.  Goal  Automatically answer questions submitted by humans in a natural language form  Approaches  Rely on techniques from diverse.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.

1 Discovering Authorities in Question Answer Communities by Using Link Analysis Pawel Jurczyk, Eugene Agichtein (CIKM 2007)

1 Efficient Search Ranking in Social Network ACM CIKM2007 Monique V. Vieira, Bruno M. Fonseca, Rodrigo Damazio, Paulo B. Golgher, Davi de Castro Reis,

Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.

Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.

Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.

Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.

Chapter 6: Information Retrieval and Web Search

1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.

Introduction to Digital Libraries hussein suleman uct cs honours 2003.

Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.

Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.

The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.

Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.

How Useful are Your Comments? Analyzing and Predicting YouTube Comments and Comment Ratings Stefan Siersdorfer, Sergiu Chelaru, Wolfgang Nejdl, Jose San.

Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.

21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,

Chapter 23: Probabilistic Language Models April 13, 2004.

Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.

What Training is Needed by Practicing Engineers Who Create Cyberphysical Systems? Chris Scaffidi Oregon State University (USA)

Conceptual structures in modern information retrieval Claudio Carpineto Fondazione Ugo Bordoni

A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,

CHAPTER OVERVIEW Say Hello to Inferential Statistics The Idea of Statistical Significance Significance Versus Meaningfulness Meta-analysis.

Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.

26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.

TO Each His Own: Personalized Content Selection Based on Text Comprehensibility Date: 2013/01/24 Author: Chenhao Tan, Evgeniy Gabrilovich, Bo Pang Source:

Personalization Services in CADAL Zhang yin Zhuang Yuting Wu Jiangqin College of Computer Science, Zhejiang University November 19,2006.

Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.

User Interfaces and Information Retrieval Dina Reitmeyer WIRED (i385d)

Navigation Aided Retrieval Shashank Pandit & Christopher Olston Carnegie Mellon & Yahoo.

Information Organization: Overview

Azita Keshmiri CS 157B Ch 12 indexing and hashing

Database Management System

IST 516 Fall 2011 Dongwon Lee, Ph.D.

Roberto Battiti, Mauro Brunato

Information Organization: Overview

Information Retrieval and Web Design

VECTOR SPACE MODEL Its Applications and implementations

Presentation transcript:

A code-centric cluster-based approach for searching online support forums for programmers Christopher Scaffidi, Christopher Chambers, Sheela Surisetty

("how to use * in labview" OR "how do you * in labview" OR "how can I * in labview") AND site:forums.ni.com

How do programmers cope? Rely on code. Code to aid understanding "I can research the individual parts myself, but an ‘assembled’ explanation from a veteran would be very helpful." Code to clarify text Answers with code samples are 54% more likely to be flagged as accepted solutions than answers without source code Code as a primary source of information LabVIEW attachments average 92 KB, compared to only 474 bytes for the text of posts

Toward a better search engine… Ideally, we could match user's keywords to a marked solution that also has code attachment Return a one-sentence summary of the key idea And code demonstrating how to do it How do you do X? View detail Download You need to use the xxxxxxx to perform yyyyyyyy.

Our first step toward a solution Clusters of code Searching for code Evaluation Ideas for the future

First step: Making sense of code

Relationships among code Code, like text, can be summarized as an N-dimensional vector, with one dimension per distinct primitive Clustering code according to structure Hypothesis #1: Structurally similar code tends to have a similar topic Adequate quality for use as a search result If a piece of code X was marked as a solution for one question, and another piece of code Y is very similar to X, then perhaps Y could be a good answer, too. Hypothesis #2: Search results will improve if we use code similarity as a proxy for quality.

Features used for code clustering Each piece of LabVIEW code is called a "VI" If M(v, j) indicates the number of times that VI v uses operation j, then Essentially the same TF-IDF vector used to classify web pages, tweets, other textual documents Amenable to clustering with k-means on vector dot product

Clustering code Sample data set 150,323 discussion threads 818,945 posts in total 71,968 VIs that could be parsed and clustered Placed into 1000 clusters 966 contained more than one attachment Informally reviewed some clusters and verified that they generally "made sense"

Hypothesis #1: Structurally similar code tends to have a similar topic Obtaining data for the analysis Randomly chose a VI from each cluster Foreach cluster Randomly chose another cluster's VI Randomly chose a second VI in same cluster Retrieved the forum text associated with each VI Statistical paired t-tests tests: Do posts within cluster tend to have more words in common than those in different clusters? Do posts within cluster tend to have higher dot product (in a "word space“ TFIDF) than those in different clusters? Results: Both significant at p< Conclusion: Hypothesis #1 is probably true

Hypothesis #2: Search results will improve if we use code similarity as a proxy for quality. Constructed search engine Given query, find "primary" search results Traditional keyword match method (vector in word-space) Restrict to explicit solutions that have code For each VI in primary results, find secondary results Those in same cluster whose posts also mention query words Heuristically merge primary and secondary result lists

Search algorithm Start with primary search results generated from a query on the text of forum posts If post p contains a VI, let N(p, i) indicate the number of times that the text of p mentions word i (which ranges over the user query, discarding stop words), and Retrieve W p vectors that are marked as solutions, in order of decreasing

Search algorithm, cont. Insight: if one relevant VI v 1 has been explicitly marked as a solution, then perhaps a similar and relevant VI v 2 might also be a useful solution Use clusters to retrieve similar code even from posts that aren't explicitly marked as solutions For each post p containing an attachment in the same cluster as any attachment in the primary results, let Sort these secondary results by decreasing score' p S p is a heuristic estimate of likelihood p is a solution (linear function of # kudos, author's activity, length of post, position of post in thread, binary variable indicating if post is self-reply)

Test data for evaluating hypothesis #2 Test queries 10 sample user queries from posts in the biggest topics identified during a prior forum study Up to 5 search results from this search engine, plus up to 5 from the existing LabVIEW forum search engine Intermediate LabVIEW user rated all search results Randomly mixed together results from different engines Rating scheme: 0=off-topic 1=on-topic but unrelated to specific question 2=related to question but doesn't actually answer it 3=partial answer to the question 4=fully answers the question

Study results Existing searchNew search % of queries for which a non-empty result set was obtained 80%100% Average # results received per query % of results rated as an answer7%40% % of queries for which results include at least one result rated as an answer 10%50% Overall average rating of results Difference in rating significant at p< Hypothesis #2 is probably true. Search results will improve if we use code similarity as a proxy for quality.

Implications for designers of Q&A systems Code can be grouped in meaningful ways using clustering Consider for use in designing new features Such as search engines similar to our prototype Such as features to recommend “similar examples” Such as features, inside IDEs, for retrieving code examples from a repository that are similar to what the programmer is currently creating

Substantial room for improvement Still, only 40% of results were rated as an answer Need a better method of filtering out non-answers Need to integrate answers from outside the forum Particularly a problem for topics that are not code-centric, principally (in this study) Hardware I/O Future work Help users understand relationships to other resources Lead users to resources other than code Provide summaries of code

What are your ideas? Time for Q&A Thank you to National Instruments for funding Thank you to ICMLA for this chance to present Thank you to you for suggestions and feedback