Download presentation
Presentation is loading. Please wait.
Published byCorey Williams Modified over 8 years ago
1
A code-centric cluster-based approach for searching online support forums for programmers Christopher Scaffidi, Christopher Chambers, Sheela Surisetty
2
("how to use * in labview" OR "how do you * in labview" OR "how can I * in labview") AND site:forums.ni.com
3
How do programmers cope? Rely on code. Code to aid understanding "I can research the individual parts myself, but an ‘assembled’ explanation from a veteran would be very helpful." Code to clarify text Answers with code samples are 54% more likely to be flagged as accepted solutions than answers without source code Code as a primary source of information LabVIEW attachments average 92 KB, compared to only 474 bytes for the text of posts
4
Toward a better search engine… Ideally, we could match user's keywords to a marked solution that also has code attachment Return a one-sentence summary of the key idea And code demonstrating how to do it How do you do X? View detail Download You need to use the xxxxxxx to perform yyyyyyyy.
5
Our first step toward a solution Clusters of code Searching for code Evaluation Ideas for the future
6
First step: Making sense of code
7
Relationships among code Code, like text, can be summarized as an N-dimensional vector, with one dimension per distinct primitive Clustering code according to structure Hypothesis #1: Structurally similar code tends to have a similar topic Adequate quality for use as a search result If a piece of code X was marked as a solution for one question, and another piece of code Y is very similar to X, then perhaps Y could be a good answer, too. Hypothesis #2: Search results will improve if we use code similarity as a proxy for quality.
8
Features used for code clustering Each piece of LabVIEW code is called a "VI" If M(v, j) indicates the number of times that VI v uses operation j, then Essentially the same TF-IDF vector used to classify web pages, tweets, other textual documents Amenable to clustering with k-means on vector dot product
9
Clustering code Sample data set 150,323 discussion threads 818,945 posts in total 71,968 VIs that could be parsed and clustered Placed into 1000 clusters 966 contained more than one attachment Informally reviewed some clusters and verified that they generally "made sense"
10
Hypothesis #1: Structurally similar code tends to have a similar topic Obtaining data for the analysis Randomly chose a VI from each cluster Foreach cluster Randomly chose another cluster's VI Randomly chose a second VI in same cluster Retrieved the forum text associated with each VI Statistical paired t-tests tests: Do posts within cluster tend to have more words in common than those in different clusters? Do posts within cluster tend to have higher dot product (in a "word space“ TFIDF) than those in different clusters? Results: Both significant at p<0.000001 Conclusion: Hypothesis #1 is probably true
11
Hypothesis #2: Search results will improve if we use code similarity as a proxy for quality. Constructed search engine Given query, find "primary" search results Traditional keyword match method (vector in word-space) Restrict to explicit solutions that have code For each VI in primary results, find secondary results Those in same cluster whose posts also mention query words Heuristically merge primary and secondary result lists
12
Search algorithm Start with primary search results generated from a query on the text of forum posts If post p contains a VI, let N(p, i) indicate the number of times that the text of p mentions word i (which ranges over the user query, discarding stop words), and Retrieve W p vectors that are marked as solutions, in order of decreasing
13
Search algorithm, cont. Insight: if one relevant VI v 1 has been explicitly marked as a solution, then perhaps a similar and relevant VI v 2 might also be a useful solution Use clusters to retrieve similar code even from posts that aren't explicitly marked as solutions For each post p containing an attachment in the same cluster as any attachment in the primary results, let Sort these secondary results by decreasing score' p S p is a heuristic estimate of likelihood p is a solution (linear function of # kudos, author's activity, length of post, position of post in thread, binary variable indicating if post is self-reply)
14
Test data for evaluating hypothesis #2 Test queries 10 sample user queries from posts in the biggest topics identified during a prior forum study Up to 5 search results from this search engine, plus up to 5 from the existing LabVIEW forum search engine Intermediate LabVIEW user rated all search results Randomly mixed together results from different engines Rating scheme: 0=off-topic 1=on-topic but unrelated to specific question 2=related to question but doesn't actually answer it 3=partial answer to the question 4=fully answers the question
15
Study results Existing searchNew search % of queries for which a non-empty result set was obtained 80%100% Average # results received per query2.93.5 % of results rated as an answer7%40% % of queries for which results include at least one result rated as an answer 10%50% Overall average rating of results 0.761.83 Difference in rating significant at p<0.0001 Hypothesis #2 is probably true. Search results will improve if we use code similarity as a proxy for quality.
16
Implications for designers of Q&A systems Code can be grouped in meaningful ways using clustering Consider for use in designing new features Such as search engines similar to our prototype Such as features to recommend “similar examples” Such as features, inside IDEs, for retrieving code examples from a repository that are similar to what the programmer is currently creating
17
Substantial room for improvement Still, only 40% of results were rated as an answer Need a better method of filtering out non-answers Need to integrate answers from outside the forum Particularly a problem for topics that are not code-centric, principally (in this study) Hardware I/O Future work Help users understand relationships to other resources Lead users to resources other than code Provide summaries of code
18
What are your ideas? Time for Q&A Thank you to National Instruments for funding Thank you to ICMLA for this chance to present Thank you to you for suggestions and feedback
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.