Assieme: Finding and Leveraging Implicit References in a Web Search Interface for Programmers I am Raphael Hoffmann and this is joint work with James Fogarty and Dan Weld at the University of Washington. My talk is about Assieme – a new Web Search Interface for programmers. Raphael Hoffmann, James Fogarty, Daniel S. Weld University of Washington, Seattle UIST 2007
Programmers Use Search To identify an API To seek information about an API To find examples on how to use an API Example Task: This talk will extend this list. It is about search as performed by programmers. As we confirmed in interviews with programmers, they frequently search the Web to identify an API (that they can use in their project), to seek more information about an API (such as documentation pages), to find examples on how to use an API (many pages contain short code snippets that are very valuable to programmers) “Programmatically output an Acrobat PDF file in Java.”
Example: General Web Search Interface Let’s first look again at our example of outputting an acrobat PDF file in Java. We could use a general Web Search interface and search for “output acrobat”. This query is too general, so let’s add the keyword “java”. Still nothing relevant. So let’s modify the query to “output pdf java”. Ok, the first two hits seem very relevant. The first one is a long article on using an API for generating pdf output in Java. The article also contains some code snippets. It says the first step is to create a document object. However, the code sample is incomplete. It doesn’t say which package contains class Document, and we also cannot look up documentation of document, for example to choose a different constructor that better suits our needs. We could do a new search (here we added the class name to our last query – that didn’t work so well, perhaps we could try the name of the library). However, looking again carefully at our article we might also find a link to more information about the library. Navigating to another 4 pages finally brings us to the information we are looking for. In summary: A general web search engine certainly gets us the information we need, but it might take many page visits and searches. So far we have only located a small piece of information about a single API that we might use.
Example: Code-Specific Web Search Interface There also exist numerous code-specific search engines on the Web now. Let’s try one of them. Here we search for “output acrobat” and we restrict the results to Java code by adding “lang:java”. We get some results, but the page summaries are confusing, so let’s click on one. A long copyright notice and a little bit of code which is totally irrelevant. Let’s again change our query to “output pdf” and restrict the results by adding “lang:java”. Again, confusing summaries, copyright notices, and code that is irrelevant. We stop at this point, but say that it is very difficult to obtain the information we need using existing code-specific search engines. It is far easier with general Web search engines, because much of the information we are looking at already exists on Web pages that have been manually crafted by humans. …
Problems Information is dispersed: tutorials, API itself, documentation, pages with samples Difficult and time-consuming to … locate required pieces, get an overview of alternatives, judge relevance and quality of results, understand dependencies. Many page visits required Unfortunately, obtaining such information is not always straightforward: One problem is that the information is dispersed; there exist tutorials, API itself in source or binary format, documentation pages in Javadoc format, or simply pages with code examples, such as articles or messages in forums It is therefore often difficult and time-consuming to … Programmers rarely find all information on one page: they must visit many pages and perform multiple searches.
With Assieme we … Designed a new Web search interface Developed needed inference In this talk we present Assieme – a new Web search interface designed to overcome these limitations. Assieme attemps to display all required pieces of information in a single view, and offers powerful capabilities in browsing code-related information on the Web. To do this it needs to perform some interesting inference – which I will talk about later.
Outline Motivation What Programmers Search For The Assieme Search Engine Inferring Implicit References Using Implicit References for Scoring Evaluation of Inference & User Study Discussion & Conclusion How did we come up with this interface? We were initially interested in finding out about information needs of programmers.
Six Learning Barriers faced by Programmers (Ko et al. 04) Design barriers — What to do? Selection barriers — What to use? Coordination barriers — How to combine? Use barriers — How to use? Understanding barriers — What is wrong? Information barriers — How to check? A work that tries to answer this question is Andy Ko and Brad Myers’ paper on six learning barriers faced by programmers. There are design barriers, when programmers do not know what to do such as to conceive of an appropriate algorithm, Selection barriers, what to use, … For at least three kinds of barriers, programmers can do Web search – and these are exactly those related to APIs and those that we are addressing with Assieme.
Examining Programmer Web Queries Objective See what programmers search for Dataset 15 million queries and click-through data Random sample of MSN queries in 05/06 Procedure Extract query sessions containing ‘java’ – 2,529 Manual looking at queries and defining regex filters Informal taxonomy of query sessions Next, we wanted to find out this is consistent with queries performed by programmers on Web search engines. We used a dataset of 15 million queries and … And filtered all query sessions containing at least one query with the keyword ‘java’ …
Examining Programmer Web Queries These are the results we got. The sizes of the circles correspond to the relative number of queries. Indeed the largest category are API related queries (followed by troubleshooting – e.g. error messages).
Examining Programmer Web Queries 64.1 % 35.9 % Descriptive Contain package, type or member name “java JSP current date” “java SimpleDateFormat” Looking more closely at API-related queries, we found that 64% contained merely descriptive keywords … presumably intended to identify an appropriate API, … From the complete set of APIs, 18% contained terms like example, using,… Selection barrier Use barrier 17.9 % Contain terms like “example”, “using”, “sample code” “using currentdate in jsp” Coordination barrier
Assieme relevance indicated by # uses documentation example code Summaries show referenced types links to related info Finally, let’s look at Assieme. We search for “output acrobat”, and the system returns some pages and also a few API packages/types/members that might be related to those keywords. We click on one which filters our set of pages to those only containing code examples using that API. All required information is now visible: Links to Javadoc, required libraries, example code, hovering, example counts tell us relevance. required libaries
How to put the right information Challenges ? How to put the right information on the interface Get all programming-related data Interpret data and infer relationships
Outline Motivation What Programmers Search For The Assieme Search Engine Inferring Implicit References Using Implicit References for Scoring Evaluation of Inference & User Study Discussion & Conclusion Let’s now talk about the Assieme Search Engine in more detail.
Assieme’s Data … is crawled using existing search engines Pages with code examples JavaDoc pages JAR files Queried Google on “java ±import ±class …” Downloaded library files for all projects on,,, Queried Google on “overview-tree.html …” Finally, let me briefly talk about Assieme’s data, which is crawled using existing search engines. Separately for pages with code examples, JAR files, and JavaDoc pages. ~2,360,000 ~79,000 ~480,000 … is crawled using existing search engines
The Assieme Search Engine … infers 2 kinds of implicit references JAR files Pages with code examples JavaDoc pages Uses of packages, types and members Matches of packages, types and members To power its interface, Assieme infers two kinds of implicit references: One are uses of packages/types/members in code examples on web pages. Those packages/types/members are contained in JAR files. The other kind are packages/types/members in JAR files and their respective Javadoc documentation pages. It turns out that inferring the second kind of references is relatively easy. Javadoc pages are automatically generated from source code, so it is not difficult to parse them and to re-create the matching to the contents in JAR files. The other kind is more involved – and this will be what I will be focusing on. To find out about uses in code examples, we must first extract code examples and then resolve references. ?
Extracting Code Samples unclear segmentation code in a different language (C++) distracting terms ‘…’ in code line numbers Extracting code examples from web pages is not trivial.
Extracting Code Samples remove HTML commands, but preserve line breaks <html> <head><title></title></head> <body> A simple example:<br><br> 1: import java.util.*; <br>2: class c {<br>3: HashMap m = new HashMap();<br>4: void f() { m.clear(); }<br>5: }<br><br> <a href=“index.html”>back</a> </body> </html> <html> <head><title></title></head> <body> A simple example:<br><br> 1: import java.util.*; <br>2: class c {<br>3: HashMap m = new HashMap();<br>4: void f() { m.clear(); }<br>5: }<br><br> <a href=“index.html”>back</a> </body> </html> A simple example: import java.util.*; class c { HashMap m = new HashMap(); void f() { m.clear(); } } back A simple example: import java.util.*; class c { HashMap m = new HashMap(); void f() { m.clear(); } } back A simple example: 1: import java.util.*; 2: class c { 3: HashMap m = new HashMap(); 4: void f() { m.clear(); } 5: } back A simple example: 1: import java.util.*; 2: class c { 3: HashMap m = new HashMap(); 4: void f() { m.clear(); } 5: } back remove some distracters by heuristics launch (error-tolerant) Java parser at every line break (separately parse for types, methods, and sequences of statements) Assieme extracts code examples by first removing all html commands from a page, while preserving line breaks. It then uses some heuristics to remove distracters. And finally launches a Java parser at every line break and separately attempts to parse for types, methods, and sequences of statements. The end of a code snippet is determined by tracking the state of the parser. There are more details about this in the paper.
Resolving External Code References Naïve approach of finding term matches does not work: 1 import java.util.*; 2 class c { 3 HashMap m = new HashMap(); 4 void f() { m.clear(); } 5 } Reference java.util.HashMap.clear() on line 4 only detectable by considering several lines ? After it has extracted code examples, Assieme needs to resolve references to external APIs. A naïve approach one could try is to search for pure term matches. Unfortunately, this doesn’t work. In this small example, line 4 contains a reference to java.util.Hashmap.clear(), but this is only detectable by combining information from several lines. We therefore use a compiler to identify unresolved names. Use compiler to identify unresolved names
Resolving External Code References Index packages/types/members in Jar files java.util.HashMap.clear() java.util.HashMap … unresolved names compile index lookup put on classpath Compile & lookup Utility function: # covered references (and JAR popularity) greedily pick best JARs JAR files More specifically, we first index all packages/types/members contained in JAR files in Assieme’s data repository. Then, when we resolve external references, we first compile code snippets – which gives us a set of unresolved names. Then we do an index lookup, and put the JAR files that contain the required objects onto the classpath and attempt a re-compilation. We repeat this until we make no further progress. However, often an object with a given name is contained in many different JAR files (e.g. different versions). JAR files
Scoring Existing techniques … … do not work well for code, because: Docs modeled as weighted term frequencies Hypertext link analysis (PageRank) JAR files (binary code) provide no context Source code contains few relevant keywords Structure in code important for relevance … do not work well for code, because: We now discuss how Assieme makes use of the implicit references it infers. Existing techniques for scoring documents (such as modeling documents as vectors of weighted term frequencies) or differentially weighting important documents by hypertext link analysis) do not work well for code, because JAR files contain few keywords and therefore lack context Source code contains few relevant keywords Structure in code (e.g. number of uses of objects) are important for relevance
Using Implicit References to Improve Scoring Assieme exploits structure on Web pages and structure in code code references Assieme tries to exploit structure on web pages (below here we see a graph of Web documents and hyperlinks between them), and structure in code (documents on the Web can be API’s or web pages with code samples, and there exist implicit references between them). HTML hyperlinks
Scoring APIs Web pages (packages/types/members) Assieme actually contains two scoring functions – one for API’s and one for Web pages. Web pages
Scoring APIs Use text on doc pages and on pages with code samples that reference API (~ anchor text) Weight APIs by #incoming refs (~ PageRank) Web Pages Use fully qualified references (java.util.HashMap) and adjust term weights Filter pages by references Favor pages with accompanying text For scoring API’s, Assieme uses the text that appears on documentation pages and also the text on pages with code samples that use an API. This is similar to the technique of anchor text scoring (the different being that Assieme uses implicit references rather than hyperlinks). Also, Assieme weights APIs by # of incoming references (this is similar to PageRank, but again using implicit references rather than hyperlinks). For scoring Web pages, Assieme uses not only the terms on the page, but also fully qualified references with weights adjusted to their frequency. Assieme also allows to filter web pages by implicit references and it favors pages with accompanying text rather than pure code.
Outline Motivation What Programmers Search For The Assieme Search Engine Inferring Implicit References Using Implicit References for Scoring Evaluation of Inference & User Study Discussion & Conclusion
Evaluating Code Extraction and Reference Resolution … on 350 hand-labeled pages from Assieme’s data Code Extraction Recall 96.9%, Precision 50.1% ( 76.7%) False positives: C, C#, JavaScript, PHP, FishEye/diff (After filtering pages without refs: precision 76.7%) Reference Resolution Recall 89.6%, Precision 86.5% False positives: Fisheye and diff pages False negatives: incomplete code samples To evaluate Assieme, we first analyzed the effectiveness of Assieme’s inference components and then performed a user study. We hand-labeled 350 pages from Assieme’s data. For code extraction Assieme reaches a recall of 96.9% at a precision of 50.1%. While recall is important, precision is of less concern here, because Assieme later filters pages without refs which increases precision to 76.7%. For reference resolution, Assieme reaches a recall of 89.6% at a precision of 86.5%.
Assieme vs. Google vs. Google Code Search User Study Assieme vs. Google vs. Google Code Search Design 40 search tasks based on queries in logs: query “socket java” “Write a basic server that communicates using Sockets” Find code samples (and required libraries) 4 blocks of 10 tasks: 1 for training + 1 per interface In our user study, we compared Assieme to Google and Google Code Search. We developed 40 search tasks based on queries found in the query logs discussed earlier. For example, from the query for “socket java” we developed a search task “Write a basic server that communicates using Sockets”. Other tasks included loading a JPEG image, parsing an XML file. Participants 9 (under-)graduate students in Computer Science
User Study – Task Time * F(1,258)=5.74 p ≈ .017 significant
User Study – Solution Quality 0 seriously flawed .5 generally good but fell short in critical regard 1 fairly complete F(1,258)=55.5 p < .0001 * F(1,258)=6.29 p ≈ .013 *
User Study – # Queries Issued F(1,259)=6.85 p ≈ .001 * F(1,259)=9.77 p ≈ .002 *
Outline Motivation What Programmers Search For The Assieme Search Engine Inferring Implicit References Using Implicit References for Scoring Evaluation of Inference & User Study Discussion & Conclusion
Discussion & Conclusion Assieme – a novel web search interface Programmers obtain better solutions, using fewer queries, in the same amount of time Using Google subjects visited 3.3 pages/task, using Assieme only 0.27 pages, but 4.3 previews Ability to quickly view code samples changed participants’ strategies In this talk we presented Assieme – a novel web search interface. We showed that using Assieme, programmers obtain better solutions, using fewer queries, in the same amount of time. We expected that programmers would need fewer queries because Assieme combines much information. But it is interesting that programmers also obtained better solutions. Looking at click-through data we found that using Google subjects visited 3.3 pages/task, using Assieme only 0.27 pages, but Assieme shows previews of code snippets on a page and when we count the number of previews they saw, they actually looked at 4.3 previews per task. It thus seems that the ability to very quickly view code examples changed participant’s strategies. Using Google, they often took the first code example and prepared a solution. Using Assieme, the ease of viewing many examples, encouraged them to continue exploring to find the best one.
Thank You Raphael Hoffmann Computer Science & Engineering University of Washington James Fogarty Computer Science & Engineering University of Washington Daniel S. Weld Computer Science & Engineering University of Washington This material is based upon work supported by the National Science Foundation under grant IIS-0307906, by the Office of Naval Research under grant N00014-06-1-0147, SRI International under CALO grant 03-000225 and the Washington Research Foundation / TJ Cable Professorship.
Search is fundamental in modern User Interfaces Visualizing search results [Paek et al. 04] Finding personal information [Cutrell et al. 06] Augmenting structured sites [Huynh et al. 06] Summarizing search sessions [Dontcheva et al. 06] Invoking commands in programs [Little et al. 06] Let me start my talk by saying that search is now fundamental in modern user interface software. We have seen numerous ideas on search at UIST and CHI in recent years; Among them work on visualizing search results (for example the WaveLens project), finding personal information (the Phlat project), augmenting structure web sites with filtering and sorting capabilities on the client-side, Summarizing search sessions, Invoking keyword commands in desktop applications
User Study - Feedback