Assieme: Finding and Leveraging Implicit References in a Web Search Interface for Programmers I am Raphael Hoffmann and this is joint work with James Fogarty.

Slides:



Advertisements
Similar presentations
Web 2.0 Programming 1 © Tongji University, Computer Science and Technology. Web Web Programming Technology 2012.
Advertisements

1 Copyright © 2002 Pearson Education, Inc.. 2 Chapter 1 Introduction to Perl and CGI.
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Automatic Timeline Generation from News Articles Josh Taylor and Jessica Jenkins.
Assieme A SSIEME – A Recommender System for Emacs Extensions Raphael Hoffmann CSE574, WIN06 1/13/2015 University of Washington.
Chapter 5: Introduction to Information Retrieval
Ziv Bar-YossefMaxim Gurevich Google and Technion Technion TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A AA.
Optimizing search engines using clickthrough data
Working with pig Cloud computing lecture. Purpose  Get familiar with the pig environment  Advanced features  Walk though some examples.
A web application for browsing research papers By: Rhea Dookeran 09’
Query Chains: Learning to Rank from Implicit Feedback Paper Authors: Filip Radlinski Thorsten Joachims Presented By: Steven Carr.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Information Retrieval in Practice
Finding Code to Reuse Kerry Chang Human-Computer Interaction Institute Carnegie Mellon University D: Human Aspects of Software Development (HASD)
Search Engines and Information Retrieval
 Copyright 2005 Digital Enterprise Research Institute. All rights reserved. 1 The Architecture of a Large-Scale Web Search and Query Engine.
The PageRank Citation Ranking “Bringing Order to the Web”
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
INFO 624 Week 3 Retrieval System Evaluation
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
FACT: A Learning Based Web Query Processing System Hongjun Lu, Yanlei Diao Hong Kong U. of Science & Technology Songting Chen, Zengping Tian Fudan University.
1 CS 430 / INFO 430 Information Retrieval Lecture 24 Usability 2.
CM143 - Web Week 2 Basic HTML. Links and Image Tags.
Information Retrieval
Overview of Search Engines
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
Concordia University Department of Computer Science and Software Engineering Click to edit Master title style ADVANCED PROGRAMING PRACTICES API documentation.
CEDROM-SNi’s DITA- based Project From Analysis to Delivery By France Baril Documentation Architect.
Slide 1 Today you will: think about criteria for judging a website understand that an effective website will match the needs and interests of users use.
1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009.
Enterprise & Intranet Search How Enterprise is different from Web search What to think about when evaluating Enterprise Search How Intranet use is different.
Evaluation Experiments and Experience from the Perspective of Interactive Information Retrieval Ross Wilkinson Mingfang Wu ICT Centre CSIRO, Australia.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
1 PARSEWeb: A Programmer Assistant for Reusing Open Source Code on the Web Suresh Thummalapenta and Tao Xie Department of Computer Science North Carolina.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
20-753: Fundamentals of Web Programming 1 Lecture 1: Introduction Fundamentals of Web Programming Lecture 1: Introduction.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin.
1.8History of Java Java –Based on C and C++ –Originally developed in early 1991 for intelligent consumer electronic devices Market did not develop, project.
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
Probabilistic Query Expansion Using Query Logs Hang Cui Tianjin University, China Ji-Rong Wen Microsoft Research Asia, China Jian-Yun Nie University of.
Improving Cloaking Detection Using Search Query Popularity and Monetizability Kumar Chellapilla and David M Chickering Live Labs, Microsoft.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Exploiting Code Search Engines to Improve Programmer Productivity and Quality Suresh Thummalapenta Advisor: Dr. Tao Xie Department of Computer Science.
Information Retrieval
A Novel Visualization Model for Web Search Results Nguyen T, and Zhang J IEEE Transactions on Visualization and Computer Graphics PAWS Meeting Presented.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
JavaScript 101 Introduction to Programming. Topics What is programming? The common elements found in most programming languages Introduction to JavaScript.
1 FollowMyLink Individual APT Presentation First Talk February 2006.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
CPS 49S Google: The Computer Science Within and its Impact on Society Shivnath Babu Spring 2007.
Information Retrieval in Practice
Advanced Programing practices
Search Engine Architecture
Assembler, Compiler, Interpreter
Search Search Engines Search Engine Optimization Search Interfaces
Data Mining Chapter 6 Search Engines
Assembler, Compiler, Interpreter
Advanced Programing practices
Tutorial 10: Programming with javascript
Presentation transcript:

Assieme: Finding and Leveraging Implicit References in a Web Search Interface for Programmers I am Raphael Hoffmann and this is joint work with James Fogarty and Dan Weld at the University of Washington. My talk is about Assieme – a new Web Search Interface for programmers. Raphael Hoffmann, James Fogarty, Daniel S. Weld University of Washington, Seattle UIST 2007

Programmers Use Search To identify an API To seek information about an API To find examples on how to use an API Example Task: This talk will extend this list. It is about search as performed by programmers. As we confirmed in interviews with programmers, they frequently search the Web to identify an API (that they can use in their project), to seek more information about an API (such as documentation pages), to find examples on how to use an API (many pages contain short code snippets that are very valuable to programmers) “Programmatically output an Acrobat PDF file in Java.”

Example: General Web Search Interface Let’s first look again at our example of outputting an acrobat PDF file in Java. We could use a general Web Search interface and search for “output acrobat”. This query is too general, so let’s add the keyword “java”. Still nothing relevant. So let’s modify the query to “output pdf java”. Ok, the first two hits seem very relevant. The first one is a long article on using an API for generating pdf output in Java. The article also contains some code snippets. It says the first step is to create a document object. However, the code sample is incomplete. It doesn’t say which package contains class Document, and we also cannot look up documentation of document, for example to choose a different constructor that better suits our needs. We could do a new search (here we added the class name to our last query – that didn’t work so well, perhaps we could try the name of the library). However, looking again carefully at our article we might also find a link to more information about the library. Navigating to another 4 pages finally brings us to the information we are looking for. In summary: A general web search engine certainly gets us the information we need, but it might take many page visits and searches. So far we have only located a small piece of information about a single API that we might use.

Example: Code-Specific Web Search Interface There also exist numerous code-specific search engines on the Web now. Let’s try one of them. Here we search for “output acrobat” and we restrict the results to Java code by adding “lang:java”. We get some results, but the page summaries are confusing, so let’s click on one. A long copyright notice and a little bit of code which is totally irrelevant. Let’s again change our query to “output pdf” and restrict the results by adding “lang:java”. Again, confusing summaries, copyright notices, and code that is irrelevant. We stop at this point, but say that it is very difficult to obtain the information we need using existing code-specific search engines. It is far easier with general Web search engines, because much of the information we are looking at already exists on Web pages that have been manually crafted by humans. …

Problems Information is dispersed: tutorials, API itself, documentation, pages with samples Difficult and time-consuming to … locate required pieces, get an overview of alternatives, judge relevance and quality of results, understand dependencies. Many page visits required Unfortunately, obtaining such information is not always straightforward: One problem is that the information is dispersed; there exist tutorials, API itself in source or binary format, documentation pages in Javadoc format, or simply pages with code examples, such as articles or messages in forums It is therefore often difficult and time-consuming to … Programmers rarely find all information on one page: they must visit many pages and perform multiple searches.

With Assieme we … Designed a new Web search interface Developed needed inference In this talk we present Assieme – a new Web search interface designed to overcome these limitations. Assieme attemps to display all required pieces of information in a single view, and offers powerful capabilities in browsing code-related information on the Web. To do this it needs to perform some interesting inference – which I will talk about later.

Outline Motivation What Programmers Search For The Assieme Search Engine Inferring Implicit References Using Implicit References for Scoring Evaluation of Inference & User Study Discussion & Conclusion How did we come up with this interface? We were initially interested in finding out about information needs of programmers.

Six Learning Barriers faced by Programmers (Ko et al. 04) Design barriers — What to do? Selection barriers — What to use? Coordination barriers — How to combine? Use barriers — How to use? Understanding barriers — What is wrong? Information barriers — How to check? A work that tries to answer this question is Andy Ko and Brad Myers’ paper on six learning barriers faced by programmers. There are design barriers, when programmers do not know what to do such as to conceive of an appropriate algorithm, Selection barriers, what to use, … For at least three kinds of barriers, programmers can do Web search – and these are exactly those related to APIs and those that we are addressing with Assieme.

Examining Programmer Web Queries Objective See what programmers search for Dataset 15 million queries and click-through data Random sample of MSN queries in 05/06 Procedure Extract query sessions containing ‘java’ – 2,529 Manual looking at queries and defining regex filters Informal taxonomy of query sessions Next, we wanted to find out this is consistent with queries performed by programmers on Web search engines. We used a dataset of 15 million queries and … And filtered all query sessions containing at least one query with the keyword ‘java’ …

Examining Programmer Web Queries These are the results we got. The sizes of the circles correspond to the relative number of queries. Indeed the largest category are API related queries (followed by troubleshooting – e.g. error messages).

Examining Programmer Web Queries 64.1 % 35.9 % Descriptive Contain package, type or member name “java JSP current date” “java SimpleDateFormat” Looking more closely at API-related queries, we found that 64% contained merely descriptive keywords … presumably intended to identify an appropriate API, … From the complete set of APIs, 18% contained terms like example, using,… Selection barrier Use barrier 17.9 % Contain terms like “example”, “using”, “sample code” “using currentdate in jsp” Coordination barrier

Assieme relevance indicated by # uses documentation example code Summaries show referenced types links to related info Finally, let’s look at Assieme. We search for “output acrobat”, and the system returns some pages and also a few API packages/types/members that might be related to those keywords. We click on one which filters our set of pages to those only containing code examples using that API. All required information is now visible: Links to Javadoc, required libraries, example code, hovering, example counts tell us relevance. required libaries

How to put the right information Challenges ? How to put the right information on the interface Get all programming-related data Interpret data and infer relationships

Outline Motivation What Programmers Search For The Assieme Search Engine Inferring Implicit References Using Implicit References for Scoring Evaluation of Inference & User Study Discussion & Conclusion Let’s now talk about the Assieme Search Engine in more detail.

Assieme’s Data … is crawled using existing search engines Pages with code examples JavaDoc pages JAR files Queried Google on “java ±import ±class …” Downloaded library files for all projects on Sun.com, Apache.org, Java.net, SourceForge.net Queried Google on “overview-tree.html …” Finally, let me briefly talk about Assieme’s data, which is crawled using existing search engines. Separately for pages with code examples, JAR files, and JavaDoc pages. ~2,360,000 ~79,000 ~480,000 … is crawled using existing search engines

The Assieme Search Engine … infers 2 kinds of implicit references JAR files   Pages with code examples JavaDoc pages Uses of packages, types and members Matches of packages, types and members To power its interface, Assieme infers two kinds of implicit references: One are uses of packages/types/members in code examples on web pages. Those packages/types/members are contained in JAR files. The other kind are packages/types/members in JAR files and their respective Javadoc documentation pages. It turns out that inferring the second kind of references is relatively easy. Javadoc pages are automatically generated from source code, so it is not difficult to parse them and to re-create the matching to the contents in JAR files. The other kind is more involved – and this will be what I will be focusing on. To find out about uses in code examples, we must first extract code examples and then resolve references.  ?

Extracting Code Samples unclear segmentation code in a different language (C++) distracting terms ‘…’ in code line numbers Extracting code examples from web pages is not trivial.

Extracting Code Samples  remove HTML commands, but preserve line breaks <html> <head><title></title></head> <body> A simple example:<br><br> 1: import java.util.*; <br>2: class c {<br>3: HashMap m = new HashMap();<br>4: void f() { m.clear(); }<br>5: }<br><br> <a href=“index.html”>back</a> </body> </html> <html> <head><title></title></head> <body> A simple example:<br><br> 1: import java.util.*; <br>2: class c {<br>3: HashMap m = new HashMap();<br>4: void f() { m.clear(); }<br>5: }<br><br> <a href=“index.html”>back</a> </body> </html> A simple example: import java.util.*; class c { HashMap m = new HashMap(); void f() { m.clear(); } } back A simple example: import java.util.*; class c { HashMap m = new HashMap(); void f() { m.clear(); } } back A simple example: 1: import java.util.*; 2: class c { 3: HashMap m = new HashMap(); 4: void f() { m.clear(); } 5: } back A simple example: 1: import java.util.*; 2: class c { 3: HashMap m = new HashMap(); 4: void f() { m.clear(); } 5: } back  remove some distracters by heuristics  launch (error-tolerant) Java parser at every line break (separately parse for types, methods, and sequences of statements) Assieme extracts code examples by first removing all html commands from a page, while preserving line breaks. It then uses some heuristics to remove distracters. And finally launches a Java parser at every line break and separately attempts to parse for types, methods, and sequences of statements. The end of a code snippet is determined by tracking the state of the parser. There are more details about this in the paper.

Resolving External Code References Naïve approach of finding term matches does not work: 1 import java.util.*; 2 class c { 3 HashMap m = new HashMap(); 4 void f() { m.clear(); } 5 } Reference java.util.HashMap.clear() on line 4 only detectable by considering several lines ? After it has extracted code examples, Assieme needs to resolve references to external APIs. A naïve approach one could try is to search for pure term matches. Unfortunately, this doesn’t work. In this small example, line 4 contains a reference to java.util.Hashmap.clear(), but this is only detectable by combining information from several lines. We therefore use a compiler to identify unresolved names.  Use compiler to identify unresolved names

Resolving External Code References Index packages/types/members in Jar files java.util.HashMap.clear() java.util.HashMap … unresolved names compile index lookup put on classpath Compile & lookup Utility function: # covered references (and JAR popularity) greedily pick best JARs JAR files More specifically, we first index all packages/types/members contained in JAR files in Assieme’s data repository. Then, when we resolve external references, we first compile code snippets – which gives us a set of unresolved names. Then we do an index lookup, and put the JAR files that contain the required objects onto the classpath and attempt a re-compilation. We repeat this until we make no further progress. However, often an object with a given name is contained in many different JAR files (e.g. different versions). JAR files

Scoring Existing techniques … … do not work well for code, because: Docs modeled as weighted term frequencies Hypertext link analysis (PageRank) JAR files (binary code) provide no context Source code contains few relevant keywords Structure in code important for relevance … do not work well for code, because: We now discuss how Assieme makes use of the implicit references it infers. Existing techniques for scoring documents (such as modeling documents as vectors of weighted term frequencies) or differentially weighting important documents by hypertext link analysis) do not work well for code, because JAR files contain few keywords and therefore lack context Source code contains few relevant keywords Structure in code (e.g. number of uses of objects) are important for relevance

Using Implicit References to Improve Scoring Assieme exploits structure on Web pages and structure in code code references Assieme tries to exploit structure on web pages (below here we see a graph of Web documents and hyperlinks between them), and structure in code (documents on the Web can be API’s or web pages with code samples, and there exist implicit references between them). HTML hyperlinks

Scoring APIs Web pages (packages/types/members) Assieme actually contains two scoring functions – one for API’s and one for Web pages. Web pages

Scoring APIs Use text on doc pages and on pages with code samples that reference API (~ anchor text) Weight APIs by #incoming refs (~ PageRank) Web Pages Use fully qualified references (java.util.HashMap) and adjust term weights Filter pages by references Favor pages with accompanying text For scoring API’s, Assieme uses the text that appears on documentation pages and also the text on pages with code samples that use an API. This is similar to the technique of anchor text scoring (the different being that Assieme uses implicit references rather than hyperlinks). Also, Assieme weights APIs by # of incoming references (this is similar to PageRank, but again using implicit references rather than hyperlinks). For scoring Web pages, Assieme uses not only the terms on the page, but also fully qualified references with weights adjusted to their frequency. Assieme also allows to filter web pages by implicit references and it favors pages with accompanying text rather than pure code.

Outline Motivation What Programmers Search For The Assieme Search Engine Inferring Implicit References Using Implicit References for Scoring Evaluation of Inference & User Study Discussion & Conclusion

Evaluating Code Extraction and Reference Resolution … on 350 hand-labeled pages from Assieme’s data Code Extraction Recall 96.9%, Precision 50.1% ( 76.7%) False positives: C, C#, JavaScript, PHP, FishEye/diff (After filtering pages without refs: precision 76.7%) Reference Resolution Recall 89.6%, Precision 86.5% False positives: Fisheye and diff pages False negatives: incomplete code samples To evaluate Assieme, we first analyzed the effectiveness of Assieme’s inference components and then performed a user study. We hand-labeled 350 pages from Assieme’s data. For code extraction Assieme reaches a recall of 96.9% at a precision of 50.1%. While recall is important, precision is of less concern here, because Assieme later filters pages without refs which increases precision to 76.7%. For reference resolution, Assieme reaches a recall of 89.6% at a precision of 86.5%.

Assieme vs. Google vs. Google Code Search User Study Assieme vs. Google vs. Google Code Search Design 40 search tasks based on queries in logs: query “socket java”  “Write a basic server that communicates using Sockets” Find code samples (and required libraries) 4 blocks of 10 tasks: 1 for training + 1 per interface In our user study, we compared Assieme to Google and Google Code Search. We developed 40 search tasks based on queries found in the query logs discussed earlier. For example, from the query for “socket java” we developed a search task “Write a basic server that communicates using Sockets”. Other tasks included loading a JPEG image, parsing an XML file. Participants 9 (under-)graduate students in Computer Science

User Study – Task Time * F(1,258)=5.74 p ≈ .017 significant

User Study – Solution Quality 0 seriously flawed .5 generally good but fell short in critical regard 1 fairly complete F(1,258)=55.5 p < .0001 * F(1,258)=6.29 p ≈ .013 *

User Study – # Queries Issued F(1,259)=6.85 p ≈ .001 * F(1,259)=9.77 p ≈ .002 *

Outline Motivation What Programmers Search For The Assieme Search Engine Inferring Implicit References Using Implicit References for Scoring Evaluation of Inference & User Study Discussion & Conclusion

Discussion & Conclusion Assieme – a novel web search interface Programmers obtain better solutions, using fewer queries, in the same amount of time Using Google subjects visited 3.3 pages/task, using Assieme only 0.27 pages, but 4.3 previews Ability to quickly view code samples changed participants’ strategies In this talk we presented Assieme – a novel web search interface. We showed that using Assieme, programmers obtain better solutions, using fewer queries, in the same amount of time. We expected that programmers would need fewer queries because Assieme combines much information. But it is interesting that programmers also obtained better solutions. Looking at click-through data we found that using Google subjects visited 3.3 pages/task, using Assieme only 0.27 pages, but Assieme shows previews of code snippets on a page and when we count the number of previews they saw, they actually looked at 4.3 previews per task. It thus seems that the ability to very quickly view code examples changed participant’s strategies. Using Google, they often took the first code example and prepared a solution. Using Assieme, the ease of viewing many examples, encouraged them to continue exploring to find the best one.

Thank You Raphael Hoffmann Computer Science & Engineering University of Washington raphaelh@cs.washington.edu James Fogarty Computer Science & Engineering University of Washington jfogarty@cs.washington.edu Daniel S. Weld Computer Science & Engineering University of Washington weld@cs.washington.edu This material is based upon work supported by the National Science Foundation under grant IIS-0307906, by the Office of Naval Research under grant N00014-06-1-0147, SRI International under CALO grant 03-000225 and the Washington Research Foundation / TJ Cable Professorship.

Search is fundamental in modern User Interfaces Visualizing search results [Paek et al. 04] Finding personal information [Cutrell et al. 06] Augmenting structured sites [Huynh et al. 06] Summarizing search sessions [Dontcheva et al. 06] Invoking commands in programs [Little et al. 06] Let me start my talk by saying that search is now fundamental in modern user interface software. We have seen numerous ideas on search at UIST and CHI in recent years; Among them work on visualizing search results (for example the WaveLens project), finding personal information (the Phlat project), augmenting structure web sites with filtering and sorting capabilities on the client-side, Summarizing search sessions, Invoking keyword commands in desktop applications

User Study - Feedback