Improving Software Package Search Quality Dan Fingal and Jamie Nicolson.

Slides:



Advertisements
Similar presentations
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Advertisements

Chapter 5: Introduction to Information Retrieval
Diversity Maximization Under Matroid Constraints Date : 2013/11/06 Source : KDD’13 Authors : Zeinab Abbassi, Vahab S. Mirrokni, Mayur Thakur Advisor :
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
IS530 Lesson 12 Boolean vs. Statistical Retrieval Systems.
Evaluating Search Engine
Information Retrieval in Practice
Personalizing Search via Automated Analysis of Interests and Activities Jaime Teevan Susan T.Dumains Eric Horvitz MIT,CSAILMicrosoft Researcher Microsoft.
Carnegie Mellon 1 Maximum Likelihood Estimation for Information Thresholding Yi Zhang & Jamie Callan Carnegie Mellon University
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Searching The Web Search Engines are computer programs (variously called robots, crawlers, spiders, worms) that automatically visit Web sites and, starting.
Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.
1 Matching DOM Trees to Search Logs for Accurate Webpage Clustering Deepayan Chakrabarti Rupesh Mehta.
Information Retrieval
HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Baseline Document Retrieval Component N. Bassiou, C. Kotropoulos, I. Pitas 20/07/2000,
Search engines fdm 20c introduction to digital media lecture warren sack / film & digital media department / university of california, santa.
Chapter 5: Information Retrieval and Web Search
Search Engine Optimization (SEO)
Overview of Search Engines
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
ITIS 1210 Introduction to Web-Based Information Systems Chapter 27 How Internet Searching Works.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Web Searching. How does a search engine work? It does NOT search the Web (when you make a query) It contains a database with info on numerous Web sites.
Video Google: A Text Retrieval Approach to Object Matching in Videos Josef Sivic and Andrew Zisserman.
Search engines are the key to finding specific information on the vast expanse of the World Wide Web. Without sophisticated search engines, it would be.
Chapter 6: Information Retrieval and Web Search
Search engines are used to for looking for documents. They compile their databases by employing "spiders" or "robots" to crawl through web space from.
Curtis Spencer Ezra Burgoyne An Internet Forum Index.
Search Engine Marketing SEM = Search Engine Marketing SEO = Search Engine Optimization optimizing (altering/changing) your page in order to get a higher.
Search Engines Reyhaneh Salkhi Outline What is a search engine? How do search engines work? Which search engines are most useful and efficient? How can.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
Query Suggestion Naama Kraus Slides are based on the papers: Baeza-Yates, Hurtado, Mendoza, Improving search engines by query clustering Boldi, Bonchi,
Publication Spider Wang Xuan 07/14/2006. What is publication spider Gathering publication pages Using focused crawling With the help of Search Engine.
Search Engines.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
1 FollowMyLink Individual APT Presentation Third Talk February 2006.
Personalization with user’s local data Personalizing Search via Automated Analysis of Interests and Activities 1 Sungjick Lee Department of Electrical.
LING 573 Deliverable 3 Jonggun Park Haotian He Maria Antoniak Ron Lockwood.
Search Engines By: Faruq Hasan.
Search Technologies. Examples Fast Google Enterprise – Google Search Solutions for business – Page Rank Lucene – Apache Lucene is a high-performance,
Post-Ranking query suggestion by diversifying search Chao Wang.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
A search engine is a web site that collects and organizes content from all over the internet Search engines look through their own databases of.
By Pamela Drake SEARCH ENGINE OPTIMIZATION. WHAT IS SEO? Search engine optimization (SEO) is the process of affecting the visibility of a website or a.
Week 1 Introduction to Search Engine Optimization.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 Personalizing Search via Automated Analysis of Interests and Activities Jaime Teevan, MIT Susan T. Dumais, Microsoft Eric Horvitz, Microsoft SIGIR 2005.
CS 440 Database Management Systems Web Data Management 1.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
Search Engine Optimization Miami (SEO Services Miami in affordable budget)
Vertical Search for Courses of UIUC Homepage Classification The aim of the Course Search project is to construct a database of UIUC courses across all.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,
Search Engine Optimization
Information Retrieval in Practice
Search Engine Architecture
Prepared by Rao Umar Anwar For Detail information Visit my blog:
What is a Search Engine EIT, Author Gay Robertson, 2017.
Agenda What is SEO ? How Do Search Engines Work? Measuring SEO success ? On Page SEO – Basic Practices? Technical SEO - Source Code. Off Page SEO – Social.
INF 141: Information Retrieval
Information Retrieval and Web Design
Presentation transcript:

Improving Software Package Search Quality Dan Fingal and Jamie Nicolson

The Problem Search engines for software packages typically perform poorly Tend to search project name and blurb Don’t take quality metrics into account Poor user experience

Improvements in Indexing More text relating to the package Every package is associated with a website that contains much more detailed information about it We spidered those sites and stripped away the html for indexing

Improvements in Quality Metrics Software download sites have additional data in their repositories Popularity: Frequency of download Rating: User supplied feedback Vitality: How active development is Version: How stable the package is

System Architecture Pulled data from Freshmeat.net and Gentoo.org into mySQL database Spidered and extracted text for associated homepages Text indexed with Lucene, using Porter Stemming and a Stop Words list Similarity metrics as found in CS276A

Ranking Function Lucene returns an ordered list of documents (packages) Copy this list and order by the various quality metrics Scaled Footrule Aggregation combines lists into one results list Scaled Footrule and Lucene field parameters can be varied

Scaled Footrule Aggregation Scaled footrule distance between two rankings of an item is the difference in its normalized position E.g., one list puts an item 30% down, another puts it 70% down. Scaled footrule is |.3 −.7| =.4 Scaled footrule distance between two lists is the sum over the items in common Scaled footrule of a candidate aggregate list and the input rankings is the sum of the distances between the candidate and each input ranking Minimum cost maximum over bipartite graph between elements and rank position optimizes

Measuring Success Created a gold corpus of 50 queries to relevant packages One “best” answer per query Measured recall within the first 5 results (0 or 1 for each query) Compared results with search on packages.gentoo.org, freshmeat.net, and google.com

Sample Gold Queries “C compiler”: GCC “financial accounting”: GNUCash “wireless sniffer”: Kismet “videoconferencing software”: GnomeMeeting “wiki system”: Tiki CMS/Groupware

Training Ranking Parameters Boils down to choosing optimal weights for each individual ranking metric Finding an analytic solution seemed too hairy, so we used local hill-climbing with random restarts

Training Results Uniform parameters gave a recall of 29/50 (58%) Best hill-climbing parameters gave recall of 37/50 (74%) We're testing on the training set, but with only five parameters the risk of overfitting seems low Popularity was the highest-weighted feature

Comparing to Google Google is a general search engine, not specific to software search How to help Google find software packages?  Append “Linux Software” to queries How to judge the results?  Package name appears in page?  Package name appears in search results?

Room for Improvement Gold corpus  Are these queries really representative?  Are our answers really the best?  Did we choose samples we knew would work well with our method? Training Process  Could probably come up with better training algorithms

Any questions?