Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University MUDABlue: An Automatic.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

2 Information Retrieval System IR System Query String Document corpus Ranked Documents 1. Doc1 2. Doc2 3. Doc3.
Experiments on Query Expansion for Internet Yellow Page Services Using Log Mining Summarized by Dongmin Shin Presented by Dongmin Shin User Log Analysis.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
1 CS 430 / INFO 430 Information Retrieval Lecture 15 Usability 3.
FACT: A Learning Based Web Query Processing System Hongjun Lu, Yanlei Diao Hong Kong U. of Science & Technology Songting Chen, Zengping Tian Fudan University.
Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011.
Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
About the Presentations The presentations cover the objectives found in the opening of each chapter. All chapter objectives are listed in the beginning.
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing.
CONTENT-BASED BOOK RECOMMENDING USING LEARNING FOR TEXT CATEGORIZATION TRIVIKRAM BHAT UNIVERSITY OF TEXAS AT ARLINGTON DATA MINING CSE6362 BASED ON PAPER.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Automatic Categorization.
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
Computer Science 101 Web Access to Databases Overview of Web Access to Databases.
Chapter 5: Information Retrieval and Web Search
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
Homework Define a loss function that compares two matrices (say mean square error) b = svd(bellcore) b2 = b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2])
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University ICSE 2003 Java.
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
Automatic Subject Classification and Topic Specific Search Engines -- Research at KnowLib Anders Ardö and Koraljka Golub DELOS Workshop, Lund, 23 June.
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University DCCFinder: A Very- Large Scale Code Clone Analysis.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Automatic Categorization.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
WAD Web application for managing the indicators of the research activity in a university department.
Natural and programming languages v0.2 – initial draft, Pikaro Tarmo v0.3 – updated, Pikaro Tarmo.
Chapter 16 The World Wide Web Chapter Goals Compare and contrast the Internet and the World Wide Web Describe general Web processing Write basic.
No. 1 Classification and clustering methods by probabilistic latent semantic indexing model A Short Course at Tamkang University Taipei, Taiwan, R.O.C.,
RELATIONAL FAULT TOLERANT INTERFACE TO HETEROGENEOUS DISTRIBUTED DATABASES Prof. Osama Abulnaja Afraa Khalifah
Copyright © 2007 Addison-Wesley. All rights reserved.1-1 Reasons for Studying Concepts of Programming Languages Increased ability to express ideas Improved.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Applying Clone.
Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università.
Katrin Erk Vector space models of word meaning. Geometric interpretation of lists of feature/value pairs In cognitive science: representation of a concept.
Chapter 6: Information Retrieval and Web Search
ICDL 2004 Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer Science Old Dominion University.
SINGULAR VALUE DECOMPOSITION (SVD)
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
PROCESSING, ANALYSIS & INTERPRETATION OF DATA
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
IR Homework #3 By J. H. Wang May 4, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:
No. 1 Knowledge Acquisition from Documents with both Fixed and Free Formats* Shigeich Hirasawa Department of Industrial and Management Systems Engineering.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University July 21, 2008WODA.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University IWPSE 2003 Program.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Cage: A Keyword.
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
Information Retrieval CSE 8337 Spring 2007 Introduction/Overview Some Material for these slides obtained from: Modern Information Retrieval by Ricardo.
1 CS 430: Information Discovery Lecture 11 Latent Semantic Indexing.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 1 Classification.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 1 Extracting Sequence.
Chapter – 8 Software Tools.
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
Chapter 04 Semantic Web Application Architecture 23 November 2015 A Team 오혜성, 조형헌, 권윤, 신동준, 이인용.
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
Programming Languages Concepts Chapter 1: Programming Languages Concepts Lecture # 4.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Software Ingredients:
Vertical Search for Courses of UIUC Homepage Classification The aim of the Course Search project is to construct a database of UIUC courses across all.
No. 1 Classification Methods for Documents with both Fixed and Free Formats by PLSI Model* 2004International Conference in Management Sciences and Decision.
Experience Report: System Log Analysis for Anomaly Detection
Information Organization: Overview
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Information Retrieval
Text Categorization Document classification categorizes documents into one or more classes which is useful in Information Retrieval (IR). IR is the task.
CSE 635 Multimedia Information Retrieval
IntroductionToPHP Static vs. Dynamic websites
Information Organization: Overview
Presentation transcript:

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University MUDABlue: An Automatic Categorization System for Open Source Repositories Shinji Kawaguchi †, Pankaj K. Garg ††, Makoto Matsushita †, Katsuro Inoue † † Osaka University, Japan †† Zee Source, USA

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2004/12/01APSEC Software Repository “Software repository” archives many software systems with their source codes It is very common in these years In open source community Provide platforms for many open source projects E.g. SourceForge ( In industrial context Archive software systems created in a company To share information about projects that exist (or existed) in the company Useful especially for large and distributed organization E.g. Corporate Source*, Progressive Open Source** *J. Dinkelacker and P. Garg. Corporate Source: “Applying Open Source Concepts to a Corporate Environment (Position Paper)“. In Proceedings of the 1st ICSE International Workshop on Open Source Software Engineering, May 15, 2001, Toronto, Canada. **J. Dinkelacker, P. Garg, D. Nelson, and R. Miller. “Progressive Open Source”. In Proceedings of the International Conference on Software Engineering, Orlando, Florida, 2002.

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2004/12/01APSEC Background Software repository is also used for... finding a software system which fills a demand finding source codes related to currently developing products. Generally, there are many software systems in a repository. SourceForge hosted nearly 100,000 projects Categorization is essential for software finding At present, software systems are categorized manually. A manager of a repository makes a hierarchical category structure. A software developer choose an adequate category for a software.

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2004/12/01APSEC Problem Inflexible and exclusive classification Generally, software systems are categorized by uses of a software system. Classification by depending library or architecture also valuable for users. A software system has various aspects Making a hierarchical category structure requires a huge amount of work. To make it better, comprehensive knowledge about various libraries and architectures is needed. A repository manager’s load becomes high

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2004/12/01APSEC Software 1 Software 2 Software 3 Software 4 Nonexclusive categorization Editor GUI (MFC) support for regular expression Spreadsheet Editor support for regular expression GUI (GTK) Spreadsheet GUI (GTK) GUI (MFC) support for regular expression EditorSpreadsheet MFC GTK regexp If you do not have knowledge about these libraries and architectures, you can not prepare such categories.

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2004/12/01APSEC Research Aim MUDABlue: Automatic categorization system for software repository Nonexclusive categorization counting various aspects of a software system. Identify depending libraries and architecture and classify software systems automatically Uses only source code. MUDABlue is not require comprehensive knowledge about software systems

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2004/12/01APSEC Classification by identifiers Identifiers imply behavior of source codes Some statements which have an identifier “window” are related to some kind of GUI operations Group some identifiers which are highly related and consider them as one category. Software 1 Software 3 Editor GUI (MFC) Spreadsheet GUI (MFC) window cmdButton window menuBar MFC

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2004/12/01APSEC Latent Semantic Analysis (LSA) We employ Latent Semantic Analysis (LSA) to define calcurate simirality between identifiers. The LSA is: proposed for calculating a similarity about documents or terms in natural language. based on Vector Space Model. able to detect similarity with documents sharing only highly related (but not same) words. Original vector space model can not detect such relation ship.

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2004/12/01APSEC Example of LSA LSA BACDEFG H B A C D EF GH Doc1 Doc3 Doc2 A DB AB Doc4 Doc5 HGF C Doc6 GE CDE H Make a word-by- document matrix. BBF CC H GG DocumentVector WordVector Similarities between words (documents) are represented by the cosine of two vectors.

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2004/12/01APSEC Singular Value Decomposition SVD reduces the dimensions of the matrix with minimum mean square error Reducing dimensions of high dimensioned data brings reducing data size merging similar data into one dimension l b a Reduce 2-dimention data (a, b) to 1-dimention (l)

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2004/12/01APSEC Effect of LSA Documents which have indirect relationship show high similarities. LSA make clear about trends of documents before LSAafter LSA Similarities about all pair of documents.

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2004/12/01APSEC Proposed Method(1/2) Preparing the Matrix 2.Make Identifier-by-Software Matrix 3.Remove Stand-off Identifiers and Common Identifiers Soft1 Soft2 Soft3 Soft4 Soft5 1.Extract Identifier IJ Soft6 Sof1 Soft3 Soft2 AB Soft4 Soft5 Soft6 GE CDE HDB HGF CCC H GGABBF BACDEFG H BACDEFG H JJ J J I

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2004/12/01APSEC Proposed Method(2/2) Making Clusters BA GF 1 C ClusterTitle1 ClusterTitle2 D H Calcurate Identifier Similarity and Cluster Analysis 6.Make Software Clusters 7.Make Cluster’s Titles B A C D EF GH BACDEFG H 4.LSA

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2004/12/01APSEC MUDABlue MUDABlue System Soft1 Soft2 Soft3 Soft4 Soft5 Soft6 Web Browser Category hierarchy view Keyword searche UCM view Detailed information display DBMS (PostgreSQL) Soft1Soft2Soft3 Soft4 Soft5Soft6 Soft1 CategoryTitle1 CategoryTitle2 Parser Matrix generator Ourlier remover LSA program Cluster analysis program Software cluster generator Category title generator RDB converter Categorization System User Interface System Supporting for C programs. Written in Perl, C and shell script. Web-based application. Written in PHP, JavaScript and JavaApplet

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2004/12/01APSEC Case study Through the case study, we show How MUDABlue shows the categories Evaluation about retrieved categories Summary of retrieved categories Precision and Recall comparison of automatic exclusive categorization methods Test data We choose 6 genres from SourceForge at random boardgames, compilers, database, editor, videoconversion, xterm We retrieve all C programs from above 6 genres. 41 software systems. 164,102 identifiers We remove stand-off and common identifiers. 22,048 identifiers are remained.

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2004/12/01APSEC Demonstration (1/4)

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2004/12/01APSEC Demonstration (2/4)

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2004/12/01APSEC Demonstration (3/4)

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2004/12/01APSEC Demonstration (4/4)

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2004/12/01APSEC The result of case study Our system returned 40 categories Details of new categories GTK(2 clusters)GUI library win32(3 clusters)Windows32 API yaccLibrary for Syntactic analysis SSLLibrary for SSL communication regexpLibrary for regular expression getoptLibrary for parsing arguments JNIJava Native Interface Python/CArchitecture for extending Python interpreter Clusters same as existed categories18 New categories11 The Other categories11

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2004/12/01APSEC Precision and Recall GURU Using IR methods Applied to Unix man pages. Ugurel et.al’s method Using support vector machine (SVM) method Applied to documents of software system. This figure indicates that MUDABlue has same accuracy with these researches.

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2004/12/01APSEC Discussion Accuracy of MUDABlue’s categories compares favorably with other researches Our method found categorization by a library and an architecture without any knowledge Categorization by many aspects of software systems without human knowledge (existing research needs predefined category set) Categorization without detailed, consistent documentation Categorization in non exclusive way

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2004/12/01APSEC Conclusion and Future Work We proposed MUDABlue, automatic categorization system for a software repository We showed that MUDABlue method could found new categorization without any knowledge about software systems Future works Reducing the other categories Improving identifier deletion process would reduce the other categories Improve understandability of categories’s title Some titles are easy to understand, and some are not. Category of same library are tend to have understandable titles. Granularity of category Generated categories tend to be too fine-graind granularity.

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2004/12/01APSEC

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2004/12/01APSEC Extract Identifier Extract all identifiers variable name constant name function name type name Soft1 Soft2 Soft3 Soft4 Soft5 1.Extract Identifier Soft6 Sof1 Soft3 Soft2 AB Soft4 Soft5 Soft6 GE CDE HDB HGF CCC H GGABBFJJ J J I

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2004/12/01APSEC Make Identifier-by-Software Matrix Identifier-by-Software Matrix A row represents a software A column represents an identifier A cell has the number of identifiers appeared in a software 2.Make Identifier-by- Software Matrix Sof1 Soft3 Soft2 AB Soft4 Soft5 Soft6 GE CDE HDB HGF CCC H GGABBFJJ J J I IJ BACDEFG H

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2004/12/01APSEC Remove Stand-off Identifiers and Common Identifiers We remove stand-off Identifier and common identifiers because they are useless for categorization Stand-off Identifier An identifier appears only one software. Common Identifier An identifier appears more than half of software 3.Remove Stand-off Identifiers and Common Identifiers IJ BACDEFG H BACDEFG H

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2004/12/01APSEC LSA We apply LSA for the matrix removed stand- off identifiers and common identifiers We can retrieve indirect relationship by applying LSA B A C D EF GH BACDEFG H 4.LSA

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2004/12/01APSEC Cluster Identifiers Calculate similarities between all pairs of identifiers using the result of LSA Apply cluster analysis based on the similarities We call the result cluster as “identifier cluster” BA GF C DH 5.Cluster Identifiers B A C D EF GH

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2004/12/01APSEC Make Software Cluster From each identifier cluster, we make a software cluster. A software cluster is an union of software systems which have a token included in an identifier cluster. 1 6.Make software cluster 23 BA GF C DH Sof1 Soft3 Soft2 AB Soft4 Soft5 Soft6 GE CDE HDB HGF CCC H GGABBF 6451 JJ J J I

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2004/12/01APSEC Make Cluster’s Titles For each software cluster, we make a title which represents what software systems are categorized. 1.Get all software vector included in a software cluster. 2.Sum up them. 3.From the summation vector, chose some tokens which have high value, and we make them as title of a cluster. 1 7.Make Cluster’s Titles ClusterTitle ClusterTitle2 1

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2004/12/01APSEC The result of case study (subset) TitleSoftwareNoI AOP, emitcode, IC_RESULT, IC_LEFT, aop, aopGet, IC_RIGHT, pic14_emitcode, iCode, etype compilers/gbdk, compilers/sdcc8597 CASE_IGNORE, CASE_GROUND_STATE, screen, CASE_PRINT, CASE_BYP_STATE, Widget, TScreen, CASE_IGNORE_STATE, CASE_PLT_VEC, CASE_PT_POINT xterm/R6.3, xterm/R YY_BREAK, yyvsp, yyval, DATA, yy_current_buffer, tuple, yy_current_state, yy_c_buf_p, yy_cp, uint32 compilers/gbdk, database/mysql , database/postgresql AVI, cinfo, OUTLONG, avi_t, AVI_errno, hdrl_data, OUT4CC, nhb, ERR_EXIT, str2ulong videoconversion/dv2jpg-1.1, videoconversion/libcu30-1.0, videoconversion/mjpgTools 177 board, num_moves, ply, pawn_file, npiece, pawns, moves, white_to_move, move_s, promoted boardgame/Sjeng-10.0, boardgame/cinag-1.1.4, boardgame/faile_1_4_4 154 GtkWidget, gchar, gpointer, gint, widget, gtk_widget_show, N_, g_free, dialog, g_return_if_fail boardgame/gbatnav-1.0.4, editor/gedit , editor/gmas-1.1.0, editor/gnotepad , editor/peacock Software systems using GTK library Software systems using YACC New Category Same category as SourceForge

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2004/12/01APSEC Naive LSA approach for categorization Apply LSA for software similarity Software Document Identifier (variable, function, type) Word Calculate similarities by result of LSA We apply cluster analysis using similarities of software systems calculated above Cluster analysis divides a set into some groups using similarities of each item

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2004/12/01APSEC Problem of naive approach Each high relationship has each reason Cluster analysis based on simple software similarity is not adequate Software 1 Software 2 Software 3 Software 4 Editor GUI (MFC) support for regular expression Spreadsheet Editor support for regular expression GUI (GTK) Spreadsheet GUI (GTK) GUI (MFC) support for regular expression

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2004/12/01APSEC (demonstration)

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2004/12/01APSEC Case study We applied our proposed method for real software systems using implemented prototype We choose 6 genres from SourceForge at random boardgames, compilers, database, editor, videoconversion, xterm We retrieve all C programs from above 6 genres. 41 software systems. 164,102 identifiers We remove stand-off and common identifiers. 22,048 identifiers are remained.