Presentation is loading. Please wait.

Presentation is loading. Please wait.

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Automatic Categorization.

Similar presentations


Presentation on theme: "Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Automatic Categorization."— Presentation transcript:

1 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Automatic Categorization Tool for Open Software Repositories Shinji Kawaguchi †, Pankaj K. Garg ††, Makoto Matsushita †, Katsuro Inoue † † Osaka University, Japan †† Zee Source, USA

2 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2003/10/26OSIC'03 2 Outline Background and research aim Latent Semantic Analysis (LSA) Problem with naive LSA approach Proposed automatic categorization method Case study Discussions and conclusions

3 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2003/10/26OSIC'03 3 Software Repository “Software repository” archives many software systems with their source codes It is very common in these years. In open source community Provide platforms for many open source projects E.g. SourceForge (http://sourceforge.net/)http://sourceforge.net/ In industrial context Archive software systems created in a company To share information about projects that exist (or existed) in the company Useful especially for large and distributed organization E.g. Corporate Source*, Progressive Open Source** *J. Dinkelacker and P. Garg. Corporate Source: “Applying Open Source Concepts to a Corporate Environment (Position Paper)“. In Proceedings of the 1st ICSE International Workshop on Open Source Software Engineering, May 15, 2001, Toronto, Canada. **J. Dinkelacker, P. Garg, D. Nelson, and R. Miller. “Progressive Open Source”. In Proceedings of the International Conference on Software Engineering, Orlando, Florida, 2002.

4 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2003/10/26OSIC'03 4 Background Software repository is also used for... finding a software system which fills a demand finding source codes related to currently developing products. Generally, there are many software systems in a repository. SourceForge hosted 69,677 projects at Oct. 24, 2003 Categorization is essential for software finding At present, software systems are categorized manually. A manager of a repository makes a hierarchical category structure. A software developer choose an adequate category for a software.

5 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2003/10/26OSIC'03 5 Problem Inflexible and exclusive classification Generally, software systems are categorized by uses of a software system. Classification by depending library or architecture also valuable A software system has various aspect Making a hierarchical category structure requires a huge amount of work. To make it better, comprehensive knowledge about various libraries and architectures is needed. A repository manager’s load is high

6 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2003/10/26OSIC'03 6 Software 1 Software 2 Software 3 Software 4 Nonexclusive classification Editor GUI (MFC) support for regular expression Spreadsheet Editor support for regular expression GUI (GTK) Spreadsheet GUI (GTK) GUI (MFC) support for regular expression EditorSpreadsheet MFC GTK regexp If you do not have knowledge about these libraries and architecture, you can not prepare such category.

7 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2003/10/26OSIC'03 7 Research Aim Automatic categorization method of OpenSource software Nonexclusive categorization counting various aspects of a software system. Identify depending libraries and architecture and classify software systems automatically Uses only source code. Not require comprehensive knowledge about software systems

8 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2003/10/26OSIC'03 8 Outline Background and research aim Latent Semantic Analysis (LSA) Problem with naive LSA approach Proposed automatic categorization method Case study Discussions and conclusions

9 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2003/10/26OSIC'03 9 LSA - Latent Semantic Analysis LSA is proposed for calculating a similarity about documents or terms in natural language. LSA is based on Vector Space Model. LSA can detect similarity with documents sharing only highly related (but not same) words. Original vector space model can not detect such relation ship.

10 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2003/10/26OSIC'03 10 Example of LSA LSA 112000100 211111000 301310000 400000020 500000112 600001011 BACDEFG H 10.30.70.90.40.30.20.3 20.41.01.40.60.30.20.1 30.61.52.31.00.40.2-0.2 40.1 -0.20.00.20.40.9 50.10.2-0.20.00.40.61.51.4 60.10.2-0.10.00.30.41.00.9 B A C D EF GH Doc1 Doc3 Doc2 A DB AB Doc4 Doc5 HGF C Doc6 GE CDE H Make a word-by- document matrix. BBF CC H GG DocumentVector TermVector Similarities about documents and terms are represented by the cosine of two vectors.

11 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2003/10/26OSIC'03 11 Effect of LSA Documents which have indirect relationship show high similarities. LSA make clear about tends of documents. 123456 11.00.2-0.1-0.3 -0.5 20.21.00.5-0.5-0.9-0.5 3-0.10.51.0-0.2-0.4-0.5 4-0.3-0.5-0.21.00.30.5 5-0.3-0.9-0.40.31.00.5 6-0.5 0.5 1.0 123456 1 0.9-0.6 -0.5 21.0 -0.8 -0.7 30.91.0 -0.8 4-0.6-0.8 1.0 5-0.6-0.8 1.0 6-0.5-0.7-0.81.0 before LSAafter LSA Similarities about each document.

12 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2003/10/26OSIC'03 12 Outline Background and research aim Latent Semantic Analysis (LSA) Problem with naive LSA approach Proposed automatic categorization method Case study Discussions and conclusions

13 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2003/10/26OSIC'03 13 Naive LSA approach for categorization Apply LSA for software similarity Software Document Identifier (variable, function, type) Word Calculate similarities by result of LSA We apply cluster analysis using similarities of software systems calculated above Cluster analysis divides a set into some groups using similarities of each item

14 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2003/10/26OSIC'03 14 Problem of naive approach Each high relationship has each reason Cluster analysis based on simple software similarity is not adequate Software 1 Software 2 Software 3 Software 4 Editor GUI (MFC) support for regular expression Spreadsheet Editor support for regular expression GUI (GTK) Spreadsheet GUI (GTK) GUI (MFC) support for regular expression

15 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2003/10/26OSIC'03 15 Outline Background and research aim Latent Semantic Analysis (LSA) Problem with naive LSA approach Proposed automatic categorization method Case study Discussions and conclusions

16 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2003/10/26OSIC'03 16 Classification by identifiers Identifier implies behavior of source code Some statements which have an identifier “window” are related to some kind of GUI operations Group some identifiers which are highly related and consider them as one category. Software 1 Software 3 Editor GUI (MFC) Spreadsheet GUI (MFC) window cmdButton window menuBar MFC

17 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2003/10/26OSIC'03 17 1.Extract Identifier Extract all identifiers variable name constant name function name type name Soft1 Soft2 Soft3 Soft4 Soft5 1.Extract Identifier Soft6 Sof1 Soft3 Soft2 AB Soft4 Soft5 Soft6 GE CDE HDB HGF CCC H GGABBFJJ J J I

18 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2003/10/26OSIC'03 18 2.Make Identifier-by-Software Matrix Identifier-by-Software Matrix A row represents a software A column represents an identifier A cell has the number of identifiers appeared in a software 2.Make Identifier-by- Software Matrix Sof1 Soft3 Soft2 AB Soft4 Soft5 Soft6 GE CDE HDB HGF CCC H GGABBFJJ J J I IJ 11200010001 21111100000 30131000000 40000002011 50000011201 60000101101 BACDEFG H

19 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2003/10/26OSIC'03 19 3.Remove Stand-off Identifiers and Common Identifiers We remove stand-off Identifier and common identifiers because they are useless for categorization Stand-off Identifier An identifier appears only one software. Common Identifier An identifier appears more than half of software 3.Remove Stand-off Identifiers and Common Identifiers IJ 112000100 211111000 301310000 400000020 500000112 600001011 BACDEFG H 11200010001 21111100000 30131000000 40000002011 50000011201 60000101101 BACDEFG H

20 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2003/10/26OSIC'03 20 4.LSA We apply LSA for the matrix removed stand- off identifiers and common identifiers We can retrieve indirect relationship by applying LSA 10.30.70.90.40.30.20.3 20.41.01.40.60.30.20.1 30.61.52.31.00.40.2-0.2 40.1 -0.20.00.20.40.9 50.10.2-0.20.00.40.61.51.4 60.10.2-0.10.00.30.41.00.9 B A C D EF GH 112000100 211111000 301310000 400000020 500000112 600001011 BACDEFG H 4.LSA

21 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2003/10/26OSIC'03 21 5.Cluster Identifiers Calculate similarities between all pairs of identifiers using the result of LSA Apply cluster analysis based on the similarities We call the result cluster as “identifier cluster” BA GF C DH 5.Cluster Identifiers 10.30.70.90.40.30.20.3 20.41.01.40.60.30.20.1 30.61.52.31.00.40.2-0.2 40.1 -0.20.00.20.40.9 50.10.2-0.20.00.40.61.51.4 60.10.2-0.10.00.30.41.00.9 B A C D EF GH

22 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2003/10/26OSIC'03 22 6.Make Software Cluster From each identifier cluster, we make a software cluster. A software cluster is an union of software systems which have a token included in an identifier cluster. 1 6.Make software cluster 23 BA GF C DH Sof1 Soft3 Soft2 AB Soft4 Soft5 Soft6 GE CDE HDB HGF CCC H GGABBF 6451 JJ J J I

23 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2003/10/26OSIC'03 23 7.Make Cluster’s Titles For each software cluster, we make a title which represents what software systems are categorized. 1.Get all software vector included in a software cluster. 2.Sum up them. 3.From the summation vector, chose some tokens which have high value, and we make them as title of a cluster. 1 7.Make Cluster’s Titles 23 123 ClusterTitle1 4561 456 ClusterTitle2 1

24 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2003/10/26OSIC'03 24 Automatic Categorization System Target: programs written in C language Implemented in Perl However token extractor is written in C using YACC Employ SVDPACKC program for LSA calculation Total number of lines are about 4,000

25 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2003/10/26OSIC'03 25 Outline Background and research aim Latent Semantic Analysis (LSA) Problem with naive LSA approach Proposed automatic categorization method Case study Discussions and conclusions

26 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2003/10/26OSIC'03 26 Case study We applied our proposed method for real software systems using implemented prototype We choose 6 genres from SourceForge at random boardgames, compilers, database, editor, videoconversion, xterm We retrieve all C programs from above 6 genres. 41 software systems. 164,102 identifiers We remove stand-off and common identifiers. 22,048 identifiers are remained.

27 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2003/10/26OSIC'03 27 The result of case study (subset) TitleSoftwareNoI AOP, emitcode, IC_RESULT, IC_LEFT, aop, aopGet, IC_RIGHT, pic14_emitcode, iCode, etype compilers/gbdk, compilers/sdcc8597 CASE_IGNORE, CASE_GROUND_STATE, screen, CASE_PRINT, CASE_BYP_STATE, Widget, TScreen, CASE_IGNORE_STATE, CASE_PLT_VEC, CASE_PT_POINT xterm/R6.3, xterm/R6.42160 YY_BREAK, yyvsp, yyval, DATA, yy_current_buffer, tuple, yy_current_state, yy_c_buf_p, yy_cp, uint32 compilers/gbdk, database/mysql-3.23.49, database/postgresql-7.2.1 223 AVI, cinfo, OUTLONG, avi_t, AVI_errno, hdrl_data, OUT4CC, nhb, ERR_EXIT, str2ulong videoconversion/dv2jpg-1.1, videoconversion/libcu30-1.0, videoconversion/mjpgTools 177 board, num_moves, ply, pawn_file, npiece, pawns, moves, white_to_move, move_s, promoted boardgame/Sjeng-10.0, boardgame/cinag-1.1.4, boardgame/faile_1_4_4 154 GtkWidget, gchar, gpointer, gint, widget, gtk_widget_show, N_, g_free, dialog, g_return_if_fail boardgame/gbatnav-1.0.4, editor/gedit-1.120.0, editor/gmas-1.1.0, editor/gnotepad+-1.3.3, editor/peacock-0.4 104 Software systems using GTK library Software systems using YACC New Category Same category as SourceForge

28 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2003/10/26OSIC'03 28 The result of case study Our system returned 40 clusters Details of new clusters GTK(2 clusters)GUI library yacc(2 clusters)Library for Syntactic analysis regexpLibrary for regular expression getoptLibrary for parsing arguments JNIJava Native Interface Python/CArchitecture for extending Python interpreter Clusters same as existed categories18 New clusters8

29 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2003/10/26OSIC'03 29 Discussion Our method found categorization by a library and an architecture without any knowledge Categorization by many aspects of software systems Categorization without human knowledge Cluster’s title Some titles are easy to understand, and some are not. Cluster of same library are tend to have understandable titles

30 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2003/10/26OSIC'03 30 Conclusion and Future Work We proposed automatic categorization method for open software systems We showed that our method could found new categorization without any knowledge about software systems Future works Improve understandability of cluster’s title Large scale experimentation

31 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2003/10/26OSIC'03 31 Similarity calcuration functionmodule, component software team lexical level semantic level metrics level abstraction level unit By lexical similarity By programming language By the number of developer, CMM level, etc... By developer, LoC, cyclomatic number, etc... By usage By library or architecture

32 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2003/10/26OSIC'03 32 Usage of Software Search functionmodule, component software team reuse implementation refer design lexical level semantic level metrics level abstraction level unit refer development process estimate metrics

33 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2003/10/26OSIC'03 33 Product Search System Company Source Repository Develop Division ADevelop Division B Software developed in division A Software developed in division B Imported from OpenSource repository Search products

34 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2003/10/26OSIC'03 34

35 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2003/10/26OSIC'03 35 Proposed Method(1/2) 2.Make Identifier-by-Software Matrix 3.Remove Stand-off Identifiers and Common Identifiers Soft1 Soft2 Soft3 Soft4 Soft5 1.Extract Identifier IJ Soft6 Sof1 Soft3 Soft2 AB Soft4 Soft5 Soft6 GE CDE HDB HGF CCC H GGABBF 112000100 211111000 301310000 400000020 500000112 600001011 BACDEFG H 11200010001 21111100000 30131000000 40000002011 50000011201 60000101101 BACDEFG H JJ J J I

36 Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2003/10/26OSIC'03 36 Proposed Method(2/2) BA GF 1 C 23 456 123 456 ClusterTitle1 ClusterTitle2 D H 1 1 5.Calcurate Identifier Similarity and Cluster Analysis 6.Make Software Clusters 7.Make Cluster’s Titles 10.30.70.90.40.30.20.3 20.41.01.40.60.30.20.1 30.61.52.31.00.40.2-0.2 40.1 -0.20.00.20.40.9 50.10.2-0.20.00.40.61.51.4 60.10.2-0.10.00.30.41.00.9 B A C D EF GH 112000100 211111000 301310000 400000020 500000112 600001011 BACDEFG H 4.LSA


Download ppt "Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Automatic Categorization."

Similar presentations


Ads by Google