Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Automatic Categorization.

Slides:



Advertisements
Similar presentations
Context-based object-class recognition and retrieval by generalized correlograms by J. Amores, N. Sebe and P. Radeva Discussion led by Qi An Duke University.
Advertisements

Relevant characteristics extraction from semantically unstructured data PhD title : Data mining in unstructured data Daniel I. MORARIU, MSc PhD Supervisor:
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Identifying Source.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Evolutional Analysis.
1 Latent Semantic Mapping: Dimensionality Reduction via Globally Optimal Continuous Parameter Modeling Jerome R. Bellegarda.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Extraction of.
An Introduction to Latent Semantic Analysis
Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011.
Indexing by Latent Semantic Analysis Scot Deerwester, Susan Dumais,George Furnas,Thomas Landauer, and Richard Harshman Presented by: Ashraf Khalil.
Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
Revision Control Practices in Software Engineering Surekha, Kotiyala Madhuri, Komuravelly Suchitra, Yerramalla.
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing.
Homework Define a loss function that compares two matrices (say mean square error) b = svd(bellcore) b2 = b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2])
© Janice Regan, CMPT 128, Jan CMPT 128 Introduction to Computing Science for Engineering Students Creating a program.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University ICSE 2003 Java.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University A Criterion for.
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University DCCFinder: A Very- Large Scale Code Clone Analysis.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Investigation.
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University A clone detection approach for a collection of similar.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University What Do Practitioners.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Automatic Categorization.
Bayesian Sets Zoubin Ghahramani and Kathertine A. Heller NIPS 2005 Presented by Qi An Mar. 17 th, 2006.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Code-Clone Analysis.
No. 1 Classification and clustering methods by probabilistic latent semantic indexing model A Short Course at Tamkang University Taipei, Taiwan, R.O.C.,
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
Kohonen Mapping and Text Semantics Xia Lin College of Information Science and Technology Drexel University.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 1 Design and Implementation.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University MUDABlue: An Automatic.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University How to extract.
CONCLUSION & FUTURE WORK Given a new user with an information gathering task consisting of document IDs and respective term vectors, this can be compared.
Unit-1 Introduction Prepared by: Prof. Harish I Rathod
LATENT SEMANTIC INDEXING Hande Zırtıloğlu Levent Altunyurt.
Learn Big Data Application Development on Windows Azure Wen-ming Ye (叶文铭 ) Sr. Technical Evangelist Microsoft Corporation.
Mining Logical Clones in Software: Revealing High-Level Business & Programming Rules Wenyi Qian 1, Xin Peng 1, Zhenchang Xing 2, Stan Jarzabek 3, Wenyun.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Development of.
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University Retrieving Similar Code Fragments based on Identifier.
PROCESSING, ANALYSIS & INTERPRETATION OF DATA
Alternative IR models DR.Yeni Herdiyeni, M.Kom STMIK ERESHA.
No. 1 Knowledge Acquisition from Documents with both Fixed and Free Formats* Shigeich Hirasawa Department of Industrial and Management Systems Engineering.
1 CS 430: Information Discovery Lecture 12 Latent Semantic Indexing.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University IWPSE 2003 Program.
Extracting a Unified Directory Tree to Compare Similar Software Products Yusuke Sakaguchi, Takashi Ishio, Tetsuya Kanda, Katsuro Inoue Department of Computer.
Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.
1 CS 430: Information Discovery Lecture 11 Latent Semantic Indexing.
1 Measuring Similarity of Large Software System Based on Source Code Correspondence Tetsuo Yamamoto*, Makoto Matsushita**, Toshihiro Kamiya***, Katsuro.
Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 1 Classification.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Extraction of.
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Towards a Collection of Refactoring Patterns Based.
Your caption here POLYPHONET: An Advanced Social Network Extraction System from the Web Yutaka Matsuo Junichiro Mori Masahiro Hamasaki National Institute.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Software Ingredients:
No. 1 Classification Methods for Documents with both Fixed and Free Formats by PLSI Model* 2004International Conference in Management Sciences and Decision.
Estimating Code Size After a Complete Code-Clone Merge Buford Edwards III, Yuhao Wu, Makoto Matsushita, Katsuro Inoue 1 Graduate School of Information.
Search Engine and Optimization 1. Agenda Indexing Algorithms Latent Semantic Indexing 2.
Best pTree organization? level-1 gives te, tf (term level)
Hansheng Xue School of Computer Science and Technology
May 26, 2005: Empiricism versus Rationalism in Language Learning
Use of Mathematics using Technology (Maltlab)
Ying Dai Faculty of software and information science,
Ying Dai Faculty of software and information science,
A Suite to Compile and Analyze an LSP Corpus
Rachit Saluja 03/20/2019 Relation Extraction with Matrix Factorization and Universal Schemas Sebastian Riedel, Limin Yao, Andrew.
Presentation transcript:

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Automatic Categorization Algorithm for Evolvable Software Archive Shinji Kawaguchi †, Pankaj K. Garg †† Makoto Matsushita † and Katsuro Inoue † † Graduate School of Information Science and Technology, Osaka University †† Zee Source

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2003/09/02IWPSE Background Recently, software archive systems become very common. (SourceForge, ibiblio, etc...) They are used for... finding software which fill a demand finding source codes related to currently developing products. These archives are very large and evolving. Need categorizing archived software

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2003/09/02IWPSE Research Aim Present: manual categorization hard work – a software archive is large and evolving less flexibility – categorization is strongly depend on pre- defined category set Automatic categorization is important less cost adaptable – automatic categorization method generate category set We are researching automatic categorization methods

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2003/09/02IWPSE Related Works on Software Clustering Divide one software into some clusters for software understanding Calculate “similarity” between all pairs of units and categorize them based on the similarities. grouping files using similarity of their names* grouping functions using call relationships among functions** grouping functions using their identifiers*** *N. Anquetil and T. Lethbridge. Extracting concepts from file names; a new file clustering criterion. In Proc. 20th Intl. Conf. Software Engineering, May **G. A. Di Lucca, A. R. Fasolino, F. Pace, P. Tramontana, U. De Carlini, Comprehending Web Applications by a Clustering Based Approach 10th International Workshop on Program Comprehension (IWPC'02) ***Jonathan I. Maletic and Andrian Marcus, Supporting Program Comprehension Using Semantic and Structural Information in Proceedings of the 23rd IEEE International Conference on Software Engineering (ICSE 2001) Similarity: They retrieve information from source code. Difference: Their works focused on intra- software relationship. Our research focused on inter- software relationship.

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2003/09/02IWPSE Three Approaches We experimented with following three approaches for automatic categorization. 1.SMAT, similarity measurement tool based on code-clone detection. 2.Decision tree approach 3.Latent Semantic Analysis (LSA) approach

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2003/09/02IWPSE st Approach - SMAT SMAT: Software similarity measurement tool SMAT calculate software similarity by ratio of “similar lines” Similar lines are determined by code-clone detection tool “CCFinder” and line-based comparison tool “diff” The similarity of two software S 1 and S 2 is defined as follows

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2003/09/02IWPSE Result of SMAT The result is table form. Each row and column represents one software Each cell has similarity value between two software systems.

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2003/09/02IWPSE nd Approach - Decision Tree One of a machine learning approach for automatic classification. Decision tree is generated from example data set. Example data set contains some data and one answer. C4.5 is a common decision tree generator Input: Example Dataset Output: Decision Tree DataAnswer C4.5

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2003/09/02IWPSE Result of Decision Tree Approach Application for software categorization Enumerate all 3-gram of *.c and *.h filenames in sample data, and use them as data. Each cell is “T” or “F” depend on the software has its 3- gram in its filenames or not. Each sample software, the category information is given. tyx _fu mpe ops alo win tin Lib boardgame compilers database editor videoconversion database xterm compilers True False

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2003/09/02IWPSE rd Approach - LSA Originally, LSA (Latent Semantic Analysis)* is proposed for similarity calculation of documents written in natural language. This method makes a word-by-document matrix and each document is represented by a vector Similarity is represented by cosine of two document vectors. LSA can detect similarity with software sharing only highly related (but not exactly same) words. This method extract cooccurrence between words by applying SVD (Singular Value Decomposition) to the matrix * Landauer, T. K., Foltz, P. W., & Laham, D. (1998). Introduction to Latent Semantic Analysis. Discourse Processes, 25,

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2003/09/02IWPSE Result of LSA method Application for software categorization Extracting identifiers (variable name, function name, etc…) from source code and consider them as words. We calculate similarities between all pairs of software systems. A part of Figure 4. Similarity of Software System by LSA

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 12 Comparison of three methods SMATDecision TreeLSA How to decideSimilarity (ratio of lines with code-clone) Decision treeSimilarity (cosine of vectors) Input Source code only Source code and category set Source code only Resultin different category similarities are all 0 no miss if example input is small high value if software using same library in same category very low value or 0 no miss if example input is small some category shows very high relationship Scalability Yes No (Generated decision tree has many errors if example is large) Yes

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2003/09/02IWPSE Conclusion We have reported some preliminary work on automatic categorization of a evolvable software archive. In each of the cases, we have limited success with the parameters that we chose. Software functionality is high abstract concept. Software has several aspects. We are actively pursuing this research direction. Non-exclusive categorization is much better for software categorization

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2003/09/02IWPSE

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2003/09/02IWPSE Application for software categorization Softwarefilcmd…mpeCategory Soft1TTFPrinting Soft2FTFEditor … SoftMTFTDatabase Enumerate all *.c *.h files in sample data, and use their 3-gram. Each cell is “T” or “F” depend on the software has its 3-gram in its filenames or not. Each input software, the category information is given.

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2003/09/02IWPSE Result of Decision Tree Approach High ratio of error with large input (57.6%) This approach require a set of category. tyx = t: xterm (2.0) tyx = f: | _fu = t: database (6.0) | _fu = f: | | mpe = t: videoconversion (3.0) | | mpe = f: | | | alo = t: editor (4.0) | | | alo = f: | | | | ops = t: database (2.0/1.0) | | | | ops = f: | | | | | win = t: compilers (6.0) | | | | | win = f: | | | | | | tin = t: compilers (2.0) | | | | | | tin = f: | | | | | | | Lib = t: compilers (2.0) | | | | | | | Lib = f: boardgame (14.0/1.0)

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2003/09/02IWPSE Result of Decision Tree Approach Application for software categorization Enumerate all *.c *.h files in sample data, and use their 3-gram. Each cell is “T” or “F” depend on the software has its 3-gram in its filenames or not. Each input software, the category information is given. Three Problem Over fitting for test data High ratio of error with large input (57.6%) This approach require a set of category. tyx _fu mpe ops alo win tin Lib boardgame compilers database editor videoconversion database xterm compilers True False

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2003/09/02IWPSE Experimentation Test data: 41 software from sourceforge these software is classified in 6 genre at sourceforge Extracting identifiers (variable name, function name, etc…) from source code identifiers are extracted Omitting unnecessary identifiers identifiers appear at only one software identifiers appear in many (more than half) software identifiers are remained Apply LSA for 41 x matrix

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2003/09/02IWPSE Result of LSA method (1/3) This table shows similarities of each software boardgame few common concepts in boardgame (board, player) compilers includes many kind of software compiler of new programming language code generator(compiler- compiler) etc...

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2003/09/02IWPSE Result of LSA method (2/3) database different implementation Full functional DB Simple text-based DB editor, videoconversion, xterm very high similarity

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2003/09/02IWPSE Result of LSA method (3/3) Some software has high similarity tough they are in different categories. They use same libraries GTK – one of a GUI library

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2003/09/02IWPSE Comparison of three methods SMAT Generally, very low similarity values Decision Tree Need pre-defined category set Overfitting test data Not applicable for large data Latent Semantic Analysis High similarity values in some category Software in different category, but using same library sometimes show high similarity

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2003/09/02IWPSE LSA – sample document c1: Human machine interface for ABC computer applications c2: A survey of user opinion of computer system response time c3: The EPS user interface management system c4: System and human system engineering testing of EPS c5: Relation of user perceived response time to error measurement m1: The generation of random, binary, orderd trees m2: The intersection graph of paths in trees m3: Graph minors IV: Widths of trees and well-quasi-ordering m4: Graph minors: A survey

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2003/09/02IWPSE LSA – word by document matrix document word