Collage Score Card & Software defect prediction Prepared by: Meetkumar Patel Srivats Srinivasan GUIDANCE BY: Prof. Meiliu Lu
Agenda Data Warehousing Project Data Mining Project Background Introduction Technologies Explored Implementation Steps Future scope Data Mining Project Objective Algorithm Applied Demo Learning Experience References
Background Source website : www.data.gov , http://promise.site.uottawa.ca/SERepository/datasets-page.html Two datasets : Collage Scorecard Software Defect Prediction dataset Collage Scorecard dataset : Data from 2009-2013 17 attributes,37835 entries Software Defect Prediction dataset: 22 attributes,1100 entries
Introduction The primary objective of our project is to design data mart. We have used Star schema to generate it. This data mart answers questions related to US universities. The primary users of the Data Mart would be High School Students.
Technologies Explored Data Preprocessing Microsoft Excel Spreadsheet MySQL Server Data Mart MsSQL Server Java OLAP Operations SQL Server Queries
Implementation Steps Data Cleaning and Preprocessing Data Mart Querying Tool
Data Cleaning and Preprocessing Original data had 80,000 rows and 1700 columns, we trimmed data to 37835 rows and 17 related columns. Add missing values using SQL Script. Since 5 years data are there we added year column for segregation.
Data Mart Data mart is implemented on star schema base Data Mart provided following information to user University details on basis of below attributes University ID Programs Type of Degree SAT & AWT scores Region State
Highest Degree Degree_ID Degree_Name State State_ID State_Name Fact Table University_ID State_ID Degree_ID PDegree_ID Region_ID Program_ID Scores University University_ID University_name Zip Website Predominant Degree PDegree_ID PDegreee_Name Region Region_ID Region_Name Program Program_ID Program_Name Star Schema
Future Scope Privileged user can insert new records in future Integrate Google Maps for location and directions Develop Web based and Mobile based environment.
Objective Mining data to extract knowledge from available data. Analyze the behavior of different data mining tools. This project focus on the high-performance fault/error predictors based on data mining technique such as Random Forests and the algorithms based on a new computational intelligence approach.
Data Mining Tools Used Classification Algorithm Weka Rapid Miner J48 Random Tree Logistic
Data Mining We will use attributes like cyclomatic complexity, essential complexity, design complexity, total number of operators, total no. of operands, volume, program length, difficulty, intelligence , effort , line count etc. Mining these attributes to study how they affect the quality of software to be produced. The final result using these attributes is to predict if its a defect or not. {true, false}.
Data Mining Pre- processing data – The collected data were noisy, missing useful info and inconsistent. First step was the Data preparation processes that consist of checking the data distribution and outliers, dealing with empty or missing values, enriching data, and transforming data into analyzable formats should be employed to improve data quality and to thus enable effective data mining.
Data Mining Algorithm Implementation Firstly, the algorithm is implemented in WEKA to gain the “Root Mean square error” and then used the Rapid Miner to obtain the graphical output. The lesser the “Root Mean square error” the efficient the algorithm is with the particular data set.
Data Mining (WEKA) J48
Data Mining (WEKA) Naïve Bayesian
Data Mining (WEKA) Random Tree
Data Mining (Rapid Miner)
Data Mining
Learning Experience Analytical processing Learned different data mining tools like Weka, rapid Miner Learned about real time application for different data mining algorithms
DEMO
Conclusion Weka predicted the “Root Mean Square error” on basis of which few algorithms were shortlisted. But, Weka wasn’t able to show the graphical representation sound and clear. So, Rapid Miner came into consideration through which we were able to simplify the graphs and able predict the probability of defect with ease.
References http://www.sciencedirect.com/science/article/pii/S0020025508005173 http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4031804&tag=1 http://promise.site.uottawa.ca/SERepository/datasets-page.html http://recommender-systems.readthedocs.org/en/latest/datamining.html