Data mining methodology in Weka

Data mining methodology in Weka
Renata Benda Prokeinova Department of Statistics and Operation research

Logistic regression- teory
Logistic regression can in many ways be seen to be similar to ordinary regression. It models the relationship between a dependent and one or more independent variables, and allows us to look at the fit of the model as well as at the significance of the relationships (between dependent and independent variables) that we are modelling. However, the underlying principle of binomial logistic regression, and its statistical calculation, are quite different to ordinary linear regression. While ordinary regression uses ordinary least squares to find a best fitting line, and comes up with coefficients that predict the change in the dependent variable for one unit change in the independent variable, logistic regression estimates the probability of an event occurring. What we want to predict from a knowledge of relevant independent variables is not a precise numerical value of a dependent variable, but rather the probability (p) that it is 1 (event occurring) rather than 0 (event not occurring). This means that, while in linear regression, the relationship between the dependent and the independent variables is linear, this assumption is not made in logistic regression.

Logistic regression -output
First section of the report: Coefficients... Class Variable yes =============================== outlook=sunny outlook=overcast outlook=rainy temperature humidity windy Intercept The coefficients are in fact the weights that are applied to each attribute before adding them together. However, the result is the probability that the new instance belongs to class yes (> 0.5 means yes).

Odds Ratios... Class Variable yes =============================== outlook=sunny outlook=overcast outlook=rainy temperature humidity windy The odds ratios indicate how large of an influence a change in that value (or change to that value) will have on the prediction. I think this link does a great job explaining the odds ratios. The value of outlook=overcast is so large because if the outlook is overcast the odds are very good that play will equal yes.

=== Confusion Matrix === a b <-- classified as 7 2 | a = yes 1 4 | b = no The confusion matrix simply shows you how many of the test data points are correctly and incorrectly classified. In your example 7 A's were actually classified as A, where as 2 A's were misclassified as B. Your question is more thoroughly answered in this question: How to read the classifier confusion matrix in WEKA.

Datasets

Market basket analysis
Market Basket Analysis is a modelling technique based upon the theory that if you buy a certain group of items, you are more (or less) likely to buy another group of items. For example, if you are in an English pub and you buy a pint of beer and don't buy a bar meal, you are more likely to buy crisps (US. chips) at the same time than somebody who didn't buy beer. association-rule-learning/

Market basket analysis: the basics terminology
Items are the objects that we are identifying associations between. Transactions are instances of groups of items co-occuring together. The support of an item or item set is the fraction of transactions in our data set that contain that item or item set. In general, it is nice to identify rules that have a high support, as these will be applicable to a large number of transactions. For super market retailers, this is likely to involve basic products that are popular across an entire user base (e.g. bread, milk). A printer cartridge retailer, for example, may not have products with a high support, because each customer only buys cartridges that are specific to his / her own printer.

Market basket analysis: the basics terminology
The confidence of a rule is the likelihood that it is true for a new transaction that contains the items on the LHS of the rule: The lift of a rule is the ratio of the support of the items on the LHS of the rule co-occuring with items on the RHS divided by probability that the LHS and RHS co-occur if the two are independent confidence(im⇒in)= support(im∪in)/ support(im) lift(im⇒in)= support(im∪in)/(support(im)×support(in))

Factor and Cluster Analysis
Chapter Twenty-one Factor and Cluster Analysis

Factor Analysis Combines questions or variables to create new factors
Combines objects to create new groups Uses in Data Analysis To identify underlying constructs in the data from the groupings of variables that emerge To reduce the number of variables to a more manageable set

Factor Analysis (Contd.)
Methodology Principal Component Analysis Summarizes information in a larger set of variables to a smaller set of factors Common Factor Analysis Uncovers underlying dimensions surrounding the original variables

Factor Analysis - Example

Export Data Set - Illustration
Respid Will(y1) Govt(y2) Train(x5) Size(x1) Exp(x6) Rev(x2) Years(x3) Prod(x4)

Description of Variables
Variable Description Corresponding Name in Output Scale Values Willingness to Export (Y1) Will 1(definitely not interested) to 5 (definitely interested) Level of Interest in Seeking Govt Assistance (Y2) Govt Employee Size (X1) Size Greater than Zero Firm Revenue (X2) Rev In millions of dollars Years of Operation in the Domestic Market (X3) Years Actual number of years Number of Products Currently Produced by the Firm (X4) Prod Actual number Training of Employees (X5) Train 0 (no formal program) or 1 (existence of a formal program) Management Experience in International Operation (X6) Exp 0 (no experience) or 1 (presence of experience)

Factors Factor Eigenvalue Criteria
A variable or construct that is not directly observable but needs to be inferred from the input variables All included factors (prior to rotation) must explain at least as much variance as an “average variable” Eigenvalue Criteria Represents the amount of variance in the original variables that is associated with a factor Sum of the square of the factor loadings of each variable on a factor represents the eigenvalue Only factors with eigenvalues greater than 1.0 are retained

How Many Factors - Criteria
Scree Plot Criteria A plot of the eigenvalues against the number of factors, in order of extraction. The shape of the plot determines the number of factors

How Many Factors: Criteria (Contd.)
Percentage of Variance Criteria The number of factors extracted is determined so that the cumulative percentage of variance extracted by the factors reaches a satisfactory level Significance Test Criteria Statistical significance of the separate eigenvalues is determined, and only those factors that are statistically significant are retained

Extraction using Principal Component Method - Unrotated
Factor Loadings Factor Score Coefficient

Extraction using Principal Component Method - Factor Rotation
Not significantly different from unrotated values

Common Factor Analysis
The factor extraction procedure is similar to that of principal component analysis except for the input correlation matrix Communalities or shared variance is inserted in the diagonal instead of unities in the original variable correlation matrix The total amount of variance that can be explained by all the factors in common factor analysis is the sum of the diagonal elements in the correlation matrix The output of common factor analysis depends on the amount of shared variance

Common Factor Analysis – Results (Contd.)

Common Factor Analysis - Results

Common Factor Analysis – Results (Contd.)

Cluster Analysis Technique for grouping individuals or objects into unknown groups. The typical criterion used in cluster analysis is distance between clusters or the error sum of squares. The input is any valid measure of similarity between objects, such as: Correlations Distance measures (Euclidean distance) Association coefficients The number of clusters or the level of clustering

Steps in Cluster Analysis
Define the problem Decide on the appropriate similarity measure Decide on how to group the objects Decide the number of clusters Interpret, describe, and validate the clusters

Cluster Analysis (Contd.)
Hierarchical Clustering Can start with all objects in one cluster and divide and subdivide them until all objects are in their own single-object cluster ( ‘top-down’ or decision approach) Can start with each object in its own single-object cluster and systematically combine clusters until all objects are in one cluster (‘bottom-up’ or agglomerative approach) Non-hierarchical Clustering Permits objects to leave one cluster and join another as clusters are being formed A cluster center is initially selected and all the objects within a pre-specified threshold distance are included in that cluster

Hierarchical Clustering
Single Linkage Clustering criterion based on the shortest distance Complete Linkage Clustering criterion based on the longest distance

Hierarchical Clustering (Contd.)
Average Linkage Clustering criterion based on the average distance Ward's Method Based on the loss of information resulting from grouping of the objects into clusters (minimize within cluster variation)

Hierarchical Clustering (Contd.)
Centroid Method Based on the distance between the group centroids (the point whose coordinates are the means of all the observations in the cluster)

Hierarchical Cluster Analysis - Example

Hierarchical Cluster Analysis (Contd.)
A dendrogram for hierarchical clustering of bank data

Hierarchical Cluster Analysis (Contd.)

Criteria for Determining the Number of Clusters
Number of clusters is specified by the analyst for theoretical or practical reasons. Level of clustering with respect to clustering criterion is specified. Determine the number of clusters from the pattern of clusters generated. The distances between clusters or error variability measure at successive steps can be used to decide the number of clusters (from the plot of error sum of squares with the number of clusters). The ratio of total within-group variance to between group variance is plotted against the number of clusters and the point at which an elbow occurs indicates the number of clusters.

Assumptions and Limitations of Cluster Analysis
The basic measure of similarity on which the clustering is based is a valid measure of the similarity between the objects. There is theoretical justification for structuring the objects into clusters Limitations It is difficult to evaluate the quality of the clustering It is difficult to know exactly which clusters are very similar and which objects are difficult to assign. It is difficult to select a clustering criterion and program on any basis other than availability.

Data mining methodology in Weka

Similar presentations

Presentation on theme: "Data mining methodology in Weka"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data mining methodology in Weka

Similar presentations

Presentation on theme: "Data mining methodology in Weka"— Presentation transcript:

Similar presentations

About project

Feedback