Chapter 7: Transformations

Chapter 7: Transformations

Attribute Selection Adding irrelevant attributes confuses learning algorithms---so avoid such attributes Both divide-and-conquer and separate-and-conquer algorithms suffer from this; Naïve Bayes does not suffer So first choose the attributes to be considered and then proceed---dimensionality reduction Scheme independent selection: Just enough attributes to divide up the instance space in a way that separates all the training instances: For example, in Table 1, if we were to drop outlook, instance 1 and 4 will be inseparable-not good. --- very tedious procedure

Using machine learning algorithms for attribute selection
Decision tree: Apply DT on all attributes, and select only that are actually used in the decisions---the selected attributes can then be used in another chosen learning algorithm Use linear SVM algorithm that ranks attributes based on weights to choose the attributes---recursive feature elimination Using instance-based learning methods Sample instances randomly from the training set Check neighboring records of the same and different classes (near hits and near misses) If a near hit has a different value for a certain attribute, that attribute appears to be irrelevant---reduce its weight If a near miss, has a different value, the attribute appears to be relevant and its weight should be increased After repeating this procedure many times, selection takes place---only attributes with +ve weights are chosen.

Searching the attribute space:
Fig 7.1 Forward selection (start with empty set and keep expanding) Backward elimination (start with all, and start eliminating one by one) Bidirectional search---combination of the above two Scheme-specific selection Cross-validation is used to measure the effectiveness of a subset of attributes

Discretizing Numeric Attributes
Global discretization: Used in 1R learning scheme: Sort the instances by the attribute’s value and assign the value into ranges at the points that class value changes---keeping some minimum instance coverage criteria Local discretization: Used in decision trees: When a specific attribute is used to split a node, a decision is made on the value at which this break could take place Transforming numeric attribute into k binary variables Unsupervised discretization: Not taking the classes of the training set---break the value range into some intervals---e,g., equal-interval binning or equal-frequency binning---runs the risk of destroying distinctions within an interval or bin Supervised discretization---takes classes into account while making intervals Proportional k-interval discretization: #of bins chosen as square root of #of instances with equal-frequency binning is found to be excellent

Entropy-based Discretization
One example: Order the values of the attribute, and for each possible break-point determine the information gain (p ). Split at the point where this value is the smallest. For all values, find the smallest (A); Repeat this procedure for each of the parts formed by the breaking at A; Repeat this step recursively until a stopping criteria is met

Some Useful Transformations
Examples: Subtracting one date attribute from another to obtain a new age attribute Converting two attributes A and B to A/B, a new attribute representing the ratio Reduce several nominal attributes to one by concatenating their vales, producing a single k1xk2 value attribute Principal component analysis: Use a special coordinate system that depends on the given cloud of points as follows: place the first axis in the direction of greatest variance of the points to maximize the variance along that axis; the 2nd axis in perpendicular to it; in multi-dimensional case, choose the 2nd axis that maximizes variance along that axis; and so on; finally, choose the ones that contribute to the highest variance---the principal components

Random Projections Since PCA is expensive (cubic in the #of dimensions), alternative is to a random projection of the data into a subspace with a predetermined number of dimensions

Text to attribute vector
Convert a document to a vector of words that occur in the document---it could be the frequency of the words or just the absence/presence of the word In other words, a document is characterized by the words that appear often in it.

Time series Some times, we may replace the attributes by the difference in successive values, etc. This is time series.

Automatic Data Cleansing
Data mining techniques themselves can sometimes help to solve the problem of cleansing the corrupted data By discarding misclassified instances from the training set, relearning, and then repeating until there are no more misclassified instances, decision trees induced from data can be improved Robust regression---by removing outliers, linear regression is improved

Combining Multiple Models
Bagging, boosting, and stacking are prominent methods to combine multiple models Bagging: Models receive equal weight---output of each model is a majority value, for example. Boosting: Similar to bagging except that it assigns different weights to different model outputs Option tree (Fig. 7.10) and Fig (-ve means play=yes; + ve means play=no;)

Chapter 7: Transformations

Similar presentations

Presentation on theme: "Chapter 7: Transformations"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chapter 7: Transformations

Similar presentations

Presentation on theme: "Chapter 7: Transformations"— Presentation transcript:

Similar presentations

About project

Feedback