Data Mining ICCM

Knowledge Discovery Process
Interpretation Data Mining Task-relevant Data Data transformations Selection Preprocessed Data Data Cleaning Data Integration Databases

Remember: Domain Expertise
Strong understanding of the business problem. Understands subtle relationships. Helps in reducing data dimensions.

Cleaning and Preparing Data
What you do with missing values depends on how many there are, and whether they’re missing randomly or systematically. When in doubt, assume that missing values are missing systematically. Appropriate data transformations can make the data easier to understand and easier to model. Normalization and rescaling are important when relative changes are more important than absolute ones. Data provenance records help reduce errors as you iterate over data collection, data treatment, and modeling. Zumel and Mount, Practical Data Science with R, 2014, Manning

Data Mining Objective: Fit data to a model
Potential Result: Higher-level meta information that may not be obvious when looking at raw data Similar terms Exploratory data analysis Data driven discovery Deductive learning

Query Examples Database Data Mining
Find all credit card applications with last name of “Smith” Find customers who have purchased milk. Data Mining Find all credit card applications that are a poor risk (classification) Find all items that are frequently purchased with milk.

Data Mining Models and Tasks

Machine Learning Algorithms
Some machines use training, which requires sample sets (models) of the data. Then later predictions are based on how the machine was trained. Unsupervised training does not require prior classification of data. Supervised training requires that the data be evaluated prior to training (e.g. records of people who purchased and did not purchase a product)

Basic Data Mining Tasks
Classification maps data into predefined groups or classes Supervised learning Pattern recognition Prediction Regression is used to map a data item to a real valued prediction variable. Clustering groups similar data together into clusters. Unsupervised learning Segmentation Partitioning

Basic Data Mining Tasks (cont’d)
Summarization maps data into subsets with associated simple descriptions. Characterization Generalization Link Analysis uncovers relationships among data. Affinity Analysis Association Rules Sequential Analysis determines sequential patterns.

Linear regression In correlation, the two variables are treated as equals. In regression, one variable is considered independent (=predictor) variable (X) and the other the dependent (=outcome) variable Y.

What is “Linear”? Remember this: Y=mX+B? m B

Linear Correlation Strong relationships Weak relationships Y Y X X Y Y
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall

Support Vector Machine (SVM)
Max-Margin Classifier Formalize notion of the best linear separator Lagrangian Multipliers Way to convert a constrained optimization problem to one that is easier to solve Kernels Projecting data into higher-dimensional space makes it linearly separable Complexity Depends only on the number of training examples, not on dimensionality of the kernel space!

Linear Separators Which of the linear separators is optimal?

Tennis example Temperature Humidity = play tennis = do not play tennis

Linear Support Vector Machines
Data: <xi,yi>, i=1,..,l xi  Rd yi  {-1,+1} x2 =+1 =-1 x1

Non-linear SVMs Datasets that are linearly separable
with some noise are OK: But what are we going to do if the dataset is like this? How about… mapping data to a higher-dimensional space: x x x2 x

Kernel Trick (Raise to higher dimension)

Confusion Matrix Measures the performance of a classification model.
Type I error: False positives (FP) Type II error: False negatives (FN)

Clustering Partitioning Clustering Approach e.g., Euclidean distance
a typical clustering analysis approach via iteratively partitioning training data set to learn a partition of the given data space learning a partition on a data set to produce several non-empty clusters (usually, the number of clusters given in advance) in principle, optimal partition achieved via minimizing the sum of squared distance to its “representative object” in each cluster e.g., Euclidean distance

Illustrating Clustering
Intracluster distances are minimized Intercluster distances are maximized

K-means Clustering User set up the number of clusters they’d like. (e.g. K=5) Randomly guess K cluster centre locations Each data point finds out which centre it’s closest to. (Thus each centre “owns” a set of data points) Each centre finds the centroid of the points it owns …and jumps there …Repeat until terminated!

Dendrogram a tree diagram frequently used to illustrate the arrangement of the clusters produced by hierarchical clustering.[1] Dendrograms are often used in computational biology to illustrate the clustering of genes or samples.

There has been a considerable amount of research in the area of Market Basket Analysis. Its appeal comes from the clarity and utility of its results, which are expressed in the form association rules. Given A database of transactions Each transaction contains a set of items Example: When a customer buys bread and butter, they buy milk 85% of the time +

Market Basket (Association Rules)
? Where should detergents be placed in the store to maximize their sales? ? Are window cleaning products purchased when detergents and orange juice are bought together? ? Is soda typically purchased with bananas? Does the brand of soda make a difference? ? How are the demographics of the neighborhood affecting what customers are buying?

Data Mining ICCM - 2017.

Similar presentations

Presentation on theme: "Data Mining ICCM - 2017."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Mining ICCM - 2017.

Similar presentations

Presentation on theme: "Data Mining ICCM - 2017."— Presentation transcript:

Similar presentations

About project

Feedback