Download presentation
Presentation is loading. Please wait.
1
Data Mining 101 with Scikit-Learn
An informal introduction of data mining Shuhan Yuan
2
What is data mining? Data mining is the computing process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. ( Data mining is the process of discovering interesting knowledge from large amounts of data stored either in databases, data warehouses, or other information repositories. (Data Mining: Concepts and Techniques)
3
What is data mining? Data mining is the computing process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. ( Data mining is the process of discovering interesting knowledge from large amounts of data stored either in databases, data warehouses, or other information repositories. (Data Mining: Concepts and Techniques)
4
What is data mining? Data Mining Knowledge Data (Models)
A naïve view of data mining Data Mining Data Knowledge (Models) knowledge discovery from data
5
Six common classes of tasks
Classification – is the task of generalizing known structure to apply to new data. For example, an program might attempt to classify an as "legitimate" or as "spam". Clustering – is the task of discovering groups and structures in the data that are in some way or another "similar", without using known structures in the data. Regression – attempts to find a function which models the data with the least error that is, for estimating the relationships among data or datasets. Summarization – providing a more compact representation of the data set, including visualization and report generation. Anomaly detection (outlier/change/deviation detection) – The identification of unusual data records, that might be interesting or data errors that require further investigation. Association rule learning (dependency modelling) – Searches for relationships between variables. For example, a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to as market basket analysis.
6
Six common classes of tasks
Classification – is the task of generalizing known structure to apply to new data. For example, an program might attempt to classify an as "legitimate" or as "spam". Clustering – is the task of discovering groups and structures in the data that are in some way or another "similar", without using known structures in the data. Regression – attempts to find a function which models the data with the least error that is, for estimating the relationships among data or datasets. Summarization – providing a more compact representation of the data set, including visualization and report generation. Anomaly detection (outlier/change/deviation detection) – The identification of unusual data records, that might be interesting or data errors that require further investigation. Association rule learning (dependency modelling) – Searches for relationships between variables. For example, a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to as market basket analysis.
7
Classification Supervised Learning
8
Regression Supervised Learning
9
Clustering Unsupervised Learning Clustering Algorithms
10
Anomaly Detection
11
Association Rule Market Basket Analysis
12
Summarization Know your data
13
Pipeline for Data Mining
Data Preprocessing Feature Engineering Model Training Testing Prediction
14
Linus Torvalds: “Talk is cheap. Show me the code.”
15
Python Ecosystem
16
Jupyter Notebook Contain both computer code (e.g. python) and rich text elements (paragraph, equations, figures, links, etc...).
17
Scikit-Learn
18
http://peekaboo-vision. blogspot
19
Like this graph? More here:
20
Scikit-learn Simple and consistent API Instantiate the model
m = Model() Fit the model m.fit(train_data) Predict m.predict(test_data) Evaluate m.score(predict_y, target_y)
21
Classification: k-nearest neighbors (K-NN)
22
Decision tree
23
Clustering: k-means Given a data set where each observed example has a set of features, but no labels
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.