Data Mining 101 with Scikit-Learn An informal introduction of data mining Shuhan Yuan sy005@uark.edu
What is data mining? Data mining is the computing process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. (https://en.wikipedia.org/wiki/Data_mining) Data mining is the process of discovering interesting knowledge from large amounts of data stored either in databases, data warehouses, or other information repositories. (Data Mining: Concepts and Techniques)
What is data mining? Data mining is the computing process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. (https://en.wikipedia.org/wiki/Data_mining) Data mining is the process of discovering interesting knowledge from large amounts of data stored either in databases, data warehouses, or other information repositories. (Data Mining: Concepts and Techniques)
What is data mining? Data Mining Knowledge Data (Models) A naïve view of data mining Data Mining Data Knowledge (Models) knowledge discovery from data http://hanj.cs.illinois.edu/bk1/
Six common classes of tasks Classification – is the task of generalizing known structure to apply to new data. For example, an e-mail program might attempt to classify an e-mail as "legitimate" or as "spam". Clustering – is the task of discovering groups and structures in the data that are in some way or another "similar", without using known structures in the data. Regression – attempts to find a function which models the data with the least error that is, for estimating the relationships among data or datasets. Summarization – providing a more compact representation of the data set, including visualization and report generation. Anomaly detection (outlier/change/deviation detection) – The identification of unusual data records, that might be interesting or data errors that require further investigation. Association rule learning (dependency modelling) – Searches for relationships between variables. For example, a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to as market basket analysis. https://en.wikipedia.org/wiki/Data_mining
Six common classes of tasks Classification – is the task of generalizing known structure to apply to new data. For example, an e-mail program might attempt to classify an e-mail as "legitimate" or as "spam". Clustering – is the task of discovering groups and structures in the data that are in some way or another "similar", without using known structures in the data. Regression – attempts to find a function which models the data with the least error that is, for estimating the relationships among data or datasets. Summarization – providing a more compact representation of the data set, including visualization and report generation. Anomaly detection (outlier/change/deviation detection) – The identification of unusual data records, that might be interesting or data errors that require further investigation. Association rule learning (dependency modelling) – Searches for relationships between variables. For example, a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to as market basket analysis. https://en.wikipedia.org/wiki/Data_mining
Classification Supervised Learning https://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/
Regression Supervised Learning https://quantdare.com/machine-learning-a-brief-breakdown/ https://medium.com/simple-ai/linear-regression-intro-to-machine-learning-6-6e320dbdaf06
Clustering Unsupervised Learning Clustering Algorithms https://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/ https://apandre.wordpress.com/visible-data/cluster-analysis/
Anomaly Detection http://machine-learning-class-notes.readthedocs.io/en/latest/lecture16.html http://amid.fish/anomaly-detection-with-k-means-clustering
Association Rule Market Basket Analysis http://www.kdnuggets.com/2016/04/association-rules-apriori-algorithm-tutorial.html https://blogs.adobe.com/digitalmarketing/analytics/shopping-for-kpis-market-basket-analysis-for-web-analytics-data/
Summarization Know your data https://generalassemb.ly/blog/the-best-topical-data-visualizations-of-2015/
Pipeline for Data Mining Data Preprocessing Feature Engineering Model Training Testing Prediction
Linus Torvalds: “Talk is cheap. Show me the code.” http://www.skilledup.com/articles/become-software-engineer
Python Ecosystem
Jupyter Notebook Contain both computer code (e.g. python) and rich text elements (paragraph, equations, figures, links, etc...).
Scikit-Learn http://scikit-learn.org/stable/
http://peekaboo-vision. blogspot http://peekaboo-vision.blogspot.de/2013/01/machine-learning-cheat-sheet-for-scikit.html
Like this graph? More here: https://unsupervisedmethods.com/cheat-sheet-of-machine-learning-and-python-and-math-cheat-sheets-a4afe4e791b6 http://peekaboo-vision.blogspot.de/2013/01/machine-learning-cheat-sheet-for-scikit.html
Scikit-learn Simple and consistent API Instantiate the model m = Model() Fit the model m.fit(train_data) Predict m.predict(test_data) Evaluate m.score(predict_y, target_y) https://medium.com/towards-data-science/train-test-split-and-cross-validation-in-python-80b61beca4b6
Classification: k-nearest neighbors (K-NN) http://bdewilde.github.io/blog/blogger/2012/10/26/classification-of-hand-written-digits-3/
Decision tree
Clustering: k-means Given a data set where each observed example has a set of features, but no labels http://stanford.edu/~cpiech/cs221/handouts/kmeans.html