Prepared by Kimberly Sayre and Jinbo Bi
Why Python? Programming languages like R and Python really shine when it comes to large amounts of data. They’re fast and efficient. R has more to offer in terms of data analysis and algorithm selection but some say there is a steeper learning curve. Python is a popular scientific language and a “rising star” for machine learning. There are plenty of easy to use tools and frameworks for Python with an active community. Image taken from: http://machinelearningmastery.com/best-programming-language-for-machine-learning/
What is Scikit? SciKit-learn is a framework that has simple and efficient tools for data mining and data analysis. It is accessible to everybody and reusable in various contexts. It is built on NumPy, SciPy, and matplotlib (Python libraries) and it is open source and commercially usable. [http://scikit-learn.org/] The SciKit-learn has an AMAZING website with TONS of documentation. There are many examples of full length scripts which are easy to understand (even if you are not familiar with Python). It also has a very large library of different algorithms. You can do classification, regression, clustering, dimensionality reduction, model selection and preprocessing.
Installing Scikit Learn Scikit-learn requires: Python (>= 2.6 or >= 3.3), NumPy (>= 1.6.1), SciPy (>= 0.9). The Scikit website has very good installation instructions with different options. Check out the website [http://scikit-learn.org/stable/install.html] to find an option that works best for you. For this PowerPoint, we’ll be using an IDE called IPython which is included, along with Scikit learn, in a third-party distribution called Anaconda [https://www.continuum.io/downloads]. Anaconda ships with a recent version of scikit-learn, in addition to a large set of scientific python library for Windows, Mac OSX and Linux. *NOTE* You have a choice of downloading Anaconda that has Python 2 or Python 3. They do have some syntactic differences. This PowerPoint will be going off of Python 2.
IPython, also known as Jupyter, is a powerful interactive shell IPython, also known as Jupyter, is a powerful interactive shell. It has support for interactive data visualization and it is very flexible. It is easy to use for analytics and high performance computing. IPython notebook is included in the Anaconda package and it runs in your browser. *NOTE* A command prompt will open when you are using Ipython notebook. Keep this open. It acts as a server for your computer while you’re using IPython.
IPython Sample: Good Each block is known as a cell in which you can write one or more lines of python code. You can then click run or hit ctrl+enter to execute cell. If there is output, it will be displayed below that cell. If there is a grey border around the cell then you are in “command mode”. This allows you to create new cells above or below the current cell by hitting a (above) or b (below). You can navigate between cells using arrow keys. Hit enter again to return to “edit mode” (shown by having a green border around the cell) and now you can type code within the cell. You go back into “command mode” by hitting esc. You can save your notebook and then come back to it at a later time. You can also download other notebooks from places like github.
IPython Sample: Bad Indenting is important in Python!!! *NOTE* Indenting is VERY important in Python. IPython notebook is nice enough to highlight stuff in red if you don’t indent properly Indenting is important in Python!!!
Machine Learning Example: Iris Dataset Let’s take a look at a machine learning example with Scikit using an available dataset. We’re going to use the iris dataset. Each iris species has a strong relationship between their petal’s and sepal’s (a type of petal) length and width. The dataset includes petal length, petal width, sepal length, sepal width and the species name. You can use machine learning to predict an iris based on these attributes so it can be used for an easy supervised learning task. http://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html
Loading Data With IPython, you import individual modules rather than an entire packages. Here we’re importing the iris dataset and then viewing the data
Here we’re looking at the different attributes of the iris dataset
Data Requirements with Scikit Features and response are separate objects Features and response should be numeric Features and response should be NumPy arrays Features and responses should have specific shapes
Here we’re creating our feature and response matrix, loading the KNN class before we instantiate a KNN object
Fitting our model with the iris data Fitting our model with the iris data. Feel free to play around and explore! There are plenty of different aspects you can tune and try If you’d like to see the script in its entirety, check out this link [http://scikit-learn.org/stable/auto_examples/neighbors/plot_classification.html]
Plotting with Scikit Scikit’s website has really good examples. If you want to take a look into plotting with KNN (or plotting with Scikit in general), check out their page which provides samples plots and completed scripts with descriptive comments The link on the previous slide gives an example of this.
Uploading Your Own Data You’re able to use your own datasets with Scikit as well. Let’s say we have a CSV file shown above (yes, this is a csv file for the iris data). *NOTE* All values are numeric
You can easily import your own data but you need to make sure you follow the requirements from slide 11 which is done in the script above. From there you can do as you like with your data.
Resources An amazing tutorial by a guy who reminds me of Sheldon from Big Bang Theory Python tutorial Anaconda Scikit-Learn Ipython/Scikit Tutorial – https://www.youtube.com/watch?v=IsXXlYVBt1M Python Tutorial - https://www.dataquest.io/ Anaconda - https://www.continuum.io/downloads Scikit learn website - http://scikit-learn.org/stable/