In the last post, we discussed the basic theoretical aspect of machine learning. In this article, we will be looking at how we can get started with machine learning using Python programming language and scikit-learn library.
Python
Python is one of the most popular programming languages in machine learning. Python is an interpreted, high-level, general-purpose object-oriented programming language. It was created by Guido Van Rossum and was first released in 1991. Python is a dynamically typed language meaning we do not have to specify the data type explicitly.
There are tons of great resources available for free on the internet to get started with python. The one I find interesting is from this youtube channel called Socratica. The video series on this channel on Python is really good.
Scikit-Learn
Scikit-learn is a machine learning library for the Python programming language. A library is a collection of frequently used modules and functions. It helps us to write the code faster and reduce errors. It features various machine learning algorithms that can be used for both supervised and unsupervised machine learning problems. Scikit-learn comes with some datasets as well on which we can practice implementing various machine learning algorithms. Apart from the machine learning libraries scikit-learn also comes with utilities that can be used for transforming the data into the required form. It also provides modules for model evaluation. It is a complete package for those who want to get started with machine learning without dividing it into too many details.
Getting Started with scikit-learn
To install scikit-learn one can follow the instructions on the installation page of the scikit-learn library website.
Type the following command in the command prompt
pip install -U scikit-learn
OR
python -m pip install -U scikit-learn
We can run this command from the JupyterNotebook iteself
!pip install -U scikit-learn
Using scikit-learn for Machine Learning
In this tutorial, we will use the famous iris dataset to learn how we can use scikit-learn in different stages of machine learning. We will be building a classification model.
Step 1: Loading the Required Dataset
Data can be loaded from various sources like database, csv file, Rest API etc. For this tutorial we will be using the datasets module provided by the scikit-learn library. The datasets module contains some famous dataset which can be useful for those who are just getting started to learn how to use the library.
from sklearn.datasets import load_iris
iris = load_iris()
X=iris.data
y=iris.target
In the code block above, we loaded the iris dataset into a variable called ‘iris’ then loaded the feature vector in X variable and target (labels) in y variable. If we check out the first five values in X and y we will get
X[:5]
=> array([[5.1, 3.5, 1.4, 0.2],
[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5. , 3.6, 1.4, 0.2]])
y[:5]
=> array([0, 0, 0, 0, 0])
As we can see the X is an array of list each with four values and according to the description they are
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
The y variable contains three numbers 0,1 and 2. Each of this number correspond to the type of iris plant
- 0 = Iris-Setosa
- 1 = Iris-Versicolour
- 2 = Iris-Virginica
Step 2: Make Required Transformations in the Dataset
The data that we get may not be of the form that is suitable to load into the model. When we have to perform various actions like handling the NULL values, normalizing the data, converting the categorical text data into numbers, and so on. The specific process varies based on the factors like the type of problem being solved, the requirement of the machine learning algorithm being used, the characteristics of the data, and so on. For now, we will just normalize the data. Normalization is the process of scaling individual samples to have unit norm.
X = preprocessing.normalize(X)
Step 3: Split the Dataset into Train and Test set
In this step, the dataset is split into two set one for training the model (train set) and one to evaluate the model by comparing the actual labels in the dataset and the result obtained from the model we trained.
There is no hard and fast rule for splitting the dataset into train and test set however, 70 to 80 percent of the data is used for training the model and 20 to 30 percent of the data is used for testing. We will use 80 percent of the data for training the model and 20 percent for testing.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,
test_size=0.2,random_state=10)
Step 4: Build the Model
We will be building a decision tree model. The decision tree is a supervised machine learning algorithm. A decision tree is a flowchart-like tree structure, where each internal node (non-leaf node) denotes a test on a feature, each branch represents the decision taken on the basis of the test and each leaf node denotes the output.
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(max_depth=2)
Here, we have set the parameter “max_depth” to 2, parameters like “max_depth” are called hyperparameters. There are no specific rule on how to set these values however, there is an entire domain called hyperparameters tuning which consists of various way to get the optimum value of hyperparameters that makes our model robust.
Step 5: Train the Model
Training the model is creating a set of rules that will be used for predicting the values on the newly provided input (X values).
model.fit(X_train,y_train)
Since we are using the decision tree algorithm we can visualize the rules created by our model.
from sklearn import tree
import matplotlib.pyplot as plt
tree.plot_tree(model,class_names=iris.target_names,
feature_names=iris.feature_names)
plt.show()
Step 6: Model Evaluation
The idea behind machine learning is to create a model that creates a generalized rule for making predictions on new data based on the old data (train data). To evaluate our model the scikit-learn provides various metrics. For classification, we can check the accuracy score using the test data.
The test data was not seen by the model during the training phase. We pass the test data to the model we created to make and compare the result against the true result.
# Making Prediction on the Test Data
y_predict = model.predict(X_test)
# Comparing the predicted result against the test data
from sklearn.metrics import accuracy_score
score = accuracy_score(y_test,y_predict)
print("The Accuracy of the Model is:
{:3g} %".format(score*100))
>> The Accuracy of the Model is: 86.6667 %
Step 7: Model Deployment
When we are happy with the result after evaluating the result the next step will be to deploy the model in the production environment.
We will be discussing how we can deploy our model on upcoming articles.
In this article we discussed how one can get get started with hands on machine learning using python programming language and scikit-learn library.