Machine Learning Using Python

article

Introduction:

Machine Learning is the art (or technique, actually) to make a “computational contraption”, or a computer, to perform actions based on the trend of outcomes of several sets of relevant and processed data input into it.

There are 2 primary ways by which we can perform machine learning – Supervised learning and Unsupervised learning. Here we are going to focus mostly on supervised learning, but we will also go through a bit of unsupervised learning, for the sake of completeness. Actually, there is a third option nowadays, and that is semi-supervised learning. We will glance over that as well in the course of this blog.

In the course of this blog, I will also discuss some mathematical basis of machine learning. Engineering and science students or people with similar such backgrounds will be able to understand the concepts quite easily. For others, I will do my best to describe these things as best as I can, but a bit of understanding of undergraduate level math would be very useful here.

Machine Learning is nowadays used almost in every aspect of life. We might not perceive them as most of them work behind the scenes. For example, consider the ads place in Google or amazon. When you login in into your account, some specific ads are selected by a program which keeps track of what your habits and interests are (e.g, if you buy a paperback thriller from Amazon, it will subsequently start showing you books of the same genre, when you login next time). Various law enforcement agencies (like the police, paramilitary and the military signals department) use machine learning to find inappropriate content in a suspect's facebook or twitter accounts as well as voice messages between the concerned person who received the call and the suspect. These types of information help them to foil sabotage and attack plans and keeps us safe from the possible harm criminals could have done.

Supervised Learning:

In some cases, a set of data is input into the system forms the basis of the “decision” made by the system when another set of one or more data in input into the system, after the initial set of data has been appropriately processed by the system. We call such methods “supervised learning”, and the initial set of data is called the “training set”. In such a case, the system is programmed to find the coefficients of a function that best satisfies the set of data given to it. Once this function is established, the system can easily provide the values of the independent variable(s) that lie in the scope of this function as well as beyond its last set of coordinate values provided to it as part of the training set. These types of functions (and algorithms based on these functions) are excellent in finding out values that need to be found by either interpolation of extrapolation of the given function. We will see some examples of these algorithms later in this article, and find how efficient they are in comparison to other algorithms. Sometimes, some of these algorithms are called “Regression Methods”. But this also applies very well to clustering problems, classification problems and certain other types of problems.

As I wrote earlier, I am assuming 12th standard knowledge of mathematics and a little bit of partial differentiation knowledge amongst the readers. If you think you need to brush up that knowledge, please feel free to do so, and come back to this blog later on, once you are confident on these above mentioned topics.

We will start by looking at some mathematical basis of the topic of Machine Learning. This will involve the sort of math I indicated in the previous paragraph. In this scheme of things, we will also cover a bit of “deep learning” (a term coined by Geoffrey Hinton). Specifically, we will look at Face Recognition System using some available libraries, and we will also point out the major pros and cons of those libraries. (By the way, the term “machine learning” was coined by Arther Samuel, a leading expert in the field of computer game development and AI)

Unsupervised Learning:

In case of unsupervised learning, the system is input a set of data (randomly), and the system has to compute something like density of a given type of data, or finding some pattern in a given dataset, etc. This type of approach is good when you have a lot of data and the data itself “says” abouts its properties. We will check this out briefly here, as unsupervised learning is a vast topic, and it won't be possible to go through all of it in this article.

Where does Python fit in in this scheme of things?

Well, as it turns out, we can do a lot of machine learning tasks using various modules in python that serve diverse aspects of machine learning. In this blog, we will specifically look at the following modules/libraries: Dlib, tensorflow and Multi-Tasking Convolutional Neural Networks (MTCNN). There are a lot of functional approaches to be considered, but we won't be able to touch them in this short space. Each of the following deserves a blog writeup by itself, so I will possibly take them up one by one at a later period in time.

  1. Decision Tree Learning
  2. Association Rule Learning
  3. Artificial Neural Network
  4. Support Vector Machines (SVM)
  5. Clustering
  6. Bayesian Networks
  7. Genetic Algorithms
  8. Rule Based Machine Learning

The modules we would be using and taking a look at would be i) Scikit-learn/Scipy/Numpy, ii) Tensorflow, iii) Keras, iv) PyTorch, and v) Theano.

The problem we would be solving here is prediction of cumulative stock prices, and we will do a barebones implementation of it using each of the libraries above. We will take our data from the Yahoo finance historical data for Nifty, for the period of December 01, 2017 to November 30, 2018. In order to train our model, we will use data from December 01, 2017 to September 30, 2018. Then we will be using this model to test and see if we can get close to the figures provided in the data provided by Yahoo. A copy of the data in CSV format can be found at the end of this blog.

Another problem we will be solving here is the classification problem from the data retrieved in the file named “Crimes – 2001_to_present.csv”. We will be taking a chunk of records for the training dataset, and consider this data as a crime data. The rest of the data will be matched, so that we can figure out which individuals frequented where after the crime has been committed.

The above data might as well provide us with a pattern, but we shall see that later. In order to copy the data from one file to another, we open up the file in an MS Excel spreadsheet (or OpenOffice Calc) software, and cut the data from a location that contains approximately 25% of the data. We then paste the collected data in another location, our first part of the task is complete.

The second part includes conversion of the data to a form that is readable by humans.

By the way, Python provides a wide array of modules that help implement machine learning – some of them are “numpy”, “pandas”, “scipy”, “scikit-learn”, “matplotlib”, etc. Scipy is the core machine learning module, while the others help this module by preparing the data and converting into a form that Scipy can use. In order to do machine learning in python application development, we need to download these modules (if you haven't done that already). You can use pip to install all the modules, and once they have been installed, you can use “pip freeze” to check out the versions of the installed modules. I strongly suggest that you do all this in a python virtual environment, since if something goes wrong, all you would need to do is delete the virtual environment and start afresh. You can create a virtual environment in a directory of your choice by running the following commands:

>>> pip install virtualenv >>>cd /path/to/the/directory/where/you/intend/to/create/the/virualenv >>>virtualenv mlenv #Here mlenv is our machine learning virtualenvironment.

We have named out virtual environment as “mlenv”, but you can go for any other name. Time to show your whims.

The next thing that you need to do is activate the virtual environment. This is done in the following way:

>>>source mlenv/bin/activate

Once you run the above mentioned command, your command prompt would reflect it with the name of the virtualenv created. At this point, you may start installing the necessary modules (mentioned above). You may do this using “pip install ...”. Once you have installed all the modules, run “pip freeze” to check the versions of all the modules you have installed.

The real fun will start now.

Let us get the hang of the numpy module first. This is the one module that will hold the processed data and it will also possibly exract the results after it has been processed by the machine learning program. Let's start with a program named “helloml.py”. Please put the following import statements inside the program. We will get back to it in a bit. First we need to get an understanding how math helps us do the work that we are going to perform.

import numpy as np import pandas import matplotlib as matplt import scipy import scikit-learn

The Math Behind Machine Learning and AI

There are 2 types of problems that we normally try to solve using machine learning – one is the regression problem where the algorithm is supposed to find out the value of a dependent variable using the trends specified by the dependent variable for various values of the independent variable. For example, take a look at the following equation:

hθ(x) = θ0 + θ1x

In the above equation, h(x) is the dependent variable, whereas x is the independent variable. θ0 and θ1 are the parameters. Clearly, the above equation is a straight line, and hence it defines a linear regression. The number of independent variables may be more than one (indeed, in any practical scenario, it IS more than one), and we call that multivariate regression. You don't need to memorize these terms, but just so you know, multivariate is a fancy term for more than one independent variable. Our goal is to find out the values of the parameters θ0 and θ1, so that the plot we create fits the training data well.

Of course, we need not restrict ourselves to straight lines. We may consider quadratic functions or cubic functions too, if they serve us better for our problems. But the main purpose of our activity is to optimize the parameters so that the function we are using is capable of providing us with near accurate predictions. It would not be possible to go through the underlying math of each of these models in the scope of this blog, so I am leaving it here. I will come up with another blog (may be quite soon), that will discuss only the math part of the machine learning algorithms. (Beware: It will be a very lengthy blog).

In the next section, we will be taking a look at some python code to train with our dataset. Later on, we will use this to make predictions and validate those predictions with the Nifty data that we have from the Yahoo finance website.

Next, we will take a look at the UCI data and create a classification script to predict if “stabf” field is statble or unstable.

Let's see some Python code for Linear Regression:

The following code takes the data for Nifty for the past one year (retrieved from Yahoo Finance). We will try to predict the “Volume” field in the data. We have taken approximately 180 records from the list of 245 records as training set and we considered the rest of the data as test data. Now, point to note here is that there are almost no corelation between the attributes and the “Volume”. We have taken this data simply to display the method by which linear regression should be implemented. Anyway, here is the code for it:

import os, sys, re, time import pandas as pd import numpy as np from sklearn import linear_model from sklearn.preprocessing import MinMaxScaler from sklearn.preprocessing import StandardScaler # Get current directory curdir = os.getcwd() # We will have the training_nifty.csv file to read training_file = curdir + os.path.sep + "training_nifty.csv" # We also have the test data that we will need to verify our prediction results. test_file = curdir + os.path.sep + "test_nifty.csv" if __name__ == "__main__": training_df = pd.read_csv(training_file) label = training_df['Volume'] features = training_df.drop(["Date", "Volume"], axis=1) training_df.fillna(0.0, inplace=True) # Fill NaN values with 0 features.fillna(0.0, inplace=True) training_regr = linear_model.LinearRegression() try: training_regr.fit(features, label) except: print "LABEL: ", label, "Error: %s"%sys.exc_info()[1].__str__() sys.exit() np.set_printoptions(precision=2) # Get the test data in which we will predict the 'Volume' attribute test_df = pd.read_csv(test_file) test_features = test_df.drop(["Date", "Volume"], axis=1) test_features.fillna(0.0, inplace=True) print training_regr.predict(test_features)

As you can probably notice above, we basically do a “fit” to get the data to fit in an equation, so that we can use it to predict the outcomes should we have similar data. Other than that, the code is fairly straight forward. You need to have a good understanding of numpy and pandas to understand what has been done here. Since you are interested in machine learning, I am assuming that you have already glanced at those tools. If not, now is an excellent time to do so.

Next, we will display a program that does classification of data. The data in question is the UCI data. We have used 7000 out of the 10000 records to serve as a training set. The remaining 3000 records are our test data. We have used a basic yet reasonably accurate algorithm named “Gaussian Naive Bayes” algorithm, and it has worked well with our data. The percentage of accuracy is over 90%, and hence we may call it quite effective. Here is the code for it:

import os, sys, re, time import pandas as pd import numpy as np from sklearn import linear_model #from sklearn.preprocessing import MinMaxScaler #from sklearn.preprocessing import StandardScaler from sklearn.naive_bayes import GaussianNB # Get current directory curdir = os.getcwd() # We will have the UCI_training.csv file to read training_file = curdir + os.path.sep + "UCI_training.csv" # We also have the test data that we will need to verify our prediction results. test_file = curdir + os.path.sep + "UCI_test.csv" if __name__ == "__main__": training_df = pd.read_csv(training_file) label = training_df["stabf"] training_features = training_df.drop("stabf", axis=1) training_df.fillna(0.0, inplace=True) # Fill NaN values with 0 gnb = GaussianNB() # Using Gaussian Naive Bayes. Will try other algorithms later. model = gnb.fit(training_features, label) test_df = pd.read_csv(test_file) test_features = test_df.drop("stabf", axis=1) preds = gnb.predict(test_features) test_data_id = 2 # That is the line from where the test data starts. The first line is headers. for p in preds: print "Entity #%s: %s"%(test_data_id, p) test_data_id += 1

Again, there is not much in the code that you won't understand if you read line by line. I do not want to spoon feed these things as that will make you dependent on a second person for such purposes like these, and that is not a good thing in the Software Industry. Take your time to understand it, but try to do it entirely by yourself (and of course, with help from Google).

You may use any other algorithm to solve this problem, may be using random forests or k-nearest neighbour, or yet some other method. But you major construct will remain the same. Also, it would be good if you can try other algorithms and play around with it.

When I started writing this blog, I thought I would at least touch upon as much material as is possible, but after 6 pages, it seems to me that in order to do so, I need to have at least 60 more pages. Machine Learning is a vast topic (and I am primarily interested in the math and algo design), but writing a blog on just that topic will require about 20-25 pages.

However, I will definitely come back to the topic some time in the future, and write a part 2 of this blog, which will delve deep into the theoritical aspect of it. It is a fascinating and addictive topic.

In my work experience, I have implemented ML systems in various law enforcment agencies in this country (face detection, number plate detection of cars which are travelling at a speed that is greater than the maximum permissible limit for the stretch of the road in question), and my part 2 of that blog will be dedicated to those aspects.

This blog is just an initiation document. It is by no means an authoritative material, and I have tried to arouse interest in you to learn more about machine learning and AI. I think the best way to start off with Machine Learning and AI is to take a basic ML course from Coursera or Udemy, and then read up specific topics from various sources on the internet. I wish you all the best in this journey.