Machine Learning is the art (or technique, actually) to make a “computational contraption” or a computer, to perform actions based on the tendency of the results of certain sets of related and processed data inputs. There are 2 primary ways by which we can perform machine learning – Supervised learning and Unsupervised learning.
Here we are going to focus mostly on supervised learning, but we will also go through a bit of unsupervised learning, for the sake of completeness. There is a third option nowadays, and that is semi-supervised learning. We will glance over that as well in the course of this blog.
In the course of this blog, I will also discuss some mathematical basis of machine learning. Engineering and science students or people with similar backgrounds will be able to understand the concepts quite easily. For others, I will do my best to describe these things as best as I can, but a bit of understanding of undergraduate-level math would be very useful here.
Machine Learning is nowadays used almost in every aspect of life. We might not perceive them as most of them work behind the scenes. For example, consider the ads place in Google or Amazon. When you login into your account, some specific ads are selected by a program that keeps track of what your habits and interests are (e.g, if you buy a paperback thriller from Amazon, it will subsequently start showing you books of the same genre, when you login next time).
Various law enforcement agencies (like the police, paramilitary and the military signals department) use machine learning to find inappropriate content in a suspect's Facebook or Twitter accounts as well as voice messages between the concerned person who received the call and the suspect. These types of information help them to foil sabotage and attack plans and keep us safe from the possible harm criminals could have done.
In some cases, a set of data is input into the system forms the basis of the “decision” made by the system when another set of one or more data is input into the system after the initial set of data has been appropriately processed by the system. We call such methods “supervised learning”, and the initial set of data is called the “training set”. In such a case, the system is programmed to find the coefficients of a function that best satisfies the set of data given to it. Once this function is established, the system can easily provide the values of the independent variable(s) that lie in the scope of this function as well as beyond its last set of coordinate values provided to it as part of the training set.
These types of functions (and algorithms based on these functions) are excellent in finding out values that need to be found by either interpolation of extrapolation of the given function. We will see some examples of these algorithms later in this article, and find how efficient they are in comparison to other algorithms. Sometimes, some of these algorithms are called “Regression Methods”. But this also applies very well to clustering problems, classification problems and certain other types of problems.
As I wrote earlier, I am assuming 12th standard knowledge of mathematics and a little bit of partial differentiation knowledge amongst the readers. If you think you need to brush up on that knowledge, please feel free to do so, and come back to this blog later on, once you are confident on these above-mentioned topics.
We will start by looking at some mathematical basis of the topic of Machine Learning. It will involve the sort of math I indicated in the previous paragraph. In this scheme of things, we will also cover a bit of “deep learning” (a term coined by Geoffrey Hinton). Specifically, we will look at Face Recognition System using some available libraries, and we will also point out the major pros and cons of those libraries. (By the way, the term “machine learning” was coined by Arther Samuel, a leading expert in the field of computer game development and AI).
In the case of unsupervised learning, the system is inputs a set of data (randomly), and the system has to compute something like the density of a given type of data or find some pattern in a given dataset, etc. This type of approach is good when you have a lot of data and the data itself “says” about its properties. We will check this out briefly here, as unsupervised learning is a vast topic, and it won't be possible to go through all of it in this article.
Where does Python fit in in this scheme of things?
Well, as it turns out, we can do a lot of machine learning tasks using various modules in python that serve diverse aspects of machine learning. In this blog, we will specifically look at the following modules/libraries: Dlib, tensorflow and Multi-Tasking Convolutional Neural Networks (MTCNN). There are a lot of functional approaches to be considered, but we won't be able to touch them in this short space. Each of the following deserves a blog writeup by itself, so I will possibly take them up one by one at a later period in time.
- Decision Tree Learning
- Association Rule Learning
- Artificial Neural Network
- Support Vector Machines (SVM)
- Bayesian Networks
- Genetic Algorithms
- Rule Based Machine Learning
The modules we would be using and taking a look at would be i) Scikit-learn/Scipy/Numpy, ii) Tensorflow, iii) Keras, iv) PyTorch, and v) Theano.
The problem we would be solving here is a prediction of cumulative stock prices, and we will do a barebones implementation of it using each of the libraries above. We will take our data from the Yahoo finance historical data for Nifty, from December 01, 2017, to November 30, 2018. To train our model, we will use data from December 01, 2017, to September 30, 2018. Then we will be using this model to test and see if we can get close to the figures provided in the data provided by Yahoo. A copy of the data in CSV format can be found at the end of this blog.
Another problem we will be solved here is the classification problem from the data retrieved in the file named “Crimes – 2001_to_present.csv”. We will be taking a chunk of records for the training dataset, and consider this data as crime data. The rest of the data will be matched so that we can figure out which individuals frequented where after the crime has been committed.
The above data might as well provide us with a pattern, but we shall see that later. To copy the data from one file to another, we open up the file in an MS Excel spreadsheet (or OpenOffice Calc) software and cut the data from a location that contains approximately 25% of the data. We then paste the collected data in another location, our first part of the task is complete.
The second part involves conversion in the form of data, which can be read by humans.
By the way, Python offers a wide range of modules to help implement machine learning - some of them are “numpy”, “pandas”, “scipy”, “scikit-learn”, “matplotlib”, etc. Scipy is the core machine learning module, while others help by compiling this module into data and converting it into a form that Scipy can use. To do machine learning in python application development, we need to download these modules (if you haven't done that already).
You can use pip to install all the modules, and once they have been installed, you can use “pip freeze” to check out the versions of the installed modules. I strongly suggest that you do all this in a python virtual environment since if something goes wrong, all you would need to do is delete the virtual environment and start afresh. You can create a virtual environment in a directory of your choice by running the following commands:
We have named out virtual environment “mlenv”, but you can go for any other name. Time to show your whims.
The next thing that you need to do is activate the virtual environment. This is done in the following way:
Once you run the above-mentioned command, your command prompt would reflect it with the name of the virtualenv created. At this point, you may start installing the necessary modules (mentioned above). You may do this using “pip install ...”. Once you have installed all the modules, run “pip freeze” to check the versions of all the modules you have installed.
The real fun will start now.
Let us get the hang of the NumPy module first. It is the one module that will hold the processed data and it will also possibly extract the results after it has been processed by the machine learning program. Let's start with a program named “helloml.py”. Please put the following import statements inside the program. We will get back to it in a bit. First, we need to get an understanding of how math helps us do the work that we are going to perform.
The Math Behind Machine Learning and AI
There are 2 types of problems that we normally try to solve using machine learning – one is the regression problem where the algorithm is supposed to find out the value of a dependent variable using the trends specified by the dependent variable for various values of the independent variable. For example, take a look at the following equation:
hθ(x) = θ0 + θ1x
In the above equation, h(x) is the dependent variable, whereas x is the independent variable. θ0 and θ1 are the parameters. The above equation is a straight line, and hence it defines a linear regression. The number of independent variables may be more than one (indeed, in any practical scenario, it IS more than one), and we call that multivariate regression. You don't need to memorize these terms, but just so you know, multivariate is a fancy term for more than one independent variable. Our goal is to find out the values of the parameters θ0 and θ1 so that the plot we create fits the training data well.
Of course, we need not restrict ourselves to straight lines. We may consider quadratic functions or cubic functions too if they serve us better for our problems. But the main purpose of our activity is to optimize the parameters so that the function we are using is capable of providing us with near accurate predictions. It would not be possible to go through the underlying math of each of these models in the scope of this blog, so I am leaving it here. I will come up with another blog (maybe quite soon), that will discuss only the math part of the machine learning algorithms. (Beware: It will be a very lengthy blog).
In the next section, we will be taking a look at some python code to train with our dataset. Later on, we will use this to make predictions and validate those predictions with the Nifty data that we have from the Yahoo finance website.
Next, we will take a look at the UCI data and create a classification script to predict if “stabf” field is statble or unstable.
Let's see some Python code for Linear Regression
The following code takes the data for Nifty for the past year (retrieved from Yahoo Finance). We will try to predict the “Volume” field in the data. We have taken approximately 180 records from the list of 245 records as a training set and we considered the rest of the data as test data. Now, the point to note here is that there is almost no correlation between the attributes and the “Volume”.
We have taken this data simply to display the method by which linear regression should be implemented. Anyway, here is the code for it:
As you can probably notice above, we do a “fit” to get the data to fit in an equation, so that we can use it to predict the outcomes should we have similar data. Other than that, the code is fairly straightforward. You need to have a good understanding of NumPy and pandas to understand what has been done here. Since you are interested in machine learning, I am assuming that you have already glanced at those tools. If not, now is an excellent time to do so.
Next, we will display a program that classifies data. The data in question is the UCI data. We have used 7000 out of the 10000 records to serve as a training set. The remaining 3000 records are our test data. We have used a basic yet reasonably accurate algorithm named the “Gaussian Naive Bayes” algorithm, and it has worked well with our data. The percentage of accuracy is over 90%, and hence we may call it quite effective. Here is the code for it:
Again, there is not much in the code that you won't understand if you read line by line. I do not want to spoon-feed these things as that will make you dependent on a second person for such purposes like these, and that is not a good thing in the Software Industry. Take your time to understand it, but try to do it entirely by yourself (and of course, with help from Google).
You may use any other algorithm to solve this problem, maybe using random forests or k-nearest neighbor, or yet some other method. But your major construct will remain the same. Also, it would be good if you can try other algorithms and play around with them.
When I started writing this blog, I thought I would at least touch upon as much material as is possible, but after 6 pages, it seems to me that to do so, I need to have at least 60 more pages. Machine Learning is a vast topic (and I am primarily interested in math and algo design), but writing a blog on just that topic will require about 20-25 pages.
However, I will come back to the topic sometime in the future and write part 2 of this blog, which will delve deep into the theoretical aspect of it. It is a fascinating and addictive topic.
In my work experience, I have implemented ML systems in various law enforcement agencies in this country (face detection, number plate detection of cars that are traveling at a speed that is greater than the maximum permissible limit for the stretch of the road in question), and my part 2 of that blog will be dedicated to those aspects.
This blog is just an initiation document. It is by no means authoritative material, and I have tried to arouse interest in you to learn more about machine learning and AI. I think the best way to start with Machine Learning and AI is to take a basic ML course from Coursera or Udemy, and then read up specific topics from various sources on the internet. I wish you all the best in this journey.