Image or Video Manipulation, Face Detection and Recognition Using Python

article

In the last few years, we have been seeing an explosive growth of python in several fields – web application development, REST API creation, automation of various processes (both technical and non-technical), financial assets management and so on. Apart from these, Python has also become a de-facto standard in creating applications that have some type of pattern recognition capabilities. Two examples of this type of application are vehicle number plate detection/reading and face detection/recognition. Technology in these fields have come a long way now, and even though I wouldn’t call them absolutely mature, they are certainly good enough to yield results that are of high quality in certain conditions. In this article, I would be concentrating on evaluating a few such libraries, and then I will conclude it with a working example for facial recognition using Python software development.

The primary libraries that come to my mind when I consider the domain of facial detection and recognition (in Python) are OpenCV, face_recognition and facenet. While all of them uses some sort of deep learning technology (for example OpenCV uses Tensorflow), the easiest to use is face_recognition. Of course, the list certainly doesn’t end here, but in my experience, quite a large number of face recognition solutions are based on OpenCV (or one of its variants).


Capabilities of OpenCV and its Python bindings:


So, let us consider OpenCV. This library has been there for a pretty long time. Intel Research launched it in 1999 to use it for some of their own projects, and they made it open source so that a community can grow around it which will aid in advancing this initiative to the subsequent levels. This is essentially a C++ library, and most programmers who used it in its early days were very strong C++ buffs. However, that is not the case anymore. Python has made its way through this wonderful tool and we now have python modules (both in 2.x as well as in 3.x), and creating and manipulating images is almost child’s play with this module (as long as the ‘child’ knows python). For example, let us consider the case of breaking up a video in terms of its frames. Here is a little python script that does just that:


import cv2 import gst def FrameCapture(path): # Path to video file vidObj = cv2.VideoCapture(path) count = 0 success = 1 frameslist = [] while success: success, image = vidObj.read() cv2.imwrite("frame%d.jpg" % count, image) frameslist.append() count += 1 return frameslist if __name__ == '__main__': all_frames = FrameCapture("/home/supriyo/work/stream_frames/VID-20180912-WA0008.mp4") pipe = gst.Pipeline(all_frames[0]) sink = gst.element_factory_make("videosink", "sink") pipe.add(sink)

In order to run the above program, you need to find a mp4 video file and replace it with the video file I am using in the script above. Also, you would need to install the OpenCV library (if you are using Ubuntu or Debian, you may use apt-get install OpenCV to get the library, in case of Centos, use yum). Then, you would also need to install the Python extension for OpenCV which you should be able to do using pip.


The next program I am going to demonstrate is a bit more complex, and what it does is identification of faces using face_recognition module of python. Firstly, you need to install the module named “face_recognition” using the command 'pip install face_recognition'. I would strongly suggest that you do this in a python virtual environment. The 'face_recognition' module uses the dlib library which is a pretty decent library as far as accuracy of recognition is concerned. The accuracy rate of OpenCV is not so good, and in my experience, you need to ascertain lighting conditions and image quality are good if you need to have a match using OpenCV.

How Face Recognition Works:

Before we get into the code, let us just go through the basics of how facial recognition is done using dlib. What actually happens is we first need to train the system with some reference faces of the person(s) who we are trying to identify. How does it work behind the scenes? Well, what actually happens is that when you supply the program with sample photographs, it identifies 128 points on the face of each individual whose photograph has been provided as a sample. When you provide the program with a face that has to be matched with one of the sample images, it identifies the same 128 points on the face presented in the test photograph. These points are called encodings and they are stored as numpy arrays. Next, the program computes the Eucledian distance between the given points in the 2 images (the sample and the test photograph), and provides a value between 0 and 1. A value of '0' means an exact match, whereas '1' means no match at all. This value is called the threshold value, and lower values are better matches. Since the computation yields a single numeric value, it is upto the person trying to match images to find out whether she/he would like to consider the value to be a match. In my experience, a threshold value below 0.4 may be considered a probable match. However, one point to note here is that this method also yields (quite) a few false positives. In order for the person operating such a system, the program may be written in such a way so that matches with threshold values greater than 0.4 would not be listed. This will allow the operator to concentrate on a small subset of images that are likely to match, thereby making the task easier.


Well, enough talk above, so let us get some code here. We will go through the important lines in detail, so don't worry about it. Implementations of some functions are not displayed here for brevity, but you would be able to understand what they do from the explanations I provide, and their names.


unpickled_data = read_pickled_data() # Basically reads a pickled file containing image encoding data as well as some metadata about the image itself. all_known_encodings = [] image_to_encoding_dict = {} sorted_score_img_list = [] for data_list in unpickled_data: image_name = data_list[0] # This contains the path of the image. known_encoding = data_list[1] cat_name = data_list[2] if cat_name.lower() == category.lower(): image_to_encoding_dict[image_name] = known_encoding try: all_known_encodings.append(known_encoding[0]) except: pass if all_known_encodings.__len__() == 0: face_found = False try: sss = scipy.spatial.distance.cdist([requested_encoding[0]],all_known_encodings) wh=np.where(sss<float(threshold)) indices = wh[1] if indices.size > 0: img_to_score_dict = {} j = 0 while j < indices.size: index = indices[j] score = sss[0][int(index)] imgname = unpickled_data[int(index)][0] img_to_score_dict[imgname] = score j += 1 if j >= 5: # We will show the top 5 matches only break sorted_score_img_list = sorted(img_to_score_dict.items(),key=operator.itemgetter(1)) face_found = True imglistasstr = "" list_of_dict_data = [] for elem in sorted_score_img_list: if elem.__len__() > 0: data_dict = {} imgfilepath = elem[0] imgfilepathparts = imgfilepath.split("/") imgname = imgfilepathparts[imgfilepathparts.__len__() - 1] pathparts = imgfilepath.split("html") imgfilepath = image_hosting_host + imgname imglistasstr += imgfilepath + "," data_dict['image'] = imgfilepath data_dict['id'] = nextid try: data_dict['score'] = elem[1] except: data_dict['score'] = "Unavailable - Error: '%s'"%sys.exc_info()[1].__str__() else: pass list_of_dict_data.append(data_dict) matched_data = json.dumps(list_of_dict_data)

In the above code we store the face parameters in pickled files (which isn't a good idea if you want to have it scaled at some point in time. The idea here is to describe the process, and hence I have stored them in pickle as I find that to be a convenient method for the purpose of demonstration). We pick up the unpickled data from the function “ read_pickled_data”, and we iterate through the entires in it, which contain some metadata (like image name, category/subcategory to which this image belongs, etc) as well as the encoding of the actual image in question. The main part comes after this. We append the encodings to a list named “ all_known_encodings”, which is then tested with the target “requested_encoding[0]” (which was defined earlier in the code which we have not shown here for brevity). The line


sss = scipy.spatial.distance.cdist([requested_encoding[0]],all_known_encodings)

basically is the crux of the program, and defines how the requested image encoding matches with any of the encodings in “all_known_encodings”. To understand the following lines, you need to have some understanding of how numpy works. Basically, what happens in the subsequent lines is that a list of “indices” is found and they are iterated over and every image in the list is given a score based on the similarity of the features of the faces. We have considered the top 5 images here, since the subsequent images will possibly not have a significant similarity.


Finally, we do some book keeping and store the matches as a json string in the variable “matched_data”.


I would suggest that you take a look at the numpy documentation to understand what I described above, since without that knowledge, it probably won't make much sense. I intend to write a blog on numpy in the near future and I hope it will give you an idea as to how numpy works and why we need it in cases such as the one mentioned above.

Pitfalls of this 128 Point Scheme

Well, first of all, this 128 point scheme of face recognition has its own issues. The primary one is that it yields a lot of “false positives”. This means it will match images that may look very different to the human eye (and hence they are no match at all), but since the data fits the conditions imposed by the algorithm, they show up as matches. This is a serious concern for government law enforcement agencies, where they pick up images of thousands of individuals and try to match them with a set that may contain a similar number of images. It becomes next to impossible to look at every match report and figure out whether the results are correct or not. Of course, it is definitely easier than manually checking every image and comparing them to the target set of images, it is still a difficult job. Hence, over the years, several improvisations have been made. Notably, keeping the threshold value at 0.2 (or somewhere near it) does help this scenario.


Personally, I have a hypothesis, but I haven't tested it yet. What if we took more points at certain parts of the face (like the eyes, ears, nose and lips) and compute the distances between them, and then compare the ratios of these corresponding points with other target faces. Scientifically, it has been established, that the distance between 2 points on one of those above mentioned features of the human face do change with time, but the ratios would remain the same (may not be exactly same, but within a 0.1% accuracy, which is fair enough). I will explain this in much greater detail in a subsequent post when I have some data and code to prove (or disprove) this point.

Conclusion:

We are moving towards an age when we will be watched almost 24 hours a day and our activities will be logged somewhere on a server in this world. While that would possibly make this world a more secure place, privacy will be compromised. But there is a silver lining there: The entities that will look at our data would be machines (mostly), unless we go and rob a bank or kill a fellow human being using whatever technique and for whatever reason. Such culprits would be easier to find and traced and that is a huge bonus for the human race in general. In order to get something, we need to lose something. In this case, we will lose privacy, but we will gain security. Personally I feel it is a good thing, but I am sure a lot of people think otherwise. However, that time has not yet arrived, and till then let’s just enjoy the remaining days of our freedom from the watchful eyes of those machines.