Computer vision has long been a key driver of artificial intelligence. But despite decades of work and proliferating applications, machines still have a lot to learn from the human brain.
Teaching machines to see hasn’t been easy. Though machine learning methods have recently brought a significant improvement in computer vision, the problem of developing machines that can understand images has not been fully solved. Machines remain inferior to humans when it comes to many visual interpretation skills. While in some ways machines have significant advantages — such as the recognition of hidden patterns and the manipulation of large amounts of data — recognizing human faces in particular remains a difficult task.1
It’s not really clear where the next improvements in computer-vision development will come from. Today the biggest breakthroughs stem from the development of multilayer neural networks. But these advances, while exciting, have their own limitations: Neural-network approaches remain empirical and nontransparent, which makes further improvement difficult. Deep neural networks provide some understanding of images but do not help us master what’s been called the understanding of understanding.
Consider mobile apps like Google Goggles or Samsung’s Bixby, which provide information and feedback on whatever a smartphone camera is “viewing.” A multilayered system drives these apps. First, they distinguish among different objects that appear in the image — a process called segmentation. Second, they separate objects and examine each of them. Third is the core of the recognition process, a database of images and three-dimensional models, which are mathematical representations of 3-D surfaces. Partial 3-D models gleaned from the image are compared with models in the database. Then database and statistical analysis may be able to tell us the theme of the image and produce a text description: Is it a busy road or a natural landscape?
Most of the information humans possess comes from their visual senses: their ability to see, process visual information, store and retrieve it from their memories, and understand what they are seeing. Not surprisingly, developing the means for machines to see like humans was one of the earliest goals of artificial intelligence (AI). In fact, the very development of AI, particularly machine learning methods such as neural networks and deep learning, has occurred with computer vision in mind. Yet for all of the high-profile successes of recent machine learning programs, which have mastered complex board games like Go and chess, a robot equipped with computer vision still struggles to compete with a three-year-old human at recognizing and understanding what’s in front of its nose.
Nonetheless, computer-vision applications are proliferating. Computer vision is not only being used to identify and analyze information from images or 3-D objects but also to understand content and context. Today you can find computer-vision algorithms in facial-recognition systems for passport control and security, object recognition for optical and electronic microscopes, motion tracking, human emotion recognition, medical diagnoses, autonomous vehicles and drones.
All of these applications depend on the development of the internet of things, which connects sensors in devices that can send and receive data across the internet. Computer vision is a metafield that makes use of various technologies to provide a major component of intelligent machines.
Machine learning plays an important role in current computer-vision technologies. A fundamental hurdle of computer vision has been identifying objects in an image.2,3 Machine learning algorithms are quite good at categorization and classification of objects, and they can recognize more than single objects. Effective computer vision requires making a clear distinction between background and foreground. This can be done using clustering methods, basic functions of machine learning algorithms. A higher level involves analyzing an image by its parts, treating it as a whole and creating a description of it. The starting point might be a vector, which contains the objects and relations among them, such as position, orientation, time (if it’s a video) and other features that provide hints about context. This process resembles human behavior: First, our brain identifies the objects in context, then it theorizes what’s happening, using experience from similar situations.
Machine learning offers effective methods for image classification, image recognition and search, facial detection and alignment, and motion monitoring, while also providing tools for synthesizing new vision algorithms.
Basics of Computer Vision
Image processing uses a family of algorithms dedicated to so-called image (picture or video) transformation. The simplest computer-vision algorithms make photo zooming and rotation applications possible, with more-sophisticated transformations, such as edge sharpening, made possible by the use of techniques like optimal filtration, frequency, auto-correlation analysis and other signal-processing algorithms. (A typical image-processing task is choosing an Instagram filter for a photo.) From a mathematical perspective, most of these algorithms deal with an image represented as a matrix of pixel intensity values. If an image is colored, we have three matrices that correspond to red, green and blue intensities (there may be other color encodings). For three-dimensional images, we deal with 3-D matrices (cubes) and their minimal elements, voxels. Each 3-D model is a mathematical representation of the 3-D surface, built using images of objects from different perspectives. A characteristic of human vision is the ability to construct a 3-D view from 2-D perspectives.
Computer vision is a collection of methods for understanding images. If we think of a computer-vision algorithm as a black box, it takes an image as an input and makes decisions that produce an output, which may aid in facial recognition, microchip-defect detection or the diagnostic reading of X-ray images. Most of these tasks can be considered classification problems, such as “Whose face is in the photo?” or “What is the defect in this microchip?” We can think about the classification tasks of image comprehension in the following way:
1. Is a target object present in the image? For example, is it a photo of a cat?
2. Which target object is present in the image? Is the animal in the photo a cat or a dog? In this case, we need an algorithm to distinguish among different images.
These may seem like rudimentary tasks, but they are challenging to machine learning algorithms. The first task can be more difficult; we may have lots of images of cats, but we don’t exactly know how to define “not a cat.” This so-called noncompact class requires special approaches to describe or distinguish its members. The second task is easier for machine learning training because both sets of key images — dog and cat — represent well-defined classes. There will be lots of cat and dog photos for training. These are known as compact classes. Drawing a distinction between them is relatively straightforward.
A Brief History of Computer Vision
Work on computer vision began in the 1960s with the earliest attempts at robotic development.4 One of the first practical applications was to teach a machine to read numbers and letters (this was particularly applicable to post office needs). Other early applications involved X-ray image processing and three-dimensional reconstruction of images from computer tomography and magnetic resonance imaging.
In 1975, Japanese electrical engineering professor Kunihiko Fukushima introduced his cognitron, a mathematical model that included layers of connected artificial neurons of two types: excitatory cells and inhibitory cells, just like in a human brain. This model is known as a self-organizing multilayered artificial neural network (ANN). The next significant step was to connect computer vision and machine learning. In 1980, Fukushima introduced a model of human vision known as the neocognitron; it was based on an ANN, with a special architecture designed to recognize objects that had shifted in an image.5 His system had several layers of artificial neurons that helped to recognize handwritten characters.
The neocognitron algorithm groups neurons into two types of clusters: S-cells and C-cells. S-cells are responsible for recognizing image patterns like lines, corners, combinations of corners and full characters. C-cells are responsible for locating these patterns and thus enable their identification no matter how they’ve shifted within the image. In other words, the neocognitron will not be confused if the character appears not in a corner but in the center of the image.
This model still has great influence on modern computer-vision research, with many achievements in image recognition based on its multilayer neural networks.
An important breakthrough occurred in 1988 when Yann LeCun from AT&T Bell Laboratories suggested applying a learning algorithm known as backpropagation of error to neocognitron’s architecture. Backpropagation is a step-by-step, layer-by-layer correction of neuron weights in a neural network, based on the network’s current output error. LeCun’s backpropagation neural network is known as the convolutional neural network (CNN) and remains one of the most popular tools for advanced automated image recognition. It’s what we mean by a teaching neural network; typically, when we talk about deep learning, we’re referring to this and similar approaches. The word “deep” refers to the depth of the layer hierarchy.
CNN is powerful, but it has a number of disadvantages. Primarily, there’s a lack of theory. Unlike theory-based machine learning methods, such as AdaBoost or a support vector machine (SVM) for neural networks, we usually have no idea how CNN will react if we change the underlying architecture — that is, the number and size of neuron layers and the connections among neurons. There is no rule saying what architecture works optimally for a specific task. Moreover, there is still no evidence that multilayer (deep) networks are really more efficient than, say, two- or three-level layers, and this makes discussions about deep learning somewhat speculative.
Another issue with neural networks is a lack of transparency or interpretability. A neural network can identify two pictures of the same person, but we can’t see what properties or features of the image led the network to make that choice. So we deal with a black box that applies huge computations to every pixel in an image, and we have no idea what happens in the neural layers. Of course, that resembles the situation with the human brain’s opaque internal processes, which we should have expected with AI. But lack of interpretation makes it difficult to correct or improve output.
Computer Vision in Practice
In practice, automated image understanding usually consists of two steps.4
The first is to split an image into segments that represent some objects, such as parts of the human body in a selfie or buildings in a photo of a town. There is no unique recipe for image segmentation. The solution may depend on many factors, including image quality, texture density and colors. Image segmentation uses tools such as gray scaling (gradations from white to black), image filtration, histogram analysis (a probability distribution of numerical values) and edge detection.
The second step is to detect desirable objects among the segments. In this sense, a human face in a photo is, roughly, a round-shaped homogeneous segment of skin that contains two segments for eyes, one for a nose and one for a mouth in a proper position beneath the nose. Features of segments might consist of size, form, texture, color and so on. If there is enough information, it also makes sense to use such features as pixel-intensity gradients; the difference of intensity among the left, middle and right parts of an object can be an input to aid in its classification. Object detection and classification take image recognition to the point where machine learning methods can be applied.
After the images have been segmented, the choice of the learning method depends on how much data is available. Generally, based on the amount of data you have, there are different strategies you can pursue.
LACK OF TRAINING DATA
1. Unsupervised learning. Sometimes examples of images for training are missing or in short supply. This may happen if we are trying to detect unknown objects or anomalies — this is common in astronomy, for example. For this purpose, we can use machine learning methods without a training set. These are usually referred to as unsupervised machine learning methods.
2. Cluster analysis. Generally, this technique segregates images into homogeneous clusters by defining some distance measure between objects, and then describes each cluster with some rules, relying on the common properties of images within the same cluster. To be efficient, unsupervised learning methods require some a priori information — features we can rely on to decide whether images are close enough to one another to appear in the same cluster.
3. Class definition. In this approach, we derive a rule that says whether a certain object belongs to a specific class. Again we have to rely on some a priori knowledge. There are plenty of formal methods for derivation rules. First is simple logic with a set of rules that all sound like “if-then,” otherwise known as Boolean logic. Another, more flexible approach is fuzzy logic, introduced by the University of California, Berkeley’s Lotfi Zadeh in the 1960s. Unlike Boolean logic, in which each statement is either true (1) or false (0), fuzzy logic works with the gray scale between 0 and 1. Fuzzy logic allows us to use statements we are not completely sure of, such as something that is, say, 63 percent true. A third approach is probabilistic. Here the probability of an object belonging to some class may be derived from a priori (or nonstatistical) information. The derivation is based on Bayesian formulas that make connections between a priori and a posteriori probabilities.
AVERAGE AMOUNT OF DATA
If you have an average amount of data (approximately 100 to 2,000 per class in the training set), it means you have enough for learning and for time-efficient processing. Here the researcher’s main responsibility is to establish limits on overfitting — that is, seeing patterns in data that are more apparent than real; methods like SVM and AdaBoost will help you do this. SVM is one of the most successful machine learning methods because of its strong theoretical base and very efficient overfitting controls. AdaBoost is simpler, with lower computational complexity. It is more useful when we have an exact classification rule for images. Of course, success will depend on good feature extraction: a process for reducing data and eliminating redundancies.
BIG DATA
“Big Data” is a very popular phrase these days and reflects a whole new direction in machine learning development. There is a science to extracting and storing very large amounts of data. For computer vision, it may require huge datasets of images — thousands, even millions of images, from databases of human faces, website images and YouTube videos. Time and resource consumption becomes more important, and methods that allow parallel computation may be necessary; these include CNN, AdaBoost or Random Forest, a learning program that constructs and manipulates decision trees. The features chosen are usually not sophisticated, such as low-level indicators like pixel intensity or the histogram of oriented gradients (HOG), another image-recognition technique, which “describes” the image with histograms. They do not require complex algorithms to extract, and they often don’t even require image segmentation.
The choice of computation algorithms is a function of data scale (Figure 1). If you have quite limited amounts of sample data, image segmentation and complex feature extraction are usually pragmatic approaches. However, if you have large amounts of data, simple feature extraction can best be accomplished using CNN, AdaBoost or Random Forest techniques.
The Future of Computer Vision
As we suggested at the start of this article, machines remain inferior to humans in many aspects of vision, particularly the understanding or interpretation of images, though they do have some very significant advantages. We can anticipate the development of computer vision unfolding in three ways. One could occur by increasing the computational power for existing algorithm architectures, with a step-by-step improvement in quality. A second could involve deeper research into nontransparent algorithms to make them more predictable and controllable. This would require theoretical work we currently lack. A third is the introduction of completely new concepts of image understanding — although it’s hard to say where they will come from.
And there’s always the scientific wild card. The field of brain studies could give us fresh insights into how the brain functions, leading to new machine learning techniques. We suspect that the human brain, with its remarkable ability to absorb and understand visual information, still has a lot of secrets to reveal.
Michael Kozlov is a Senior Executive Research Director at WorldQuant and has a PhD in theoretical particle physics from Tel Aviv University.
Avraham Reani is a Senior Executive Research Director at WorldQuant and has a PhD in electrical engineering from Technion – Israel Institute of Technology.