Let’s take a look at the technology illuminating vast aspects of our everyday lives.
Broadly speaking, computer vision is all about enabling machines to “see” images as humans do, to process and identify objects and even activities within a still frame or a video sequence. Ideally, the computer – empowered by AI –would understand what it is seeing; assess the situation and identify suspect or out of the ordinary sequences; and advise users of needed actions.
The specifics around this include capabilities like object detection, which combines object classification (“it’s a truck”) or object identification (“it’s a Volkswagen”) with object location (where is the object in the image). Other techniques include motion analysis, to determine where and how objects are moving in the video; 3D scene reconstruction, to create 3D models based on one or more 2D images acquired from different viewpoints around the scene; and image segmentation, through which algorithms partition the image into semantic regions (e.g. cars, pedestrians, buildings, roads, background, etc.) . A capability receiving increased attention is object tracking, in which the computer picks up an entity and follows it through the scene, across multiple cameras.
To support all this, you traditionally needed sophisticated hardware: Advanced cameras, lenses, sensors, and processing chips to glean the visual input. However, the cost and complexity of getting started today is dropping dramatically. Then the AI takes it deeper. Numerous complex algorithms come into play, processed through neural networks that attempt to mimic some of the functions of the brain’s visual system.
Under the hood, a convolutional neural network parses data though multiple layers of kernels with increasing complexity. The key is that the actual numeric values of these kernels are learned from the data during the training process, as opposed to being manually engineered. Recurrent neural networks help machines to learn data causality given an ordered set of images by keeping track of the content of previously seen images. This enables computer vision systems to exploit temporal context and learn about object actions, behaviors, trajectories and other temporal patterns.
With so many moving parts, it isn’t surprising that we see a broad range of industry players engaged in the development of computer vision.
Companies like Qualcomm are working the optics and camera hardware side, adding depth perception and other interpretive tools to make cameras themselves more intelligent and capable. On the processing side, Nvidia, Amazon, Microsoft, Facebook, and Google all are pushing the algorithmic envelope. Academia is in the game too, with the Rochester Institute of Technology among the leading players. Stanford University, the University of Toronto, MIT, Berkeley and the University of Montreal also boast strong programs.
The University of Michigan also is deeply engaged, with computer vision a natural outgrowth of the school’s efforts to untangle the complexities around self-driving cars, which will depend heavily on computer vision for their safe operations. The electrical engineering lab there is developing algorithms that will enable machines to perform real-world visual tasks such as autonomous navigation, visual surveillance, and content-based image and video indexing.
Another capability of computer vision systems, and an advantage with respect to the human visual system, is that they can operate on different image modalities besides the visible spectrum. Algorithms can perform object recognition in the infra-red bands, or in other modalities like radar-based imaging, such as synthetic aperture radar (SAR). This, for instance, allows them to process data to detect objects at night time or “see” through clouds.
These nuts and bolts help us to understand how computer vision might work: Smart processing enables computers to ingest and interpret complex data from sophisticated visual sensors. That’s how it works. But what can it do? Here’s where the fun starts.