Smart image and video search
Although the brain isn’t fast in comparison to computer hardware systems we have today, it has several key abilities that computers don’t currently have. The brain supports heavy parallel processing capabilities, which means we (unlike current computer systems) can, in mere seconds, recognize complex patterns within images. Our brain and visual systems recognize shapes, colors, situational context and extremely fine details better than almost any other cognizant process we perform. Moreover, our minds provide us with imagination, which allows us to forecast what it is we think we are seeing. Today, computers don’t do those things very well without our direct intervention. Hence, the need to add metadata to source image or video information so that our existing search technology can actually help us find what we are looking for.
However, the need to manually add tags to data in order to address the limitations of our current search technology is rapidly changing as we learn more about how the human brain actually works. Without necessarily reproducing exact copies of neurons and synapses in silicon and software, our cognitive scientists today are getting better at building models that simulate certain aspects of the human brain. That has resulted in improved business systems, such as image search, that work more like the human brain with greater context and awareness.
Currently there are dozens, if not hundreds, of projects worldwide to create models of the human brain. Those projects don’t seek to reproduce the entire brain. Instead, they attempt to emulate different regions and aspects of the brain by building simple simulations of nerves, neurons, dendrites, ganglia and synapses, and with more complicated models of the cerebellum, the cerebral cortex, the auditory system and the visual cortex and retinas. Those efforts continue to grow almost exponentially, adding a vast amount of fine detail to what we already understand about the human brain and how it can be applied to computer systems.
One area of significant progress is a rapidly improving understanding of human vision. Two of the key contributors to the growing body of information in this area are Dr. Frank S. Werblin, Ph.D., who is a professor of molecular and cell biology at the University of California Berkeley, and Dr. Botond Roska, M.D./Ph.D., of the Friedrich Miescher Institute for Biomedical Research, Basel, Switzerland, who have been working together to present some of the first real understanding of how human vision actually works.
Their groundbreaking research explains that human vision systems work in a fragmented fashion. Different parts of our visual processing systems, including the retina, the visual cortex and other brain components, work together to collect 12 separate streams of relatively sparse information from the visual data entering our eyes. Werblin and Roska refer to those streams of video as the “movies in our eyes.” Some of the components function to see color, some shape and others edges and backgrounds. The brain then takes those various sparse information feeds and weaves them together to form the rich visual images we perceive.
Understanding in detail how our brain and visual capabilities work gives us the ability to model and emulate those systems with ever-improving accuracy. The quest is to make computers see and understand the contents of images and videos, so that they can work on our behalf and bring to us search results that are contextually relevant and an appropriate response to our search queries. We now have an emerging generation of software built with off-the-shelf, standards-based computer hardware and software that aims to operate in a fashion that parallels how humans see images.
PiXlogic—content-based image and video search
Providing computers with the ability to see like humans is an extremely challenging problem. Enter piXlogic, a small company in Los Altos, Calif., that has released a next-generation image search engine, called piXserve, to search image and video content without the need for manually entered descriptive, textual metadata. As far as I am concerned, piXlogic has dragged image-based search into the 21st century and the world of Web 2.0.
PiXlogic is the brainchild of Joseph Santucci, a nuclear engineer and technology entrepreneur. Santucci, who is president and CEO of piXlogic, says of the company’s accomplishments, “It has been a team effort, and we have been blessed with truly remarkable people at every stage of our work.” One of those people, now retired, is Dr. Shelia Guberman, who worked on handwriting recognition in the late 1980s and early 1990s, and developed the underlying principles that were incorporated first by Apple on the Newton, and later by Microsoft in the Windows CE platform. In that work, he showed the importance of “asking the right question” in the field of pattern recognition.
At the time, common approaches to handwriting recognition were based on matching patterns of pixels, Guberman reformulated the problem and defined it in terms of the motion that the hand makes to create handwritten characters. That turned out to be a much more robust way to solve the problem.
This story is a great example of how sometimes looking at a problem from a different perspective can yield quite interesting results. Looking at things from a different angle is what has given Santucci the edge. He believes that the fact that he is not a classically trained computer scientist has allowed him to ask the right kinds of questions to solve the challenging problems of helping computers “see” like humans do. As a result, piXserve is currently the most advanced commercially available enterprise class search engine for images and videos—one that is based on automatic indexing of the contents of the image, and without the need for any manually input textual metadata whatsoever.
PiXserve incorporates algorithms that give the software the ability to automatically “see” almost anything in an image, and in many cases understand the context and the content of that image. To accomplish that, piXlogic tackled several fundamental challenges: image segmentation, comparisons of identical or similar objects in different images and the cognitive ability to fill in information missing from an image.
Image segmentation: A major breakthrough developed by piXlogic relates to methods for automatically segmenting an image. PiXserve can do this without a priori (before the fact) knowledge of the image’s contents, and in such a way that the segmentation products can correspond to what humans would consider to be “logical visual objects.” Simply put, piXserve is able to identify and catalog all the separate objects in an image, thus enabling search on individual components of the image,
Comparisons: Another technical challenge addressed by piXlogic stems from the first breakthrough, and relates to how the software compares visual objects from different images. Sometimes the appearance of those objects in an image is from the same camera perspective, and the geometry is quite simple. More often than not, however, the geometry is more complex because the logical entity (for example “a person”) can be the same in two images. However, the geometric appearance can be quite different; for example, a picture of a person with arms folded vs. one with arms spread. PiXlogic has developed a means of normalizing the image so that two or more similar, but not identical, images can be rationally compared.
Filling in the blanks: Being able to see and compare objects in a picture might be enough to consider the challenge of visual search sufficiently solved. However, according to Santucci, “This is definitely not enough because our minds see and understand a lot more than what is in the image. Humans unconsciously use context to understand what is in the image, rationalize its contents and deal with the missing or incomplete information implied by an image. We have an imagination and so can fill in the blanks, which is something that software doesn’t do too well.”
To have the software solve that problem, piXlogic has incorporated an idea they call “notions.” They are interpreted understandings about the context of the image and the objects in it. PiXserve started out with just a handful of fundamental notions, but as the software has evolved, pixLogic has enhanced and added more and more notions to the mix to create a contextually rich and accurate search environment.
The piXserve software is complicated underneath, but at the user interface level, it is refreshingly simple. To catalog a repository of images/videos, the user points pixServe to that repository (or employs a Web crawler to collect images), and it will automatically index the contents of those files. Through a Web browser interface, users can search using an image and/or point to one or more items in the image that are of interest to them as shown in Figure 3 on page 8. The software can also see and recognize text that may appear anywhere in the field of view of the image, which can be quite useful in a range of applications (from scanning TV broadcasts, to recognizing license plates, to picking out names of restaurants, etc.).
Images as objects
Most image and video search technologies work by trying to match image signatures that are based on simple concepts such as color histograms, texture, edges and other such metrics. Those “image wide” measures are imprecise, leading to imprecise matches such as a yellow sign, a yellow shirt, a yellow sun. PiXserve doesn’t rely on those approaches, but instead literally sees the image as being composed of many objects. The software automatically creates a vectorized representation or set of lines describing the object, and stores it in a database.
For example, as shown in Figure 4, piXserve can see and create a description of a black arrow and use that description to compare the