Thursday, July 3, 2008

The Semantic Gap

There exists a gap in the world of Information Retrieval, and this gap is prevalent especially in the world of Computer Vision and Pattern Recognition.

Imagine wanting to retrieve a picture of a golf course :) Intuitively, we can picture a big green field, having grassy texture, as well as consisting of poles in each of the 18-holes in a golf course. To us humans, this is relatively a painless task to envision. But what if you were to search for a gold course from a set of 1000 pictures? And to make matters worse (which is usually the case), these pictures are of mixed genres. There are snap shots of soccer fields, somebody's backyard, tennis courts, paddy fields, a few Wimbledon tennis courts etc. Now... would you be willing to wade through all these 1000-pictures to pick out the golf courses from the NOTS? ... Hmmmm... I would not love this opportunity :P As a matter of fact, I'd run away if someone comes to me with this sort of offer (unless there's a 1-million price tag attached :P).

In short, we humans are GOOD at discriminating between conceptual semantics (in this case, the golf course)... but lack the consistency and 'energy' to search through a huge database (so to speak).

====

Ok. Now that we know human beings lack the long-term consistency as well as 'will' power to perform certain tasks for extended periods, the next best thing would be to turn to computers :) Machines, can be programmed to do tasks precisely for as long as their processors allow them. So, the obvious solution would be to ask the computer to look for the golf courses! :D Hah! Problem solved :)

BUT!!! Yes, of course there's a but :)

Computers are merely built using inorganic materials such as microchips, wires, nuts and bolts and the works. There is no living organism in a computer. As a result, computers cannot intuitively think or make decisions based on past experiences (well, not naturally that is). So, if we ask a computer to find the golf course pictures from the 1000-picture database, it would most probably have no problem 'looking' at the whole 1000-pictures... but how can it make a decision of which one is which?

Computers are good at extracting data though. We can program a computer (or a computer program) to get some necessary features from a set of pictures... For example, in our case here:

We write a computer program to extract color and texture features from the pictures. As a result, we might end up with a color histogram and a coarseness map for texture. These are the two features that we will use to search for our Golf Course pictures :)

Normally, work in the past has relied on CONTENT BASED IMAGE RETRIEVAL to retrieve pictures of interest. And for most of the earlier experiment, it was done in a controlled environment. And one of the most popular techniques was "QUERY BY EXAMPLE" or QBE for short.

QBE is quite a powerful technique, whereby a human user will feed the computer program with a sample, or example image. From there, the computer program will look for pictures in the database that is similar to the example image.

To check for similarity, some sort of DISTANCE measures are used (the most common one being the Euclidean distance). And these distances are measured based on the features being chosen.

In our case, let's say we use the color histogram information to determine dominant GREEN color. And from the coarseness information, we define something that looks GRASSY. The program will then attempt to look for similarities within the database, and match it with these selected features! And once suitable matches are found, it can then be returned to the user as results! :D

Computers are good and relatively precise at calculations and processing of image features. But alas, when it returns the results, it does so blindly. All numerical matches are considered matches, whereas in the real world this is not the case.

The computer might mistakenly mark an indoor soccer field as being similar to a golf course. Also, it might return a seasoned snooker table in need of service. In the worst case, it might just return a short-haired-green-skinned sea monster (I know I am exaggerating here).

Anyway, the main point here is that, computers are good at extracting, processing as well as comparing numeric data derived from features. But what these numbers actually represent in the real world? This one they do not and I believe, cannot do as well as we humans can. HENCE!!!

THE SEMANTIC GAP à The disparity between computer extractable low-level features and the humanly perceived high-level semantic concepts that they represent.

Well, at least that’s one way of putting it :)

============

So, my main research question now is how to bridge this gap :) Or, maybe if I get really2 lucky, find a way to totally eliminate the gap! Hah!

Many works in the past and present, are trying to tackle this problem, and many have proposed various methods. Earlier methods made use of heuristics OR rules in order to deduce domain concepts. Others made use of statistical modeling. The trend now however, is to ask the computers to learn... hence Machine Learning. Another approach, which is gaining rapid popularity, is the use of Ontology in order to represent the knowledge of certain domains.

Either way... something has to be done to bridge or close this gap. This is because the number of digital information (namely VIDEO) is quickly getting out of hand. And we need to devise a way, so that the management and utilization of such resources could be at the tip of our fingertips, God Willing :)

No comments: