Sports (Soccer) Video Analysis-Indexing-Retrieval: July 2008

Tuesday, July 29, 2008

psuedo-Chef Illegal @ M01 Restu

To Eat... Makan... Mangez du riz (makan nasik dalam base Perancis kalau tak silap), A kill (arab kot?) and of course... MANGAN (eat in Javanese) and MADANG (eat rice in Javanese pulak)...

Okes lah. Today hari, I would like to cakap2 pasal masak2 :D I and my roomate have successfully and illegally brought in some stuff into our hostel room at M01 Desasiswa Restu. The accessories range from a FABER Rice Cooker to... to... one big 14-kg cylindrical 'thinggy' that can provide some sort of 'fire' power :D Fridge also we have... but the small bar-type one (Brand GLOBAL...)...

Anywayz, the main thing is that we could start cooking in our room. Even though agak merbahaya... and also due to the fact that the smoke-detector lurks just above our heads... tapi kena masak jugak. If not, aiya-gazamborina-aida-rahim... kos akan meningkat!!! At least if we cook, insya-Allaah our food expenditure would be so so kurang :D Which would then lead to great amoutns of savings!!!

Annywayz...

The first night tu, Azrul attempted nasik-kurma-briyani Al-Jantan? (ye ke mat? ada nama ke nasik kau tu :P)... SODAP MARBELES!!! The next night, saya pulak tried my hands at cooking --> Oyster-Chicken-With-Too-Much-Salt... a bit salty, but Alhamdulillaah, palatable :D

Annywayz... pagi ni yang best sket. Cuz we experimented with curry powder and oso some spices such as kayu manis, bunga lawang, black n white pepper dan sebagainya :D Recipe ni more En Azrul kita yang punya... dan basically proses2nya ialah:

1. Tumis bawang putih n merah
2. Kasik masuk potato yang telah didadu
3. Kasik masuk ayam yang telah dipotong2 halus (btw, this chicken has already been left overnight in the fridge with curry powder, salt n pepper marinade... oh, and some arabic olive oil :D)
4. Kasik masuk sket cherry-tomatoes (kitorang taruk pasal nak kasik habis je)... quite sweet this type of tomato... highly reccommended :D
5. Add some curry powder... to color and taste :D (ada ke 'to color'?)... oh! And also salt to taste... :D
6. Stir stir and stir... stir until satisfied with what you're stirring (ok, that does not make sense... but it did to us!)
7. Add rice and water...
8. Leave to cook until the RICE COOKER button flips to 'KEEP WARM'

(Mamat... perbetulkan kalau aku tersilap aturan ramuan ni :D)

:D

Alhamdulillaahi-robbili'aalameen :D

The dish came out MARBELES (ni term Azrul ni...) !!! For our taste la tapi...
Even tho a bit hot and stormy, it was not only palatable... but oso almost the same with what the arabs are selling in front of the mesjid!... and they're cashing in RM6 per pack for their stuff! :( Overpriced... overpriced...

Wannywayes... Let us layan de gambar of the:

Steamed Curry Rice with Curry Chicken Deluxe :D (btw, if anyone has any other recipes yang senang2... kasik la tau kat kami ye :D)

Okes... thank you 4 reading. Assalaam aleykom WBT and Have A Good One...

Proses menumis dan memasak ayam selesai
Adding rice and water
Another look before the lid is closed
The end result...
Close up sket :D

My bekal for today :D

Monday, July 28, 2008

Tovinkere - Detecting Semantic events in Soccer Games - Towards A Complate Solution

Reference: Tovinkere, V. and R.J. Qian. Detecting Semantic Events in Soccer Games: Towards a Complete Solution. in IEEE International Conference of Multimedia and Expo 2001 (OCME 2001).

Objective: This paper is putting forward their knowledge and rule based method to detect soccer events!

Methodology: From what I've read, the authors are going down to the tee on detecting soccer events via encoding the domain knowledge of soccer using XML, and then using player and ball tracking (along with physics related information e.g. ball bounce angle) to be used in a rule based system. (Authors claim that other methods [such as Machine Learning perhaps?] can also be used besides rules...)

1. Firstly, after understanding the LAWS OF SOCCER, SOCCER GAME FLOW and IDENTIFYING ALL POSSIBLE SOCCER EVENTS, authors conceptually model the domain knowledge of soccer using a hierarchical Entity Relationship model.

2. This model is then translated to XML (why don't they model in XML straight away ek?)

3. Then takes the inputs below to detect events:
* Domain Knowledge (the XML)
* Player and Ball tracking information

PHASE 1 of DETECTION
- Compute the derived information from player motion and orientation to identify all sections of tracking data containing player-ball interactions
- Player-ball interactions are determined by getting rid of deflections that involve bouncing off ground or goal post

PHASE 2 of DETECTION
- Determines which rules (from Domain Knowledge) will be used
- These rules evaluate game situation and execute relevant rules
- The appropriate segments are then marked as VALID or INVALID events, depending on what the evaluation goes :)

=================

Main soccer events are detected by firstly detection BASIC ACTIONS.

These BASIC ACTIONS are then used in combination (i.e. how they are represented in the XML schema) to detect the more COMPLEX EVENTS :)

e.g. Deflection (BASIC) is evaluated... according to XML (domain knowledge)... and in the end a Save (COMPLEX) event is detected.

Wednesday, July 16, 2008

A Novel Framework for Semantic Annotation and Personalized Retrieval of Sports Video

Reference: Xu, C., et al., A Novel Framework for Semantic Annotation and Personalized Retrieval of Sports Video. Multimedia, IEEE Transactions on, 2008. 10(3): p. 421-436.

Objectives: To detect important sporting events, (shows experiment in soccer and basketball domain), Create an index, and finally cater for personalized querying by users.

In a nutshell

The authors are approaching semantic indexing and retrieval of digital video (namely sports) in a rather different way. Instead of solely analyzing the low-level features of the aural and visual modalities, they are making extensive use of OUTSIDE/EXTERNAL information. The external source here is text from websites (also referred to as web-casting text... or some call them weblogs?) - examples are such as those found on ESPN's http://www.soccernet.com (under the weblog link if I'm not mistaken).

Basically, their method is divided into three parts...

1. Text Analysis

The FIRST part (TEXT ANALYSIS) involves querying the web server at ESPN or BBC (for example), and the looking for text region of interests (ROI). In short, look for text-areas that describe the game at hand.

This is followed by keyword identification. This is done via matching the keywords found on the web-casting text website with the particular sports keyword(s) that the authors define.

Finally, the authors come up with TEXT EVENTS based on the matched event keywords. Involves keyword matching combined with STEMMING, PHONIC and FUZZY search...

2. Video Analysis

FIRSTLY, after Shot Boundary Detection (using M2-Edit Pro) they classify shots into FAR VIEW, MEDIUM VIEW and CLOSE-UP VIEW. These views are common in sports video, which serves the purpose to vary (where necessary) the viewers attention... This process is done by:

Classify each frame within a shot into one of the aforementioned three views by analyzing COLOR, EDGE and MOTION features,

Simple Weighted Majority Voting of FRAMES is done (within a shot boundary) to finally classify the shot.

SECONDLY, Replay Detection is done... enough said :)

THIRDLY, Video Event Modeling is done. Here, the authors structure the two previous steps' results as mid-level features to model the event.

i.e. EVENT = [Si, Si+1, ... , Sj], whereby Si = beginning of event shot, and Sj the ending.. Each shot (Sk, where i<=k<=j), is represented by a feature vector Fk... Fk = (SCk, Rk, Lk)

* SCk = Shot Class of shot K,

* Rk = Replay Detection Flag (1 or 0), indicates whether Sk is included in a replay,

* Lk = Length of Sk

3. Text/Video Alignment

Basically detects the region where the game clock is present. Then they do their own OCR (which only recognizes round numbers from 0-9, which is neat btw).

The recognized clock time is then matched with the text event time... this time (on the video frame) will be the starting point for further EVENT BOUNDARY DETECTION (EBD)! (i.e. the start and finish of the event). EBD is done via MODELING THE VISUAL TRANSITION PATTERNS of the video events:-

i. Linking (or matching) of game clock time and text event time

ii. Events are modeled via HMM (trained using the mid-level features mentioned previously)

iii.(candidate) SHOT containing the event is selected as reference (Sref) iv. Search range starts FROM FIRST FAR VIEW SHOT BEFORE Sref and ends at FIRST FAR VIEW SHOT after Sref. (Authors say that the temporal transition event patterns will occur within this range)

i.e. Search Range (FarView-CloseUpView-FARVIEW(START HERE)-CloseUpView-SREF-MedView-CloseUpView-FARVIEW(END HERE)-MidView.......)

iv. Trained HMM's are then used to calculate probability scores of all possible partitions within the search range (aligned by shot boundaries). Partition that has highest probability score is selected as detected event candidate...
- Start and End boundaries are then the first and last shots within the partitions.

Friday, July 4, 2008

Explicit Semantic Events Detection and Development of Realistic Applications for Broadcasting Baseball Videos

Reference: Chu, W.-T. and J.-L. Wu, Explicit semantic events detection and development of realistic applications for broadcasting baseball videos. Multimedia Tools and Applications, 2008. 38(1): p. 27-50

=================
IN ALL
=================

Features:
i) Shots occurences --> Achieved from the differents video shots generated from their shot bounday detection algorithm... this algo btw, is color based... (using color adjacency histogram for distinction between in-field and out-field view, and hori. and verti. proj. profile for pitch shot view)

ii) Take into consideration Shot transition, tempral duration and motion... for particular events, there's a combination of such features... and hence these features are used for E.D.

Technique for E.D.:
i) K-NEAREST NEIGHBOR (KNN) - Neighbor = 8!!!

Results:
Good... at least: 0.85 PRECISION and 0.90 RECALL!!!

=================

Objectives:

Detect events in baseball videos --> Only interested in this one...
Come up with practical user applcations

Framework:

Starts with Shot Classification (there are a few classes of shots), then uses shot information as one of the inputs for event detection, finally creates applications.

My focus in this paper(Event Detection):

How they do it? --> Rule based + Model based (when confusion occurs)

Rule Based (Domain Knowledge of Baseball) -->
1. (Caption) TEXT information extraction

a) Characters pixels are determined first --> HIGH INTENSITY as compared to BGROUND.
b1) Character template construction (1) --> Identified character region is represented by 13-dimendion ZERNIKE moments
b2) Character template construction (2) --> For each digit (e.g. 4), a 30-sec vid. clip is used as training
b3) Character template constructoin (3) --> Character template for the digit 4 is constructed by averaging all the ZERNIKE moments of all the frames!!!
c) Character Recognition --> Test Vectors (unseen data that is) are compared with ALLvtemplates' vectors... look at VECTOR ANGLE!!! (so Zernike can come up with angles?). SMALLEST INCLUDED ANGLE with a particular digit's template is considered character match!

2. (Caption) SYMBOL information extraction

a) Uses the same Intensity Comparison with Bground to determine symbols
b) Based on pre0indicated symbol regions, BASE-OCCUPATION SITUATION is displayed according to whether the corresponding base is highlighted or not

b2) in the above case (image), this means the FIRST BASE is occupied.... this is what I understand though :)
c) Then, in the duration between two PITCH SHOTS, looks at changes in number of outs, number of score and base-occupation situation to further come up with 'evidences' for event detection
d) A few other domain rules are followed based on the three criteria in italics (as in above)...

3) From all of the above are concatenated into one feature vector fi,i+1...

Then, from another set of rules... determines whether the feature vector is LEGAL or ILLEGAL
Only LEGAL feature vectors are considered :)

4) Event Identification --> Determined at the leaves of a DECISION TREE!!!

Event identification is treated as a classification task into subsets of predefined event sets
Tree traversal is based on predefined rules out OUTS, RUNNER BASE OCCUPATION and SCORE and BASE-OCCUPATION SITUATION

Model Based

1) Some rules are common for some events... hence needs to be determined further by examining contextual (visual + temporal) information

2) Look at the combinational occurences of shot types (e.g. pitch shots, field shots, close-up shots), differences in time between shots (i.e. pitch and pivot), field view duration (in frames) and also motion of pivot shot...

3) All these shot context features are normalized between [0,1]... and 20 training sequences are manually selected!

4) In the end, train and test using K-NEAREST NEIGHBOUR algorithm... (authors say becuz it's simple to use this algo... tu je?) :P --> K is set to 8

========================

My Thoughts About This Technique

Quite related to my idea of event and important segment detection :) The basic framework is almost similar, ... but I beleive I can bring in novelty due to:

Different domain (SOCCAH!!! ... ok. Soccer :D)
Different technique to process context (Because the whole event and segment detection framework is different, might be able to make use of another context based classification technique... other than KNN, insya-Allaah)
Different way/approach of processing caption text...
Oh! And that DECISION TREE part is neat... might be able to use it in the analysis of my features... :) I will use rules also I reckon...

Insya-Allaah, let's just see...

BUT NOT ONLY SEE!!! MUST DO THE WORK!!!! Ameen :D

Thursday, July 3, 2008

The Semantic Gap

There exists a gap in the world of Information Retrieval, and this gap is prevalent especially in the world of Computer Vision and Pattern Recognition.

Imagine wanting to retrieve a picture of a golf course :) Intuitively, we can picture a big green field, having grassy texture, as well as consisting of poles in each of the 18-holes in a golf course. To us humans, this is relatively a painless task to envision. But what if you were to search for a gold course from a set of 1000 pictures? And to make matters worse (which is usually the case), these pictures are of mixed genres. There are snap shots of soccer fields, somebody's backyard, tennis courts, paddy fields, a few Wimbledon tennis courts etc. Now... would you be willing to wade through all these 1000-pictures to pick out the golf courses from the NOTS? ... Hmmmm... I would not love this opportunity :P As a matter of fact, I'd run away if someone comes to me with this sort of offer (unless there's a 1-million price tag attached :P).

In short, we humans are GOOD at discriminating between conceptual semantics (in this case, the golf course)... but lack the consistency and 'energy' to search through a huge database (so to speak).

====

Ok. Now that we know human beings lack the long-term consistency as well as 'will' power to perform certain tasks for extended periods, the next best thing would be to turn to computers :) Machines, can be programmed to do tasks precisely for as long as their processors allow them. So, the obvious solution would be to ask the computer to look for the golf courses! :D Hah! Problem solved :)

BUT!!! Yes, of course there's a but :)

Computers are merely built using inorganic materials such as microchips, wires, nuts and bolts and the works. There is no living organism in a computer. As a result, computers cannot intuitively think or make decisions based on past experiences (well, not naturally that is). So, if we ask a computer to find the golf course pictures from the 1000-picture database, it would most probably have no problem 'looking' at the whole 1000-pictures... but how can it make a decision of which one is which?

Computers are good at extracting data though. We can program a computer (or a computer program) to get some necessary features from a set of pictures... For example, in our case here:

We write a computer program to extract color and texture features from the pictures. As a result, we might end up with a color histogram and a coarseness map for texture. These are the two features that we will use to search for our Golf Course pictures :)

Normally, work in the past has relied on CONTENT BASED IMAGE RETRIEVAL to retrieve pictures of interest. And for most of the earlier experiment, it was done in a controlled environment. And one of the most popular techniques was "QUERY BY EXAMPLE" or QBE for short.

QBE is quite a powerful technique, whereby a human user will feed the computer program with a sample, or example image. From there, the computer program will look for pictures in the database that is similar to the example image.

To check for similarity, some sort of DISTANCE measures are used (the most common one being the Euclidean distance). And these distances are measured based on the features being chosen.

In our case, let's say we use the color histogram information to determine dominant GREEN color. And from the coarseness information, we define something that looks GRASSY. The program will then attempt to look for similarities within the database, and match it with these selected features! And once suitable matches are found, it can then be returned to the user as results! :D

Computers are good and relatively precise at calculations and processing of image features. But alas, when it returns the results, it does so blindly. All numerical matches are considered matches, whereas in the real world this is not the case.

The computer might mistakenly mark an indoor soccer field as being similar to a golf course. Also, it might return a seasoned snooker table in need of service. In the worst case, it might just return a short-haired-green-skinned sea monster (I know I am exaggerating here).

Anyway, the main point here is that, computers are good at extracting, processing as well as comparing numeric data derived from features. But what these numbers actually represent in the real world? This one they do not and I believe, cannot do as well as we humans can. HENCE!!!

THE SEMANTIC GAP à The disparity between computer extractable low-level features and the humanly perceived high-level semantic concepts that they represent.

Well, at least that’s one way of putting it :)

============

So, my main research question now is how to bridge this gap :) Or, maybe if I get really2 lucky, find a way to totally eliminate the gap! Hah!

Many works in the past and present, are trying to tackle this problem, and many have proposed various methods. Earlier methods made use of heuristics OR rules in order to deduce domain concepts. Others made use of statistical modeling. The trend now however, is to ask the computers to learn... hence Machine Learning. Another approach, which is gaining rapid popularity, is the use of Ontology in order to represent the knowledge of certain domains.

Either way... something has to be done to bridge or close this gap. This is because the number of digital information (namely VIDEO) is quickly getting out of hand. And we need to devise a way, so that the management and utilization of such resources could be at the tip of our fingertips, God Willing :)

Too Much SPORTS Video!

As many of you already know, digital video can be (digitally, of course :P) viewed from almost anywhere. You have your mobile-TVs, mobile-phones, laptops, PDAs, electronic billboards and obviously, your personal computer.

An advancement that has made viewing digital videos possible is the speed wise improvements made to Internet connections. Nowadays, 1 Megabit per second is a no brainer. At this, or higher speeds, users can seamlessly download and view any type of video in (almost) real-time!

In addition to that, storage devices such as Blueray, Portable Hard Disks and Flash Memory, CDs and DVDs have become so affordable. This allows people to store their videos; the creation of personal repositories. In Malaysia, ~200 Gigabytes of storage can be bought from under 200 Ringgit Malaysia (around 60 to 64 US Dollars). USB Memory Sticks can now store Gigabytes of storage, making themselves ideal candidates for temporary repositories, managing the storage of up to 4 high-quality movies at a time.

Another technology that has made its way into the home of consumers is the video capture cards. Savvy computer users can plug one of these contraptions into their PC; attach it to a TV, and presto! You can record your video in no time. For the least savvy, DVD recorders are now available, making recording your favorite sitcom or football match a no brainers. Furthermore, you'd have the recording on a DVD, which provides Gigabytes of storage.

=================

With all of this said, we can see that there is no problem for anyone (with means of course) to get a hold of their video of choice. Sports video is considered one of the popular genres due to its wide fan base (even some housewives LOVE soccer as much as their husbands). Besides that, there are the huge commercial benefits attached such as advertising.

TV broadcast companies, as well as individual users, can have video archives, consisting of hundreds (or even thousands) of Gigabytes of video documents. With all of this video in their hands, is it really possible to manage all of it? Can you really find what you want in this huge repository of information? ... ... ... Let us look at an example:

Xander and his team are asked to produce a 30-minute video showing the past exploits of Diego Armando Maradona. They are required to look for the BEST past footage of Maradona, which are his individual goals, his Hand-of-God goal, as well as all his previous teams/clubs.

The station stores ALL of Maradona's videos, and fortunately, all are in digital format :) The size of the archive is around 300 Gigabytes, consisting of documentaries, friendly games, international games, club games etc.

Now... the problem is... how can the team go about and look for the best footage of Maradona in hundreds and hundreds of hours of video, as well as in thousands of gigabytes worth of media? Should there be a division of team who could go and wade (and I mean wade...) through each video to get the best footage?

You and I know the answer to this... and it's a definite NO! NO-NO as a matter of fact. To MANUALLY go through archives of videos is indeed a problem. And if no proper annotation or indexing is done, the whole task is downright prohibitive....

And even if there's a computerized mechanism for all of this, how does the team search for a particular Maradona event? Is it possible to type in a query, for example: "Look for Maradona Hand-of-God goal", and get a good match from the archive/database? ... Hmmmm... MAYBE THERE IS?

How can this be done? ... Now, this is actually the MAIN part of my research... and it will attempt to tackle a popular issue/problem in the world of Computer Vision and Pattern Recognition. This problem is called the SEMANTIC GAP!

Introduction to my PhD Research

Hello. My name is Alfian Abdul Halin, and I am a PhD student at the School of Computer Sciences, Universiti Sains Malaysia. This is my first attempt at doing research by the way. Because my bachelor's (obviously) and masters were both coursework :) So, I hope I can find the niche in my PhD work, and hopefully graduate so that I could start (and hopefully JUMP start) my academic career, insya-Allaah (meaning God Willing). Ok! So let me introduce my research work :)

==========================

I am basically doing something that has already been done before, only that I hope to provide a new or improved solution. So basically, my research is under the category of OLD-PROBLEM NEW-SOLUTION group.

My interest is in sports, particularly soccer. Even though they call it football from where I come from (Malaysia), let's just use the term soccer for the sake of not confusing it with American Football (why they call it that? I have no idea, for not much FOOT action is used :P).

For my PhD research, I plan to look at the detection of important events in digitally RECORDED soccer videos (that means, LIVE TELECASTS are out of my domain). Besides events, I would also like to examine the structure of the video itself, where I could possibly identify important video segments.

For events, it's the normal stuff. I am now looking at a better way (or more intuitive way) to detect GOALS, YELLOW CARDS, RED CARDS, SUBSTITUTION, FREE KICKS etc. For important segments, I am looking to search for FIRST-HALF, SECOND-HALF, FIRST-HALF ANALYSIS, FULL-TIME or MATCH ANALYSIS, TEAM STARTING LINEUP, FORMATION etc.

Yes... yes... I know... the work has been done before. Among the good ones are such as listed under the BIBLIOGRAPHY label (as can be seen at my blog's sidebar).

Anywayz, my work, hopefully, will look at how to improve event detection by minimizing (or maybe optimizing) processing of video! The results should be...:

1. A better framework that will minimize the need for video processing,
2. Higher accuracy (in terms of PRECISION and RECALL) for event or segment detection.

And once detected, I will create a proper (semantic) index, so that soccer fans, as well as anyone interested, can browse a particular soccer game... or also possibly look for an event or video segment of interest (from a particular game) using intuitive or high(human)-level queries... :)

IN ALL, I hope this index can be built, and also... I hope to achieve a certain amount of novlty so that I could contribute to the Semantic Video Analysis Knowledge, as well as... of course... get my PhD degree :D

Ok... Cheerios! Will update on some more things later....

==========================

Sports (Soccer) Video Analysis-Indexing-Retrieval