Wednesday, July 16, 2008

A Novel Framework for Semantic Annotation and Personalized Retrieval of Sports Video

Reference: Xu, C., et al., A Novel Framework for Semantic Annotation and Personalized Retrieval of Sports Video. Multimedia, IEEE Transactions on, 2008. 10(3): p. 421-436.

Objectives: To detect important sporting events, (shows experiment in soccer and basketball domain), Create an index, and finally cater for personalized querying by users.

In a nutshell

The authors are approaching semantic indexing and retrieval of digital video (namely sports) in a rather different way. Instead of solely analyzing the low-level features of the aural and visual modalities, they are making extensive use of OUTSIDE/EXTERNAL information. The external source here is text from websites (also referred to as web-casting text... or some call them weblogs?) - examples are such as those found on ESPN's http://www.soccernet.com (under the weblog link if I'm not mistaken).

Basically, their method is divided into three parts...

1. Text Analysis

The FIRST part (TEXT ANALYSIS) involves querying the web server at ESPN or BBC (for example), and the looking for text region of interests (ROI). In short, look for text-areas that describe the game at hand.

This is followed by keyword identification. This is done via matching the keywords found on the web-casting text website with the particular sports keyword(s) that the authors define.

Finally, the authors come up with TEXT EVENTS based on the matched event keywords. Involves keyword matching combined with STEMMING, PHONIC and FUZZY search...

2. Video Analysis

FIRSTLY, after Shot Boundary Detection (using M2-Edit Pro) they classify shots into FAR VIEW, MEDIUM VIEW and CLOSE-UP VIEW. These views are common in sports video, which serves the purpose to vary (where necessary) the viewers attention... This process is done by:
  • Classify each frame within a shot into one of the aforementioned three views by analyzing COLOR, EDGE and MOTION features,

  • Simple Weighted Majority Voting of FRAMES is done (within a shot boundary) to finally classify the shot.

SECONDLY, Replay Detection is done... enough said :)

THIRDLY, Video Event Modeling is done. Here, the authors structure the two previous steps' results as mid-level features to model the event.

i.e. EVENT = [Si, Si+1, ... , Sj], whereby Si = beginning of event shot, and Sj the ending.. Each shot (Sk, where i<=k<=j), is represented by a feature vector Fk... Fk = (SCk, Rk, Lk)

* SCk = Shot Class of shot K,

* Rk = Replay Detection Flag (1 or 0), indicates whether Sk is included in a replay,

* Lk = Length of Sk


3. Text/Video Alignment

Basically detects the region where the game clock is present. Then they do their own OCR (which only recognizes round numbers from 0-9, which is neat btw).

The recognized clock time is then matched with the text event time... this time (on the video frame) will be the starting point for further EVENT BOUNDARY DETECTION (EBD)! (i.e. the start and finish of the event). EBD is done via MODELING THE VISUAL TRANSITION PATTERNS of the video events:-

i. Linking (or matching) of game clock time and text event time

ii. Events are modeled via HMM (trained using the mid-level features mentioned previously)

iii.(candidate) SHOT containing the event is selected as reference (Sref) iv. Search range starts FROM FIRST FAR VIEW SHOT BEFORE Sref and ends at FIRST FAR VIEW SHOT after Sref. (Authors say that the temporal transition event patterns will occur within this range)

i.e. Search Range (FarView-CloseUpView-FARVIEW(START HERE)-CloseUpView-SREF-MedView-CloseUpView-FARVIEW(END HERE)-MidView.......)

iv. Trained HMM's are then used to calculate probability scores of all possible partitions within the search range (aligned by shot boundaries). Partition that has highest probability score is selected as detected event candidate...
- Start and End boundaries are then the first and last shots within the partitions.

No comments: