Nowadays, a vast amount of videos are taken due to the exponential growth of commercial devices capable of video recording. The main targets of such videos include sports that users may record in, e.g., a public event or a professional match. These videos are usually long, containing redundant and uninteresting parts, and thus they are often stored and never reviewed again. The field of sports video summarization allows to automatically extract the highlights of the original video for a quick review. Existing work in this field leverages various knowledge in application domains, e.g., structure of games and editing conventions, which are commonly found in broadcast video. However, users' self-recorded videos normally lack any kind of editing conventions and the structure of the sport is sometimes lost, and thus the existing work is ineffective.
This thesis approaches the challenge of summarizing self-recorded sports video by resorting to the field of human action recognition (HAR). We hypothesize that players' actions can be recognized and used as a novel source of semantics to elaborate summaries. The greatest difficulty in HAR is to deal with the variability in the actors' anthropometry and the camera viewpoint. This can be alleviated by using depth information, obtainable by widely used commodity RGB-D sensors (e.g., MS Kinect).
In this thesis, we first propose an HAR method with flexible learning that does not require a large number of training instances to perform recognition. Then, we propose a novel user-generated sports video summarization method that acquires higher level semantics of the video by applying the HAR to RGB-D video sequences. We took Kendo as an example sport to evaluate our method, and investigated the accuracy and quality of the generated summaries objectively and subjectively.