3D – Spatial Temporal Volumes

3D-Spatio temporal volumes (3D-STV) are obtained by stacking the bounding boxes of the detected human. The generated 3D-STV should be figure-centric, which demands good human detection and tracking. In this work, we propose using 3D-Gradients and PFF in a spatio temporal volume as low-level representation of the video. The underlying idea is an extension of 3D-Gradients used at the interest point level in a Bag of Visual Words framework. Our work has the following contributions:

  • Robust representation of motion in spatio temporal volumes using 3D-Gradients and PFF.
  • Motion descriptors generated using PFF have better performance than 3D-Gradients in 3D-Spatio temporal volume framework.
  • Detailed set of experiments are done to understand the sensitivity of the proposed approach to scale, frame rate, and translation.
  • Experimental results on six datasets: UT-Tower, VIRAT ground and aerial, IXMAS, KTH, and Weizmann datasets.
  • Unlike 3D-Gradients, PFF does not require tracking of a human to generate figure-centric
    bounding boxes.