Chennai Mathematical Institute

Seminars




Data Science Colloquium Series
2 pm - 3 pm, NKN Hall
Recognizing Human Actions at a Distance in Video: From Bag-of-Words to Deep Neural Network Models

Snehasis Mukherjee
IIIT Sricity.
06-03-20


Abstract

Semantic analysis of human actions in video leads to various vision-based intelligent systems for applications like surveillance, robotics, action-based human-computer interfaces, etc. A methodology to automatically detect actions such as running, punching, pushing or kicking seen in a video can be useful, especially when the performer is at a distance from the camera. In our proposed approach, an important pre-cursor to action recognition in video is to decompose the video into smaller video elements (called units​), where a possible action may have happened. Inside an ​unit​, in order to estimate the frame where major change of content has happened, a key-frame detection procedure is applied, which is the second step to our proposed framework for action recognition. We detect key-frames in each ​unit in the video to generate a small video clip keeping the key-frame at the central position of the video clip. This clip is supposed to define one particular action, say running or walking, etc. We propose an efficient person detection method to locate the performers in action video clips and draw rectangular bounding boxes centering the performers. The third step of our proposal is to recognize the action of a single human performer in the video clip using the bag-of-words model, where human poses are taken as words and the video clips as documents. An action is decomposed into a set of poses and the relation between the poses are represented using a pose graph where each node of the graph represents a pose. If a video clip contains two performers, then another approach is proposed to recognize the interaction between the two performers.

Recently the bag-of-words based technique is used to recognize human actions in egocentric videos, where camera is fixed on the forehead of the performer. We present a novel methodology for recognizing the activity in the given Egocentric video based on the assumption that some portions of video is sufficient for recognizing the activity.

We made an attempt to apply deep neural networks for recognizing actions at a distance. We proposed a novel scheme for recognizing action at a distance, using a 3-dimensional Convolutional Neural Network (3D-CNN) based classifier. Conventionally in human action recognition, every k-th frame of the video is considered for training the CNN, where k is a small positive integer. This reduces the volume of data for training the network and also avoids over-fitting to some extent, thus increasing the accuracy of recognition. In the proposed scheme of sampling, consecutive k frames are encoded into a single frame by computing a weighted summation of the k frames. The weights are given based on Gaussian distribution. This preserves the necessary temporal information in a better way than the conventional methods, and experimentally shown to perform well. 3D-CNN is used to extract the spatio-temporal features and LSTM is used as the classifier to recognize the human actions.

Bio:

Snehasis Mukherjee obtained his PhD in Computer Science from the Indian Statistical Institute in 2012. Before doctoral study, he completed his Bachelors degree in Mathematics from the University of Calcutta and Masters degree in Computer Applications from the Vidyasagar University. He did his Post Doctoral Research works at the National Institute of Standards and Technology (NIST), Gaithersburg, Maryland, USA. Currently he is working as an Assistant Professor in the Indian Institute of Information Technology SriCity. He has authored several peer-reviewed research papers (in reputed journals and conferences). He is an active reviewer of several reputed journals such as, IEEE Trans. CSVT, IEEE Trans. IP, IEEE Trans. HMS, IEEE Trans. Cyb, Pattern Recognition Letters, IET CV, IET IP and many more. His research area includes Computer Vision, Machine Learning, Image Processing and Graphics.