Research results on human activity classification in video are described, based on initial human skeleton estimation in selected video frames. Simple, homogeneous activities, limited to single person actions and two-person interactions, are considered. The initial skeleton data is estimated in selected video frames by software tools, like “OpenPose” or “HRNet”. Main contributions of presented work are the steps of “skeleton tracking and correcting” and “relational feature extraction”. It is shown that this feature engineering step significantly increases the classification accuracy compared to the case of raw skeleton data processing. Regarding the final neural network encoder-classifier, two different architectures are designed and evaluated. The first solution is a lightweight multilayer perceptron (MLP) network, implementing the idea of a “mixture of pose experts”. Several pose classifiers (experts) are trained on different time periods (snapshots) of visual actions/interactions, while the final classification is a time-related pooling of weighted expert classifications. All pose experts share a common deep encoding network. The second (middle weight) solution is based on a “long short-term memory” (LSTM) network. Both solutions are trained and tested on the well-known NTU RGB+D dataset, although only 2D data are used. Our results show comparable performance with some of the best reported LSTM-, Graph Convolutional Network-(GCN), and Convolutional Neural Network-based classifiers for this dataset. We conclude that, by reducing the noise of skeleton data, highly successful lightweight- and midweight-models for the recognition of brief activities in image sequences can be achieved.