Wearable IoT and Artificial Intelligence Techniques for Leveraging the Human Activity Analysis

Recently, the incorporation of wearable IOT devices with AI methods has potentially transformed the landscape of human activity analysis. The advancements in wearable IoT technology, combined with the power of AI algorithms, have opened up new avenues for real-time monitoring, personalized healthcare, fitness tracking, and even the enhancement of daily life [ 1,2, 3,4]. Wearable IoT devices such as smartwatches, fitness trackers, and smart clothing are capable of continuously collecting massive volume of information on various human activities, including physical movements, vital signs, and environmental factors. This data, once processed and analyzed using AI techniques, can provide a wealth of insights into an individual’s health, lifestyle, and overall well-being [5].

Wearable IoT devices typically incorporate a range of sensors, like accelerometers, gyroscopes, heart rate monitors, GPS units, and thermometers, to track diverse aspects of human behavior. These sensors collect data on activities such as walking, running, sitting, sleeping, and even more complex behaviors like cycling or specific sports-related movements. This data is then transmitted to cloud platforms or local devices where it can be stored, processed, and analyzed [6,7,8]. The growing availability of high-quality sensors and advanced communication technologies, such as 5G and Bluetooth, enables these devices to operate seamlessly in real-time, offering continuous monitoring without the need for user intervention.

The true ability of wearable IoT devices lies in the integration of AI techniques particularly machine learning (ML) [9,10] and deep learning (DL), can process and assess the vast amount of information generated by these devices to uncover patterns, predict future activity, and offer actionable insights. ML models, like supervised and unsupervised learning, can be trained to classify varied types of human activity. DL algorithms like CNN and Recurrent Neural Networks (RNNs), can go beyond simple classification, enabling more sophisticated tasks such as activity recognition, gesture detection, and even the prediction of future health events or physical performance [11,12].

Despite the significant advancements in wearable IoT devices and AI, there are several challenges associated with existing algorithms for human activity analysis. One major challenge is the complexity of processing vast amount of sensor data in real-time. The sheer amount of information generated by wearable devices can overwhelm traditional machine learning models, requiring advanced computational techniques to process efficiently. Additionally, the variability of human activities and individual behavior adds complexity to activity recognition, often leading to issues with accuracy and generalization. Another challenge is the limited ability of existing models to handle noisy or incomplete data. Furthermore, the integration of heterogeneous sensor data from different IoT devices and the need for continuous monitoring without draining battery life remain ongoing concerns.

To address these challenges, this work leverages the power of CNN and LSTM models. CNNs are particularly effective at extracting spatial features from sensor data, while LSTMs are suitable for handling sequential data and capturing temporal dependencies in activity patterns. Together, these models can more accurately recognize complex activities, predict future behaviors, and detect anomalies in real-time.

1.1

Contribution of the Research

Motivated by the challenges in existing Human Activity Recognition (HAR) methods, this research makes the following key contributions:

The study introduces a comprehensive data preprocessing pipeline, addressing noise, missing values, and normalization challenges in the WISDM dataset. This ensures the input data is of high quality and suitable for accurate feature learning.

A hybrid DL approach is proposed, combining CNN and LSTM for modeling temporal dependencies, effectively capturing complex activity patterns from time-series data.

The model demonstrates its effectiveness in classifying diverse human activities, utilizing metrics like precision, recall, and F1-score to validate its effectiveness.

1.2

Structure of the Paper

The sections are organized in the following manner: Section 2 discusses relevant studies conducted by various authors. The dataset preparation, pre-processing, proposed methodology with the detailed description is provided in Section-3. Evaluation outcomes of the suggested approach and comparative assessments are demonstrated in Section-4. In conclusion, the paper discusses future enhancements in Section 5.

2.

Related Works

Alarfaj et al. (2024) [13] introduced an innovative technique to human activity recognition utilizing specialized CNN developed for individual sensor types, including accelerometers, gyroscopes, and barometers. Their key contribution was developing sensor-specific CNN models capable of capturing unique characteristics of each sensor type, combined with a late-fusion method to combine predictions from different methods. When evaluated against existing traditional support vector machine (SVM) classifiers, their approach demonstrated superior performance, achieving better accuracy, significantly outperforming the SVM’s performance. The research highlighted the importance of integrating multiple sensors with barometer data and using additional filtering algorithms for improved accuracy, though the complexity of managing multiple sensor-specific models may limit real-world implementation.

Alanazi et al. (2024) [14] conducted a comprehensive evaluation of machine learning algorithms for HAR utilising smartphone sensor data. Their study uniquely aimed at testing various algorithms under predefined settings using the KU-HAR dataset. The LightGBM classifier emerged as the superior performer across multiple metrics like accuracy, F1 score, precision, and recall. While the gradient boosting approach demonstrated excellent performance, the study noted its lengthy training time as a significant limitation, though other classifiers achieved acceptable training periods.

Raj and Kos (2023) [15] presented a sophisticated implementation of CNN for HAR tasks using the WISDM dataset. Their approach utilized a two-dimensional CNN model to capture both temporal and spatial data related to human activities. The proposed model achieved an impressive accuracy rate surpassing other ML techniques. Their study highlighted the possibilities of deep learning methods in expanding HAR applications, although it is limited to a single public dataset.

Nia et al. (2023) [16] demonstrated the effectiveness of combining Artificial Neural Networks (ANNs) for feature extraction with the collective decision-making power of Random Forest, resulting in enhanced accuracy for classifying human activities. By utilizing the advantages of both ANN and Random Forest (RF), the integrated model tackles the issues with real, imperfect data, resulting in a more resilient and precise classification system. The outcomes emphasize the efficiency of feature extraction through ANNs and stress the significance of including RF in HAR methods to achieve enhanced accuracy. While the results were promising, the computational complexity of maintaining both ANN and Random Forest components may present challenges in resource-constrained environments.

Li and Wang (2022) [17] focused on HAR for elderly populations, employing an integration of conventional ML and DL algorithms. Their study utilized smartphone-based gyroscope and accelerometer data to monitor activities in various environments. LSTM achieved the highest accuracy at, while Support Vector Machine demonstrated efficient performance with 85.07% accuracy and significantly lower computational time of 0.42 minutes utilising 10-fold cross-validation. The research particularly emphasized the importance of efficient activity recognition for elderly care applications. A drawback of this study is the reliance on smartphone sensors, which may not capture all relevant activity patterns or environmental factors.

Nazari et al. (2022) [18] proposed several models for predicting four activities using inertial sensors located in the ankle area of lower-leg assistive device users. Their innovative approach focused on sensors that could be integrated directly into the control unit of assistive devices rather than requiring skin attachment. The Artificial Neural Network-based models achieved average classification accuracy, though the limited scope of activities and sensor placement may restrict broader applications.

Uddin and Soylu (2021) [19] proposed an innovative time-sequential information-based deep Neural Structured Learning (NSL) system for HAR using body sensor data. Their approach incorporated kernel-based discriminant analysis (KDA) for feature optimization and LSTM for activity modeling, achieving enhanced recall rate on public datasets. Additionally, the system’s reliance on specific body sensor data may restrict its applicability across different environments or diverse user populations.

Wang et al. (2020) [20] introduced a wearable sensors-based IOT system for continuous health monitoring of sports persons. Their research implemented various classifiers, with the Ensemble Bayesian deep classifier (EBDC) achieving the best performance with a minimum error rate of 0.25 and maximum recall value, outperforming other approaches including neural networks, RNN, BDNN, and LSTM. Moreover, these models may require substantial computational resources, reducing its scalability for large-scale deployments.

Mahmud et al. (2020) [21] introduced a multi-stage training technique for HAR that uniquely employed multiple transformations on time series data to obtain diverse feature representations. Their method utilized an efficient deep CNN architecture trained on different transformed spaces, achieving remarkable results across multiple datasets: highest accuracy on UCI HAR, USC HAR, and on SKODA database. While the results were impressive, the computational overhead of managing multiple transformation stages and training processes could impact real-time applications.

Hou (2020) [22] designed a comparative assessment of DL and existing ML models for HAR, specifically examining their performance on datasets of varying sizes. The research revealed that traditional machine learning methods performed better on smaller datasets (achieving 87% accuracy), while DL approaches including CNN and LSTM were more effective with larger datasets (reaching 88% accuracy). This research offered valuable findings into the selection of appropriate methodologies based on dataset characteristics, nevertheless its findings may be limited to the specific datasets and architectures tested.

3.

PROPOSED METHODOLOGY

The proposed HAR system which is shown in Figure 1, is designed as a hybrid deep learning framework comprising four integral components: data collection, data preprocessing, feature extraction, and activity classification. The system utilizes the Wireless Sensor Data Mining (WISDM) dataset, which includes accelerometer and gyroscope data recorded during various human activities. The raw data undergoes rigorous preprocessing to ensure consistency and quality, addressing issues like noise, missing values, and normalization to prepare it for model training. For feature extraction, the system employs a hybrid architecture that combines CNN and LSTM networks. The CNNs extract intricate spatial features from the time-series data, while the LSTMs capture the temporal dependencies critical for recognizing activity sequences. These features are then transferred to fully connected dense layers for activity classification. This classification component leverages the information-rich features learned by the hybrid model to deliver precise predictions.

3.1

Materials and Methods

For this research, we utilized the WISDM (Wireless Sensor Data Mining) dataset [23], a high-quality time-series dataset specifically designed for human activity recognition (HAR) tasks. It consists of motion data collected from wearable devices using accelerometers during six distinct physical activities: walking, jogging, sitting, standing, upstairs, and downstairs. The dataset contains 1,098,207 records, with class distribution reflecting the natural duration and frequency of each activity. Activities are represented in percentages shown in Figure 2. Data collection was performed at a sampling rate of 20Hz (one sample every 50 milliseconds), capturing fine-grained details of user movements. The dataset is structured with six key attributes, covering user information, activity labels, timestamps, and acceleration data along three axes which are illustrated in table 1. Notably, the acceleration values include gravitational components, making the dataset comprehensive for studying motion patterns.

Table 1:

Dataset Attribute Description

Attribute	Type	Description
User	Nominal	Identifier for participants, ranging from 1 to 36.
Activity	Nominal	The activity performed, classified into six categories: Walking, Jogging, Sitting, Standing, Upstairs, Downstairs.
Timestamp	Numeric	Device uptime in nanoseconds, representing the timing of recorded motion.
x-Acceleration	Numeric	Acceleration along the x-axis in m/s², including gravitational acceleration.
y-Acceleration	Numeric	Acceleration along the y-axis in m/s², including gravitational acceleration.
z-Acceleration	Numeric	Acceleration along the z-axis in m/s², including gravitational acceleration.

3.2

Data Preprocessing

The WISDM dataset undergoes key preprocessing steps to ensure its readiness for model training, with a focus on handling missing values and normalizing the data. First, any missing values in the dataset are addressed by removing incomplete records to maintain the integrity of the data. Since missing values can introduce bias or distort model predictions, eliminating these records ensures that only complete and reliable data is used for training. This step guarantees consistency and prevents any gaps that could undermine the effectiveness of the ML techniques.

Next, the dataset is normalized to assure that all input features, such as accelerometer values along the x, y, and z axes, are on a similar scale. Normalization is essential because accelerometer data can vary significantly across different users and activities, leading to discrepancies that could impact the algorithm’s proficiency in learning meaningful patterns. In this case, Z-score normalization is applied, which standardizes the data by subtracting the mean and dividing by the standard deviation for each feature. This ensures that the input features have a mean of 0 and a standard deviation of 1, enabling the model to train more efficiently and effectively by preventing any single feature from dominating the learning process.

3.3

Feature Extraction Module

The Feature Extraction Module in this research integrates the benefits of CNN and LSTM networks to examine the raw sensor information collected from wearable devices. CNNs are used to extract intricate spatial features from the accelerometer and gyroscope time-series data, ensuring the identification of activity-specific patterns. LSTMs complement this by capturing the temporal dependencies inherent in the sequential data, enabling the model to understand the progression and context of activities over time. This integrated approach ensures robust and comprehensive feature extraction, which is crucial for achieving high accuracy and reliability in human activity recognition (HAR) tasks.

3.3.1

Convolutional Neural Network (CNN)- An Overview

It is an DL approach designed to extract features from structured data, like images and time-series data. They are especially powerful for tasks that require localized pattern recognition because of their capacity to self-learn spatial relationships of features. CNN architecture comprises of convolutional layers, pooling layers, and fully connected layers, each contributing to the overall feature extraction process.

3.3.1.1

Convolutional Layers

They are the primary components of a CNN. Their role is to apply convolution operations to the input data using filters (also known as kernels). These filters slide over the data, performing a mathematical operation that generates feature maps. These feature maps highlight key characteristics, such as edges or textures, in the input data. In time-series data, convolutional layers capture temporal dependencies and activity-specific features by moving along one dimension. By stacking multiple convolutional layers, a CNN is able to learn progressively more abstract and complex features, enabling the model to represent higher-level patterns and structures from the raw input.

3.3.1.2

Pooling Layers

Pooling layers are placed between convolutional layers to decrease the spatial dimensions of the feature maps. This reduction in size helps minimize computational complexity and prevents the model from becoming overly complex, which could lead to overfitting. The most common type of pooling used in CNNs is max-pooling, where the maximum value from each region of the feature map is chosen. This operation retains the most important features while discarding less relevant data. Pooling not only reduces computational load but also aids in generalizing the model by making it more invariant to slight changes in the input. As a result, pooling layers help to efficiently extract robust and high-level features essential for tasks like activity recognition or object detection.

The raw time-series data from wearable devices, is first processed through a series of convolutional layers. These layers capture intricate spatial patterns related to human activities, such as walking, running, or sitting. These feature are given to the LSTM layers, which are adept at capturing temporal dependencies.

3.3.2

Long Short Term Memory

It is a type of RNN, are integral to the hybrid DL framework proposed for HAR in this system. LSTMs are highly efficient in capturing long-range temporal dependencies, which are essential in accurately recognizing activity sequences in time-series data from accelerometers and gyroscopes. By addressing the challenges of existing RNNs, especially the issue of vanishing gradients, LSTMs facilitate the model in keeping essential information over time while disregarding non-relevant data.

The LSTM architecture as illustrated in Figure 4, is composed of memory cells and gating mechanisms, that control the transmission of information within the network. These components allow LSTMs to effectively handle the sequential nature of human activity data, ensuring accurate recognition across varying time frames.

The fundamental parts of an LSTM unit are:

Cell State (C_t): This represents the memory of the network and carries important information over time.

Hidden State (h_t): This represents the output of the LSTM unit at time step t, that is passed to the next layer or time step.

Gates: These are mechanisms that control the transmission of data within the memory cell. They are:

Forget Gate: Determines which information to remove from the cell state.

Input Gate: Determines which new information to save in the cell state.

Output Gate: Decides what the next hidden state will be.

3.3.2.1.

Forget Gate (f_t)

It controls how much of the previous cell state (Ct−1) should be carried forward to the current cell state. It takes the previous hidden state (ht−1) and the current input (xt) as inputs and outputs a value between 0 and 1, which is then used to scale the previous cell state. (1) $f_{t} = σ (W_{f} \cdot [h_{t - 1}, x_{t}] + b_{f})$ {f_t} = \sigma ({W_f}\, \cdot \,\,[{h_{t - 1}},{x_t}] + {b_f})

Here f_t is the forget gate’s output, W_f is the weight matrix for the forget gate, b_f is the bias term and σ is the sigmoid activation function, which squashes the output between 0 and 1.

3.3.2.2.

Input Gate (it)

It decides how much of the new information should be incorporated into the cell state. The candidate cell state signifies the possible new data that can be included in the cell state. The input gate is measured as follows: (2) $i_{t} = σ (W_{i} \cdot [h_{t - 1}, x_{t}] + b_{i})$ {i_t} = \sigma ({W_i}\, \cdot \,[{h_{t - 1}},\,{x_t}] + {b_i})

Here i_t is the input gate’s output, W_i is the weight matrices for the input gate and b_i is the bias terms.

3.3.2.3.

Update Cell State (Ct)

This is modified by merging the previous cell state (Ct−1) and the new candidate cell state (C͠_t), managed by the forget gate and the input gate. The update rule is as follows: (3) $C_{t} = f_{t} \cdot C_{t - 1} + i_{t} \cdot {\tilde{C}}_{t}$ {C_t} = {f_t} \cdot {C_{t - 1}} + {i_t} \cdot {\widetilde C_t}

Where C_t is the updated cell state, f_t regulates the extent to which the previous cell state is preserved and i_t manages the degree to which the new candidate cell state is incorporated into the cell state.

3.3.2.4.

Output Gate (ot) and Hidden State (ht)

The output gate decides what the next hidden state (ht) will be. The hidden state is based on the updated cell state (C_t), which is passed through the output gate and squashed using the tanh function. The output gate is computed as follows: (4) $o_{t} = σ (W_{o} \cdot [h_{t - 1}, x_{t}] + b_{o})$ {o_t} = \sigma ({W_o} \cdot [{h_{t - 1}},{x_t}] + {b_o}) (5) $h_{t} = o_{t} \cdot \tanh (C_{t})$ {h_t} = {o_t} \cdot \tanh ({C_t})

Where o_t is the output gate’s output, h_t is the hidden state, W_o is the weight matrix for the output gate, b_o is the bias term and the tanh (C_t) function squashes the cell state to ensure that the hidden state is between −1 and 1.

These components enable the LSTM to capture long-term dependencies in time-series data by maintaining and updating relevant information across time steps. This ability makes it particularly well-suited for recognizing complex and evolving activity patterns, essential for accurate human activity recognition.

3.4

Dense Classification Layers

The dense classification layers in the proposed human activity recognition (HAR) system serve as the final decision-making stage, where the features extracted by the CNN and LSTM layers are transformed into predictions for different activity classes. After the hybrid model captures spatial features through CNNs and temporal dependencies through LSTMs, the extracted feature vector is passed to the fully connected layers, or dense layers. These layers execute a linear transformation of the feature vector, followed by an activation function, which helps capture complex relationships between the learned features.

For multiclass classification, each neuron in the final dense layer corresponds to one of the activity classes. The output of these neurons represents the network’s confidence in each class, with the highest value indicating the predicted activity. The dense layers are followed by a softmax activation function, which is particularly suitable for multiclass problems as it normalizes the outputs into a probability distribution.

The softmax function is applied to the output of the dense layer to convert the raw outputs (logits) into a probability distribution over all the activity classes. These classification layers, equipped with softmax, ensure that the network provides accurate and interpretable predictions for the multiple activity classes in the WISDM dataset.

4.

Results and discussion

This section provides the evaluation of the proposed human activity recognition system, highlighting key performance metrics and discussing the performance of the hybrid CNN-LSTM in classifying activities from the WISDM dataset.

4.1

Implementation Details

The model was developed utilising Python3.19 programming and libraries like matplotlib, numpy, pandas, Scikit-Learn, seaborn are utilized for evaluating the proposed model. The experimentation was carried out in the PC workstation with i7 CPU, 3.2 GHZ operating frequency, 16GB RAM, NVIDIA Tesla GPU.

4.2

Evaluation Metrics

Evaluation metrics like accuracy, precision, recall, specificity as well as F1-score are measured and immediately evaluated against other state of the art DL models to show the effectiveness of the proposed model. These measures also highlight the model’s productivity and low computational overhead. Table 2 displays the mathematical expression employed to calculate the performance metrics. The early halting method is applied to avoid overfitting and generalization problems. This technique is used to terminate the iteration when the validation performance of the recommended model fails to improve over time.

Table 2:

Evaluation Metrics utilized for assessment

SL.NO	Performance Measures	Expression
1	Accuracy	$\frac{T P + T N}{T P + T N + F P + F N}$ {{TP + TN} \over {TP + TN + FP + FN}}
2	Recall	$\frac{TP}{TP + FN} \times 100$ {{{\rm{TP}}} \over {{\rm{TP}} + {\rm{FN}}}} \times 100
3	Specificity	$\frac{T N}{T N + F P}$ {{TN} \over {TN + FP}}
4	Precision	$\frac{T N}{T P + F P}$ {{TN} \over {TP + FP}}
5	F1-Score	$2. \frac{Precison * Recall}{Precison + Recall}$ 2.{{Precison\, * \,{Recall}} \over {Precision\, + \, {Recall}}}

TP & TN are True Positive & negative, FP & FN are False Positive & negative.

4.3

Experimental Outcomes

Table 3 presents a comparative assessment of various DL algorithms for HAR based on key evaluation metrics such as accuracy, sensitivity, specificity, precision, and F1-score. The proposed hybrid model outperforms the baseline models, including Artificial Neural Networks (ANN), Gated Recurrent Units (GRU), LSTM and CNN. With an accuracy of 99.4%, the proposed algorithm highlights high performance for all metrics, highlighting its robustness in recognizing activities from the WISDM dataset.

Table 3:

Comparative Analysis between the Different Models in recognizing activities.

Algorithm	Accuracy (%)	Sensitivity (%)	Specificity (%)	Precision (%)	F1-Score (%)
ANN	84.6%	85.8%	83.6%	83.4%	85.9%
GRU	86.1%	86.4%	85.6%	85.8%	85.8%
LSTM	88.3%	87.5%	86.7%	86.7%	87.8%
CNN	92%	91.3%	89.1%	90%	91.8%
Proposed Model	99.4%	98.6%	99.3%	99%	98.2%

Figure 5 illustrates the comparative assessment of testing accuracy among various deep learning methods. It highlights the consistent superiority of the proposed hybrid CNN-LSTM model over other architectures, with a significant margin in performance accuracy.

Figure 6 shows the ROC curves for different models, showcasing the ability of each model to differentiate between activity classes. The proposed model achieves an enhanced ROC curve, demonstrating excellent classification performance.

Figure 7 depicts the loss validation curve for the recommended approach during training. It highlights the convergence of the model, with the validation loss decreasing steadily and stabilizing, indicating the model’s ability to generalize well without overfitting. These results collectively emphasize the performance and reliability of the recommended hybrid model in human activity recognition tasks.

5.

CONCLUSIONS

This research underscores the effectiveness of integrating wearable IoT devices with advanced AI techniques to design a robust Human Activity Recognition (HAR) system. Wearable IoT devices, equipped with sensors like accelerometers and gyroscopes, continuously collect diverse data on human movements, offering a rich foundation for activity analysis. By utilizing the WISDM dataset, the proposed hybrid deep learning framework—combining CNNs for spatial feature extraction and LSTMs for temporal sequence modeling—achieved an improved classification accuracy alongside strong evaluation metrics such as precision, recall, specificity, and F1-score. The model successfully addresses challenges like data variability and noise, showcasing its reliability in real-time activity analysis. Its robust performance highlights its practical utility in applications such as personalized healthcare, fitness monitoring, and remote health systems. The study not only advances state-of-the-art HAR models but also demonstrates the feasibility of deploying such systems in real-world environments, setting a foundation for further innovation in wearable IoT-based activity analysis. Future research can explore integrating additional sensors, such as physiological and environmental ones, to enhance activity recognition. Incorporating energy-efficient models and hardware optimization will further improve real-time applicability in resource-constrained settings. Expanding datasets to include diverse populations and activities will improve model generalization and inclusivity.

Lingua:: Inglese

Frequenza di pubblicazione:: 2 volte all'anno
Argomenti della rivista:: Ingegneria, Introduzioni e rassegna, Storia dell'ingegneria, Elettrotecnica, Basi dell'elettrotecnica, Elettronica, Tecnologia dell'informazione

Feed RSS della rivista

Wearable IoT and Artificial Intelligence Techniques for Leveraging the Human Activity Analysis

Lina Sheker

Vishwanath Petli

K. Satish Reddy

Categoria dell'articolo: Article

Pubblicato online: 15 giu 2024

Pagine: 31 - 45

Ricevuto: 05 mar 2024

Accettato: 05 mar 2024

DOI: https://doi.org/10.2478/jsiot-2024-0003

Parole chiaveWearable IoT, Human Activity Recognition (HAR), WISDM Dataset, CNN, LSTM, Deep Learning Framework

© 2023 Lina Sheker et al., published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Parole chiave
Wearable IoT, Human Activity Recognition (HAR), WISDM Dataset, CNN, LSTM, Deep Learning Framework