Article Category: Research Article
Published Online: Feb 24, 2025
Received: Dec 12, 2024
DOI: https://doi.org/10.2478/ijssis-2025-0008
Keywords
© 2025 Archana Chaudhari et al., published by Sciendo
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Wireless sensor networks (WSNs) are most extensively used to collect and analysis of data from the environment. The applications of WSNs include agriculture, military, and medicine. WSN deploys sensor nodes, such as temperature, humidity, and moisture, to collect data. These data are then transmitted to the destination or the base station [1–2]. A representative node consists of a battery, a sensor, and certain memory and microcontroller devices. The power needed for the node for data transmission is provided using a battery source. The memory device stores the algorithm along with the data sensed from the environment. The microcontroller device controls the data transmission and execution via algorithms. Figure 1 demonstrates a WSN with the architecture of a sensor node.

WSN with sensor node architecture. WSN, wireless sensor network.
The sensor nodes are placed in remote locations and harsh environmental conditions, and they are often inaccessible. The sensor nodes depend on the battery life for energy. Also, the network life of WSN depends on the life of the sensors. Due to limitations of the battery, it is important to route the data transmission in such a way to consume energy in an optimal way [3–4]. This will help to extend the lifetime of the WSN.
The previous literature present the survey of energy-efficient or energy-aware-based routing protocols [5,6,7,8,9]. Two types of approaches are proposed: energy saving and energy balancing [10]. Many routing approaches have been proposed to save energy during monitoring of the environment and transmission of data. These are known as energy saving approaches. Another approach is the energy balancing approach. Both approaches prove useful to enhance the networks lifetime.
Machine learning-based routing approaches have been used in the past decade. An overview of machine learning-based energy-efficient routing approach has been presented in Ref. [11].
Reinforcement learning (RL), a branch of machine learning, has been widely used in recent years for routing in WSN. RL offers a basis for WSN nodes to learn from earlier interactions with the environment and choose efficient actions for the future. It proposes two approaches: centralized and decentralized. In a centralized approach, the routing path is chosen based on information of the overall topology while in a decentralized approach, local information of every sensor and its neighbor is used for route selection [12].
An energy-efficient RL routing approach can be helpful to choose the ideal path to enhance the networks lifetime. In this work, Section 1 presents the introduction to the routing in WSNs and the Q-learning approach. Section 2 discusses state-of-the-art methods based on energy-efficient RL-based approaches for routing. Section 3 presents the proposed energy-aware Q-learning-based routing method. Experiments conducted using the proposed method are presented, along with results and discussions, in Section 4. Finally, conclusions based on the work are presented in Section 5.
Many routing protocols choose the next forwarder node having good quality of service (QoS). Hence the node energy with good QoS will exhaust. Jafarzadeh and Moghaddam [13] presented a QoS routing protocol in WSNs based on an energy-aware RL approach. Oddi et al. [14] proposed a Q-learning approach with energy-efficient routing. In this approach, the residual batteries and control overhead of wireless sensors are considered. An intelligent routing protocol based on energy was proposed by Kiani et al. [15] to reduce energy consumption. In the method, initially clustering is applied to establish the network using a connected graph. Data are transmitted using the Q-value to improve networks lifetime. Li et al. [16] proposed an energy-efficient Q-learning-based routing for underwater WSN in which the link quality and energy contribute to Q-value. An energy-efficient routing approach based on RL was proposed by Guo et al. [17]. It explores the reward policy to compute the ideal route so as to enhance lifetime and energy efficiency. For balancing the energy and reducing the energy consumption of the WSN, Su et al. [12] selected the next forwarder based on the Q-values. A Q-learning algorithm depending on the data aggregation along with energy efficiency considerations was proposed by Yun and Yoo [18]. In this approach, they exploited the reward function based on data aggregation at every node and two types of energy: the residual node energy and communication energy to choose the optimum routing path. Cluster-based energy-efficient routing using the RL approach was presented by Yun and Yoo [18]. In the work, the initial energy of the nodes and hop count are used for the calculation of the initial Q-value for the choice of the cluster head (CH). For making the routing decisions, residual energy of the node, along with the hop count, is used to provide energy efficiency during data transmission [19].
For cognitive radio sensor networks, Joon and Tomar [20] proposed Ad hoc On Demand Distance Vector (AODV) routing based on energy-aware Q-learning. In the work, rewards are exploited depending on the Q-learning for choosing a cluster head. The route is established using AODV [20]. Su et al. [21] presented a Q-learning-based approach using the energy of information transmission and Q-values for extending the lifetime of the network. Maivizhi and Yogesh [22] presented Q-learning-based routing for in-network aggregation in WSNs. Q-values in the work are computed based on the node energy, link strength, and distance among nodes. Yun and Yoo [23] proposed Q-learning-based routing by exploiting the rewards in terms of sensor data aggregation, node residual energy, and communication energy.
Q-learning is a model-free type of RL in which the RL agent learns from the environment from the past and current experiences. Q-learning consists of state-action pairs. To learn from the environment, a state is defined, and for every state, an action is defined. For each action, the agent is rewarded. The reward can be positive, negative, or neutral. Based on the reward, the Q-values are computed. The computed Q-values further choose the action to be executed to reach the goal.
For routing in WSN, the objective of the proposed Q-learning-based routing approach is to decide the next forwarder node for the transmission of data from the source node to the destination. In the proposed method, the source is taken as the current or present node. From the present node, the goal of the agent is to route data to the sink node via the best next neighbor.
The best next forwarder node is defined as that node chosen depending on the Q-value of each node. For the node on the path to be the best next forwarder node, the following conditions need to be met:
The node needs to be the neighbor node of the current node. A node can be a neighbor of the current node if it lies within the transmission range of the current node. The Q-value of the node needs to be maximum among all the neighbor nodes of the present or current node.
From the above hypotheses, the Q-learning agent is defined in Eq. (1) as
For the proposed method, the reward exploits the energy of the neighbor nodes. It is defined in Eq. (2) as:
The neighbor node of the present node is defined based on the transmission range. The transmission range for the WSN scenario is defined as 40 m. Eq. (3) is used to compute the neighbors of the current node.
The neighbor nodes update its node energy after every packet transmission using Eq. (5) as
In the proposed method, from Eq. (2), if the energy of the neighbor node is <0.1 Joules, the reward is more, and if the energy of the node is <0.1 Joules, the reward is less. This is because the node having energy <0.1 Joules may not be able to transmit the packet to the next neighbor. The Q-value of the neighbor node is computed based on the reward from Eq. (1). The best next forwarder is chosen from the neighbor table of the current node as the node with the maximum Q-value.
The proposed method can communicate the packets from the source to the sink node. But the nodes adjacent to the sink nodes will experience added energy depletion because of congestion. Hence, after some time in the WSN scenario, no best next forwarder might exist as the energy of the nodes will deplete after packet transmissions. At this instance, the data packet will not be transmitted further, and the packet will be dropped.
Steps for the proposed energy-aware Q-learning-based routing algorithm are summarized as below:
Initialize the network nodes’ positions with the number of nodes. Choose the source node as the current node. Obtain the neighbor nodes of the current node using the transmission range. If the sink node exists in the neighbor table of the current node, check the node energy of the current node, and if the energy of the current node is >0.1 Joules, forward the packet to the sink node; else, drop the packet. Compute the reward matrix Compute the Choose the next forwarder node as the node with the max Q-value among the neighbor nodes of the current node. Forward the packet to the next forwarder node and update the node energy of the current node. If the packet has not reached the sink node, make the next forwarder node the current node and continue with Step 3 till the data packet reaches the sink.
The experimental analysis of the proposed method is implemented using Matlab 2022 software. Two source nodes are defined, and Constant Bit Rate (CBR) traffic is attached to the source nodes. The duration of the CBR traffic is defined as 5 s. For experimental analysis, the number of nodes (N) is varied from 20 to 100 by keeping the packet size as 20 bytes and packet transmission rate as 20 packets/s. The proposed energy-aware Q-learning-based routing method is compared with the basic Q-learning-based routing algorithm [24].
The performance evaluation of the experimental analysis and comparison are conducted using the packet delivery ratio (PDR), packet loss ratio (PLR), throughput, and the networks lifetime. PDR is defined as the total number of packets sent from the source to the sink node. The PLR is the number of packets lost during transmission from the source to the destination. Throughput is defined as the total amount of data packets transmitted per given time. The networks lifetime is the total amount of energy consumed by the network and how many nodes are alive in the network.
Table 1 presents the comparison of the performance of the proposed energy-aware Q-learning-based routing algorithm with Q-learning-based routing.
Performance comparison of the proposed energy-aware Q-learning-based routing algorithm with Q-learning-based routing.
20 | 0.3 | 0.58 | 0.7 | 0.42 | 0.175 | 21 | 0.2674 | 0.358 |
30 | 0.28 | 0.76 | 0.71 | 0.44 | 20.75 | 22 | 0.3155 | 0.344 |
40 | 0.28 | 0.51 | 0.725 | 0.49 | 23.25 | 25 | 0.3439 | 0.33 |
50 | 0.275 | 0.42 | 0.725 | 0.58 | 36.25 | 37 | 0.46233 | 0.33 |
60 | 0.17 | 0.01 | 0.725 | 0.99 | 41.35 | 49.5 | 0.5965 | 0.978 |
70 | 0.13 | 0.01 | 0.87 | 0.99 | 43.5 | 49.5 | 0.6865 | 1.29 |
80 | 0.13 | 0.01 | 0.87 | 0.99 | 43.5 | 49.5 | 0.8072 | 1.43 |
90 | 0.13 | 0.01 | 0.87 | 0.99 | 43.5 | 49.5 | 0.9189 | 1.47 |
100 | 0.13 | 0.01 | 0.87 | 0.99 | 43.5 | 49.5 | 1.032 | 1.596 |
PDR, packet delivery ratio; PLR, packet loss ratio.
Figure 2 presents the performance of the proposed energy-aware Q-learning-based routing method for the number of nodes against PLR. From Figure 2, it is observed that due to the energy constraint in the reward function, the PLR has decreased compared with the basic Q-learning-based routing method.

PLR against number of nodes. PLR, packet loss ratio.
Figure 3 represents the performance of the proposed method for the number of nodes against PDR. From Figure 3, PDR has increased compared with the basic Q-learning-based routing method.

PDR against number of nodes. PDR, packet delivery ratio.
Figure 4 exhibits the performance of the throughput of the proposed method. The throughput of the proposed method is seen to enhance compared with the Q-learning method.

Throughput against number of nodes.
Figure 5 demonstrates the networks lifetime of the proposed method and illustrates the enhanced networks lifetime using the proposed method in comparison to the Q-learning algorithm.

Networks lifetime against number of nodes.
Routing of data packets in WSNs ensures that packets transmitted from the source have reached the sink node without packet loss. In WSN, the nodes near the sink nodes experience congestion as most of the packets are routed via these nodes to the sink. This results in depletion of the nodes energy and packet loss. The work proposed an energy-aware Q-learning-based routing method. The work exploits the rewards in Q-learning in terms of the node energy for a better choice of the next forwarder node to the sink node along with experimental analysis. As stated in the literature review, many other works exploit the hop count of the nodes to the sink to choose the next neighbor. A few works have explored the hop count of the node along with the residual energy of the node for better routing. Experimental analysis of the proposed work demonstrates enhanced PDR, throughput, and networks lifetime and reduced PLR compared with basic Q-learning-based routing.