Otwarty dostęp

Automatic Landing Control of Aircraft Based on Cognitive Load Theory and DDPG


Zacytuj

Introduction

The research on pilot cognitive load can be traced back to the early 20th century [1-3]. Cognitive load refers to an individual’s psychological resource used to solve problems or complete tasks within a certain period of time [4,5]. When a person’s working memory capacity is overloaded with new information received directly or indirectly, the burden on the cognitive system increases, forming a cognitive load [6]. Nowadays, artificial intelligence technology has been widely applied in many fields, and its future development cannot be estimated. These developments will profoundly transform related fields. But if there is only a single application of artificial intelligence, although machines driven by AI technology and automation technology, such as cars, have both automation capabilities, that is, multiple mechanical/control systems that enable cars to travel according to instructions, and autonomous capabilities such as AI driven environmental perception and path planning, machines can already complete more and more tasks without human participation. However, as artificial intelligence is increasingly applied in various fields, especially in the field of automation control, human participation will become increasingly indispensable. The machine intelligence driven by AI has brought enormous imaginative space for automation applications in various fields and also requires the influence of human factors. The process of pilot information processing includes: firstly, the pilot obtains sensory and tactile information; Secondly, the pilot’s ability to make decisions based on past experience when making plans; Finally, the level of pilot’s operational behavior [7]. After information processing, the amount of cognitive resources consumed by pilots is called cognitive load. The autonomous willingness of pilots to react and operate runs through the process of human machine environment interaction, while the subjective driving intention of pilots is rarely considered in current research results. The reflection of actual effectiveness in flight control processes under different cognitive loads needs to be emphasized.

Related theories
Cognitive load theory

Early measurement methods often relied on methods such as action response detection, pen and paper filling, form testing, and intelligence testing, due to limitations in experimental equipment and environment. The validity of various subjective scales has been validated and modified by many researchers in order to achieve more accurate evaluation results. Due to subjective evaluation methods such as paper, pen, and questionnaire surveys, which require participants to fill in based on their own perception, the results are often subjective.

Objective measurement methods have been applied to most human-computer interaction studies due to the inability to directly observe and measure cognitive load, such as eye tracking [8-9], Index of Cognitive Activity (ICA) [10], and other technologies. In addition to objective methods, some scholars in the field of multimedia learning research use subjective tools such as the Paas Psychological Effort Scale [11] and the NASA-TLX (National Aeronautics and Space Administration Task Load Index) scale [12] to evaluate cognitive load. The main methods currently used to measure cognitive load in human-computer interaction research include physiologytask [13] performance based [14][13], and subjective self-assessment [15].

Cognitive compliance

Types Indirect mode Specific mode
subjective measure Subjective assessment scale NASA-TLX, WP scales, etc
Task performance Dual task measurement
objective measurement Physiological measurement data Electrocardiogram, oculomotor, electroencephalogram, etc

Through literature analysis, it was found that in human-computer interaction research, objective methods tend to be used to obtain relatively reliable and effective data, and there are few studies that use a single subjective method to measure cognitive load (such as Clarke, Schuetzler, and Windle et al.), in order to avoid the influence of personal characteristics of participants on experimental results [16]. The stimuli experienced by individuals indirectly affect the changes in physiological data and represent the level of psychological processing. The hypothesis that human physiological changes to some extent reflect an individual’s psychological state establishes a physiological method for measuring cognitive load [17]. Indirect objective measurement methods such as eye tracking technology[18], functional near-infrared spectroscopy (fNIRS) [19], skin electric response, electroencephalogram (EEG) [20]have been used to measure cognitive load in human-computer interaction research.

Heart rate variability is an indicator of electrocardiogram signals, referring to the irregularity of differences between consecutive heart beat cycles. Physiological functions that are not subjectively controlled by humans, including heartbeat, respiration, blood pressure fluctuations, and digestion, are all regulated by the autonomic nervous system of the human body. Due to the influence of multiple factors such as hormones, staying up late, and diet, there is no optimal standard interval for heart rate variability. However, the time-frequency indicators and other characteristic information of heart rate variability can provide non-invasive and quantitative evaluation of the autonomic nervous system, so electrocardiogram signals are selected as measurement data for human factors.

Time domain indicators

Name unit illustrate Formula
MEAN ms Mean RR interval MEAN=i=1NRRiN
SDNN ms Normal RR interval standard deviation SDNN=i=1N(RRiRR)2N

rMSD

D

ms Root mean square of RR interval difference between neighbors rMSSD=i=1N(RRi+1RRi)2N
pNN50 % proportion of RR interval difference greater than 50ms pNN50=NN50NN×100%

Time domain analysis is the simplest and most intuitive way to study HRV, and its analysis principle is based on quantitative exploration of statistical indicators such as MEAN and SDNN in RR interval sequences. HRV time-domain indicators commonly used in analysis.

The area enclosed by the power spectrum curve and coordinates on each frequency band is numerically the power of the signal in that frequency band. Therefore, the energy characteristics of each frequency band are extracted based on the power spectrum to quantitatively analyze the frequency domain characteristics of HRV, as shown in the figure.

Figure 1.

Power Spectrum

Meaning of each frequency

Name Abbreviation Meaning Frequency range
Very low frequency VLF - <0.04 Hz
low frequency LF Reflecting sympatheti c nervous activity 0.04 ~ 0.15 Hz
High frequency HF Reflecting parasympat hetic nerve activity 0.15 ~ 0.4 Hz
LFnorm=LFTPVLF×100% HFnorm=HFTPVLF×100%

Obtain the HRV frequency domain indicators on each side of the pentagonal flight through frequency domain analysis, (including the standardized low-frequency power LFnorm, HFnorm standardized high-frequency power, and the ratio of low-frequency to high-frequency power result LF/HF.)

Reinforcement Learning Theory

Reinforcement learning consists of three parts: intelligent agent, reward function, and environment. As shown in the figure, the initial state of the environment is inputted to the intelligent agent. The intelligent agent selects appropriate actions based on the state, and the actions are inputted to the environment. The environment obtains the reward value generated by the action and the new state. The two are inputted to the intelligent agent. The intelligent agent corrects the strategy based on the reward value, outputs new actions based on the new state, and thus repeats the cycle. The goal of reinforcement learning is to learn a strategy function π (x), which is a mapping from state space x to action space a. Reinforcement learning algorithms can be divided into three categories structurally: actor critic (A-C) structure, value function based reinforcement learning, and policy based reinforcement learning.

Figure 2.

Meaning of Each Frequency

The actor and critic represent the policy π and the value function V(s), respectively, and are approximated using a neural network. The input of the actuator is the current state of the aircraft, the output is the changes in speed, pitch angle, and altitude, the input of the evaluator is the state of the aircraft, including the cognitive load level, and the output is a state value function. After inputting the action into the actor, a new state variable is obtained and the real-time reward value is obtained based on the reward function formula. The critic iteratively updates based on the direction that minimizes the time difference error, while the actor also updates based on the weighted gradient of the time difference error. In an iterative update, update the critic first and then update the actor. When the cumulative reward value reaches the target requirement or the training reaches the set number of times, the training stops.

The core of DDPG is to split the actor and evaluator critic into two networks: the current network and the target network. After the actor generates an action for the environment, samples (si, ai, si+1,ri) are generated and placed in the experience replay pool, as shown in the figure 3.

Figure 3.

Schematic diagram of DDPG

The function of the current network’s critic is to update parameter θQ and calculate the current state-action value Q(si, ai), while the target network of critic calculates the value Q′(si,ai). Afterwards, update the critical current network based on the loss function. Loss=1Ni(yiQ(si,ai|θQ))2

The updated current network will periodically copy the weights θQ to the target network. Actor’s current network acceptance status si, select the optimal action ai, based on weight θπ. And update the weights according to the gradient formula. θπJ1NiaiQ(si,ai|θQ|θππ(si|θπ))

The target network selects the optimal action ai+1 based on the state si+1 and weight θπin the experience replay pool. The current network will periodically copy weights to the target network. yi=ri+γ*(si+1,π(si+1|θπ) | θQ) θQ=τ*θQ+(1τ)θQ θπ=τ*θπ+(1τ)θπ

Design of DDPG model improved based on cognitive load theory
Selection of Quantitative Indicators for Cognitive Load

Due to the need to characterize short-term cognitive load, electrocardiogram time-frequency indicators MEAN, LFnorm , HFnorm were selected to calculate the cognitive load of pilots. Urzfh=c1meanMEAN+c2IfnormLfnorm+c3hfnormHFnorm+c4

MEAN, LFnorm, HFnorm is the constant value of human physiological indicators in a sedentary state. To solve for each weight value and constant term, define dij is the relative value of the jth electrocardiogram indicator in the jth measurement window, based on the comparison between the measured value and the reference value. Matrix W = (dij)m*4 is The relative value matrix of the norm indicator MEAN, LFnorm, HFnorm, where b = (c1,c2,c3,c4)T is the vector composed of the corresponding weight values and constant terms for each indicator ω = (Urzfh1,Urzfh2,Urzfh3Urzfhm)T is the cognitive load vector, where the total score of the NASA-TLX scale is used as the training value. Therefore, the determination of weight values is transformed into finding the optimal solution e0 for the equation W · e = ω , such that for all ∀eR, ||W · e0ω||≤|| W·eω|| holds true. According to the generalized inverse matrix theorem and its existence conditions e = W−1ω, Using 20 sets of experimental measurements and a set of equations listed with 4 measurement windows in each group, substitute e = W−1ω the weight matrix of each electrocardiogram indicator obtained is e =(73.76, −28.53,17.96,−116.44)T, Substitution ω = W · e, we obtain Urzfh=73.76meanMEAN28.53IfnormLfnorm+17.96hfnormHFnorm116.44

Compared to calculating the average cognitive load using time-domain and frequency-domain indicators over a time period, using the above formula to fit short-term cognitive loads is more accurate and real-time This can then serve as one of the input data for the reinforcement learning model.

Reinforcement Learning A-C Structure Neural Network Design

The neural network structure of the critic and actuator is shown in the figure. The hidden layer size is 100, where the input layer of the critic inputs the aircraft’s cognitive load state Urzfh and Aircraft status x = (z, ω, θ,v)T .The hidden layer consists of five fully connected layers and three ReLU activation functions. The output layer outputs the value of the state action function with a learning rate of 0.001; The input layer of the actuator inputs the state of the aircraft, the hidden layer consists of four fully connected layers and three relu activation functions, and the output layer outputs the deflection angle and acceleration of the controller. The learning rate of the actuator is 0.0001, and the gradient thresholds of the actuator and critic are both 1.

Figure 4.

Critic Network Structure

Figure 5.

Actuator Network Structure

Reward Function Design

Assuming that the observation input of the reinforcement learning agent is 4 dimensions, the state input of the aircraft is x = (z,ω,θ,v)T, the output of the agent is used as the control action of the aircraft, and the action signal is saturated before being used as the input of the aircraft, with a maximum deflection angle not exceeding 0.6 rad. Based on the above standards, the training conditions for aircraft are: { 0z600m1θ1rad

When the range set is exceeded, the training is terminated. If the range is too small, the exploration space of the action during the training process is limited, and it takes a lot of time for the reward function to converge. If the range is too large, it does not conform to the actual operation situation, and it is easy to control the situation when the aircraft pitch angle reaches 80 °. When the aircraft pitch angle exceeds a certain range, it is impossible to adjust back to a stable state during actual operation, Therefore, by setting the training range, useless training sample data is screened out. The design of a reward function guides the aircraft to approach the expected operating state during the training process, and the design of the reward function directly affects the control accuracy and robustness of the final controller. Based on experience and experimentation, Setting the coefficients for the relatively direct state variables z and γ as 0.3 and 0.5, and the coefficients for 0.04 and 0.03, respectively. and reward the previous control action with a coefficient of 0.005. Overall, the final reward function is: R=0.3Δz20.04Δω20.5Δv20.05δc

Data collection and analysis
Collection equipment

The human factor wireless physiological acquisition platform includes experimental equipment such as v1.0 ArgoLAB signal acquisition device, laptop, high-definition camera, etc. The experimental equipment is located in a space with artificial low light and maintained at a comfortable temperature. The electrocardiogram collection device is a wireless optical capacitive pulse sensor with a sampling frequency of 512 Hz. The specific technical parameters are described in Table IV . Then, the v1.0 ErgoLAB wireless receiver is connected to a laptop to transmit the subject’s electrocardiogram collection signal in real-time through a local area network, with a transmission frequency of 2.4 GHz.

Technical parameters of signal acquisition equipment

Name Value range
resolution ratio ECG ≥ 16Bit
measurement range -1500 μV ~ 1500 μV
Adjustable magnification 1,2,3,4,5,6,7
accuracy 0.183 μV, 0.0915 μV, 0.061 μV , 0.046 μV, 0.037μV , 0.026 μV
Number of wireless sensor channels ≥1
Wireless transmission frequency 2.4GHz
Distance 10m ~ 100m
Battery operating time ≥ 4h
Data collection process

The flight simulation adopts DCS World Steam Edition. Conduct experiments with 8 skilled flight trainees and apprentices. The debugging content of experimental equipment mainly includes model, airport, weather, date, and aircraft location, as shown in the figure. After entering the experimental course software, open the instructor console and select “Five sided Flight” experiment in “Create Task”, then click the start button. Enter the scene settings interface again, change the aircraft model to su-25T, set the takeoff runway to Senaki, select the current experimental date and time, and ensure suitable meteorological conditions.

Figure 6.

Experimental setup map runway

The specific experimental operation process is as follows:

Set the weather of the flight simulator to clear, with a temperature of 20 degrees Celsius and a cloud layer of 2500m. Set the initial position of the aircraft at the runway entrance, align it with the centerline of the runway, and the subjects begin to perform pre takeoff checks.

After completing the pre takeoff checklist, the subjects fully push the accelerator and maintain stable acceleration until the airspeed gauge shows more than 55 knots. Then, pull the lever to lift the wheels of the aircraft backwards and take off at a climbing rate of 500 feet per minute.

Turn right 90 ° to both sides at a turning landmark, with a maximum turning angle of 20 °, heading from 50 ° to 150 °, and maintaining a climbing speed of 70 knots. Then turn to the third side at the second turning point.

The aircraft has reached the altitude of the takeoff and landing route, with a stable airspeed of around 80 knots, maintaining the altitude heading airspeed.

Reduce speed in advance on the short three sides, perform a pre landing checklist, check that the throttle valve is open, check the engine parameter table, check the engine temperature, the remaining fuel level on the fuel gauge, check that the mixing ratio lever is in the rich oil position, check the effectiveness of the braking device, lightly retract the throttle, start descent, maintain a descent rate of 500 feet per minute and an airspeed of around 70 knots.

Turn to the fifth side at the four turning points, with a maximum turning angle of 30 °. Check that there are no obstacles on the runway, control the throttle as needed, release the throttle before touchdown, gently level the aircraft and wait for touchdown. After touchdown, gently brake to a stop.

Experimental Results and Analysis

For baseline drift and other noise mixed in electrocardiogram signals, a low-pass filter is set to remove them; Then utilize a notch filter to eliminate power frequency interference mixed into it, apply threshold method to extract features of R waves in the electrocardiogram waveform, and quantitatively analyze the time-domain indicators of HRV. The main idea of the threshold method is to utilize the characteristic of QRS characteristic waves being the most oscillatory band within the electrocardiogram waveform. By setting different threshold ranges, the starting point of QRS main waves is obtained, and then the position of the R-wave vertex is determined using window and amplitude thresholds.

Quantitative results of cognitive load

The original physiological signal obtained is shown in the figure. The horizontal axis represents the number of samples, in units of 104, and the vertical axis represents the amplitude, in units of μV :

Figure 7.

Original electrocardiogram signal map

The image after denoising is shown in the figure:

Figure 8.

Electrocardiogram image after denoising

Using threshold method to extract R-waves from denoised electrocardiogram data, the results of R-wave extraction are shown in the figure:

Figure 9.

ECG image after R-wave extraction

Cognitive load Urzfh obtained through time-domain and frequency-domain analysis and processing, The curve is shown in the figure, and the trend of change roughly fits the subjective measurement, with a value range of (15,35).

Figure 10.

Cognitive load value

As shown in the figure, in the Five sided Flight, cognitive load follows a trend of first rising and then falling in the image after quantification. This is because cognitive load is more significantly influenced by psychological factors during takeoff and landing stages.

Experiment environment and parameters

The CPU of the server used in the experiment is a 256 core AMD EPYC 9654 with a frequency of 2.4GHz, with 128GB of memory and two A800 graphics cards. Each graphics card has 80GB of memory, and the operating system is Ubuntu. The framework is the TensorFlow platform of Python 3.6. The number of samples per batch is set to 512, the number of iterations is set to 1000, the learning rate is set to 0.01, the delay steps are 2, the experience pool size is 1000, the Actor network learning rate is 0.0001, the Critical network learning rate is 0.0002, and the exploration rate is 0.9.The closer the aircraft is to the expected state, the greater the reward value, Set the training objective to achieve an average reward greater than 200 over five consecutive episodes.

In subsequent testing, it was found that due to the setting of the training range, the aircraft exceeded the training range within 1 second, resulting in the termination of this training set. However, the cumulative reward value of this set exceeded 200 due to the small number of samples. The controller obtained after terminating the training for 5 consecutive sets cannot complete the aircraft stability control task. Change the training completion conditions to meet the target requirement when the sampling reaches 400 times in each set and the average value of the reward function in 5 consecutive sets is greater than 200. By establishing a simulation environment for training, terminate the training when the reward function is received and meets the requirements. The process of obtaining a controller through reinforcement learning is a continuous process of adjustment and improvement, and there is no optimal result. Through the simulation results of this training, the reward function and training requirements can be further adjusted to gradually reach the expected state of aircraft operation.

Evaluation of landing experiment results

In Figures 11 to 14, during the initial training stage, the aircraft is in the exploration and learning experience stage, so the learning effect is not ideal. However, the gradual increase in training frequency makes the aircraft’s experience more and more rich. After the initial trial and error learning, the cumulative return of the algorithm increases rapidly, and the reward value increases and stabilizes in time step, quickly reaching convergence. In addition, the return value gradually increases, with an exploration variable enhancement value of 0.8 and a decrease in model entropy below -2, indicating a good training effect of the model.

Figure 11.

Reward Value Turn Curve

Figure 12.

Time Step Turn Curve

Figure 13.

Return Value Change Curve Figure

Figure 14.

Exploring Variables and Entropy Change Curve

In Figures 15 and 16, the velocities and angular velocities in all directions decrease in fluctuation, and the acceleration in the z-direction tends to 0. The angular velocities in all directions also tend to 0, indicating that the aircraft has landed from the air and entered the ground sliding phase.

Figure 15.

Speed time curve

Figure 16.

Angular velocity time curve

Figures 17 and 18 show that the final position, mean square error of pitch angle, and standard deviation of the aircraft gradually decrease with the number of iterations and tend to 0, indicating that the aircraft is gradually stabilizing its landing.

Figure 17.

Final position mean square deviation and standard deviation Figure

Figure 18.

Final pitch angle mean square deviation and standard deviation

Conclusion

This article uses linear fitting method to obtain cognitive load curve based on time-domain and frequency-domain analysis data of heart rate specificity test, and obtains cognitive load data that is consistent with the frequency of flight data. Based on cognitive load theory, the DDPG algorithm has been improved by incorporating human factors into the closed-loop of reinforcement learning. By using the improved DDPG algorithm for training, the number of ineffective explorations in the early stage was effectively reduced, and the impact of physiological level changes was considered in the field of human-computer interaction, achieving good control effects.

eISSN:
2470-8038
Język:
Angielski
Częstotliwość wydawania:
4 razy w roku
Dziedziny czasopisma:
Computer Sciences, other