Acceso abierto

Research on Hierarchical Multi-core Scheduling Algorithm Based on Task Replication


Cite

Introduction

Multi-core processor technology mainly integrates two or more processor cores on a single chip to enhance computing performance. Multi-core processors improve system performance by distributing load on multiple CPU cores, and relying on high-speed on-chip interconnection and high-bandwidth pipelines of memory and input/output (I/O). Under the same conditions, multi-core processors can bring more performance and productivity advantages than current single-core processors. Therefore, the research of scheduling algorithms under multi-core platforms is also a future development trend.

Multi-core processor task scheduling refers to how to allocate multiple tasks to multiple cores for parallel execution through a scheduling algorithm, so as to minimize the total time for task completion. Multi-core task scheduling has long been proved to be an NP problem [1], and it is difficult to find the optimal solution in polynomial time. The most common task scheduling algorithm is based on heuristic scheduling algorithm. Heuristic scheduling algorithms mainly include table scheduling algorithm based on critical path [2,3,4,5], task duplication algorithm [6,7,8], processor allocation algorithm based on task duplication [9], improved multi-core scheduling based on task duplication Algorithm [10], clustering algorithm [11,12,13] and so on.

Since the communication overhead between tasks on the same processor can be ignored, scheduling based on task duplication is an effective strategy for reducing communication overhead. The characteristic of the task copy method is to reduce the communication time between processors by copying the predecessor tasks that have a communication relationship, thereby reducing the execution time of the system as a whole.

When using reasonable and effective duplication rules and strategies, scheduling algorithms based on task duplication have been proven to have better scheduling effects than other scheduling algorithms. However, the scheduling algorithm does not consider the factor of load balancing, and in the DAG graph, there is no dependency between nodes in the same layer. According to the adjustment of the scheduling sequence between nodes in the same layer, the idle time is reduced and the CPU is increased. Utilization, while coordinating the load in each CPU to make it more balanced. Therefore, this paper proposes a hierarchical scheduling algorithm based on task duplication to solve the shortcomings of unbalanced load of traditional scheduling algorithms based on task duplication.

Task scheduling model

The task scheduling problem is a kind of combinatorial optimization problem in mathematics, that is, an abstract task model of a computer application is established, and then based on the constraints of the task model, through a reasonable scheduling strategy, a scheduling sequence is generated and the tasks are assigned to the processing cores for calculations. Get the least total task execution time and maximize the parallel execution advantages of multi-core systems.

The task scheduling model is mainly divided into two aspects: system model and task model. The system model is a mathematical abstraction of information such as the topological structure and computing capabilities of a multi-core system, and the task model is a mathematical abstraction of computer application programs. It mainly includes information such as the constraint relationship between tasks and the characteristics of the task itself. The following are two parts Detailed discussion.

System model

The system model is an abstraction of the actual computing system. The actual computing system in this article is a multi-core system, that is, a system composed of multiple processing cores. The system is generally expressed as P = {p1, p2,⋯, pi, ⋯, pn}.

Among them, P represents a collection of processing cores in a multi-core system, which pi represents the i processing core, and n represents that the system contains a total of n cores.

Mission model

The relationship between multi-core tasks is generally represented by DAG (Directed Acyclic Graph), and when there is a dependency between tasks, a weighted DAG graph is used (as shown in Figure 1)

Figure 1

DAG diagram

Its mathematical description is: G={T,E,t,c}. G = \left\{ {T,E,t,c} \right\}.

Among them, the formula T = {T1, T2,⋯, Ti,⋯, Tn} represents the set of nodes in the graph, which is the first task; represents the set of nodes in the graph, which is the first task;

E = {Eij} Represents the set of directed edges that Eij is a communication relationship between task Ti and task Tj, otherwise, they cannot communicate directly. t = {t1, t2,⋯, ti,⋯, tn}. This Set represents the set of node weights in the graph, in other words, ti is execution time of the task i. Meanwhile, The set, {c = cij}, is the set of weights of directed edges that cij Indicates the communication time between task Ti and task Tj.

Since the communication time of tasks between different cores is much longer than the communication time between the same cores, when two tasks are on the same processor, the communication time is ignored, that is cij = 0 in two related tasks on the same processor.

Definition:

The earliest start time Tibegin of task i: it represents the smaller value between the predecessor time of task execution and the maximum associated predecessor time in the task predecessor set. which is: Tibegin=min{Tiexe_pre,max{Tire_pre}}. {T^i}_{begin} = \min \left\{ {{T^i}_{exe\_pre},\max \left\{ {{T^i}_{re\_pre}} \right\}} \right\}.

The completion time Tiend of task i: the time when the task is executed on the processing core is equal to the start time plus the task execution time, which means the time it takes to complete the task. which is: Tiend=Tibegin+ti. {T^i}_{end} = {T^i}_{begin} + {t_i}.

Associated predecessor j of task i: The set of tasks that must be completed before the task i is executed. That is, the task set on which the execution of the task depends. For example, the task set Ti, Tj in Figure 1 is the associated predecessor of task T7. The associated predecessor time Tire_pre of task i is: if the associated predecessor task j and task i of the task are on the same core, the associated predecessor time is the completion time of the associated predecessor task j; if not on the same core, The associated predecessor The time is the completion time of the associated predecessor task j and the maximum value of the sum of communication values between task i and its associated predecessor task j. Tire_pre={Tjend,i,jinthesamecoreTjend+cij,i,jinadifferentcore. {T^i}_{re\_pre} = \left\{ {\matrix{ {{T^j}_{end},} & {i,j\;{\rm{in}}\;{\rm{the}}\;{\rm{same}}\;{\rm{core}}} \cr {{T^j}_{end} + {c_{ij}},} & {i,j\;{\rm{in}}\;{\rm{a}}\;{\rm{different}}\;{\rm{core}}} \cr } } \right..

Where task j is a predecessor task of task i.

The running result of a certain processor piend indicates the total time spent by the processor after all tasks on the processor are scheduled.

The running result of all processors Pend: it represents the final task scheduling result, namely: the total time for all tasks to complete.

The successor task next of task i: the task related to task i, and it must be ensured that task i has been executed before it is executed.

Execution predecessor k of task i: On the same core, task k to be executed before task execution is the execution predecessor of task i. And the execution predecessor time Tire_pre of task i is: the completion time of the execution predecessor of task k, that is: Tire_pre=Tkend. {T^i}_{re\_pre} = {T^k}_{end}.

The idle time Tirest of task i: represents the wasted time slice on the same core when the task is executed. which is: Tirest=Tibegin+Tiexe_pre. {T^i}_{rest} = {T^i}_{begin} + {T^i}_{exe\_pre}.

The communication start time Tnextbegin_comm between task i and the successor task next: It mainly represents the communication relationship between task i and its successor task next that are not on the same processor. That is: after the execution of task i is over, after the communication time between processor cores, the start time of the successor task next.

Basic constraints of task scheduling

For a certain task, when it meets the following two necessary conditions, it can be executed on a specific processing core.

On one hand, All the predecessor tasks of the same processing core with which it is dependent have been executed, and the communication between the predecessor tasks that are not on the same processing core and this task has also been completed;

On the other hand, the time period occupied between the task start time and the end time does not conflict with other tasks on the processing core where the task is located.

For task i, its earliest start execution time must not be less than the execution completion time of all predecessor tasks, which is: Tibeginmax{Tire_pre}. {T^i}_{begin} \ge \max \left\{ {{T^i}_{re\_pre}} \right\}.

For a task graph, the final task scheduling result depends on the task that finishes executing at the latest. That is, the time used by the processor with the longest scheduling length among all processors. which is: Tibegin_commTiend. {T^i}_{begin\_comm} \ge {T^i}_{end}.

The start time of the communication between the task and the successor task must be after the end time of the task, because only the task execution is completed before the relevant data can be provided to the successor task, which is: Pend=max{piend}. {{\rm{P}}_{{\rm{end}}}} = \max \left\{ {{p^i}_{end}} \right\}.

Task duplication scheduling algorithm

The idea of task duplication scheduling algorithm is to generate copies of specific tasks through duplication. These copies will be allocated to the processing core according to a certain strategy. When the subsequent tasks of the copy are allocated to the same processing core, they can be offset. Consumption of communication between tasks to save time.

Task duplication can be divided into single-task duplication and multi-task duplication according to the number of duplication tasks. Single-task duplication copies and allocates tasks that restrict the start time of the current task to the processing core; multi-task duplication copies and allocates multiple predecessor tasks of the task to the processing core.

Taking Figure 1 as an example, the successor tasks of task T1 are T2, T3, T4, and T5. Therefore, when T1 and T2, T3, T4, and T5 are executed on different processing cores, there will be inter-core communication, in order to save inter-core During the communication time, a task duplication strategy is adopted to replicate T1, and all three copies are allocated to the processing cores where T2, T3, T4, and T5 are located. Through this task duplication method, the task scheduling result is finally optimized.

As shown in Figure 2, if the T1 task is not copied, and assuming that the tasks T2 and T1 are not on the same core, the earliest start time of the task T2 is equal to the time W1 waiting for the completion of the predecessor task T1 and the task that is not on the same processor The sum of the communication time C1,2 between T1 and task T2, that is, the earliest start time of T2 is 6, and the scheduling result of task T2 is 9; and if the task duplication strategy is adopted for scheduling, a copy is made on the processor where T2 is located The copy of T1, at this time, because the communication time is reduced by 4, the start time of T2 becomes 2, and the scheduling result of task T2 is reduced to 5 (as shown in Figure 3). At this time, the optimization effect is significant and the scheduling efficiency is greatly improved.

Figure 2

T1 does not use task duplication strategy T2 scheduling result diagram

Figure 3

T1 adopts task duplication strategy T2 scheduling result graph

TDLS scheduling algorithm

In the DAG diagram, tasks can be classified by layer. The tasks in the same layer are independent, and the tasks in the same layer are not dependent on each other, that is, if there is no priority difference, the tasks in the same layer, there is no difference in the execution of tasks. The TDLS algorithm mainly relies on hierarchical scheduling without dependencies between tasks on the same layer. By adjusting the scheduling sequence of tasks on the same layer, tasks with a smaller initial start time can be scheduled when the time slice comes, reducing the number of cores. At the same time, considering that the communication value between tasks with dependencies between different layers is too large, it will also affect the completion time of the task, so the predecessor tasks that have a greater impact on the task are adjusted to reduce the communication time between tasks, Thereby increasing the CPU utilization.

Steps of hierarchical multi-core scheduling algorithm based on task replication

Step 1: Calculate the in-degree of all tasks in the DAG graph;

Step 2: Determine whether the in-degree of all tasks is 0, and put all tasks with in-degree of 0 into one layer, that is, the k layer is obtained;

Step 3: Remove the tasks that have been layered in the DAG graph and their related edges to get a DAG graph, and make k = k + 1, repeat steps 1~3, until the task in the DAG graph is empty, that is The DAG graph completes the layering operation.

The Steps of TDLS Algorithm

Step 1: According to the in-degree of the tasks in the DAG, use the hierarchical algorithm to perform hierarchical operations on all tasks;

Step 2: According to the layering result, the scheduling sequence is initially obtained; because the tasks in the same layer do not have mutual dependence, the tasks of the same layer can be scheduled according to the earliest start time Tibegin of the task in the order of scheduling from small to large to reduce idle time, Improve CPU utilization.

Step 3: Schedule each task in turn according to the adjusted task scheduling sequence sequence. The polling method assigns the task Ti to each core in turn, calculates and compares the earliest start time Tibegin of the task on each core, so as to find out which core takes the least time, that is, allocate Ti to the core.

If the task's earliest start time Tibegin is not single, consider the core with the smaller idle time Tirest for priority allocation, and when the idle time Tirest is greater than 0, it means that the communication time between a predecessor task Tire_pre of the task and the task is relatively large. When they are not on a core, the earliest start time Tibegin of the task is relatively large, so the precursor task Tire_pre that affects the start time of the task is adjusted to the core where the task is located for analysis and comparison. If the earliest start time of the task is not If it changes or becomes smaller, adjust the predecessor task Tire_pre of the task to the core where the task is located; otherwise, give up the adjustment. The final scheduling sequence is the optimal scheduling sequence (the algorithm flow chart is shown in Figure 4).

Figure 4

TDLS algorithm flow chart

Analysis of the time complexity of TDLS algorithm

When selecting an algorithm for a task scheduling problem, it is usually necessary to evaluate the time complexity of the algorithm. Time complexity refers to the increasing trend of the execution time required by the algorithm when the scale of the problem expands. Generally, O is used to represent the time complexity. Among them, O(1) means that the time complexity of the algorithm is constant, that is, no matter how much the problem scale increases, the time consumed by the algorithm remains the same; O(n^2) means that the time complexity of the algorithm is square, that is, when the problem scale When the problem is enlarged by 2 times, the time required for the algorithm to solve the problem is 4 times; O(2^n) means that the time complexity of the algorithm is exponential of 2, and once the scale of the problem increases, the time consumed by the algorithm will be exponential increase.

Assume that the DAG task graph has n nodes. By traversing the task graph to schedule each node, the time complexity required for each node is O(n). For each node of each layer, it is necessary to perform simulation scheduling before confirming the scheduling to determine whether the scheduling reduces idle time, reduces CPU waste, and improves CPU utilization. When planning to schedule, the scheduling sequence of each layer needs to be adjusted. At this time, assuming that there are m nodes in a certain layer, the time complexity required at this time is O(m), so the TDLS algorithm is extremely In this case, the time complexity of the algorithm is not greater than O(n^2).

Experimental results and analysis

This chapter will use the TDLS algorithm and the traditional task replication-based scheduling algorithm (CPTD and TDMC algorithm) to conduct experiments in the same environment to schedule the specific DAG graph (Figure 1), and use TDLS, CPTD and TDMC to perform the experiments respectively. The task scheduling distribution diagram of the three algorithms available for scheduling is shown in Figures 5~7, and the two performance indicators and the algorithm time complexity of the number of processors used and the CPU utilization rate are compared. Each of the three algorithms is the comparison of performance evaluation parameters is shown in Table 1.

Figure 5

TDLS algorithm scheduling diagram

Figure 6

CPTD algorithm scheduling diagram

Figure 7

TDMC algorithm scheduling diagram

Comparison of scheduling results of TDLS, CPTD, and TDMC algorithms

Algorithms Scheduling Length Number of Processors Algorithm Time Complexity
TDLS 23 2 O(n^2)
CPTD 23 3 O(n^2)
TDMC 24 5 O(n^2)
Performance evaluation parameters

The performance of the task scheduling algorithm can be evaluated with the following performance evaluation parameters:

The number of processors used to complete all tasks. After all tasks in the task graph are scheduled, the fewer processors are used, the more resources can be saved.

Scheduling length. After all tasks in the task graph are scheduled, the length of time used, the smaller the scheduling length, the better the explanation of resources, and the better the algorithm.

Task diagram example analysis

For a specific task graph example (Figure 1), call TDLS, CPTD, and the (TDMC algorithm for short) in Literature 14 (referred to as TDMC algorithm [14]) respectively based on task duplication scheduling algorithm, draw a scheduling diagram, and compare these 3 algorithms Various performance indicators.

The DAG graph shown in Fig. 1 has 4 layers, including 9 tasks T1 ~ T9. Among them, each circle in the figure represents a task node, and the nodes in the node represent the task and the time required to execute the task. The line segment with arrows in the figure represents the communication dependency between tasks. Where the start position of the arrow line segment is the predecessor task node, the end position is the successor task node, and the number on the straight line represents the communication time between tasks (because the execution time of two tasks on the same CPU is much shorter than that of different CPU The execution time of the two tasks between the two, therefore, when two dependent tasks are on the same CPU, the communication time can be ignored).

Use TDLS, CPTD and TDMC three algorithms to schedule the DAG graph respectively, and the task scheduling and distribution diagram corresponding to the three algorithms can be obtained, as shown in Figure 5~7, and the comparison of each performance evaluation parameter under the three algorithms the situation is shown in Table 1.

It can be seen from Figures 5 to Figures 7 and Table 1 that the TDMC algorithm is used to schedule the task graph. The total execution time is 24, and the number of processor cores is 5. The main reason why the CPU is not fully utilized is T7 and T9. The communication time of the node is too long, resulting in waste between T9 and T8, thereby increasing the total length of task scheduling. The CPTD algorithm is used to schedule the task graph. The total execution time is 23, and the number of processor cores is 3. The algorithm first finds the critical path, and then adopts the early completion strategy of the preceding node group to merge the scheduling sequences of nodes T7 and T8, So that the execution time of T9 can be advanced, and the communication time can be better controlled. The scheduling of T6 and T4 is completed through task replication, but the processors where these two tasks are located mainly complete the scheduling of T6 and T4, which greatly reduces the CPU Utilization rate makes the CPU load unbalanced.

The total execution time used by the improved TDLS algorithm is 23, which only occupies 2 processor cores. It is mainly used in the scheduling process of the third layer, by adjusting the scheduling sequence, according to the T7 and T8 nodes are unrelated For nodes on the same layer, T7 is adjusted to T8 before scheduling, which shortens the waste of time caused by too long communication time, improves CPU utilization, and makes the load more balanced than the previous algorithm.

Compared with the TDMC algorithm, the TDLS algorithm on the one hand shortens the total scheduling length of task execution (the scheduling length is reduced from 24 to 23), and on the other hand, it reduces the number of CPU required to complete all tasks (the number of CPU is reduced from 5 to 2), the algorithm reduces the waste of resources on the whole and improves the utilization of CPU; at the same time, compared with the CPTD algorithm, the TDLS algorithm reduces the number of redundant tasks on the one hand (such as: Reduction in the number of tasks T1 and task T2 redundancy). On the other hand, it also reduces the number of CPU required to complete the task (the number of CPU is reduced from 3 to 2). It also reduces the waste of time slices between tasks and greatly improves each CPU The utilization rate makes the load more balanced. Through the comparison and analysis of the above two groups of experiments, it can be seen that the TDLS algorithm has better scheduling performance than the CPTD algorithm and the TDMC algorithm.

Conclusion

Aiming at the problem of load imbalance in the traditional multi-core scheduling algorithm based on task duplication under certain circumstances, this paper proposes a hierarchical scheduling algorithm based on task replication, TDLS. TDLS is based on no scheduling sequence between tasks in the same layer. The scheduling sequence of tasks in the same layer can shorten the idle time of the CPU, shorten the execution time of the program, and then improve the utilization of multiple CPU. It is proved through comparative experiments that the TDLS algorithm is compared to the other two traditional scheduling algorithms based on task replication. The load is more balanced, and it also has a certain significance for the improvement of the scheduling performance of the multi-core parallel computer system.

The hierarchical task scheduling algorithm based on task replication proposed in this paper overcomes some inherent shortcomings of traditional task replication-based algorithms in the experimental environment of simulated scheduling, and improves the efficiency of task scheduling and CPU utilization. The research of this algorithm is completely based on assumptions and simulated environment. However, the real situation is more complicated and changeable. For multi-core systems, its load balancing, power consumption, and communication congestion between cores need to be considered under actual conditions. Therefore, it is necessary to further expand the scope of research in future research, apply it to real-time systems, and make more comprehensive considerations for the problems in the actual situation.

eISSN:
2470-8038
Idioma:
Inglés
Calendario de la edición:
4 veces al año
Temas de la revista:
Computer Sciences, other