Front. Control. Eng.Frontiers in Control EngineeringFront. Control. Eng.2673-6268Frontiers Media S.A.72147510.3389/fcteg.2021.721475Control EngineeringOriginal ResearchLeader-Following Multi-Agent Coordination Control Accompanied With Hierarchical Q(λ)-Learning for PursuitZhu and LiuHierarchical Q(λ)-Learning for PursuitZhuZhe-YangLiuCheng-Lin*Key Laboratory of Advanced Process Control for Light Industry (Ministry of Education), Institute of Automation, Jiangnan University, Wuxi, China
Edited by:Kim-Doang Nguyen, South Dakota State University, United States
Reviewed by:Alexander Von Moll, Air Force Research Laboratory, United States
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
In this paper, we investigate a pursuit problem with multi-pursuer and single evader in a two-dimensional grid space with obstacles. Taking a different approach to previous studies, this paper aims to address a pursuit problem in which only some pursuers can directly access the evader’s position. It also proposes using a hierarchical Q(λ)-learning with improved reward, with simulation results indicating that the proposed method outperforms Q-learning.
hierarchical Q(λ)-learningleader-followingmulti-agent systemsreinforcement learningpursuit problemFoundation for Innovative Research Groups of the National Natural Science Foundation of China10.13039/5011000126591 Introduction
Interest in pursuit problems and their applications has increased in recent years, facilitated by recent technological and computational advances. As a significant branch of the pursuit problem, multi-agent coordination pursuit has received much attention for its broad applications in military (Eklund et al., 2012), aerospace (Ye et al., 2020), autonomous vehicle fields (Vidal and Sastry, 2002), underwater vehicles (Qi and Cai, 2021), artificial intelligence (Haynes and Sen, 2006) and so on in the past decade.
In 1965, Isaacs first proposed the pursuit problem in a paper about differential games (Isaacs, 1965). From the view of control, the pursuit problem is an optimization problem with the minimum cost as the goal. In recent years, researchers have proposed many solutions to pursuit problems. Under the known agents’ dynamics or environments’ dynamics, pursuit problems have been addressed by finding analytical solutions (Shinar et al., 2009; Yan and Li, 2013; Beke and Kumbasar, 2018; Casini et al., 2019; Mejia et al., 2019). However, in practical engineering applications, it is hard to obtain an analytical solution of the pursuit problem. Hence, many learning algorithms, especially reinforcement learning, have been introduced into pursuit problems. Ishiwaka et al. (2003) studied a pursuit problem with four pursuers and one evader and employed reinforcement learning to complete the capture. Bilgin and Kadioglu-Urtis (2015) solved a pursuit problem with two pursuers and one stationary evader by employing Q(λ)-learning and verified the influence of learning rate and decay rate on simulation results. A pursuit problem with one evader whose goal is to reach its target while avoiding being captured by pursuers was investigated in Selvakumar and Bakolas (2020) by utilizing Min-Max Q-learning and matrix game theory. Noro et al. (2014) proposed signal learning with messages based on reinforcement learning to deal with a multi-agent pursuit problem. Additionally, Yu et al. (2020) presented a fully decentralized multi-agent reinforcement learning approach and applied it to the cooperative multi-robot pursuit problem successfully.
Pursuit problems can be divided into two categories. The first is a pursuit problem with a single evader, the other is a pursuit problem with multiple evaders. This paper focuses on a pursuit problem with a single evader in the presence of obstacles. In a typical pursuit problem, agents can be divided into evaders and pursuers. In most literature, researchers assume that all pursuers can obtain the position of the evader independently, so some effective and expensive detectors are needed for practical application. In this work, based on leader-following control, we divided the pursuers into leader pursuers and follower pursuers. Only leader pursuers can directly detect the position of the evader. We addressed the pursuit problem with reinforcement learning under the assumption that partial pursuers know the evader’s position. The contributions of this paper are summarized as follows:
• Leader-following control thought is adopted to solve the pursuit problem since it can reduce the cost of detectors.
• We propose the use of a hierarchical Q(λ)-learning with improved reward, as it has shown good performance for pursuit problems.
This paper is organized by defining the pursuit problem in Section 2. We then introduce Q(λ)-learning in Section 3. The proposed algorithm is described in Section 4 and the simulation experiments and results are shown in Section 5. Finally, Section 6 presents the conclusions of this study.
2 Our Pursuit Problem Setting
In this article, we mainly focus on a pursuit problem with a single evader. The environment of our pursuit problem is represented by a bounded two-dimensional grid map with obstacles, which are randomly distributed in the map and each obstacle occupies a cell. Obstacles cannot be occupied by agents.
Assuming that there are 2n pursuers, which include n leader pursuers and n follower pursuers in our map, each agent executes an action at each time step and moves in four main directions or remains in its current cell. Additionally, agents cannot cross the boundaries. For example, an agent can execute the 4 possible actions or remain in the current cell, when the agent in cell 5 of the simple map, as illustrated in Figure 1. Each cell can only be occupied by an agent or an obstacle. When several agents try to move to the same cell as part of the time step t + 1, the actions of these agents are canceled and the agents remain in the cell they inhabited at time step t.
Possible actions (black arrows) for an agent in cell 5 of a sample map (cell 8 is an obstacle).
This pursuit problem assumes that only leader pursuers can directly detect the position of the evader. Leader pursuers can communicate with each other, and follower pursuers can also communicate with each other. Each leader pursuer sends its real-time position to all follower pursuers, so each follower pursuer knows all leader pursuers’ positions. The communication of a pursuit problem with two leader pursuers and two follower pursuers is illustrated in Figure 2.
Communication of a pursuit problem with two leader pursuers and two follower pursuers.
2.1 Pursuit Groups
In this article, we assume that a leader pursuer can only be followed by one follower pursuer at the same time, and a follower pursuer can only choose one leader pursuer as its target to follow. Therefore, as illustrated in Figure 3, 2n pursuers can be divided into n pursuit groups, and each pursuit group has a leader pursuer and a follower pursuer. Furthermore, in Figure 3, L_{
i
}(i ∈ {1, 2, …, n}) is the target that F_{
i
} follows.
Pursuit groups.
In nature, small carnivores hunt large prey collectively. They usually encircle the prey first to prevent the prey from escaping. Then, when a large number of companions arrive, they hunt the prey together to ensure success. Similarly, in this paper, to ensure the success of the pursuit, we divided the pursuit into two stages: encirclement and capture.
2.2 Definition of Encirclement
We found that when the northwest cell and the southeast cell of the evader are occupied by the pursuers simultaneously, as shown in Figure 4A, or the northeast cell and the southwest cell are occupied by the pursuer simultaneously, for example, in the case of Figure 4B. We think the evader has been encircled. In our pursuit problem, at least two pursuers are required to encircle an evader. When the evader is encircled by the pursuers, it will always remain in its current cell.
Definition of encirclement (red circles represent the pursuers and blue circles represents the evaders).
2.3 Definition of Capture
When the evader has been encircled and at least two cells of the evader’s north cell, the evader’s south cell, the evader’s east cell, and the evader’s west cell are occupied by the pursuers, the evader has been successfully captured. For example, in Figure 5, the evader in cell 5 has been encircled. The evader has been captured because cell 4 and cell 8 are also occupied by pursuers. Therefore, in our pursuit problem, at least four pursuers are required to capture an evader.
A simple case of capture (the evader in cell 5 has been encircled and cell 1, cell 4, cell 8, and cell 9 are occupied by pursuers).
3 Q(<italic>λ</italic>)-Learning
Q(λ)-learning is an improved Q-learning algorithm. As the foundation of Q(λ)-learning, Q-learning was first proposed by Watkins et al. (1992) and it is also known as single-step Q-learning. Due to its simple structure, single-step Q-learning has become a popular topic in reinforcement learning. Yet, Q-learning exhibits slow convergence. In order to accelerate convergence, Peng and Williams (1998) proposed Q(λ)-learning.
For accelerating the convergence, the eligibility trace is introduced into Q(λ)-learning. With the eligibility trace, Q(λ)-learning will look back further in time. For example, if the current reward is good, Q(λ)-learning not only updates the current state but also assigns some of the rewards to some of the previous states which drove the agent to be in a current state (Schwartz, 2014). The eligibility trace tracks a particular state that has been visited at last time and then assigns the current reward to recently visited states. A state that has not been visited for a long time is not eligible to get some of the current rewards. These requirements will greatly accelerate the convergence.
The eligibility trace of each state s (s ∈ S, s is the state of the agent and S is state space of the agent) at time step t is defined as e_{
t
}(s) and decays as λγ (0 ≤ λ ≤ 1, 0 ≤ γ ≤ 1). For the state that has just been accessed, its eligibility trace will increase by 1, so the eligibility trace is updated as (Schwartz, 2014)ets=γλet−1s,ifs≠st,γλet−1s+1,ifs=st.
In Q(λ)-learning algorithm, the eligibility trace function becomes e(s, a), a is the action that was executed by agent. The eligibility trace of Q(λ)-learning is updated asets,a=γλet−1s,a,ifs≠st,γλet−1s,a+1,ifs=st,and the prediction error is defined asδt=Rt+1+γmaxQst+1,at+1−Qst,at,where R_{
t+1} represents the immediate reward, γ is the discount factor, at∈Ast and Ast is an action set of the agent for the given state. In conventional Q-learning, we usually define the evaluation for each action at a given state as Q-value, and Q-values are stored in a state-action table, which is termed as Q-table. In Eq. 3, when the agent executes action a_{
t
} at a given state s_{
t
}, the Q-value of a_{
t
} is Q(s_{
t
}, a_{
t
}). The iterative equation for the Q-value takes asQst,at=Qst,at+αδtetst,at,where α (0 ≤ α ≤ 1) is the learning rate. The Q(λ)-learning algorithm is described in Algorithm 1 (Schwartz, 2014).
4 The Proposed Algorithm4.1 Coordination Multi-Agent Pursuit
In our research, since only partial pursuers can access the position of the evader, we assign different tasks to the pursuers according to their abilities. For the leader pursuers, because they can access the position of the evader, they are mainly responsible for encircling the evader to prevent the evader from escaping. For the follower pursuers, since the follower pursuers do not know the evader’s position, the follower pursuers are responsible for following the leader pursuers and assisting the leader pursuers to capture the evader after the evader is encircled.
4.1.1 Pursuit Groups Creation
In our pursuit problem, at least two leader pursuers are required to encircle an evader, and at least four pursuers are required to capture an evader. To minimize the total cost of the pursuit, in this article, two leader pursuers and two follower pursuers are selected to form two pursuit groups to participate in pursuit. The specific steps of pursuit groups creation are as follows.
Step 1. Initialize the position of evader x_{
e
}.
Step2. Initialize the positions of leader pursuers xli(i∈{1,2,…,n}) and the positions of follower pursuers xfj(j∈{1,2,…,n}).
Step 3. Calculate the Euclidean distance dlie between the leader pursuer l_{
i
} and the evader and the distance matrix D_{
le
}
Dle=dl1edl2e⋮dlne,
Step 4. Select two leader pursuers closest to the evader as L_{1}, L_{2} to participate in the pursuit according to D_{
le
}.
Step 5. Calculate the Euclidean distance dfjLk(k∈{1,2}) between the follower pursuer f_{
j
} and the selected leader pursuer L_{
k
} and the distance matrix D_{
fL
}
DfL=df1L1df1L2df2L1df2L2⋮⋮dfnL1dfnL2,
Step 6. Select a follower pursuer closest to L_{1} as F_{1} and select a follower pursuer closest to L_{2} as F_{2} according to D_{
fL
}.
Step 7. Solve the contradictions in the follower pursuers task assignment through negotiation, for example, if F_{1} and F_{2} are the same follower pursuer.
Step 8. The selected 2 leader pursuers and 2 follower pursuers form two pursuit groups.
4.1.2 Specific Task Assignment
The selected leader pursuers choose an encirclement pattern with the shortest distance from Figures 4A,B to encircle the evader and assign specific tasks through negotiation.
For the follower pursuers, they need to select a target cell to realize capture when the evader is encircled. According to the definition of capture, for each follower pursuer, we define a cell that has a common side with the follower pursuer’s target leader pursuers and is closest to the other leader pursuer as the possible target cell under the assumption that the evader is encircled. For example, in the case of Figure 6, the evader is encircled, if L_{1} in cell 9, only cell 4, cell 8, cell 10, and cell 14 have a common side with L_{1}. For F_{1}, cell 8 and cell 14 are possible target cells, since cell 8 and cell 14 are closest to the other leader pursuers. The follower pursuer will select the nearest possible target cell as its final target cell.
Possible target cells for the follower pursuer (red circles represent the leader pursuers, blue circle represents the evader and possible target cells are printed yellow).
4.2 Hierarchical Reinforcement Learning for Pursuit Problem
Hierarchical reinforcement learning is a widely utilized algorithm to solve the problem of “curse of dimensionality” (Botvinick, 2012). Decomposing the whole team task into some subtasks at different levels is the core idea of hierarchical reinforcement learning. Moreover, the policy, which is learned by an agent in a subtask, can also be utilized by other agents, so hierarchical reinforcement learning significantly accelerates the convergence. Option learning (Sutton et al., 1999) is one of the most popular hierarchical learning algorithms.
In option learning, option means closed-loop policies for taking action over a period of time. The option is a term for generalizing primitive actions and it consists of three elements: policy π, termination condition β and an initiation set ζ. Only if current state s ∈ ζ, an option <π,β,ζ> is available. While an option is adopted, actions are chosen according to policy π until the option ends. When the current option terminates, agents have the opportunity to select other options (Sutton et al., 1999).
In our research, we apply option learning to leader pursuers. We abstract each leader pursuer’s task into two options: O_{1} and O_{2}. O_{1} and O_{2} are defined as follows.
• In O_{1}, approaching the evader quickly is the aim of the leader pursuer.
• In O_{2}, encircling the evader is the aim of the leader pursuer.
In this paper, there is only one evader in our pursuit problem, so leader pursuers can share learning experiences and update the policy together during O_{1}. When the leader pursuers in O_{2}, each leader pursuer will learn its policy separately. The internal policy π of O_{1} and O_{2} are unknown, and the internal policy of each option is learned with reinforcement learning respectively. By introducing hierarchical reinforcement learning, we have greatly improved the learning efficiency of leader pursuers.
4.3 Reward Improving
Q(λ)-learning and Q-Learning are all typical reward-guidance learning algorithms. Therefore, it is very significant to define rewards and penalties. The sparse reward is one of the most popular rewards nowadays due to its simple structure. Generally, the sparse reward is defined as followsR=r1,Situation 1,r2,Situation 2,−r3,Situation 3,…,…,where r_{
n
}⩾0 (n ∈ 1, 2, 3, …) and all rewards are constants. When dealing with some simple tasks, sparse reward exhibits a good performance. However, when the task is complex, the agent may achieve more penalties than rewards during training with sparse reward, which will reduce the learning efficiency and even hinder the agent from learning. In order to solve this problem, we alter the reward function. Different from the sparse reward, we designed an additional reward. The proposed reward is defined as followsR=rc+ra,where r_{
c
} refers to sparse reward. r_{
c
} is defined as followsrc=rc1,Situation 1,rc2,Situation 2,−rc3,Situation 3,…,…,where rcn⩾0(n∈1,2,3,…) and all rewards are constants. Besides, r_{
a
} in Eq. 6 represents the additional reward and it is given byra=fx,where f(x) is a continuous function related to the distance x between agents.
4.4 Complete Algorithm
The complete algorithm is described in Algorithm 2.
5 Simulation Experiments and Results5.1 Configuration of the Simulation Experiment Environment
In our simulation experiment, we consider a bounded grid map of 15 × 15 cells with obstacles, which are randomly distributed in the map and each obstacle occupies a cell. There are 3 leader pursuers, 3 follower pursuers, and 1 evader on the map. The map is illustrated in Figure 7, where the red circles represent leader pursuers, the black stars represent follower pursuers, the blue circle represents the evader, and obstacles are printed grey.
Our pursuit problem environment.
5.2 Experimental Results and Analysis5.2.1 Q-Learning and Hierarchical Q(<italic>λ</italic>)-Learning
In this section, Q-learning, Q(λ)-learning, hierarchical Q-learning, and hierarchical Q(λ)-learning are utilized to solve our pursuit problem respectively. In our experiment, Q-learning and Q(λ)-learning takes ϵ-greedy strategy as their action-selection strategy. The parameters are set in Table 1.
Parameters setting.
α
0.1
γ
0.9
ϵ
0.9
λ
0.5
episode_{max}
100,000
step_{max}
500
For leader pursuers, entering within 5 × 5 cells centered on the evader is defined as O_{1}. For example, in the case of Figure 7, when the evader in the current cell, entering within the yellow area is defined as O_{1}. In this section, the reward is defined as followsR=rm+rd+rb,where r_{
m
} is defined as followsrm=−3,The distance between the leader pursuer(thefollower pursuer) and the evader (its targetleader pursuer) becomes longer,0,Other conditions,r_{
d
} is given byrd=100,The evader is captured successfully,50,If the pursuer completes its own individual task,0,Other conditions,r_{
b
} can be obtained byrb=−5,If the pursuer attempts to cross boundaryor collide with obstacle,0,Other conditions.
Every 1,000 episode, we record the average time steps it takes the pursuers to successfully capture the evader. The simulation results are illustrated in Figure 8 and Table 2 shows the average time steps for 100,000 episodes. By introducing the eligibility trace, learning results at any time step can immediately influence the policy and improve the learning efficiency. Compared with Q-learning, Q(λ)-learning greatly accelerates the convergence. Yet, Q(λ)-learning cannot significantly improve the convergence results. Because the hierarchical reinforcement learning greatly reduces the state set of the algorithm, the average time steps for the pursuers to capture the evader are significantly reduced. From Table 2, we can intuitively conclude that compared with Q-learning, hierarchical Q(λ)-learning can save 56.1% of average time steps.
Average time steps of four learning algorithms.
Average time steps for 100,000 episodes.
Q-learning
164.4
Q(λ)-learning
150.2
hierarchical Q-learning
80.1
hierarchical Q(λ)-learning
72.2
5.2.2 The Improved Reward
In this section, we verify the effectiveness of the improved reward and hierarchical Q(λ)-learning approach to solve our pursuit problem. Compared with the above simulation, we only replace r_{
m
} with r_{
a
}. r_{
a
} is defined as followsra=−3edle2−dmax2,If the pursuer is a leader pursuer,−3edfl2−dmax2,If the pursuer is a follower pursuer,where the theoretical maximum distance between two agents in our map is defined as d_{max}, d_{
le
}(1⩽d_{
le
}⩽d_{
max
}) is the distance between the leader pursuer and the evader, and d_{
fl
}(1⩽d_{
fl
}⩽d_{
max
}) refers to the distance between the follower pursuer and its target leader pursuer.
We also record the average time steps every 1, 000 episodes. Figure 9 shows the simulation results and Table 3 exhibits the average time steps for 100,000 episodes. Compared with sparse reward, the improved reward provides more reward signals for the pursuers during training and improves the learning efficiency of the pursuers. It can be seen from Table 3 that the improved reward can save 17.5% of average time steps in our pursuit problem.
Average time steps of hierarchical Q(λ)-learning and hierarchical Q(λ)-learning with improved reward.
Average time steps for 100,000 episodes.
Hierarchical Q(λ)-learning
72.2
hierarchical Q(λ)-learning with improved reward
59.6
6 Conclusion
In this paper, we address a pursuit problem in a two-dimensional environment with obstacles. Different from previous literature, in the present study only a partial number of the pursuers know the evader’s position. We combine the thought of leader-following control and reinforcement learning to address the pursuit problem and present a hierarchical Q(λ)-learning with improved reward to accelerate the convergence. Our proposed method demonstrates better performance than Q-learning in the pursuit problem.
Data Availability Statement
The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.
Author Contributions
All authors listed have made a substantial, direct, and intellectual contribution to the work and approved it for publication.
Funding
This work was supported by the National Natural Science Foundation of China under Grants 61973139 and 61473138, and the Fundamental Research Funds for the Central Universities under Grant JUSRP22014.
Conflict of Interest
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Publisher’s Note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
ReferencesBekeA.KumbasarT. (2020). Type-2 Fuzzy Logic-Based Linguistic Pursuing Strategy Design and its Deployment to a Real-World Pursuit Evasion Game. BilginA. T.Kadioglu-UrtisE. (2015). “An Approach to Multi-Agent Pursuit Evasion Games Using Reinforcement Learning,” in International Conference on Advanced Robotics, July 27-31, Turkey. ICAR. 164–169. 10.1109/ICAR.2015.7251450BotvinickM. M. (2012). Hierarchical Reinforcement Learning and Decision Making. CasiniM.CriscuoliM.GarulliA. (2019). A Discrete-Time Pursuit-Evasion Game in Convex Polygonal Environments. EklundJ. M.SprinkleJ.SastryS. S. (2012). Switched and Symmetric Pursuit/evasion Games Using Online Model Predictive Control with Application to Autonomous Aircraft. HaynesT.SenS. (2006). IsaacsR. (1965). IshiwakaY.SatoT.KakazuY. (2003). An Approach to the Pursuit Problem on a Heterogeneous Multiagent System Using Reinforcement Learning. LopezV. G.LewisF. L.WanY.SanchezE. N.FanL. (2020). Solutions for Multiagent Pursuit-Evasion Games on Communication Graphs: Finite-Time Capture and Asymptotic Behaviors. NoroK.TenmotoH.KamiyaA. (2014). Signal Learning with Messages by Reinforcement Learning in Multi-Agent Pursuit Problem. PengJ.WilliamsR. J. (1996). Incremental Multi-step Q-Learning. QiX.CaiZ.-J. (2021). Cooperative Pursuit Control for Multiple Underactuated Underwater Vehicles with Time Delay in Three-Dimensional Space. SchwartzH. M. (2014). SelvakumarJ.BakolasE. (2020). Min-max q-learning for multi-player pursuit-evasion games. arXiv: 2003.03727ShinarJ.GlizerV. Y.TuretskyV. (2009). A Pursuit-Evasion Game with Hybrid Pursuer Dynamics. SuttonR.PrecupD.SinghS. (1999). Between Mdps and Semi-mdps: A Framework for Temporal Abstraction in Reinforcement Learning. VidalR.SastryS. (2002). Vision-based Detection of Autonomous Vehicles for Pursuit-Evasion Games. WatkinsC. J. C. H.DayanP.DayanP. (1992). Technical Note. YanJ.GuanX.-p.LuoX.-y.ChenC.-l. (2013). A Cooperative Pursuit-Evasion Game in Wireless Sensor and Actor Networks. YeD.ShiM.SunZ. (2020). Satellite Proximate Pursuit-Evasion Game with Different Thrust Configurations. YuC.YinzahoD.LiY.ChenY. (2020). Distributed Multi-Agent Deep Reinforcement Learning for Cooperative Multi-Robot Pursuit.