^{*}

Edited by: Raul Vicente, Max Planck Institute for Brain Research, Germany

Reviewed by: Chengyi Xia, Tianjin University of Technology, China; Francisco Martinez-Gil, University of Valencia, Spain

This article was submitted to Interdisciplinary Physics, a section of the journal Frontiers in Physics

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Complex global behavior patterns can emerge from very simple local interactions between many agents. However, no local interaction rules have been identified that generate some patterns observed in nature, for example the rotating balls, rotating tornadoes and the full-core rotating mills observed in fish collectives. Here we show that locally interacting agents modeled with a minimal cognitive system can produce these collective patterns. We obtained this result by using recent advances in reinforcement learning to systematically solve the inverse modeling problem: given an observed collective behavior, we automatically find a policy generating it. Our agents are modeled as processing the information from neighbor agents to choose actions with a neural network and move in an environment of simulated physics. Even though every agent is equipped with its own neural network, all agents have the same network architecture and parameter values, ensuring in this way that a single policy is responsible for the emergence of a given pattern. We find the final policies by tuning the neural network weights until the produced collective behavior approaches the desired one. By using modular neural networks with modules using a small number of inputs and outputs, we built an interpretable model of collective motion. This enabled us to analyse the policies obtained. We found a similar general structure for the four different collective patterns, not dissimilar to the one we have previously inferred from experimental zebrafish trajectories; but we also found consistent differences between policies generating the different collective pattern, for example repulsion in the vertical direction for the more three-dimensional structures of the sphere and tornado. Our results illustrate how new advances in artificial intelligence, and specifically in reinforcement learning, allow new approaches to analysis and modeling of collective behavior.

Complex collective phenomena can emerge from simple local interactions of agents who lack the ability to understand or directly control the collective [

If in one of such systems we observe a particular collective configuration, how can we infer the local rules that produced it? Researchers have relied on the heuristic known as the modeling cycle [

Studies in collective behavior might benefit from a more systematic method to find local rules based on known global behavior. Previous work has considered several approaches. Several authors have started with simple parametric rules of local interactions and then tuned the parameters of the interaction rules via evolutionary algorithms based on task-specific cost functions [

These approaches have limitations. Using simple parametric rules based on a few basis functions produces models with limited expressibility. Tabular mapping has limited generalization ability. As an alternative not suffering from these problems, neural networks have been used as the function approximator [

Despite these difficulties, very recent work using inverse reinforcement learning techniques has been applied to find interaction rules in collectives [

Our approach includes the following technical ingredients. We encode the local rule as a sensorimotor transformation, mathematically expressed as a parametric policy, which maps the agent's local state into a probability distribution over an agent's actions. As we are looking for a single policy, all agents have the same parametric policy, with the same parameter values, identically updated to maximize a group level objective function (total reward during a simulated episode) representing the desired collective configuration. A configuration of high reward was searched for directly, without calculating a group-level value function and thus circumventing the problem of an exploding action space. For this search, we use a simple algorithm of the class of Evolution Strategies (ES), which are biologically-inspired algorithms for black-box optimization [

We applied this approach to find local rules for various experimentally observed schooling patterns in fish. Examples include the rotating ball, the rotating tornado [

We placed the problem of obtaining local rules of motion (policies) that generate the desired collective patterns in the reinforcement learning framework [

Framework to obtain an interaction model producing a desired collective behavior.

We model fish as point particles moving trough a viscous three-dimensional environment. In this section, we explain how we update the state of each and every fish agent.

Let us define a global reference frame, with

In this reference frame, we consider a fish agent moving with a certain velocity. We describe this velocity as three numbers: the speed

where δ corresponds to the duration of a time step (see

Environment parameters described in the methods.

α (viscous drag) | 1 |
---|---|

Δ_{max} |
6.75 BL ^{−1} |

Δϕ_{max} |
^{−1} |

θ_{max} |

The elevation angle, azimuth angle change, and speed change are updated based on three outputs of the policy network, _{1}, _{2}, and _{3}, each bounded between 0 and 1. The three outputs of the policy network are independently sampled at time

The azimuth, ϕ, is updated using the first output of the policy, _{1}:

with Δϕ_{max} the maximum change in orientation per unit time, and δ is the time step duration.

The elevation angle, θ, is calculated based on the second output of the policy network, _{2}, as

where the maximum elevation is θ_{max}

Finally, the speed change is the sum of two components: a linear viscous drag component (with parameter α) and an active propulsive thrust determined by the third output of the policy network, _{3},

The parameter Δ_{max} is the maximum active change of speed of a fish. This equation for the change in velocity captures that deceleration in fish is achieved through the passive action of viscous forces [

At the beginning of each simulation, we initialize the positions and velocities of all fish randomly (see

In our simulations, the final behavior to which the group converges is determined by the reward function. We aim to model four different collective behaviors, all of which have been observed in nature. These behaviors are called the rotating ball [

At each time step, the configuration of agents allows to compute an instantaneous group level reward

The first term is composed of collision avoidance rewards, _{c}. It provides an additive negative reward for every pair of fish (_{i,j}. Specifically, for each neighbor we use a step function that is zero if _{i,j} > _{c} and −1 otherwise. This term is meant to discourage the fish from moving too close to one another.

The second term is an attraction reward, _{a}, which is negative and proportional to the sum of the cubed distances of all fish from the center of mass of the group. This attraction reward will motivate the fish to stay as close to the center of mass as possible while avoiding mutual collisions due to the influence of the collision reward. Together with _{c}, it promotes the emergence of a dense fish ball.

The third term in the instantaneous reward, _{r}, is added to promote rotation. We calculate for each fish _{i}. The rotation term, _{r}, is the sum of beta distributions of that angular rotation across all fish.

The fourth and final term, _{v}, penalizes slow configurations. It is a step function that is 0 if the mean speed is above _{min} and −1 otherwise. _{min} is small enough to have a negligible effect in the trained configuration, but large enough to prevent the agents from not moving. As such, this last term encourages the agents to explore the state-action space by preventing them from remaining still.

The reward functions designed to encourage the emergence of a rotating tornado and the rotating mills are described in the

Unlike previous work in which each agent is trying to maximize an internal reward function [

We parameterize our policy as a modular neural network with sigmoid activation functions,

Modular structure of the policy network. _{y} and _{z}, and social variables, _{i}, _{i}, _{i}, _{x,i}, _{y,i}, _{z,i}, from a single neighbor _{1}, _{2}, _{3}, determines the heading and speed of the focal fish in the next time step.

All the networks have the same weight values, but variability in the individual behaviors is still assured for two reasons. First, we use stochastic policies, which makes sense biologically, because the same animal can react differently to the same stimulus. In addition, a stochastic policy enables a better exploration of the state-action space [

At each time step, the input to the network is information about the agent surroundings. For each focal fish, at every time step we consider an instantaneous frame of reference centered on the focal fish, with the _{i}, _{i}, _{i} (the components of the neighbor _{x,i}, _{y,i}, _{z,i} (the components of the neighbor velocity in the new frame of reference). In addition, we also use _{y} and _{z} (the components of the focal fish velocity in the new frame of reference). Please note that the frame is centered in the focal fish, but it does not move nor rotate with it, so all speeds are the same as in the global frame of reference.

The policy network outputs three numbers, _{1}, _{2}, and _{3} (see next section for details), that are then used to update the agent's azimuth, elevation angle and speed, respectively.

To enable interpretability, we chose a modular structure for the policy neural network. Similar to our previous work [

The first module, the pairwise-interaction module, contains 6 output neurons, _{1,i} (mean azimuth angle change, anti-symmetrized with respect to the _{2,i} (mean elevation change, anti-symmetrized with respect to the

The previous values,

For each neighbor, _{1,i}, _{2,i} and _{3,i} independently from the respective distributions,

The second module, the aggregation module, has a single output,

The final output combines both modules,

where we combined _{1,i}, _{2,i}, and _{3,i} as components of a vector _{i}. The final outputs used to update the dynamics of the agent, _{1}, _{2}, _{3} are the components of

Everywhere in this paper, the set of neighbors considered,

Following previous work [

Let us denote by

where λ is the learning rate (see

We estimate the gradient numerically from the rewards of many simulations using policy networks with slightly different parameters. We first sample _{i}. Then, we use

We refer to

Training makes reward to increase and group behavior to converge to the desired configuration. Reward as a function of number of training epochs in an example training cycle for each of the four configurations. In each of the examples, we show two frames (agents shown as blue dots) from the generated trajectories, one early in the training process (100 epochs) and a second after the reward plateaus (8,000 epochs).

Evolution strategies algorithm.

As in previous work [

We obtained the attraction-repulsion and alignment scores from a centered and scaled version of

where we chose to only explicitly highlight its dependence with the relative neighbor orientation in the XY plane, ϕ_{i}. This relative neighbor orientation can be calculated as the difference of the azimuth angle of the neighbor and the azimuth angle of the focal fish.

Attraction-repulsion score is defined by averaging

We would say there is attraction (repulsion) when the score is positive (negative).

The alignment score is defined as

As in [_{i}. Otherwise, it is in an attraction or repulsion area, depending on the sign of the attraction-repulsion score [

To simulate collective swimming, we equipped all fish with an identical neural network. At each time step, the neural network analyzes the surroundings of each fish and produces an action for that fish, dictating change in its speed and turning,

As in previous work [_{1,i}, _{2,i}, _{3,i}, from a clipped Gaussian distribution with the mean and variance given by the outputs of the first part,

An aggregation module outputs a single positive number expressing the importance carried by the signal of each neighbor, _{1}, _{2} and _{3}, determine the motor command. We perform these computations for each agent, and use the outputs to determine the position and speed of each agent in the next time step (Equations 1–3).

We introduced a reward function, measuring how similar are the produced trajectories to the desired group behavior (see section 2 for details). We used one of four different reward functions to encourage the emergence of one of four different collective configurations, all of which have been observed in natural groups of fish. These patterns are the rotating ball, the rotating tornado, the rotating hollow core mill and the rotating full core mill.

We used evolutionary strategies to gradually improve the performance of the neural network at the task of generating the desired collective configurations. The value of the reward function increased gradually during training for all four patterns,

Here, we use the low dimensionality of each module in terms of inputs and outputs to describe the policy with meaningful plots. We describe here the policy of the rotating ball (

Policy producing a rotating ball, as a function of neighbor relative location. Each output (three from the pair-interaction, one from the aggregation) is shown in a different column. All columns have three diagrams, with the neighbor 1 BL above (top row), in the same _{1}, we explain interaction using the approximate notions of alignment (gray), attraction (orange), and repulsion (purple) areas, as in [_{2} parameter. Blue areas indicate that the focal fish will move downwards (_{2} < 0.5), while red areas indicate that the focal fish will move upwards (_{2} > 0.5). _{3} parameter. Darker areas (large mean _{3}) indicate increase in speed, and lighter areas indicate passive coast.

The pairwise-interaction module outputs three parameters for each focal fish, all bounded between 0 and 1. The first one, _{1}, determines the change in azimuth, that is, rotations in the XY plane (

The attraction areas give the neighbor positions in this XY plane which make a focal fish (located at _{i} = _{i} = 0) swim toward the neighbor, independently of the neighbor orientation (

The second parameter, _{2}, determines the elevation angle (_{2} is 0.5 on average, and thus elevation angle is zero on average, when the neighbor is on the same XY plane as the focal (

The third parameter, _{3}, determines the active speed change (

The aggregation module outputs a single positive output, determining the weight of each neighbor in the final aggregation. In the rotating ball policy, the neighbors that are weighted the most in the aggregation are the ones closer than 1 BL from the focal (

Note that the aggregation module is not constrained to produce a local spatial integration, since the network has access to every neighboring fish. However, we can observe how an aggregation module like the one shown for the rotating ball (

In the previous section, we described the policy we found to best generate a rotating ball. The policies we found that generate the other three configurations have many similarities and some consistent differences,

Policies producing different configurations. Each column corresponds to one of the four desired configurations. Each panel is the equivalent of the middle row in

The policy generating a tornado has an attraction-repulsion pattern somewhere in between the rotating ball and the full core milling (

The policy generating a full-core mill has an increased repulsion area, particularly in the frontal and frontal-lateral areas (

The highlighted differences between policies are robust (see

In the preceding section, the observations made by each agent were simple variables like position or velocities of neighbors. This simplification aided analysis, but animals do not receive external information in this way but by sensory organs.

We checked whether we could achieve the group configuration we have studied when the input to the policy for each agent is the activation of an artificial retina observing the other agents. The retina is modeled using a three-dimensional ray-tracing algorithm: from each agent, several equidistant rays project until they encounter a neighbor, or up to a maximum ray length

We approximated the policy using a single fully-connected network. Using the interaction and attention modules described in section 2.3.2 would not have added interpretability in this case, because the number of inputs is too large. By using the same evolutionary strategy, we were able to obtain a decision rule leading to the desired collective movement configurations (

Although these configurations were qualitatively similar to the ones we obtained with the modular network (

We have applied evolutionary strategies (ES) to automatically find local rules able to generate desired group level movement patterns. Namely, we found local rules that generate four complex collective motion patterns commonly observed in nature [

We used neural networks as approximators of the policy, the function mapping the local state to actions. The naive use of a neural network would produce a black-box model, that can be then analyzed with different

We used a modular policy network, composed by two modules. Each module is an artificial neural network with thousands of parameters, and therefore it is a flexible universal function approximator. However, we can still obtain insight, because each module implements a function with low number of inputs and outputs that we can plot [

To find the local rules generating the desired configurations, we used a systematic version of the collective behavior modeling cycle [

There are theoretical guarantees for convergence in tabular RL, or when linear approximators are used for the value functions [

The method we have proposed could have several other interesting applications. In cases where it is possible to record rich individual level data sets of collective behavior, it can be possible to perform detailed comparisons between the rules discovered by our method and the ones observed in experiments [

Here we relied on an engineered reward function because the behaviors we were modeling have not yet been recorded in quantitative detail. In cases where trajectory data is available, detailed measures of similarity with observed trajectories can be used as a reward [

The present work may be used as a normative framework when the rewards used represent important biological functions. While prior work using analytic approaches has been successful for simple scenarios [

The datasets were generated using the software from

AL and GP devised the project. TC, AL, FH, and GP developed and verified analytical methods. TC wrote the software, made computations and plotted results with supervision from AL, FH, and GP. All authors discussed the results and contributed to the writing of the manuscript.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

We are grateful to Francisco Romero-Ferrero and Andreas Gerken for discussions.

The Supplementary Material for this article can be found online at:

Simulation of the agents trained to adopt a Rotating ball, with the modular deep networks. Here the number of agents is the same as the number of agents used in the training.

Simulation of the agents trained to adopt a Tornado, with the modular deep networks. Here the number of agents is the same as the number of agents used in the training.

Simulation of the agents trained to adopt a Full core milling, with the modular deep networks. Here the number of agents is the same as the number of agents used in the training.

Simulation of the agents trained to adopt a Hollow core milling, with the modular deep networks. Here the number of agents is the same as the number of agents used in the training.

Simulation of the agents trained to adopt a Rotating ball, with the modular deep networks. Here the number of agents is 70 while the number of agents used in training is 35.

Simulation of the agents trained to adopt a Tornado, with the modular deep networks. Here the number of agents is 70 while the number of agents used in training is 35.

Simulation of the agents trained to adopt a Full core milling, with the modular deep networks. Here the number of agents is 70 while the number of agents used in training is 25.

Simulation of the agents trained to adopt a Hollow core milling, with the modular deep networks. Here the number of agents is 70 while the number of agents used in training is 35.

Simulation of the agents trained to adopt a Rotating ball, when the network received the activation of a simulated retina as input.

Simulation of the agents trained to adopt a tornado, when the network received the activation of a simulated retina as input.

Simulation of the agents trained to adopt a full core milling, when the network received the activation of a simulated retina as input.

Simulation of the agents trained to adopt a hollow core milling, when the network received the activation of a simulated retina as input.