^{1}

^{2}

^{*}

^{2}

^{1}

^{2}

^{1}

^{2}

Edited by: Florian Röhrbein, Technische Universität München, Germany

Reviewed by: Marco Mirolli, Istituto di Scienze e Tecnologie della Cognizione, Italy; Evangelos Theodorou, University of Washington, USA

*Correspondence: Ken Kinjo, Neural Computation Unit, Okinawa Institute of Science and Technology, 1919-1 Tancha, Onna-son, Okinawa 904-0412, Japan. e-mail:

This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in other forums, provided the original authors and source are credited and subject to any copyright notices concerning any third-party graphics etc.

Linearly solvable Markov Decision Process (LMDP) is a class of optimal control problem in which the Bellman's equation can be converted into a linear equation by an exponential transformation of the state value function (Todorov,

When we want to design an autonomous robot that can act optimally in its environment, the robot should solve non-linear optimization problems in continuous state and action spaces. If a precise model of the environment is available, then both optimal control (Todorov,

However, a drawback is the difficulty to find an optimal policy for continuous states and actions. Specifically, the non-linear Hamilton-Jacobi-Bellman (HJB) equation must be solved in order to derive an optimal policy. Dynamic programming solves the Bellman equation, which is a discrete-time version of the HJB equation, for discrete states and actions problems. Linear Quadratic Regulator (LQR) is one of the well-known optimal control methods for a linear dynamical system with a quadratic cost function. It can handle continuous states and actions, but it is not applicable to non-linear systems.

Recently, a new framework of linearly solvable Markov decision process (LMDP) has been proposed, in which a non-linear Bellman's equation for discrete and continuous systems is converted into a linear equation under certain assumptions on the action cost and the effect action on the state dynamics (Doya,

In order to apply the LMDP framework to real robot applications, the environmental dynamics should be estimated through the interaction with the environment. This paper proposes a method which integrates model learning with the LMDP framework and investigates how the accuracy of the learned model affects that of the desirability function, the corresponding policy, and the task performance. Although Burdelis and Ikeda proposed a similar approach for the system with discrete states and actions (Burdelis and Ikeda,

At first, we show how a non-linear Bellman's equation can be made linear under the LMDP setting formulated by Todorov (_{x} and _{u} are the dimensionality of the spaces, respectively. At time ^{Nu} and σ denote Brownian noise and a scaling parameter for the noise, respectively. _{k} = _{k} =

A control policy or controller π(^{*} that can lead the robot to the desired state _{g} represents an arrival time. ^{π}(^{1}

In the LMDP framework, the system dynamics (Equation 1) are assumed to be known in advance. When they are unknown, estimation of the dynamics is required from samples collected by the passive dynamics. Many methods exist which can estimate the system dynamics (Nguyen-Tuong and Peters,

Let us suppose that the deterministic state transition _{φ} basis functions φ_{i}(_{1}, _{1}, …, _{Ns}, _{Ns}, _{Ns+1}} are obtained by the passive dynamics. The objective function of model learning is given by the following sum-of-squares error function,
_{k} = _{k + 1} − _{k}. Setting ∂_{k} and _{k}, _{k}). The detail is as follow,

The desirability function is approximated by
_{i} is a weight, _{1}, …, _{Nz}]^{T}, _{i}) is a basis function parameterized by _{i}, and _{1}), …, _{Nz})]^{T}. We adopt an unnormalized Gaussian function as Todorov suggested (Todorov, _{i} and _{i} denote a center position and a precision matrix of the _{i} = _{i}) for brevity and
_{i}, _{i},

The desirability function (Equation 15) should satisfy the linearized Bellman's equation (9). Therefore, in order to optimize _{1}, …, _{Nc}}:
_{c} × _{z} matrices and their (

In the LMDP framework, the optimal control policy is given by

In this paper, we conduct two experiments to evaluate the LMDP framework with model learning. One is a swing-up pole task in simulation. The other is a visually-guided navigation task using a real robot.

To verify that an appropriate control policy can be derived based on estimated dynamics, we conducted a computer simulation of the swing-up pole task. In the simulation, the one side of pole was fixed and the pole could rotate in plane around the fixed point as shown in Figure

^{2}) and

^{2}/s). The state equation was discretized in time with a time step of

_{cost}) = [0.1, 1.6].

As written in section 2.2, the weight matrix was estimated by Equation (14). In the sample acquisition phase we repeated simulations sufficiently, each simulation started from different initial states to avoid unevenly distributed samples. As a result,

In this simulation, we prepared two types of basis functions _{i}(

Linear model | [^{⊤} ^{⊤}]^{⊤} |

Linear-NRBF model | [^{⊤} ψ_{1}(_{2}(_{M}(^{⊤}]^{⊤} |

_{i}, of the basis functions, ψ

_{i}(

_{ψi}were determined experimentally and set to diag(

_{ψi}) = [π/4, π]. In the linear-NRBF model,

_{ψ}= 25 basis functions were used.

The set of collocation states {_{1}, …, _{Ns}}, which were required to optimize the parameters of the desirability function, were uniformly distributed in the state space. The centers _{i} of the basis functions _{i}(_{i} were determined empirically and set to diag([16, 1]). The optimal control policy ^{*}(

To evaluate the performance of the optimal control policy derived from the estimated dynamics and the desirability function, we conducted a visual navigation task using a wheel type robot called the Spring Dog. Figure

Figure

To realize a visually-guided navigation task, image binarization was applied to a captured image in order to separate the battery pack with the green LED from background. Some image features were calculated as shown in Figure _{cx}, _{cy}), average of absolute values around the center in horizontal and vertical axes of the extracted pixels (_{ax}, _{ay}), and the current joint angles of the neck controlled by the visual servoing controller. The state and action were summarized as follows:

_{g}, was set to comprise of both a posture and location which allowed the Spring Dog to successfully capture of the battery. The view feed from the USB camera allowed recognition of the desired proximity and posture, as shown in Figure

Two types of state dependent cost functions _{1}(_{2}(

Next we explain the procedure for estimation of visual-motor dynamics. At first, the Spring Dog moved around using the fixed stochastic policy and obtained data. In the experiment, the control cycle was required to keep

In this experiment, we used two types of basis functions

Linear model | [(_{g})^{⊤} ^{⊤}]^{⊤} |

Bilinear model | [(_{g})^{⊤} _{left}(_{g})^{⊤} _{right}(_{g})^{⊤} ^{⊤}]^{⊤} |

As well as the swing-up pole task, collocation states {_{1}, …, _{Ns}} were uniformly distributed in the state space, and the covariance matrices _{i} were determined by hand. Moreover, only centers of basis functions of desirability were updated and covariance matrices were fixed in the experiment. The optimal control policy ^{*}(_{i} in each basis function _{i}(_{init} = [_{1}, …, _{Nz}], were chosen from the data set of state

_{init}.

_{x}.

_{init}

_{init}← ∅ while

_{x}≠ ∅

_{x})

_{x}←

_{D}− {

_{i}(

_{i}) < τ or

_{init}= ∅

_{init}←

_{init}∪{

_{init}

As already explained, to verify that LMDP can be apply to non-linear state transition system and non-quadratic cost function and the obtained controller performs optimal. In the experiment we tested the following four conditions:

Linear model + quadratic state cost.

Bilinear model + quadratic state cost.

Linear model + Gaussian based state cost (non-quadratic).

Bilinear model + Gaussian based state cost (non-quadratic).

Note that LQR can be applicable in the first condition. Therefore, LQR was also implemented to compare the result of the LMDP framework to the ground truth obtained from LQR in the first condition.

As described in the section 2.5.1, we used the linear and the linear-NRBF models to approximate the environmental dynamics of the swing-up pole. To evaluate the accuracy of estimation using these models, we measured the estimation errors. We extracted _{k} denotes the elements of

Figure ^{T}, the covariance matrix was derived diag (Σ) = [0, 0.04]. The covariance matrix affects to the MSE by square, the MSE between real deterministic state transition and an observed temporal state transition should be at least 1.6 × 10^{−3}. The MSE of angular velocity component in the linear-NRBF model was also 1.6 × 10^{−3}, it was suggested that most of the error was caused by noise. Consequently, This result suggested that the environmental dynamics were accurately approximated by the linear-NRBF model. The estimated input gain matrices were given by

^{T}.

The desirability function was optimized using the estimated dynamics and the control policy derived from the obtained desirability function. Figure ^{*}(^{T}. The top panels of Figure

^{*}(

To evaluate the performance in more detail, we measured the cumulative costs corresponding to each of the obtained policies. In this test simulation, the initial state was set to ^{T} which corresponds to the bottom position. Figure

Figure

As described in section 2.5.2, we used the linear and bilinear models for environmental dynamics approximation. After the data acquisition phase, we obtained

Figure _{cx} and _{pan} derive larger MSE in both model. The reason is these components change more significantly than other components. During the sample acquisition phase, more movement in the rotatory direction occurred than in the translation direction. As a result, the variation of _{cx}, which was caused by movement of rotatory direction, was large and the variation of _{pan} also became large due to visual servoing to keep track of the battery in center of visual field.

Figure _{i}, the center positions of the basis functions for approximating the desirability function. Although the peak of the desirability functions trained with the proposed method is broader than that of the desirability of LQR due to function approximation, obtained controllers show almost same tendency.

^{*}_{left}(^{*}_{right}(

Next, to evaluate performance of obtained controllers, we tested the approaching behavior under the each controller. In the test, the initial position of the robot was set at a distance of 1.5 (m) left the target. The initial direction for each episode was selected randomly a set of three directions; target is placed directly in front of the robot, at a 15° offset to the right of the robot's line of motion or at a 15° offset to the left side, as shown in Figure

Comparing the total cost among the three controllers using quadratic cost as shown in Figure

_{tilt} and the immediate state cost under the quadratic state cost

On the other hand, the controller using a bilinear model acquired marginally worse result as compared with the other controllers. One possible reason is that over fitting occurred in bilinear model.

In comparing performance among all obtained controllers, we cannot use the total cost because of the difference on state costs. For this reasons we calculated L-1 norm^{2}

Although it has been reported that the framework of LMDP can find an optimal policy faster than conventional reinforcement learning algorithms, the LMDP requires the knowledge of state transition probabilities in advance. In this paper, we demonstrated that the LMDP framework can be successfully used with the environmental dynamics estimated by model learning. In addition, our study is the first attempt to apply the LMDP framework to real robot tasks. Our method can be regarded as a of model-based reinforcement learning algorithms. Although many model-based methods includes model learning (Deisenroth et al.,

In the swing-up pole task, the linear and linear-NRBF models were tested to approximate the pole dynamics. The policy derived from the linear model achieved the task of bringing the pole to the desired position even though it cannot represent the dynamics correctly. In the visually-guided navigation task, we compared the desirability function and control policy of LMDP with those of LQR if the environmental dynamics and the cost function were approximated by the linear model and the quadratic function, respectively. In this setting, the optimal state value function and the control policy were calculated analytically by LQR, and therefore, we obtained the optimal desirability function. The obtained desirability function and control policy were not exactly the same as those of LQR. However, we confirmed that the performance using the obtained control policy was comparable to the performance using LQR. Both models prepared in this experiment failed to approximate a part of state transition such as _{cx} and _{pan}. This means that the Spring Dog could not predict the future position of the battery pack precisely when turned left or right. Nevertheless, the robot could approach the battery pack appropriately. This result suggests that LMDP with model learning is promising even though the estimated model was not so accurate. Fortunately, the control policy which brings the robot to the desired position can be obtained with simple linear model in both experiments. We plan to evaluate the proposed method to non-linear control tasks such as learning walking and running behaviors.

As discussed in section 3, the quality of obtained control policy depends on the accuracy of the estimated environmental model. For instance, the bilinear model used in the robot experiment did not improve the approximation accuracy, as shown in Figure ^{uk}(_{k+1}|_{k}) itself. There exist several methods for estimating a probability distribution from samples. For example, Gaussian process is widely used to estimate environmental dynamics (Deisenroth et al.,

The other extension is to develop a model free approach of learning desirability functions, in which the environmental dynamics is not estimated explicitly. Z learning is a typical model-free reinforcement learning method which can learn a desirability function for discrete states and actions, and it was shown that the learning speed of Z learning was faster than that of Q-learning in grid-world maze problems (Todorov,

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

This work was supported by Grant-in-Aid for Scientific Research on Innovative Areas: Prediction and Decision Making (24120527).

^{1}The Kullback–Leibler (KL) divergence measures the difference between two distributions. If two distributions are the same, the KL-divergence becomes 0. In the LMDP, the control cost is defined by how certain control

^{2}The L-1 norm of a vector _{1}, …, _{n})^{T} is the sum of the absolute value of the coordinate of _{1} = ∑_{i}|_{i}|.

When the cost function is non-negative, the value function _{i} ≥ 0 for all _{n}; _{i} move away from collocation states _{n} (Todorov, ^{T}