^{1}

^{*}

^{1}

^{2}

^{1}

^{2}

Edited by: Quan Zou, UnitedHealth Group, United States

Reviewed by: Eric Chen, Thomas Jefferson University, United States; Yanan Sun, Booz Allen Hamilton, United States

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

To realize human-like robot intelligence, a large-scale cognitive architecture is required for robots to understand their environment through a variety of sensors with which they are equipped. In this paper, we propose a novel framework named Serket that enables the construction of a large-scale generative model and its inferences easily by connecting sub-modules to allow the robots to acquire various capabilities through interaction with their environment and others. We consider that large-scale cognitive models can be constructed by connecting smaller fundamental models hierarchically while maintaining their programmatic independence. Moreover, the connected modules are dependent on each other and their parameters must be optimized as a whole. Conventionally, the equations for parameter estimation have to be derived and implemented depending on the models. However, it has become harder to derive and implement equations of large-scale models. Thus, in this paper, we propose a parameter estimation method that communicates the minimum parameters between various modules while maintaining their programmatic independence. Therefore, Serket makes it easy to construct large-scale models and estimate their parameters via the connection of modules. Experimental results demonstrated that the model can be constructed by connecting modules, the parameters can be optimized as a whole, and they are comparable with the original models that we have proposed.

To realize human-like robot intelligence, a large-scale cognitive architecture is required for robots to understand their environment through a variety of sensors with which they are equipped. In this paper, we propose a novel framework that enables the construction of a large-scale generative model and its inferences easily by connecting sub-modules in order for robots to acquire various capabilities through interactions with their environment and others. We consider it important for robots to understand the real world by learning from their environment and others, and have proposed a method that enables robots to acquire concepts and language (Nakamura et al., ^{1}

In the field of cognitive science, cognitive architectures (Laird,

One approach to develop a large-scale cognitive model is the use of probabilistic programming languages (PPLs), which make it easy to construct Bayesian models (Patil et al.,

Large-scale cognitive models can be constructed by connecting smaller fundamental models hierarchically; in fact, our proposed models have such a structure. In the proposed novel architecture Serket, large-scale models were constructed by hierarchically connecting smaller-scale Bayesian models (hereafter, each one is referred to as a

In this paper, we propose the Serket framework and implement models that we proposed by leveraging this framework. Experimental results demonstrated that the model can be constructed by connecting modules, the parameters can be optimized as a whole, and they are comparable with original models that we have proposed.

Recently, it has been said that artificial intelligence is superior to human intelligence in the area of supervised learning, as typified by deep learning as far as certain specific tasks (He et al.,

The symbol emergence system is based on the genetic epistemology proposed by Piaget (Piaget and Duckworth,

Symbol emergence system.

We have proposed models that enable robots to acquire concepts and language by considering its learning process as a symbol emergence system. The robots form concepts in a bottom-up manner, and acquire word meanings by connecting words and concepts. Simultaneously, words are shared with others, and their meanings are changed through communication with others. Therefore, such words affect concept formation in a top-down manner, and concepts are changed. Thus, we have considered that robots can acquire concepts and word meanings through loops of bottom-up and top-down effects.

There have been many attempts to develop intelligent systems. In the field of cognitive science, cognitive architectures (Laird,

Frameworks of deep neural networks (DNNs) such as TensorFlow (Abadi et al.,

Alternatively, PPLs that make it easy to construct Bayesian models have also been proposed (Patil et al.,

We believe that cognitive models make it possible to predict an output _{Y}_{1}, _{2}, ⋯ are conditionally independent against

Overview of cognitive model by

Generalized hierarchical cognitive model:

Considering the modeling of various sensor data as observations _{1}, _{2}, ⋯ , it is not always true for all the observations to satisfy the conditionally independent assumption. In general, the information surrounding us has a hierarchical structure. Hence, a hierarchical model can be used to avoid this difficulty (Attamimi et al., _{*, *} are observations and _{*, *} are latent variables, and the right side of Equation (1) corresponds to the following equation:
_{m} and

From the viewpoint of hierarchical models, many studies have proposed models that capture the hierarchical nature of the data (Li and McCallum,

In the past, studies on how the relationships between multimodal information are modeled have been conducted (Roy and Pentland,

There are also studies in which manifold learning was used for modeling a robot's multimodal information (Mangin and Oudeyer,

Recently, DNNs have made notable advances in many areas such as object recognition (He et al.,

To develop a cognitive model where robots learn autonomously, our group proposed several models for concept formation (Nakamura et al., ^{O} and ^{M} denote an object and a motion concept, respectively, and their relationship is represented by ^{O} and ^{M}.

Graphical models for concept formation:

In these Bayesian models, the latent variables shown as the white nodes ^{O}, and ^{M} in Figure

In the proposed architecture, the parameters of each module are not learned independently but learned based on their dependence on each other. To implement such learning, it is important to share latent variables between modules. For example, ^{O} and ^{M} are shared between two MLDAs in the model, respectively, as shown in Figure

Figure _{m−1, *} and observations _{m, n, *}, which are assumed to be generated from latent variable _{m, n} of a higher level. Modules with no shared latent variable or observations are also included in the generalized model. Moreover, the modules can have any internal structure as long as they have shared latent, observation, and higher-level latent variables. Based on this module, a larger model can be constructed by connecting the latent variables of module(

In each module with shared latent variables, the probability that latent variables are generated can be computed as

The module can send the following probability by leveraging one of the methods explained in the next section:

The module can determine _{m, n} by using the following probability sent from module (

Terminal modules have no shared latent variables and only have observations.

In Serket, the modules affecting each other and the shared latent variables are determined by their communication with each other. Methods to determine the latent variables are classified into two types depending on their nature. One is the case that they are discrete and finite, and another is the case that they are continuous or infinite.

In this section, we explain the parameter inference methods used for the composed models. We focus on the batch algorithm for parameter inference, which makes it easy to implement each module. Therefore, real-time application is beyond the scope of this paper although we would like to realize it in the future. One of the inference methods used to estimate the parameters of complex models is based on variational Bayesian (VB) approximation (Minka and Lafferty,

In this section, we utilize three approaches according to the nature of the latent variables.

First, we consider the case when the latent variables are discrete and finite. For example, in the model shown in Figure ^{O} was generated from a multinomial distribution, which is represented by finite dimensional parameters. Here, we consider the estimation of the latent variables according to the simplified model shown in Figure _{1} was generated from _{2}; and in module 1, the observation _{1}. The latent variable _{1} is shared in modules 1 and 2, and determined by the effect on these two modules as follows:
_{1}) and _{1}|_{2}) can be computed in modules 1 and 2, respectively. We assumed that the latent variable is discrete and finite, and _{1}|_{2}) is a multinomial distribution that can be represented by a finite-dimensional parameter whose dimension ranges from the number of elements of _{1}. Therefore, _{1}|_{2}) can be sent from module 2 to module 1. Moreover, _{1}|_{2}) can be learned in module 2 by using _{1}|

In module 1, _{1}|

_{1}|

In module 2, the probability distribution _{1}|_{2}), which represents the relationships between _{1} and _{2}, is estimated using _{1}|

_{1}|_{2}) is sent to module 1.

In module 1, the latent variable _{1} is estimated using Equation (9), and the parameters of _{1}) are updated.

Connecting two modules by

Thus, in the case when the latent variable is infinite and discrete, the modules are learned by sending and receiving the parameters of a multinomial distribution of _{1}. We call this the message passing (MP) approach because the model parameters can be optimized by communicating the message.

In the previous section, the latent variable was determined by communicating the parameters of the multinomial distributions if the latent variables are discrete and finite. Otherwise, it can be difficult to communicate the parameters. For example, the number of parameters becomes infinite if the possible values of the latent variables are infinite patterns. In the case of a complex probability distribution, it is difficult to represent it by a small number of parameters. In such cases, the model parameters are learned by approximation using sampling importance resampling (SIR). We also consider parameter estimation using the simplified model shown in Figure _{1} is shared, and its possible value is either an infinite pattern or continuous. Similar to the previous section, the latent variable is determined if the following equation can be computed:
_{1} is infinite or continuous, module 2 cannot send _{1}|_{2}) to module 1. Therefore, _{1}|^{(l)}:_{1}|_{1}|_{2}):

We have presented two methods but these are not the only ones available for parameter estimation. There are other applicable methods to estimate parameters. For example, one of the applicable methods is the Metropolis-Hastings (MH) approach. In the MH approach, samples are generated from a proposal distribution ^{*}), where ^{*} and ^{*}):

The model parameters in Figure _{1}|_{1}|_{2},_{1}|_{2},_{1}|

Thus, various approaches can be utilized for parameter estimation, and it should be discussed which methods are most suitable. However, we will leave this for a future discussion because of limited space.

First, we show that a more complex model, mMLDA, can be constructed by combining the simpler models based on Serket. By using the mMLDA, the object categories, motion categories, and integrated categories representing the relationships between them were formed from the visual, auditory, haptic, and motion information obtained by the robot. The information obtained by the robot is detailed in Appendix

The mMLDA shown in Figure ^{O} can be formed from multimodal information ^{v}, ^{a}, and ^{h} obtained from the objects, and motion categories ^{M} can be formed from joint angles obtained by observing a human's motion. Details of the information are explained in the ^{O} and ^{M} as observations. In this model, latent variables ^{O} and ^{M} are shared; therefore, the whole model parameters are optimized in a mutually affecting manner. Figure

Implementation of mMLDA by connecting three MLDAs. The dashed arrows denote the conditional dependencies represented by Serket.

First, in the two MLDAs shown in Figures

Thus, in the integrated concept model, category ^{m} represents all the information of modality _{−jmn} represents a set of latent variables, except for the latent variable assigned to the information of modality

Figure

Pseudo code of mMLDA.

Figure

Classification results of motion and object by

Furthermore, we conducted an experiment to investigate the efficiency of the original mMLDA which was not divided into modules. The results in Figure

Table

Computational time of mMLDA.

Independent model | 1.77 |

Serket implementation | 21.4 |

Original model | 64.1 |

In the original mMLDA, the structure of the model was fixed, and we derived the equations to estimate its parameters and then implemented them. However, by using Serket, we can flexibly change the structure of the model without deriving the equations for the parameter estimation. As one example, we changed the structure of mMLDA and constructed a deeper model as shown in Figure _{m} and _{z} were randomly generated, and we used uniform distribution as _{5}). This generative process was repeated 50 times, and 250 observations were made. The parameters were estimated by classifying these 250 observations through a Serket implementation and independent model. Table

mMLDA that has five hierarchies.

Classification accuracies of mMLDA having five hierarchies.

_{1, 1} (%) |
_{2, 1}(%) |
_{3, 1}(%) |
_{4, 1}(%) |
_{5, 1}(%) |
||
---|---|---|---|---|---|---|

Independent model | 70.0 | 66.0 | 74.0 | 76.0 | 66.0 | 70.4 |

Serket implementation | 100 | 90.0 | 100 | 100 | 100 | 98.0 |

In Nakamura et al. (

Here, we reconsider the mutual learning model based on Serket. The model shown in Figure

Mutual learning model of concepts and language model.

First, as the initial parameter of ^{v},^{a},^{t} by utilizing Gibbs sampling.

Figure _{1}, _{2}, and _{3} represent multimodal information obtained by the sensors on the robot, and _{4}, which is an observation of the speech recognition model, represents the utterances given by the human user. Although the parameter estimation of the original model proposed in Nakamura et al. (

Pseudocode of mutual learning of concept model and language model.

We conducted an experiment where the concepts were formed using the aforementioned model to demonstrate the validity of Serket. We compared the following three methods.

(a) A method where speech recognition results

(b) A method where the concepts and language model are learned by a mutual learning model implemented based on Serket. (Proposed method)

(c) A method where the concepts and language model are learned by a mutual learning model implemented without Serket proposed in (Nakamura et al.,

In method (a), the following equation was used instead of Equation (30), and the parameter

Table

Accuracies of speech recognition, segmentation, and object classification.

(a) w/o mutual learning | 0.64 | 0.50 | 0.68 | 0.58 | 0.80 |

(b) Serket implementation | 0.74 | 0.91 | 0.59 | 0.72 | 0.94 |

(c) Original model | 0.77 | 0.95 | 0.59 | 0.73 | 0.94 |

Table _{TP}, _{FP}, and _{FN} denote the number of points evaluated as TP, FP, and FN, respectively. Comparing the precision of methods (a) and (b) in Table

Evaluation of segmentation.

Correct segmentation: | A | / | B | C | / | D | |

Estimated segmentation: | A | / | A | / | C | D | |

Evaluation: | TN | TP | TN | FP | TN | FN | TN |

Table

Computation time of mutual learning model.

w/o mutual learning | 135 |

Serket implementation | 2,640 |

Original model | 2,637 |

In this paper, we proposed a novel architecture where the cognitive model can be constructed by connecting modules, each of which maintains programmatic independence. Two approaches were used to connect these modules. One is the MP approach, where the parameters of the distribution are of a finite dimension and communicated between the modules. If the parameters of the distribution are of an infinite dimension or a complex structure, the SIR approach was utilized to approximate them. In the experiment, we demonstrated two implementations based on Serket and their efficiency. The experimental results demonstrated that the implementations are comparable with the original model.

However, there is an issue with regard to the convergence of the parameters. If a large number of samples can be obtained, each latent variable can be locally converged into global optima because the MP, SIR, and MH approaches are based on the existing Markov chain Monte Carlo method. But when various types of models are connected, it is not clear whether all latent parameters can be converged into global optima as a whole. It was confirmed that the parameters were converged in the models used in the experiments. Nonetheless, this remains a difficult and important issue which will be examined in future work.

We believe that models that can be connected by Serket are not limited to generative probabilistic models, although we focused on the connected generative probabilistic models in this paper. Neural networks or other methods can be one of the modules of Serket, and we are planning to connect them. Furthermore, we believe that large-scale cognitive models can be constructed by connecting various types of modules, each of which represent a particular brain function. In so doing, we will realize our goal of artificial general intelligence. Serket can also contribute to developmental robotics (Asada et al.,

ToN, TaN and TT conceived of the presented idea. ToN developed the theory and performed the computations. ToN wrote the manuscript with support from TaN and TT. TaN and TT supervised the project. All authors discussed the results and contributed to the final manuscript.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

This work was supported by JST CREST Grant Number JPMJCR15E3.

The Supplementary Material for this article can be found online at:

^{1}Symbol emergence in robotics focuses on the real and noisy environment, and the