^{1}

^{*}

^{1}

^{*}

^{2}

^{*}

^{3}

^{1}

^{1}

^{1}

^{1}

^{1}

^{1}

^{2}

^{3}

Edited by: Davide Valeriani, Massachusetts Eye and Ear Infirmary, Harvard Medical School, United States

Reviewed by: Wei-Long Zheng, Massachusetts General Hospital and Harvard Medical School, United States; Matthias Treder, Cardiff University, United Kingdom

This article was submitted to Neural Technology, a section of the journal Frontiers in Neuroscience

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Robust cross-subject emotion recognition based on multichannel EEG has always been hard work. In this work, we hypothesize that there exist default brain variables across subjects in emotional processes. Hence, the states of the latent variables that relate to emotional processing must contribute to building robust recognition models. Specifically, we propose to utilize an unsupervised deep generative model (e.g., variational autoencoder) to determine the latent factors from the multichannel EEG. Through a sequence modeling method, we examine the emotion recognition performance based on the learnt latent factors. The performance of the proposed methodology is verified on two public datasets (DEAP and SEED) and compared with traditional matrix factorization-based (ICA) and autoencoder-based approaches. Experimental results demonstrate that autoencoder-like neural networks are suitable for unsupervised EEG modeling, and our proposed emotion recognition framework achieves an inspiring performance. As far as we know, it is the first work that introduces variational autoencoder into multichannel EEG decoding for emotion recognition. We think the approach proposed in this work is not only feasible in emotion recognition but also promising in diagnosing depression, Alzheimer's disease, mild cognitive impairment, etc., whose specific latent processes may be altered or aberrant compared with the normal healthy control.

In recent years, affective computing has started to become an active research topic in fields of pattern recognition, signal processing, cognitive neuropsychology, etc. Its main objective is exploring effective computer-aided approaches in recognizing a person's emotions automatically by utilizing explicit or implicit body information, e.g., through facial expressions or voices. It has wide application prospects within the field of human-computer interaction (e.g., intelligent assistants and computer games) (O'Regan,

Considering that facial or vocal muscle activity can be deliberately controlled or suppressed, researchers are currently starting to explore this question through implicit neural activities, particularly through the multichannel EEG (electroencephalograph). The neural oscillations revealed by the EEG are highly correlated with various dynamic cognitive processes (Ward,

Nevertheless, there exist some major problems with regards to multichannel EEG-based emotion recognition that need to deal with, such as the poor generalization of data across subjects and the limitations in designing and extracting handcrafted emotion-related EEG features. Further, in medical data-mining tasks, acquiring enough manually labeled data for training supervised models remains a problem. How to fully utilize the limited data to enhance the model performance is worthy of exploration. Hence, unsupervised and handcrafted featured non-dependent modeling methods are worth in-depth exploration.

In this work, we have utilized the findings of prior related works (Adolphs et al.,

Emotional processes are higher-order cognitive processes that are produced by the collaborative involvement of various latent brain factors, including different brain areas and physical or functional brain networks. The status information of the latent factors contains emotion-related information that contribute to estimating the emotional status. Hence, how to effectively and precisely infer the latent factors is the core issue that we have been concerned with in this work. As the EEG is the external manifestation of the latent brain factors' activities, the recorded EEGs of different scalp locations having internal associations, it has provided us with a way to infer the latent factors from the external multichannel EEGs.

In this work, we have studied and compared three kinds of autoencoder-form neural network models, including the traditional autoencoder (AE), the variational autoencoder (VAE), and the restricted Boltzmann machine (RBM) to determine the latent factors from the multichannel EEG data. Furthermore, for estimating the emotional status, after training, the state sequences of the latent factors were modeled by contextual learning models (e.g., the LSTM unit-based recurrent neural network); at the same time, the emotional status can be estimated based on the contextual information. The entire method framework is illustrated in the flow chart as in

The decoded EEG factors and recurrent neural network-based emotion recognition approach framework.

Latent factor decoding from a brain activity signal is a key tool for studying cognitive task performance and impairment (Calhoun and Adali,

Traditionally, in order to model latent factors, we first need to determine the independent components (ICs) from the multichannel brain signals by solving a blind source separation (BSS)/single matrix factorization (SMF) problem, among which the independent component analysis (ICA) is the most commonly used method (Chen et al., ^{n×t}, where the ^{m×t} from the external multichannel EEG signals, where the

where the channels-by-sources matrix

In the above latent factor decoding studies, the “demixing” matrix

The basic autoencoder model is a feedforward neural network that consists of symmetrical network structures: the “encoder” and “decoder.” To be more specific, consider one dataset

The AE is generally trained through a back propagation (BP) method. As we can see, the AE shares some practical similarities with the SMF models. To some extent, the weight matrices _{1} and _{2} can also been regarded as “demixing” and “mixing” secret keys, respectively. Then, the relationship between the observed EEGs and the latent factors can be determined by them.

A restricted Boltzmann machine (RBM) is a kind of undirected probabilistic graph model with no connections between units of the same layer. It provides the possibility of constructing and training deeper neural networks (Hinton and Salakhutdinov,

in which _{θ} is the normalization term, and

where θ = {

and

where σ(.) is the logistic function and

The parameters θ = {

Very recently, the variational autoencoder (VAE) was introduced as a powerful DL model for some problem scenarios that needed modeling of the data's probability distribution (Kingma and Welling,

It hypothesizes that all the data are generated by one random process that involves an unobservable latent variable _{θ}(_{θ}(_{θ}(_{ϕ}(_{θ}(_{ϕ}(_{θ}(

This is also referred to as the variational lower bound.

The first term 𝔼_{qϕ(z|x)}[log_{θ}(_{ϕ}(

The second term −_{KL}(_{ϕ}(_{θ}(_{ϕ}(_{θ}(

Let J be the dimensionality of

To sum up, the prior variational lower bound can be further transformed into the following form in Formula 13:

where the _{ϕ}(_{θ}(^{(t)} and standard deviation σ^{(t)} are the computation outputs of the encoder network with respect to the input ^{(t)} and the variational parameter ϕ. The VAE is trained through stochastic gradient descent and back-propagation (BP) method. Compared with the traditional matrix factorization-based approach, the VAE is formulated as a density estimation problem. The structure and mechanism of the three autoencoder-like neural network models are illustrated in

The neural network-based multichannel EEG fusion and latent factor decoding method.

According to some studies, the generation of the emotional experience generally lags behind the activity of the brain neural systems (Krumhansl,

Specifically, in this work, we have fed the multichannel EEGs into the “encoder” part of the trained models to obtain the corresponding latent factor sequences, namely the independent components (ICs):

which follows the “many-to-one” mode.

We examined the proposed approach on two publicly accessible datasets, including

The samples of DEAP were divided into positive and negative samples according to the ratings on the Valence and Arousal emotional dimensions. A sample with score over five points was considered to be a positive class, while a sample with a score below five points was considered a negative class. The SEED dataset had pre-defined negative and positive emotional classes for the samples that we did not need to conduct label processing.

In addition to removing EEG artifacts, we conducted z-score method-based normalization for each subject's channel data. For comparison, we built a one-hidden-layer structure for the neural network models, and the number of hidden nodes was set according to the number of latent factors we set in advance (DEAP: 2–16, SEED: 2–31). Both of the two datasets were acquired with high sampling rate. Take the DEAP dataset for example; the number of samples of one subject was over 320,000. Hence, considering the training speed, we set the batch size for unsupervised latent factor learning as 500. The loss functions were selected and set according to the descriptions in section 2.2. We selected the Adam and RMSprop method for AE and VAE training, respectively. According to the experiments, the loss function can converge to the minimum within 20 training epoches. The RBM model-based approach was realized through the Matlab DeeBNet V3.0 deep belief network toolbox, whereas the AE- and VAE-based approaches were realized through the deep learning framework–Keras based on Tensorflow backend. More experimental setting details can be accessed in the source codes located in the following repository:

For comparison, we also set an experiment of an ICA-based decoding approach, and selected FastICA as the implementation method, which is most widely used and accepted in the resolution of EEG source localization and blind source separation problems. This is due to the fact that the ICA-based approach has problems in determining the specific order of the latent components as well as the reconstructed multi-channel EEGs. In this work, we also took this problem into consideration. Specifically, for each original channel EEG, we measured the correlation between it and each reconstructed signal. We supposed that the reconstructed signal with the highest correlation was the counterpart of the specific original EEG, and the highest correlation was adopted here to measure the reconstructed performance of the ICA-based approach. Besides, the reconstruction experiment and performance measurement were conducted 10 times to increase the reliability of the results, and the average performance was reported in this paper. Though we know this is a crude approach, and there may be some mistakes in determining the counterpart for the original EEG, the reported results here were indeed the best estimation and constituted the performed upper bound for ICA.

Besides, the performance of the PCA-based approach was also presented. Nevertheless, the PCA and ICA were totally different in theory and application scenarios. The PCA-based approach was generally used as a dimension reduction method, which is not to mine the underling random processing but try to extract the most important information that can best represent the original data. In this work, we were interested in the EEG latent factors-based approaches. Anyway, as a classic method, the PCA was also worth exploring, and we also added experiments when the PCA approach was adopted.

Although LSTM unit-based RNN has the ability to process long-term sequences, in the case of high-sampling rate EEG signals, signal sequences with hundreds of thousands of data points can introduce significant time overhead in sequence learning. Hence, as shown in

We also set baseline methods based on handcrafted features for comparison with our framework. We choose the support vector machine (SVM) combined with the L1-norm penalty-based feature selection method (SVM-L1). Besides, random forests (RF), K-nearest neighbors (KNN), logistic regression (LR), naive bayes (NB), and the feed-forward deep neural network (DNN) were also examined. As listed in

Three main categories of EEG features that we extracted for baseline methods.

Time-frequency | 1. Peak-Peak Mean. 2. Mean Square Value. |

domain features | 3. Variance. 4. Hjorth Parameter:Activity. |

5. Hjorth Parameter: Mobility. | |

6. Hjorth Parameter: Complexity. | |

7. Maximum Power Spectral Frequency. | |

8. Maximum Power Spectral Density. | |

9. Power Sum. | |

Non-linear dynamical | 1. Approximate Entropy. 2. C0 Complexity. |

system features | 3. Correlation Dimension. |

4. Kolmogorov Entropy. | |

5. Lyapunov Exponent. | |

6. Permutation Entropy. 7. Singular Entropy. | |

8. Shannon Entropy. 9. Spectral Entropy. | |

Asynchronous brain | 1. Fp1-Fp2. 2. AF3-AF4. 3. F3-F4. |

activity features | 4. F7-F8. 5. FC5-FC6. 6. FC1-FC2. |

7. C3-C4. 8. T7-T8. 9. CP5-CP6. | |

10. CP1-CP2. 11. P3-P4. 12. P7-P8. | |

13. PO3-PO4. 14. O1-O2. |

For evaluating the reconstruction performance, we adopted the Pearson correlation coefficient as the metric to measure the difference between the input original channel signal and the output reconstructed signal, as in Formula 15. In other words, high r-value indicated the model has good ability in reconstructing the time series. This metric gave us a general view of the feasibility and effectiveness of the model in modeling the multichannel EEG data.

For evaluating the emotion recognition performance, we chose to leave one subject's data out of the cross-validation method to compare our framework with the baseline methods. Every time, we left one subject's data out as the test set and adopt the other subjects' data as the training set. Considering the problem of unbalanced classes, the model performance was evaluated on the test set based on the F1-score metric, as in Formula 16. This procedure iterates until each subject's data has been tested.

The reconstruction performance under different assumed number of latent factors was of interest. In this work, we examined the reconstruction performance with varying number of latent factors. Specifically, considering the experimental cost, we only examined the number of latent factors from two to half the number of EEG channels. As shown in

The reconstruction performance (mean Pearson correlation coefficient) of different decoding models when assuming a different number of latent factors.

Whether the reconstruction performance was consistent across subjects when assuming different latent factors is worth exploring. As shown in

The reconstruction performance (mean Pearson correlation coefficient) of different subjects when assuming a different number of latent factors (take the DEAP for example).

As shown in

Besides, as shown in

Pearson correlation coefficient between the reconstructed EEG signal and the original EEG signal for each channel.

As mentioned above, the performance of each unsupervised modeling method on EEG reconstruction cannot be used as a criterion for judging whether the model successfully deciphers latent factors that contribute to emotion recognition. It is necessary to apply pattern recognition methods on those mined factors, and conduct a comparison based on the recognition performance. Specifically, the LSTM takes charge of modeling the latent factor sequence decoded by the ICA-, AE-, RBM-, and VAE-based approaches and also inferred the emotional state. Besides, the performance, when applying the LSTM to the principle components mined by the PCA method, has also been reported.

The classification performance is evaluated when the number of the latent factor is set as half of the number of total electrodes. Namely, for the DEAP dataset, the number of latent factors used for sequence modeling and classification was 16, whereas, for the SEED dataset, the number of latent factors was set as 31. We think the emotion recognition performance must closely related to the EEG reconstruction performance. In other words, the latent factors with low reconstruction performance were not an accurate reflection of the latent EEG process and could not lead to an ideal emotion classification result. Hence, the emotion recognition experiments on both datasets were conducted on the latent factors with the high reconstruction performance, namely 16 and 31. Besides, the datasets were recorded with very high sampling rate; take the DEAP dataset for example, the total number of EEG samples of just one subject was over 320,000 the experimental cost for training, evaluating the models, and storing the decoded latent factors are high. Evaluating on more parameter settings is somewhat impractical in our current experimental conditions, e.g., for the SEED dataset, a total of 5 methods × 62 factors × 15 subjects = 4,650 different experimental settings were needed. Furthermore, the purpose of this work was to verify the effectiveness of the EEG latent factor-based emotion recognition method and was not to find the best parameter settings; the reconstruction performance achieved when the number of latent factor was half of the number of electrodes was good enough to test our idea.

We adopted a “leave one subject's data out” cross-validation method. For the DEAP dataset,

Recognition performance on subject data of DEAP dataset (Valence).

s01 | 0.5818 | 0.6296 | 0.6207 | 0.6296 | 0.6552 |

s02 | 0.5882 | 0.6897 | 0.6885 | 0.7018 | 0.7213 |

s03 | 0.7097 | 0.6897 | 0.7241 | 0.7097 | 0.7097 |

s04 | 0.4444 | 0.5455 | 0.5490 | 0.5000 | 0.5818 |

s05 | 0.5965 | 0.7000 | 0.7302 | 0.7500 | 0.7541 |

s06 | 0.7241 | 0.7500 | 0.7419 | 0.8529 | 0.8182 |

s07 | 0.7368 | 0.7619 | 0.7586 | 0.8065 | 0.8125 |

s08 | 0.6071 | 0.6415 | 0.6545 | 0.6780 | 0.7213 |

s09 | 0.5217 | 0.4898 | 0.5926 | 0.6545 | 0.6552 |

s10 | 0.6207 | 0.6182 | 0.6441 | 0.6441 | 0.7059 |

s11 | 0.6909 | 0.6552 | 0.7097 | 0.7419 | 0.7500 |

s12 | 0.6667 | 0.6909 | 0.6552 | 0.6885 | 0.7000 |

s13 | 0.4186 | 0.5714 | 0.6182 | 0.5882 | 0.6182 |

s14 | 0.4444 | 0.5926 | 0.6182 | 0.6429 | 0.6780 |

s15 | 0.5200 | 0.6143 | 0.6667 | 0.6441 | 0.6667 |

s16 | 0.4706 | 0.5217 | 0.5306 | 0.5385 | 0.5660 |

s17 | 0.6786 | 0.6552 | 0.6885 | 0.6897 | 0.7097 |

s18 | 0.6667 | 0.6983 | 0.7407 | 0.7213 | 0.7500 |

s19 | 0.7059 | 0.6697 | 0.7302 | 0.7302 | 0.7419 |

s20 | 0.7018 | 0.7119 | 0.7097 | 0.7302 | 0.7719 |

s21 | 0.6667 | 0.6600 | 0.6441 | 0.6780 | 0.6667 |

s22 | 0.5306 | 0.5333 | 0.5965 | 0.6207 | 0.6207 |

s23 | 0.7000 | 0.7170 | 0.7119 | 0.7241 | 0.8000 |

s24 | 0.5106 | 0.5714 | 0.6207 | 0.5714 | 0.6316 |

s25 | 0.4889 | 0.5532 | 0.6441 | 0.6441 | 0.6441 |

s26 | 0.6415 | 0.7619 | 0.7333 | 0.7692 | 0.7813 |

s27 | 0.8060 | 0.8309 | 0.8125 | 0.8235 | 0.8529 |

s28 | 0.6786 | 0.7070 | 0.7241 | 0.7692 | 0.7619 |

s29 | 0.6667 | 0.6667 | 0.7097 | 0.7143 | 0.7541 |

s30 | 0.7742 | 0.7241 | 0.7813 | 0.8125 | 0.8182 |

s31 | 0.6667 | 0.6897 | 0.7018 | 0.7302 | 0.7213 |

s32 | 0.6441 | 0.5769 | 0.6538 | 0.6667 | 0.6897 |

Mean _{f1} |
0.6303 | 0.6528 | 0.6783 | 0.6927 | 0.7167 |

^{1}number of latent factors: 16

Recognition performance on subject data of DEAP dataset (Arousal).

s01 | 0.6545 | 0.7241 | 0.7419 | 0.7419 | 0.7500 |

s02 | 0.7119 | 0.7018 | 0.7458 | 0.7500 | 0.7619 |

s03 | 0.2791 | 0.3182 | 0.3043 | 0.3111 | 0.3478 |

s04 | 0.4138 | 0.5714 | 0.5106 | 0.5660 | 0.5714 |

s05 | 0.5600 | 0.6441 | 0.6316 | 0.6316 | 0.6441 |

s06 | 0.5200 | 0.6182 | 0.5818 | 0.5818 | 0.6275 |

s07 | 0.7213 | 0.7338 | 0.7541 | 0.7500 | 0.7869 |

s08 | 0.6667 | 0.6667 | 0.7213 | 0.7018 | 0.7333 |

s09 | 0.6885 | 0.7302 | 0.7419 | 0.7213 | 0.7636 |

s10 | 0.5660 | 0.6154 | 0.6667 | 0.6667 | 0.7097 |

s11 | 0.5217 | 0.4898 | 0.5455 | 0.5306 | 0.5556 |

s12 | 0.8000 | 0.8732 | 0.8889 | 0.8889 | 0.9041 |

s13 | 0.5385 | 0.7125 | 0.7419 | 0.9041 | 0.8889 |

s14 | 0.7241 | 0.7385 | 0.7619 | 0.7937 | 0.8060 |

s15 | 0.5882 | 0.6316 | 0.6545 | 0.6552 | 0.6667 |

s16 | 0.6273 | 0.6296 | 0.6545 | 0.6667 | 0.6667 |

s17 | 0.6667 | 0.7302 | 0.7241 | 0.7333 | 0.7869 |

s18 | 0.7541 | 0.7500 | 0.7692 | 0.7500 | 0.7742 |

s19 | 0.7213 | 0.7213 | 0.8060 | 0.7097 | 0.8065 |

s20 | 0.7879 | 0.8060 | 0.8235 | 0.8657 | 0.8732 |

s21 | 0.8529 | 0.8406 | 0.8732 | 0.8696 | 0.9014 |

s22 | 0.7241 | 0.7458 | 0.7500 | 0.7500 | 0.7500 |

s23 | 0.3673 | 0.3404 | 0.3750 | 0.4000 | 0.4000 |

s24 | 0.8358 | 0.8889 | 0.8696 | 0.8732 | 0.9041 |

s25 | 0.7500 | 0.7458 | 0.8182 | 0.8406 | 0.8406 |

s26 | 0.5714 | 0.5769 | 0.5769 | 0.5965 | 0.6071 |

s27 | 0.6780 | 0.7797 | 0.8060 | 0.7879 | 0.8060 |

s28 | 0.6071 | 0.6071 | 0.6154 | 0.6207 | 0.6545 |

s29 | 0.7018 | 0.7000 | 0.7458 | 0.7419 | 0.7586 |

s30 | 0.6182 | 0.5769 | 0.6207 | 0.6207 | 0.6441 |

s31 | 0.5769 | 0.6182 | 0.6316 | 0.6316 | 0.6552 |

s32 | 0.7213 | 0.7419 | 0.7692 | 0.7541 | 0.8148 |

Mean _{f1} |
0.6418 | 0.6741 | 0.6944 | 0.7002 | 0.7243 |

^{1}number of latent factors: 16

Recognition performance on subject data of SEED dataset.

s01 | 0.6753 | 0.7195 | 0.7368 | 0.7677 | 0.8308 |

s02 | 0.6897 | 0.7636 | 0.7386 | 0.7721 | 0.8504 |

s03 | 0.7255 | 0.7259 | 0.7674 | 0.8018 | 0.8741 |

s04 | 0.6434 | 0.6495 | 0.6677 | 0.6915 | 0.7012 |

s05 | 0.6683 | 0.7221 | 0.7323 | 0.7552 | 0.8027 |

s06 | 0.7321 | 0.7333 | 0.7733 | 0.8054 | 0.8904 |

s07 | 0.7143 | 0.7713 | 0.7525 | 0.7953 | 0.8600 |

s08 | 0.7053 | 0.7479 | 0.7488 | 0.7897 | 0.8571 |

s09 | 0.6912 | 0.7080 | 0.7430 | 0.7759 | 0.8538 |

s10 | 0.6677 | 0.7599 | 0.7264 | 0.7452 | 0.7931 |

s11 | 0.7295 | 0.7548 | 0.7696 | 0.8026 | 0.8827 |

s12 | 0.6776 | 0.7095 | 0.7373 | 0.7718 | 0.8320 |

s13 | 0.7168 | 0.7894 | 0.7560 | 0.7962 | 0.8676 |

s14 | 0.6939 | 0.7586 | 0.7442 | 0.7876 | 0.8565 |

s15 | 0.7608 | 0.7699 | 0.7844 | 0.8079 | 0.8908 |

Mean _{f1} |
0.6994 | 0.7389 | 0.7452 | 0.7777 | 0.8429 |

^{1}number of latent factors: 31

It is found from the table that the ICA-LSTM-based method on both datasets exceeds 0.5, indicating that the traditional ICA-based method can still decipher emotion-related information from the multichannel EEG. It also indicates that the latent factor decoding combined with sequence modeling-based approach is suitable for emotion recognition from multichannel EEG. In general, the ICA-based approach did not perform as well as other neural network-based approaches, confirming the conclusion that ICA has limitations in representation ability (Choudrey,

_{1} into the objective function to induce sparsity by shrinking the weights toward zero. It is natural for features with 0 weights to be eliminated from the candidate set. The parameter “C” controls the trade-off between the loss and penalty. Hence, the results of the performance when a different penalty parameter “C” was tested are shown in

Performance comparison between this work and the baseline methods.

_{f1} |
|||
---|---|---|---|

L1-SVM (kernel=“linear”) + handcrafted features | 0.7134 | 0.7154 | 0.8234 |

RF (n_estimators=100) + handcrafted features | 0.5870 | 0.5754 | 0.7680 |

KNN (n_neighbors=7) + handcrafted features | 0.6406 | 0.5890 | 0.6763 |

LR + handcrafted features | 0.5757 | 0.5614 | 0.7649 |

NB + handcrafted features | 0.6903 | 0.6989 | 0.6666 |

DNN (hidden_layer_sizes={100, 100, 100}) + handcrafted features | 0.6183 | 0.6593 | 0.7219 |

ICA+LSTM | 0.6303 | 0.6418 | 0.6994 |

PCA+LSTM | 0.6528 | 0.6741 | 0.7389 |

AE+LSTM | 0.6783 | 0.6944 | 0.7452 |

RBM+LSTM | 0.6927 | 0.7002 | 0.7777 |

VAE+LSTM | 0.7167 | 0.7243 | 0.8429 |

Mean cross-subject recognition performance with different settings when L1-SVM based approach is applied.

Compared to the baseline methods, the VAE-based approach achieved higher performance on both datasets. It should be pointed out that, though the performance shown here was not good enough compared with the L1-SVM method, it avoided the problems of high computational cost when calculating the handcraft features, especially for the Non-linear Dynamical System Features (e.g., the Lyapunov exponent). Besides, the effectiveness of the features highly depends on the parameter settings (e.g., the setting of the number of the embedding dimension when calculating the Lyapunov exponent). When extracting the features from multichannel EEGs, the cost will multiply. This issue hampers the practical usage of the EEG-based emotion recognition. Hence, compared to the traditional handcraft feature-based methods, the proposed neural network-based approach was advantageous in terms of data-processing speed when the trained network was provided in advance. The process of latent factor decoding, sequence modeling, and classification can be conducted and completed at a very fast speed. In addition, the experimental settings and parameters of our approach were not fully tested. On the whole, the approach proposed in this paper is also valuable and has great potential in this field. The AE has shown excellent ability in reconstructing the multichannel EEG; however, according to the recognition performance, its decoded factors were not an accurate reflection of the brain cognitive state compared to the RBM- and VAE-based approaches. Hence, in cognition research-oriented neural signal computing, the generative model-based approaches are more advisable. Finally, despite the excellent performance the PCA obtained in reconstructing the multichannel EEGs, the principle components mined by it do not contribute to promoting the performance in recognizing subjects' emotions.

As shown in

List of related works in recent years and the corresponding performance obtained.

_{acc} |
_{f1} |
_{acc} |
_{f1} |
||
---|---|---|---|---|---|

DEAP | |||||

Ontology-based storage and representation, feature selection and decision tree based recognition method (Chen et al., |
0.6783 | N/A | 0.6896 | N/A | |

Minimum-redundancy-maximum-relevance (MRMR) based feature selection combined with the statistical features, band power features, Hjorth parameters and fractal dimension (Atkinson and Campos, |
0.7314 | N/A | 0.7306 | N/A | |

Integrated classifier based on multi-layer stacking autoencoder combined with time domain features and PSD features (Yin et al., |
0.7617 | 0.7243 | 0.7719 | 0.6901 | |

Multivariate empirical mode decomposition (MEMD) based feature extraction combined with ANN (Mert and Akan, |
0.7287 | N/A | 0.7500 | N/A | |

Generative adversarial network (WGANDA) based transfer learning combined with differential entropy feature (Luo et al., |
0.6799 | N/A | 0.6685 | N/A | |

The VAE based approach proposed in this work | 0.7623 | 0.7167 | 0.7989 | 0.7243 | |

_{acc} |
_{f1} |
||||

SEED | |||||

Dynamical graph CNN (DGCNN) learns from the DE, PSD, DASM, RASM and DCAU features based adjacency matrix representation (Song et al., |
0.7995 | N/A | |||

Extracting differential entropy features to construct 2D sparse graph representation, then combining CNN for classification (Li et al., |
0.8820 | N/A | |||

Transfer learning methods combined with differential entropy features and logistic regression based classification (Lan et al., |
0.7247 | N/A | |||

Generative adversarial network (WGANDA) based transfer learning combined with differential entropy feature (Luo et al., |
0.8707 | N/A | |||

Spatial-temporal recurrent neural network (STRNN) combined with differential entropy feature (Zhang et al., |
0.8950 | N/A | |||

The VAE based approach proposed in this work | 0.8581 | 0.8429 |

This paper explored EEG-based emotion identification methods that were not restricted to handcrafted features. Brain cognition research finds that “there exists cross-user, default intra-brain variables involved in the emotional process.” Hence, the status of the brain hidden variables is closely related with the emotional psychophysiological processes and can be utilized to infer the emotional state. In this work, artificial neural networks are used for unsupervised modeling of the state space of the latent factors from the multichannel EEGs, and LSTM-based supervised sequence modeling is further performed on the decoded latent factor sequences to mine the emotion related information, which is used for inferring the emotional states. It has been verified that the neural network models are more suitable for modeling and decoding brain neural signals than the independent component analysis (ICA) method, which is widely used in brain cognitive research. Although, from the perspective of data reconstruction, the VAE cannot achieve the same performance as that of the traditional AE, we obtained a better recognition performance on the latent factors decoded by the VAE. It indicated that VAE, as a kind of generative model, can truly model the hidden state space of the brain in cognitive processes. The decoded latent factors contain the relevant and effective information for emotional state inference. This approach is also promising in diagnosing depression, Alzheimer's disease, mild cognitive impairment, etc., whose specific brain functional networks may have been altered or could be aberrant compared with the normal healthy control. In future work, we will study the influence of a different sampling step size in latent factor sequence modeling and emotion recognition. Other directions deserving of exploration in future works include source localization and functional network analysis based on the decoded latent factors.

The DEAP dataset analyzed for this study can be found in the following link:

XL proposed the idea, realized the proposed approach, conducted experiments and wrote the manuscript. ZZ provided advice on the research approaches, guided the experiments, checked, polished the manuscript and provided the experimental environment. DS proposed the idea, reviewed the related works, analyzed the results and polished the manuscript. YZ realized the proposed methods, conducted experiments, analyzed the results and provided revision suggestions. JP and LW analyzed the experimental results, checked this work and provided revision suggestions. JH conducted experiments of the baseline methods for comparison and summarized the results. CN and DW acquired, pre-processed the experimental dataset and extracted the handcraft EEG features.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

We thank Prof. Bin Hu and his research group in Lanzhou University for inspiring our research. We thank Dr. Junwei Zhang, Prof. Peng Zhang for taking care of our research work. Finally, we would also like to thank the editors, reviewers, editorial staffs who take part in the publication process of this paper.