^{1}

^{1}

^{1}

^{†}

^{1}

^{2}

^{1}

^{2}

^{3}

^{*}

^{1}

^{2}

^{3}

Edited by: Jochem W. Rieger, University of Oldenburg, Germany

Reviewed by: Natasha Sigala, Brighton and Sussex Medical School, UK; Ingo Fründ, York University, Canada

*Correspondence: Jack L. Gallant

†Present Address: Shinji Nishimoto, Center for Information and Neural Networks, National Institute of Information and Communications Technology, Osaka, Japan

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

One crucial test for any quantitative model of the brain is to show that the model can be used to accurately decode information from evoked brain activity. Several recent neuroimaging studies have decoded the structure or semantic content of static visual images from human brain activity. Here we present a decoding algorithm that makes it possible to decode detailed information about the object and action categories present in natural movies from human brain activity signals measured by functional MRI. Decoding is accomplished using a hierarchical logistic regression (HLR) model that is based on labels that were manually assigned from the WordNet semantic taxonomy. This model makes it possible to simultaneously decode information about both specific and general categories, while respecting the relationships between them. Our results show that we can decode the presence of many object and action categories from averaged blood-oxygen level-dependent (BOLD) responses with a high degree of accuracy (area under the ROC curve > 0.9). Furthermore, we used this framework to test whether semantic relationships defined in the WordNet taxonomy are represented the same way in the human brain. This analysis showed that hierarchical relationships between general categories and atypical examples, such as

In the past decade considerable interest has developed in decoding stimuli or mental states from brain activity measured using functional magnetic resonance imaging (fMRI). Early results in this field (Kay et al.,

Brain decoding can be viewed as the problem of finding the stimulus,

The other popular approach to this problem is direct decoding. In this approach, one constructs an explicit model of

Our direct decoding approach, hierarchical logistic regression (HLR), decodes which object and action categories are present in natural movies while capturing hierarchical dependencies among them. Logistic regression is a natural choice for modeling a system with gaussian inputs (such as BOLD responses) and binary outputs (such as the presence or absence of a specific category). The most basic logistic regression approach would be to build a separate model for each category. However, this approach implicitly assumes that each category is independent from all the others. This assumption is clearly false when the categories are related hierarchically and it can lead to nonsensical results, such as decoding that a scene contains a

We solved this problem by combining multiple logistic regression models together hierarchically. The HLR model decodes the conditional probability that each category is present, given that its hypernyms (its superordinate or parent categories in the hierarchy) are present. These conditional probability relationships can be represented as a graphical model (Figure

Thus, the joint probability that a scene contains the categories

To estimate the full HLR model, we first estimated a separate logistic model for each conditional probability. Each logistic model predicts the binary presence or absence of a category given a vector of voxel responses across a few previous time points,

To decode whether a category was present using the HLR models, we multiplied the conditional probabilities together. For example, to decode the probability that

We applied the HLR modeling framework to BOLD fMRI responses recorded from seven subjects (Figure

Functional data were collected from seven human subjects. All subjects had no neurological disorders and had normal or corrected-to-normal vision. The experimental protocol was approved by the Committee for the Protection of Human Subjects at University of California, Berkeley. Written informed consent was obtained from all subjects. The data for five of the subjects used here were the same as those used in a previous publication (Huth et al.,

The stimuli for this experiment consisted of 129 min of natural movies drawn from movie trailers and other sources. These stimuli are identical to those used in earlier experiments from our laboratory (Nishimoto et al.,

MRI data were collected on a 3T Siemens TIM Trio scanner at the UC Berkeley Brain Imaging Center, using a 32-channel Siemens volume coil. Functional scans were collected using a gradient echo-EPI sequence with repetition time (TR) = 2.0045 s, echo time (TE) = 31 ms, flip angle = 70 degrees, voxel size = 2.24 × 2.24 × 4.1 mm, matrix size = 100 × 100, and field of view = 224 × 224 mm. The entire cortex was sampled using 30–32 axial slices. A custom-modified bipolar water excitation radiofrequency (RF) pulse was used to avoid signal from fat.

Separate model estimation (fit) and model validation (test) datasets were collected from each subject in an interleaved fashion during three scanning sessions. The stimuli for the model estimation dataset consisted of 120 min of movie trailers. These stimuli are identical to the stimuli used in Nishimoto et al. (

Each run was motion corrected using the FMRIB Linear Image Registration Tool (FLIRT) from FSL 4.2 (Jenkinson and Smith,

For each voxel, low-frequency voxel response drift was identified using a median filter with a 120-s window and this was subtracted from the signal. The mean response of each voxel was then subtracted and the remaining response was scaled to unit variance.

Anatomical images were obtained using a T1 MP-RAGE pulse sequence. These images were then segmented to obtain a 3D representation of the cortical surface using Caret5 software (Van Essen et al.,

The HLR model includes a separate conditional logistic regression model for each category. Each conditional logistic regression model converts a spatiotemporal pattern of voxel activity to the binary presence (1) or absence (0) of one category, for time points where all of that category's hypernyms are present. While the cortex contains tens of thousands of voxels, many voxels are very noisy or contain little information about the stimuli. Thus, to reduce model complexity and reduce noise, only 5000 voxels in each subject were used as input to the HLR model. (Models were tested in one subject using 1000, 5000, and 10000 voxels. The best performance was found with 5000 voxels.) To find the best 5000 voxels for each subject, we first used regularized linear regression to estimate an independent encoding model for each voxel (encoding models predict the response of single voxels as a weighted sum across binary category labels). This modeling procedure was repeated 50 times, each time holding out and predicting responses on a separate segment of the model estimation dataset. Model prediction performance was then averaged across the 50-folds and the best 5000 voxels were selected. The model estimation dataset was used for this procedure, the validation data were reserved for use elsewhere.

For each scene, the spatiotemporal input to the HLR model is a length 15000 vector consisting of the BOLD responses for the 5000 selected voxels at three consecutive time points. Multiple time points were included because BOLD responses are slow, taking 5–15 s to rise and fall after a neural event (Boynton et al.,

To build each conditional logistic regression model we used only the subset of the model estimation data where all the hypernyms of the selected category were present. For example, to build a model for the category

We tested whether this gradient descent with early stopping produced different results from more standard L2-penalized regression, but found very little difference. We implemented L2 regularized logistic regression using scikit-learn (Pedregosa et al., ^{−6} to 10^{4}. For each of three bootstraps, we fit the model on 90% of the data and evaluated the loss on 10% to choose the best regularization coefficient. We then took the median regularization coefficient found over the bootstraps and used it to refit the model on the entire training set. We compared results of this procedure with those using the early stopping approach and found that, on average, regression with early stopping performed slightly better. Over all categories with AUC > 0.5 for either regression method, early stopping AUCs were on average higher by 0.09, and 59.0% of categories were better decoded by the early stopping model than L2 regularization. These differences appear to be due to early stopping doing much better on categories with few positive examples.

To avoid overfitting the model output was smoothed toward the original prior probability. We assumed a beta distributed prior on model outputs, with the mean set to the conditional prior probability for each category. We then fit a scaling parameter η such that _{i}|_{\i}, _{i, 0} is the prior probability of seeing the

All individual category models were then combined to form a HLR model that describes the full probability distribution over all scene labels.

One potential issue with the logistic regression approach described above is that the manually assigned category labels in the model estimation dataset might be inaccurate or noisy. To account for this possibility we re-estimated logistic regression models for one subject using the method from (Bootkrajang and Kabán,

For each time point in the validation dataset we predicted the probability that each category was present in the stimulus using the HLR. Then an ROC analysis was used to assess model decoding performance for each category. To perform the ROC analysis we gradually increased a detection threshold from zero to one. For each threshold we computed the number of false positive detections (points where the predicted time course is higher than the threshold but the category is not present) and true positive detections (where the predicted time course is higher than the threshold and the category is actually present in the stimulus). Then we plotted the true positive rate (TPR) against the false positive rate (FPR) across all thresholds, producing the ROC curve.

A common statistic used to gauge detection performance is the area under the ROC curve (AUC). An AUC value of 1.0 represents perfect decoding, where the decoded probability for any time point where the category is actually present is higher than the decoded probability for every time point where the category is absent. We determined chance level of the AUC by shuffling the actual binary labels for each category across time. Blocks of four TRs were shuffled 1000 times to produce new time courses with the same prior probability and a similar autocorrelation structure to the original data (we tested other block sizes but found no difference in the results). The AUC was then computed for each of 1000 shuffled time courses, and the null distribution of AUCs was fit with a beta distribution centered at 0.5. Finally, we computed the probability of obtaining the actual AUC under this distribution. The actual AUC was declared significant if its probability under this null distribution was below the significance threshold. Significance thresholds were determined by applying the Benjamini-Hochberg procedure (Benjamini and Hochberg,

The ROC analysis tests how well each category is decoded across all time. Yet it is also important to test how well all the categories are decoded within each time point. To test this we calculated the likelihood of the actual category labels at each time point, given the decoded category probabilities. This likelihood was computed as the product of the probabilities of obtaining the actual binary label for each category under the model. For the null model we used the prior probability according to the model estimation dataset, which was constant over time. We then quantified model performance as the relative log likelihood ratio between the HLR model and the null model. To estimate chance level performance we shuffled the model output for each category across time 100,000 times, recomputing the log likelihood ratio on each shuffle. The relative log likelihood was declared significant if the probability under the shuffled distribution was below the significance threshold (

Figure

The first row of Figure

The second row of Figure

The third row of Figure

The fourth row of Figure

The results in Figure

Many general categories, such as

The HLR approach assumes that cortical responses follow the WordNet taxonomy, but this assumption is likely false in some cases. Therefore we performed an analysis that shows which hypernymy relationships in WordNet were not reflected in brain activity. Under the HLR approach we used WordNet to construct conditional models that decode the presence of a given category including all of its hyponyms. For example, the conditional model for

We used this logic to construct a test for each hypernymy relationship in the subset of WordNet used in this study. For each category, we computed the conditional AUC (cAUC) using only the time points in the validation dataset when all the hypernyms of that category were present. Thus the cAUC shows how well a category can be distinguished from its siblings. We then compared the cAUC to the overall AUC for each category. If the cAUC was significantly higher than the overall AUC, then we concluded that the assumed relationship between this category and its hypernym is not reflected in brain activity.

The results of this analysis are shown in Figure

The HLR model recovers information about the presence of individual categories in the stimulus, but in natural movies many different categories appear at each point in time. To test how well the HLR model decodes all of the categories present at each time point, we computed the probability of the actual categories present in the stimuli, _{0}(_{0}(_{i}) by setting it equal to the proportion of the time that the category _{i} was present in the movies used for model parameter estimation. Thus, the likelihood of the decoded category relative to the prior for each time point is given by:

Figure

To provide an intuitive and accessible demonstration of the performance of the HLR decoder, we constructed a composite video that shows the stimulus movie on the left, and the categories with the highest decoded probability on the right (see Supplementary Video

Since the HLR decoder seems able to recover many object and action categories from BOLD responses, one might naturally ask which voxels are used to decode each category. However, note that decoding results must be interpreted with caution; asking which voxels contribute to decoding is not equivalent to asking which voxels represent information about a category (Haufe et al.,

To illustrate the difficulty of interpreting decoding model weights, we plotted both decoding and encoding weights for one category,

In this study we showed that it is possible to accurately decode the presence or absence of many object and action categories in natural movies from BOLD signals measured using fMRI. These include general categories, such as

Our decoder used a HLR model based on the graphical structure of WordNet, a semantic taxonomy that was manually constructed by a team of linguists (Miller,

The second important feature of the HLR model is that it uses the relationships between categories to rationally constrain the decoding results. If these constrains were not included, simultaneously decoding hierarchically related categories could easily lead to nonsensical results. For example, a naïve simultaneous decoder might find that the probability of a particular scene containing a

One potential issue with the HLR approach is its implicit assumption that all the hyponyms of a category elicit similar brain responses. This could lead to problems because the category relationships come from WordNet, which is a hand-constructed semantic taxonomy and thus is not guaranteed to reflect brain activity. To address this issue, we tested each of the relationships specified in the subset of WordNet spanned by our stimuli. This was done by examining how easily each category could be distinguished from its siblings under the same hypernym.

We found that two specific types of WordNet relationships were not reflected in cortical representations. The first are relationships that are technically correct, but where the specific category fails to share many features with the general category. For example, the relationship between

One alternative to the HLR approach would be to decode only the “basic level” categories (Rosch et al.,

Another alternative to the HLR model would be to represent categories not as binary variables, but as vectors of features, topic probabilities (Stansbury et al.,

In recent years, the field of brain reading has generated considerable interest from scientists and the public alike. Every improvement in brain measurement technology brings us closer to the goal of a general device for reading out the state of a person's brain. To that end, the HLR model developed here has improved our ability to simultaneously decode many variables while respecting some of the statistical dependencies between them. Yet there are still many issues in brain reading that remain unsolved. We believe that the most important theoretical limitation is that all current methods (the HLR included) assume independence between variables that are actually not independent. One example is the assumption that each category within a scene occurs independently, as discussed above. Another example is the assumption that stimuli are independent from timepoint to timepoint. Relaxing these assumptions should improve the performance of future brain decoders. The future ideal decoder should capture as many of these dependencies among stimulus variables as possible, thus minimizing the amount of information needed to decode the stimuli.

AH and TL designed and carried out the analysis, with input from NB, JG, SN, and AV. AH, NB, AV, and SN collected the data. SN designed the stimulus. AH labeled semantic categories in the stimulus. AH and TL wrote the paper, with contributions from NB, SN, JG, and AV. JG oversaw all stages of research.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

We would like to thank James Gao and Tolga Çukur for their help in this project. The work was supported by grants from the National Eye Institute (EY019684), and from the Center for Science of Information (CSoI), and NSF Science and Technology Center, under grant agreement CCF-0939370. AH was also supported by the William Orr Dingwall Neurolinguistics Fellowship.

The Supplementary Material for this article can be found online at: