^{*}

Edited by: Jonathan D. Victor, Feil Family Brain and Mind Research Institute, Weill Cornell Medicine, United States

Reviewed by: Corey Ziemba, University of Texas at Austin, United States; Roland W. Fleming, University of Giessen, Germany; Dicle Dövencioğlu, Middle East Technical University, Turkey

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Texture information plays a critical role in the rapid perception of scenes, objects, and materials. Here, we propose a novel model in which visual texture perception is essentially determined by the 1st-order (2D-luminance) and 2nd-order (4D-energy) spectra. This model is an extension of the dimensionality of the Filter-Rectify-Filter (FRF) model, and it also corresponds to the frequency representation of the Portilla-Simoncelli (PS) statistics. We show that preserving two spectra and randomizing phases of a natural texture image result in a perceptually similar texture, strongly supporting the model. Based on only two single spectral spaces, this model provides a simpler framework to describe and predict texture representations in the primate visual system. The idea of multi-order spectral analysis is consistent with the hierarchical processing principle of the visual cortex, which is approximated by a multi-layer convolutional network.

The primate visual system rapidly analyzes texture information, or image statistics or ensemble, from complex natural images (

Visual texture is defined as the image region consisting of complex repetition of various features (

Following Julesz’s conjecture, studies have proposed a computational model that analyzes spatial distribution of low-level statistics. The most influential one is often referred to as the Filter-Rectify-Filter (FRF) model (

Revisiting the computational architecture of the FRF model, the present study proposes a novel model, or a viewpoint, that natural texture perception is essentially based on 1st- and 2nd-order spectral analyses. We show that the computations of this model are functionally consistent with the computations of PS statistics in two single-frequency spaces. To validate the model, we also introduce a novel texture synthesis based only on scrambling of the 1st- and 2nd-order phase spectra.

A diagram of the Filter-Rectify-Filter (FRF) model of texture vision. The model can be regarded as a two-stage amplitude spectral analysis: The 1st stage is a local spectral analysis of the luminance input, and the 2nd stage is a global spectral analysis of the 1st-stage output.

The conventional FRF model assumes that both the 1st- and 2nd-order processes involve two-dimensional filtering only for space (x,y). However, the energy output of the 1st-order process is four-dimensional, consisting of space (x,y), orientation (ori), and spatial frequency (freq). Corresponding to the dimensionality of the output, the 2nd-order process must be a spectral analysis of four dimensions (x, y, ori, and freq).

Relationship between subband energy data in the space domain (x, y, ori, and freq) and its amplitude spectrum in the frequency domain (Fx, Fy, Fori, and Ffreq).

From a functional view, this notion is consistent with another powerful texture model, the PS statistics model (

In summary, the FRF model can be extended and considered as a simple Fourier spectral analysis of the luminance data (1st-order, 2D) and the subband energy data (2nd-order, 4D). On this basis, we propose a novel model that states visual texture processing is represented as 1st- and 2nd-order spectral analyses (

A model in which texture perception is based on the 1st- and 2nd-order frequency spectrum. The 1st-order is the spectrum of the luminance image (2D) and the 2nd-order is the spectrum of the subband energies (4D).

Synthesis of a natural texture based on a model is a powerful and ecologically valid way to test the model. One of the most successful cases is the PS synthesis (

The luminance-energy phase randomized image is generated as shown in

Schematic diagram of the luminance-energy phase randomization. For simplicity, only four orientations and four scales are shown.

(1) Using white noise as a seed, generate a lum-PR image which has the luminance amplitude spectrum equal to that of the target. (2) Decompose both the target and the lum-PR image into orientation and spatial-frequency subbands through bandpass filters. (3) Convert Each subband into an energy image. (4) Perform four-dimensional fast-Fourier transform (4D-FFT) on the energy data to obtain the amplitude spectrum of the target and the phase spectrum of the lum-PR image. (5) Apply an inverse FFT to the amplitude and phase spectra to obtain new subband energy data. (6) Extract linear subbands from energy data, and then collapse subbands to reconstruct the new luminance image.

It is well known that the luminance histogram, or pixel moment statistics, also has an impact on the appearance of a texture (

We applied the lum-energy phase randomization for 300 natural textures.

Luminance-energy phase randomized images of various natural textures.

Comparison of the images of lum-energy PR (4D le-PR) with the images of luminance phase randomization (l-PR), Heeger-Bergen synthesis (HB), Portilla-Simoncelli synthesis (PS), and 2D lum-energy PR (2D le-PR).

To compare the perceptual quality of the lum-energy PR textures with those of the other synthetic textures, we had human observers assess the perceptual similarity to the original for natural textures of 300 samples, which is much larger than the number of samples used in previous studies (

Although we did not control stimulus duration, if we controlled it to a short time, the importance of the statistics (and hence the rank between synthesis conditions), might have changed due to temporal dynamics in the hierarchy of neural processing.

In the present study, we extended the dimensions of FRF processing and proposed a novel model that texture perception is based on the 1st-order (2D-luminance) and 2nd-order (4D-energy) amplitude spectra of the image. The model is represented within only two single spectral spaces (+pixel histogram), and it provides a simple framework to describe and predict texture representations in various visual tasks, including scene and material perception. In addition, the notion is consistent with the PS statistical model, and it therefore provides a comprehensive understanding of the FRF and PS models in the frequency domain.

The model is biologically plausible as both the FRF and PS models are supported by rich physiological correlates in the early visual cortex, such as simple and complex cells in V1 (

The model analyzes up to the 2nd-order spectrum: the final output is a pooled summary of the 2nd-stage (i.e., global spectrum analysis), and no further analysis is performed. Termination of the process at the 2nd-stage is based on the notion that relatively low-level features are important for preattentive texture perception. However, it is also possible to perform a local spectral analysis without pooling in the 2nd-stage, as in the 1st-stage, and continue the spectral analysis at higher stages. Such an extension may reconcile the findings that point to the significance of higher-order features in texture perception (

One may notice that such a multi-order spectral analysis is remarkably consistent with the hierarchical processing principle of the visual brain (

It should also be mentioned there are some discoveries that have a similar structure to our model. One of those examples is the wavelet scattering network used to compute a translation-invariant image representation for classification (

Furthermore, the analogy of the two-stage spectral analysis applies not only to vision but also to audition. One good example is the analysis of the modulation spectrum of natural sounds (

While we introduced the luminance-energy phase randomization (lum-energy PR) only to test the idea of the two-stage spectrum, it may be used as a new technique to synthesize naturalistic textures. The algorithm is simpler than PS synthesis as it is mainly based on the FFT and histogram matching only. On the other hand, the (4D) lum-energy PR requires a relatively large amount of data (total data = [N × N](histogram matching) + [N/2 × N/2](1st-order spectrum) + [N/2 × N/2 × 4 × 4](2nd-order spectrum), if N × N pixels of image size, eight orientations, and eight frequencies) because it was not designed to represent a texture image with a compact code. However, there is space to compress the data size by using under-sampling, PCA, ICA, etc. As the data are represented only in two single spaces (i.e., 2D spectrum and 4D spectrum), one would apply PCA/ICA more effectively than previously done for the PS statistics (

The psychophysical results show that there is a significant difference in the synthesis quality of the lum-energy PR texture depending on whether the preserved energy spectrum is obtained by 4D-FFT or 2D-FFT. The improvement in representation is considered one of the advantages of extending the conventional FRF model that operates only in the spatial dimension to our model that also considers orientation and spatial frequency correlations. It is noted, however, that the difference was small when compared with the difference between PS synthesis and 4D le-PR. This suggests that the effect of energy correlation across orientation and frequencies on the quality of the synthesis is not larger that of energy correlation across space.

Through the development of lum-energy PR images, we also found that the pixel-luminance histogram plays a significant role in addition to the two spectra data. This is consistent with the previous texture models, including PS (

Luminance-energy phase-randomized images were generated according to the following procedure. All computations were implemented by a MATLAB code. An image with the same luminance amplitude spectrum as the target (lum-PR image) was generated using white noise as a seed. Both the target and the lum-PR image were decomposed into subband images with eight orientations (0–157.5° in 22.5° step) and eight spatial frequencies (1–128 cycle/image in 1 octave step) using log-Gabor filters with a spatial-frequency bandwidth of 1 octave and an orientation bandwidth of 30°. Each subband was then converted into an energy image by taking the square root of the sum of squares of the quadrature pair. The amplitude spectrum of the target energy and the phase spectrum of the lum-PR image energy were then obtained by four-dimensional fast-Fourier transform (4D-FFT) on the energy data. New subband energy data was obtained by the inverse FFT of the amplitude and phase spectra. Linear subbands were extracted from energy data using the carrier from the lum-PR image. A new luminance image was then obtained by collapsing the linear subbands. Finally, a luminance histogram of the obtained image was matched to that of the target. Histogram matching was performed in the same way as in the Heeger-Bergen synthesis (

Visual stimuli consisted of 300 natural texture images (4.3 × 4.3 deg, 256 × 256 pixels). They were collected from NYU Laboratory for Computational Visiony^{1}, McGill Calibrated Color Image Database^{2} (

In each trial, the original texture was presented in the center of the background, and synthetic textures from the five different methods were randomly presented at each vertex of a regular pentagon with the original as the center, and all were located at 6.0° from the center. The observers viewed the display with free gaze and ranked the perceptual similarity of the synthetic images to the original image. Stimuli were shown until the observer responded.

One of the authors and seven naïve paid volunteers participated in the experiment (one females, 21–28 years old, mean = 23.0, SD = 2.35). All of them had normal or corrected-to-normal vision. All experiments were conducted in accordance with the Ethics Committee for Experiments on Humans of the Graduate School of Arts and Sciences, The University of Tokyo. All stimuli were generated by a PC and presented on LCD or OLED monitors with a refresh rate of 60 Hz. Due to the COVID-19 pandemic situation, each observer used LCD monitors (BenQ XL2720B, BenQ XL2730Z, BenQ XL2735B, and BenQ XL2430T) or OLED monitors (SONY PVM-A250 and SONY PVM 2541A) installed in a dark room at their individual homes. The luminance of all monitors was carefully calibrated and gamma-corrected by Colorimeter (ColorCal II CRS). The mean background luminance ranged from 26.2 to 48.6 cd/m^{2} (mean = 36.2, SD = 7.48). The viewing distance was adjusted so that the pixel resolution was 1.00 min/pixel. The size of the background in each monitor varied from 31.0° (W) × 18.0° (H) to 42.7° (W) × 24.0° (H).

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

The studies involving human participants were reviewed and approved by the Ethics Committee for Experiments on Humans of the Graduate School of Arts and Sciences, The University of Tokyo. The patients/participants provided their written informed consent to participate in this study.

IM conceived the study. KO and IM designed the study and experiment and wrote the manuscript. KO collected and analyzed the data. Both authors contributed to the article and approved the submitted version.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.