^{*}

Edited by: Oi-Man Kwok, Texas A&M University, United States

Reviewed by: Heungsun Hwang, McGill University, Canada; Eun Sook Kim, University of South Florida, United States

*Correspondence: Terrence D. Jorgensen

This article was submitted to Quantitative Psychology and Measurement, a section of the journal Frontiers in Psychology

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

The assumption of equivalence between measurement-model configurations across groups is typically investigated by evaluating overall fit of the same model simultaneously to multiple samples. However, the null hypothesis (H_{0}) of configural invariance is distinct from the H_{0} of overall model fit. Permutation tests of configural invariance yield nominal Type I error rates even when a model does not fit perfectly (Jorgensen et al.,

Many behavioral researchers do not have the luxury of being able to directly observe the phenomena they study. For example, organizational researchers need to measure job satisfaction or morale. Clinicians need to measure various psychological disorders. Social psychologists and sociologists need to measure attitudes and social orientations. Educational researchers need to measure teaching and learning outcomes. Often, researchers rely on indirect measures, such as self-report scales, and psychometric tools, such as reliability estimates and latent trait models [e.g., confirmatory factor analysis (CFA) and item-response theory (IRT) models] facilitate evaluation of the quality of those measurements.

Similarly frequent is the need for researchers to compare groups, in either experimental (e.g., treated vs. control) or observational contexts (e.g., demographic or intact groups). In order to make valid comparisons of scale responses across groups, the scale must function equivalently for those groups. In other words, if measurement parameters are equivalent across groups, observed group means will only differ as a function of differences on the latent trait itself (Meredith,

Latent trait models facilitate the investigation of ME/I, and different levels of ME/I have been defined according to categories of model parameters. In a CFA framework, configural invariance is represented in a model with the same pattern of fixed and free (i.e., near-zero and substantial) factor loadings across groups, although the values of these parameters may differ across groups. When fitting models to multivariate normally distributed data using maximum likelihood estimation, the null hypothesis (H_{0}) of configural invariance is traditionally tested using a likelihood-ratio test statistic (LRT)^{1}^{2} random variable with

Metric equivalence (or “weak” invariance) indicates the additional assumption that the values of factor loadings are equal across groups, and this assumption must hold in order to make valid across-group comparisons of latent variances or correlations. This model is nested within the configural model, so a Δχ^{2} test can be used to test the H_{0} of exact metric equivalence. If a researcher concludes that full (or partial^{2}

The current paper discusses recent advances only in tests of configural invariance, which is the least restrictive level of invariance. A false H_{0} would imply that model configurations differ across groups, in which case data-generating population processes do not share all the same parameters across groups. A test that rejects the H_{0} of configural invariance would therefore prohibit researchers from testing more restrictive levels of ME/I. Currently, configural invariance is assessed by evaluating the overall fit of the configural model (Putnick and Bornstein, _{0} should lead to a poor fit does not imply the reverse^{3}_{0} of configural invariance is true), but the specified model is a poor approximation of the true functional form of that process (i.e., false H_{0} that the model is correctly specified).

Using a newly proposed permutation test of configural invariance (Jorgensen et al., _{0} of configural invariance can be tested with nominal Type I error rates even when the H_{0} of correct specification is false. I extend this line of research by proposing the use of multivariate modification indices (Bentler and Chou, _{0} of group equivalence in true model configurations. This study is therefore only concerned with the situation when the H_{0} of configural invariance is true (but the model does not fit well), not when the H_{0} is false. To evaluate the use of multivariate modification indices for the purpose of testing whether the same parameter should be freed simultaneously across groups, I designed a small-scale simulation study as a proof of concept to show that they are capable of preventing Type I error inflation better than traditional 1-

I begin by reviewing in more detail issues with testing model fit vs. configural invariance, using an analysis of the classic Holzinger and Swineford (_{0} of configural invariance. I then describe the small-scale Monte Carlo simulation study comparing Type I error rates using univariate and multivariate modification indices. I conclude with recommendations for future applied and methodological research.

Configural invariance in a multigroup context is equivalence in model configurations across the populations of interest. The analysis models are typically specified as configurally invariant, and the LRT of overall model fit is used to evaluate whether the model adequately approximates the population models. As noted in the Introduction, rejection of the H_{0} of exact model fit could imply numerous conditions, including but not limited to the following: (a) the hypothesized model corresponds well to one or more populations but poorly to at least one other; (b) the model does not correspond to any group's model, for different reasons across groups; (c) all groups true models are configurally invariant, but the hypothesized model does not correspond to that shared functional form. Thus, when a model's overall fit to multiple groups needs improvement, the decision of how to respecify the model would depend on which condition led to poor overall fit.

Because the LRT is a test of overall exact fit of the model to the data, two potential sources of misspecification are confounded (Cudeck and Henly, _{0} of configural invariance only concerns the former source of approximation discrepancy (which I will refer to as

Good model fit and equivalent model configurations are both important foundational assumptions of ME/I because testing equality of measurement parameters is only valid if the estimated parameters correspond to actual parameters of the true data-generating process. But merely testing the overall fit of a configural model does not provide adequate information about whether model configurations can be assumed equivalent across groups. It is possible (perhaps even probable) that a model provides as good a description of one population as it does for another population (e.g., men and women or respondents from different countries), even if the model fits poorly or only approximately well. Evaluating overall fit therefore tests the wrong H_{0} by confounding group equivalence and overall exact model fit into a single test. The permutation method introduced by Jorgensen et al. (

Another common issue with model-fit evaluation is the common perception that the LRT nearly always rejects good models because SEM requires large sample sizes for estimation. Although it is true that power is a function of sample size, an analysis model that corresponds perfectly with a true population model would not yield inflated Type I errors (actually, small-sample bias would; Nevitt and Hancock, _{0} would be true. But because theoretical models are more realistically interpreted as approximations to more complex population models (MacCallum, _{0} of exact fit should rarely be expected to be precisely true in practice. In order to help researchers evaluate the degree to which a H_{0} is false, numerous indices of approximate fit have been proposed since the 1970s, analogous to providing standardized measures of effect size that accompany a null-hypothesis significance test in other contexts (e.g., Cohen's

Unfortunately, approximate fit indices (AFIs) or their differences (Δ) between competing models rarely have known sampling distributions. Even when they do [e.g., the root mean-squared error of approximation (RMSEA); Steiger and Lind,

Putnick and Bornstein (_{0} of equivalent group configurations (Jorgensen et al.,

To demonstrate the utility of the recently proposed permutation test and how multivariate modification indices can be used to modify a model under the assumption of configural invariance, I fit a three-factor multigroup CFA model with simple structure to the Holzinger and Swineford (

Estimated parameters from CFA with simple structure.

Visual | _{1} |
Visual perception | 1.047 | 0.298 | 0.777 | 0.715 |

_{2} |
Cubes | 0.412 | 1.334 | 0.572 | 0.899 | |

_{3} |
Lozenges | 0.597 | 0.989 | 0.719 | 0.557 | |

Textual | _{4} |
Paragraph comprehension | 0.946 | 0.425 | 0.971 | 0.315 |

_{5} |
Sentence completion | 1.119 | 0.456 | 0.961 | 0.419 | |

_{6} |
Word meaning | 0.827 | 0.290 | 0.935 | 0.406 | |

Speed | _{7} |
Speeded addition | 0.591 | 0.820 | 0.679 | 0.600 |

_{8} |
Speeded counting of dots | 0.665 | 0.510 | 0.833 | 0.401 | |

_{9} |
Speeded discrimination between straight and curved capital (uppercase) letters | 0.545 | 0.680 | 0.719 | 0.535 |

There is evidence that the configural model does not fit the data perfectly,

A permutation test of configural invariance can be conducted by comparing ^{2} distribution with 48 _{0} of equivalent model configurations can be estimated by randomly reassigning rows of data to the two schools, fitting the configural model to the permuted data, and saving χ^{2}. Repeating these steps numerous times results in a permutation distribution of χ^{2}, and a ^{2}. Because the students are assumed equivalent when they are randomly reassigned to schools, the permutation distribution reflects the sampling variance of χ^{2} under the assumption that the schools share the same data-generating model, but without assuming that the data-generating model corresponds perfectly with the fitted model. Due to poor model fit (i.e., the H_{0} of no overall approximation discrepancy is rejected), the permutation distribution is not expected to approximate a central χ^{2} distribution with 48 _{0} of no group discrepancy (Jorgensen et al.,

A permutation test revealed no evidence against the H_{0} of configural invariance using either χ^{2}(

My discussion below is in the context of maximum likelihood estimation, but the same concepts can be applied to other discrepancy functions for estimating SEM parameters (Bentler and Chou, _{0}) and unrestricted (M_{1}) model. The LRT statistic is calculated by comparing the log-likelihood (ℓ) of the data under each model: LRT = −2 × (ℓ _{0} − ℓ _{1}). If the H_{0} is true and distributional assumptions are met, the LRT statistic is asymptotically distributed as a central χ^{2} random variable with _{0} relative to M_{1}.

The Wald and Lagrange multiplier tests are asymptotically equivalent to the LRT, but the Wald test only requires fitting M_{1}, whereas the Lagrange multiplier test only requires fitting M_{0} (for details see Buse, ^{2} of M_{0}) if that constraint were freed in M_{1} (but without needing to fit M_{1}), assuming all other parameter estimates would remain unchanged between M_{0} and M_{1}. Calculation of Lagrange multipliers utilizes information from the gradient (first derivative of the discrepancy function). Specifically, the curvature of the likelihood function evaluated with respect to the null-hypothesized value (θ_{0}) of a fixed parameter (typically zero) provides a clue about how far θ_{0} is from the true θ, relative to the estimated sampling variability.

Bentler and Chou (^{4}

Table _{9} task required similar visual skills as the other visual indicators. If one considered the standardized expected parameter changes in tandem with modification indices, as advised by Saris et al. (

Largest univariate and multivariate modification indices for fixed (to zero) parameters.

Pasteur | Visual → _{9} |
11.07^{a} |
0.32 | 0.32 |

Textual → _{1} |
10.18^{a} |
0.89 | 0.76 | |

_{4} ↔ _{6} |
11.28^{a} |
−0.33 | −0.29 | |

Grant–White | Visual → _{7} |
11.27^{a} |
−0.39 | −0.38 |

Visual → _{9} |
24.54^{a}^{,}^{b} |
0.58 | 0.57 | |

_{7} ↔ _{8} |
24.82^{a}^{,}^{b} |
0.61 | 0.57 | |

Multivariate | Visual → _{7} |
16.45^{a}^{,}^{b} |
||

(MI = |
Visual → _{9} |
35.61^{a}^{,}^{b} |
||

_{7} ↔ _{8} |
29.01^{a}^{,}^{b} |

The bottom rows of Table ^{2} = 35.61 in Table

Next, I present a small-scale simulation study designed to evaluate the use of multivariate modification indices. A concise simulation was designed to keep the focus on the purpose of this simulation, which is to provide a “proof of concept” that multivariate modification indices can control Type I errors better than univariate modification indices when the hypothesized model is approximately well specified but needs improvement. I focus on this situation because modification indices are unlikely to lead to the true data-generating model when a hypothesized model deviates substantially from it (MacCallum,

To simulate data in which the H_{0} of configural invariance was true but the H_{0} of exact model fit is false, I specified a two-factor CFA model for four groups, with three indicators for each of two common factors. The factor loadings were λ = 0.6, 0.7, and 0.8 for the first, second, and third indicator of each factor, respectively. The residual variances were specified as 1 – λ^{2} so that indicators were multivariate normal with unit variances. Factor variances were fixed at 1 (also in the analysis model, for identification), and all indicator and factor intercepts were zero. Factor correlations were 0.2, 0.3, 0.4, and 0.5 in Groups 1, 2, 3, and 4, respectively, so that population covariance matrices were not identical, although model configurations were equivalent.

Imperfect overall model fit was specified by setting two residual covariances in the four populations with values of 0.2 between the first and fourth indicators, corresponding to a moderate residual correlation of 0.2/0.64 = 0.31, and 0.15 between the second and fifth indicators, corresponding to a moderate residual correlation of 0.15/0.51 = 0.29. These parameters were specified in all groups, so the population models were configurally invariant. Fixing these two residual covariances to zero in the analysis model resulted in significant misfit,

The configural model fixed six cross-loadings and 15 residual covariances to zero, yielding 21 modification indices to consider in each of four groups. The Bonferroni-adjusted α level was therefore 0.05/21 = 0.0024 for 4-^{2}, CFI, and RMSEA) and model respecification (univariate and multivariate modification indices). Within each replication, I also used a permutation test of configural invariance. When the model needed respecification, the parameter with the largest significant 4-

Using overall model fit as the criterion for evaluating configural invariance led to rejecting the model in 99.9% of replications using a significant LRT as criterion. Using Hu and Bentler (_{0} of configural invariance in only 4.9% of the 1,000 replications, so the Type I error rate did not deviate substantially from the nominal α = 5%. This demonstration is consistent with previous results investigating the permutation method for testing ME/I in a two-group scenario (Jorgensen et al.,

Multivariate modification indices correctly detected that at least one of the two omitted residual covariances should be freed in 99.6% of the replications, and correctly detected both omitted parameters in 73.9% of replications. This was accomplished while maintaining nominal (4.4%) familywise Type I errors across iterative modifications. By comparison, the largest 1-

The aim of this paper was to advance two methods for testing configural invariance: how to test the correct H_{0} and how to test constraints in a poor-fitting configural model. A recently developed tool is a permutation test of the H_{0} of equivalent model configurations, which has shown promising control of Type I errors even when a configural model fits poorly (Jorgensen et al., _{0}, researchers might be motivated to explore ways to modify their model to better reflect the data-generating process. Multivariate Lagrange multipliers (Bentler and Chou,

The simulation was not designed to provide comprehensive information across a variety of conditions, but it contributes some evidence that these tools warrant further investigation. Given that fully invariant metric (17.8%) and scalar (42.2%) models are rejected many times more often than configural (5.5%) models (Putnick and Bornstein,

This paper focused only on the situation when the H_{0} of configural invariance was true. When the data provide evidence against the assumption of equivalent model configurations^{5}

I conclude by reiterating the importance of substantive theory to guide the process of model respecification (Brown,

TJ is responsible for the data analysis (using openly available data), design the simulation study, and writing the manuscript.

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

^{1}Although the configural model is only tested with a single model's χ^{2} statistic, this statistic is nonetheless equal to −2 times the difference between log-likelihoods of models representing two competing hypotheses: the hypothesized configural model (labeled H0 in the output from software such as M^{2} and ^{2} test between the configural and saturated models would therefore be calculated by subtracting zero from the configural model's χ^{2} and

^{2}Partial invariance models posit that some, but not all, measurement parameters can be constrained to equality across groups or occasions, which still allows valid comparisons of latent parameters across groups Byrne et al.,

^{3}This logical fallacy is referred to as

^{4}As stated on the Frontiers web page (

^{5}See Jorgensen et al. (