Bayesian power equivalence in latent growth curve models

Longitudinal studies are the gold standard for research on time‐dependent phenomena in the social sciences. However, they often entail high costs due to multiple measurement occasions and a long overall study duration. It is therefore useful to optimize these design factors while maintaining a high informativeness of the design. Von Oertzen and Brandmaier (2013,Psychology and Aging, 28, 414) applied power equivalence to show that Latent Growth Curve Models (LGCMs) with different design factors can have the same power for likelihood‐ratio tests on the latent structure. In this paper, we show that the notion of power equivalence can be extended to Bayesian hypothesis tests of the latent structure constants. Specifically, we show that the results of a Bayes factor design analysis (BFDA; Schönbrodt & Wagenmakers (2018,Psychonomic Bulletin and Review, 25, 128) of two power equivalent LGCMs are equivalent. This will be useful for researchers who aim to plan for compelling evidence instead of frequentist power and provides a contribution towards more efficient procedures for BFDA.


Introduction
Researchers design experiments to gain knowledge of the world. In a world of limited resources, it is ethical to conduct these experiments efficiently (Halpern et al., 2002). Hunter and Hoff (1967) define research efficiency as 'the amount of useful information obtained per unit cost'. Often, longitudinal studies entail especially high costs. These accrue either due to a long overall study duration, for example when a treatment has to be administered over a long period of time, or due to a large number of measurement occasions, for example when non-reusable testing material is spent at each testing event. It is therefore especially important to plan longitudinal studies carefully so that an optimal balance between study costs and the expected gain in information can be achieved (Brandmaier et al., 2015).
Longitudinal designs can be statistically evaluated with a sub-group of structural equation models (SEMs; for an overview see e.g., Baltes et al., 1988) called Latent Growth Curve Models (LGCMs; see e.g., Duncan & Duncan, 2009). In a simple LGCM, the values of a variable across several measurement occasions (x i ) are modeled as a combination of a This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited. *Correspondence should be addressed to Angelika M. Stefan, Department of Psychology, Faculty of Behavioral and Social Sciences, University of Amsterdam, Nieuwe Achtergracht 129-B, 1018WS Amsterdam, The Netherlands (email: a.m.stefan@uva.nl). latent intercept (I) and a latent slope (S). The intercept has a constant influence on the measurement occasions, while the slope adds time-dependent linear changes (see Figure 1). To add nonlinear changes, a quadratic or higher-order term can be introduced (Duncan & Duncan,2009). For example, Lindenberger and Ghisletta (2009) investigated cognitive and sensory decline in elderly participants with an LGCM. In this context, the intercept parameter captured the participants' initial abilities and the slope parameter captured the extent of the linear time-dependent decline.
An advantage of LGCMs is that they allow the direct estimation of between-subjects variability in the latent intercept and slope, described as the variance of the intercept (r 2 I ) and the variance of the slope (r 2 S ) in the model. These random effects represent the individual differences in initial performance and change, respectively (Rogosa & Willett, 1985). In an LGCM where the intercept reflects the initial status of the observed variable, the intercept-slope covariance (r IS ) reflects the extent to which individual differences in the initial status correlate with subsequent change (Rovine & Molenaar, 1999). Thus, in the example used earlier (Lindenberger & Ghisletta, 2009), the variance of the intercept can be interpreted as the variability of cognitive and sensory abilities of participants at the beginning of the study. The variance of the slope corresponds to differences in the steepness of the cognitive decline between participants. A positive covariance between intercept and slope in the example would show that participants with higher initial abilities suffer from a more rapid decline.
In a frequentist setting, an important aspect of the quality of a design is its statistical power, which is defined as the long-term probability of correctly rejecting the null hypothesis under a given population effect size that differs from zero. The statistical power of a design depends on the size of the effect in the population, on the significance level a of the hypothesis test, on the sample size N, and on the measurement design  (Brandmaier, et al., 2018;Cohen, 1992). For most traditional hypothesis tests, such as a z-test or a t-test, it is possible to calculate the statistical power analytically (Murphy et al., 2014). However, for most SEMs there is no analytical solution available, so the statistical power of a model has to be estimated via numerical approximations (e.g., Saris & Satorra, 1993) or through simulations (e.g., Hertzog, et al., 2008, Muth en & Muth en, 2002. Von Oertzen (2010) introduced the concept of power equivalence, which describes that two designs have the same statistical power to detect a true effect. Power equivalence can be used to find research designs that are most resource efficient among designs with the same power. For example, von Oertzen and Brandmaier (2013) illustrated how power equivalence facilitates finding a cost-optimal solution among multiple longitudinal designs. In longitudinal designs, power equivalence can be established by balancing the overall duration of the study and the number of measurement occasions. To keep the power constant, more measurement occasions are required if the overall study duration is shortened. By comparing multiple power-equivalent longitudinal designs based on data and cost estimates from the Berlin Aging Study (BASE; Ghisletta, et al., 2006), von Oertzen andBrandmaier (2013) showed that the overall study costs could be reduced by 16% compared to the original design while keeping the statistical power with respect to the variance of slopes constant. Thus, power equivalence can facilitate the planning of future studies in two ways: First, instead of conducting multiple potentially resource-intensive power analyses for different designs, a power analysis has to be computed only once for a theoretically infinite number of power-equivalent designs. Second, knowing that certain designs do not differ in an important aspect of design quality, researchers can focus on minimizing the costs (Hunter & Hoff, 1967).
Conceptually, power equivalence as applied in von Oertzen and Brandmaier (2013) can be described by the following procedure. Any LGCM can be reduced to a power equivalent model with a minimum number of observed parameters, from which further power equivalent models can be derived. These power equivalent models balance different design parameters, for example the number of measurement occasions ( j = 1, . . ., k) and the time distance between measurement occasions, modeled in the path parameters h S ? x j , so that the power to detect an effect (e.g., r 2 S > 0) is equivalent. 1 This is reflected in the effective error variance r 2 eff which is shared by all power-equivalent models. Figure 2 schematically depicts this trade-off: A linear trend is measured with three power equivalent designs which differ in their number of measurement occasions and their overall study duration.
In recent years, the replicability crisis (Pashler & Wagenmakers, 2012) as well as continuing criticism regarding the frequentist hypothesis testing framework (e.g., Edwards, et al., 1963;Wagenmakers, 2007) have led to a growing interest in Bayesian methods for statistical inference. 2 The single most important quantity in Bayesian hypothesis testing is the Bayes factor (Kass & Raftery, 1995). Mathematically, the Bayes factor (BF 10 ) is defined as the ratio of the marginal likelihood of the data under the alternative model (pðDjH 1 ÞÞ and the marginal likelihood of the data under the null model (pðDjH 0 ÞÞ. It provides a continuous quantification of the evidence in favor of one statistical model compared to another statistical model. Since most researchers aim to collect compelling evidence in a study, both very large or very small Bayes factors can be regarded as a desirable outcome of a study. For example, a Bayes factor of BF 10 = 10 indicates a tenfold increase in prior odds to posterior odds in favor of the alternative hypothesis after having observed the data, while a Bayes factor of BF 10 = 1/10 indicates a tenfold increase in prior odds in favor of the null hypothesis. How large the Bayes factors get that an experiment yields, depends on the tested models (described by likelihoods and prior distributions), on the population effect size, on the amount of collected data, that is, the number of observations in the sample, and on the measurement design (Stefan, et al., 2019). Assuming that the models are determined by the research question, only the sample size and measurement design can be directly influenced by the researcher. This shows that researchers who use Bayesian statistics to evaluate their data are also in the need to balance the costs and the information gain of their designsin other words that design planning is an important topic from a Bayesian viewpoint, too.
How can researchers find an adequate sample size or measurement design so that their study likely yields compelling evidence, but is also designed economically? Sch€ onbrodt & Wagenmakers (2018) proposed a framework called 'Bayes Factor Design Analysis' (BFDA) that enables researchers to find the expected Bayes factors of their design. Their approach is based on Monte Carlo simulations where data are repeatedly simulated under a population model ('design prior') and a Bayesian hypothesis test is conducted for each of these samples. BFDA is applicable to both sequential Bayesian designs, where the sample size is gradually increased until a prespecified Bayes factor is reached, and fixed-N designs, where the sample size is specified prior to data collection. For the latter more traditional sampling procedure, a BFDA results in a distribution of Bayes factors that enables researchers to assess the informativeness of their planned design.
In this paper, we show that the notion of power equivalence can be extended to Bayesian hypothesis tests. Specifically, we show that the results of a BFDA for a fixed-N design (Sch€ onbrodt & Wagenmakers, 2018) of two power equivalent models as defined by von Oertzen (2010) are equivalent. Our findings are not only relevant on a conceptual level as they instantiate a bridge between frequentist and Bayesian methods. They also provide Bayesians with a possibility of design justification in longitudinal settings and help to save resources in design planning because computationally expensive BFDAs need to be conducted only once for power equivalent designs. Our paper is structured as follows: First, we will formally prove the equivalence of BFDA results for power equivalent models. In a second step, we will substantiate our proof with a simulation for power equivalent LGCMs. Then, we will provide an application example that illustrates how Bayesian power equivalence can facilitate design planning. We will discuss the implications and limitations of our findings at the end of this article.

Formal proof of BFDA equivalence for power equivalent models
In this section, we show formally that two power equivalent models with the same parameter set h will also produce the same distribution of the Bayes Factor when comparing two hypotheses about h under data generated by a population model. We assume that both hypotheses are given by a prior distribution p 1 and p 1 for h, where as usual one or both can be point hypotheses, i.e., degenerated prior distributions with the mass fixed at any specific point.
Power equivalence on multivariate normal models, as defined in von Oertzen (2010), can be expressed as a combination of two basic power equivalent operations. The first one is a linear transformation of the observed variables, the second an omission of observed variables with a probability distribution which is constant with respect to h, and which are independent of other variables. For example, in an LGCM, the linear transformation transforms the measurement model into a minimal model with one observed variable that is dependent on the latent slope and a number of variables that are independent of the latent slope (and hence of the slope variance parameter). An example for a power equivalent transformation of an LGCM can be seen in Figure 3. The mathematical details of the calculation can be found in the Appendix.
Let (S,m) be the estimated covariance matrix and mean of a sample and (R, l) of a model. In the following, we will write L R;l ðS; mÞ for the minus two log likelihood, i.e., L R;l ðS; mÞ ¼ À2 log LðS; mjR; lÞ: We start by showing two simple lemmas.
Lemma 1. For any multivariate normal model with covariance matrix R and mean l, an orthogonal transformation Q on the model space does not change the likelihood function.
Proof . The minus two log likelihood of a multivariate normal with parameter l and R and a dataset with mean m and covariance matrix S per participant is L R;l ðS; mÞ ¼ c þ lnðjRjÞ þ TrðR À1 SÞ þ ðm À lÞ T R À1 ðm À lÞ: Transforming all four distribution parameters with Q results in L QRQ T ;Ql ðQSQ T ; QmÞ ¼ c þ lnðjQRQ T jÞ þ TrðQR À1 Q T QSQ T Þ þðm À lÞ T Q T QR À1 Q T Qðm À lÞ ¼ c þ lnðjQRQ T jÞ þ TrðQR À1 SQ T Þ þ ðm À lÞ T R À1 ðm À lÞ; where the determinant and the trace do not change by an orthogonal transformation, therefore, À2 log LðQSQ T ; QmjQRQ T ; QlÞ ¼ c þ lnðjRjÞ þ TrðR À1 SÞ þ ðm À lÞ T R À1 ðm À lÞ ¼ L R;l ðS; mÞ: h Lemma 2. For any multivariate normal model with covariance matrix R and mean l, omitting observed variables which have distributions that are constant with respect to some parameter set h and are independent of all other parameters does not change the likelihood ratio of any two parameter values h 1 and h 2 .
Proof . For simplicity of notation, we prove that the difference of the minus two log likelihoods is constant. Let R ¼ R 1 ðhÞ 0 0 R 2 be the separation of R and l ¼ l 1 ðhÞ l 2 be the separation of l into a first part that depends on h and a second, independent part that does not. We separate the data distribution accordingly. Note that the covariances between the two blocks in the data distribution are not relevant for the likelihood, i.e., we can write L RðhÞ;lðhÞ ðS; mÞ ¼ c þ lnðjR 1 ðhÞjÞ þ TrðR 1 ðhÞ À1 S 1 Þ þ ðm 1 À l 1 ðhÞÞ T R 1 ðhÞ À1 ðm 1 À l 1 ðhÞÞ þ lnðjR 2 jÞ þ TrðR À1 2 S 2 Þ þ ðm 2 À l 2 Þ T R À1 2 ðm 2 À l 2 Þ When taking the difference of the minus two log likelihoods for h 1 and h 2 , the second part of the equation and c cancels, so that the difference solves to L Rðh 1 Þ;lðh 1 Þ ðS; mÞ À L Rðh 2 Þ;lðh 2 Þ ðS; mÞ ¼ lnðjR 1 ðh 1 ÞjÞ þ TrðR 1 ðh 1 Þ À1 S 1 Þ þðm 1 À l 1 ðh 1 ÞÞ T R 1 ðh 1 Þ À1 ðm 1 À l 1 ðh 1 ÞÞ ÀlnðjR 1 ðh 2 ÞjÞ À TrðR 1 ðh 2 Þ À1 S 1 Þ Àðm 1 À l 1 ðh 2 ÞÞ T R 1 ðh 2 Þ À1 ðm 1 À l 1 ðh 2 ÞÞ ¼ L R 1 ðh 1 Þ;l 1 ðh 1 Þ ðS 1 ; m 1 Þ À L R 1 ðh 2 Þ;l 1 ðh 2 Þ ðS 1 ; m 1 Þ: h We conclude that the likelihood ratio remains constant under both base power equivalent operations, and hence under all combinations of those. Since the Bayes factor is the ratio of two prior-weighted likelihoods, we conclude further that the Bayes factor is unaltered by power equivalent transformations for any data set (S,m) and parameter sets h 1 and h 2 . Thus, in particular, the distribution of the Bayes factor is identical for any priors p 1 and p 2 and any data distribution: Corollary 3. If ðR A ðhÞ; l A ðhÞÞ and ðR B ðhÞ; l B ðhÞÞ are two power equivalent multivariate normal models A and B, then under any distribution for data sets (S,m) and prior distributions p 1 and p 2 to be compared, the corresponding distribution of the Bayes factor is identical for both models.
Proof . For simplicity, we omit the explicit separation of S and m in S 1 and S 2 and m 1 and m 2 , respectively, because the irrelevant parts are ignored by the likelihood function (see proof of Lemma 2). For any specific outcome (S,m) of the random variable representing the data, let (S * , m*) be the power equivalent transformation of the data as explained at the beginning of this section. The Bayes factor for the first model is given by Since the Bayes factor is identical for both models for any specific outcome of the data, its distribution under any random distribution of (S,m) is identical for both power equivalent models.

Simulation study
We performed a simulation study to illustrate the equivalence of Bayes factor distributions for power equivalent LGCMs. As in von Oertzen and Brandmaier (2013), we concentrated on a single parameter of interest: r 2 S , the interindividual variance in the latent slope parameter. The focal Bayesian hypothesis test therefore compared the two hypotheses H 0 : r 2 S ¼ 0 and H 1 : r 2 S $ p 1 where p 1 , is a prior distribution that allows the parameter r 2 S to vary. We operationalized this prior distribution as a gamma distribution with a shape parameter of k = 1 and a rate parameter of b = 0.5. This prior places most weight on parameter values between 0 and 6 and can be considered as an example for an informed prior for typical effect sizes in psychology (see e.g., Duncan, et al., 2006;Iddekinge, et al., 2009;von Oertzen & Brandmaier, 2013). In this special case, all parameters of the model apart from r 2 S are considered to be known and fixed. Thus, the Bayes factor can be calculated through a simple integration procedure.
For our simulation study, we conducted a total of 36 BFDAs, where each BFDA result is based on 1000 Bayes factors. All BFDAs were performed using the following Monte Carlo simulation algorithm: 1. Find three power equivalent models with the given parameters for r 2 E and r 2 I ; 2. simulate 1,000 datasets for each of the models given a certain population parameter (design prior) for r 2 S ; 3. compute the Bayes factor for each of the datasets.
We compare the results of a fixed-N BFDA for 3 power equivalent LGCMs under 12 different population models (design priors). The three power equivalent models have 7, 5, and 3 equally distanced measurement occasions, respectively, and were computed using the equations provided in von Oertzen and Brandmaier ( 2013; see the Appendix below). In the simulations, we varied the variance of the intercept r 2 I , the residual variance r 2 E , and the true variance of the slope (r 2 S | H1). All BFDAs were conducted for a sample size of N = 300. Figure 4 shows the distributions of log Bayes factors for the three power equivalent models under all simulated conditions. Overall, the distributions are nearly identical for the power equivalent models which illustrates the formal proof of BFDA equivalence conducted in the previous section of this article. Generally, the Bayes factors are very large, which happens due to the relatively large dataset and the assumption that several important parameter values of the model are already known. There are small differences in the Bayes factor distributions that can be explained through the random variation in the simulation process.
The simulation code as well as the simulation results are openly accessible on https:// osf.io/hkt4p/.

Application example: Effects of a mindfulness training
In this section, an applied example is discussed that illustrates how the notion of power equivalence can be used to facilitate a-priori design analyses for longitudinal studies. We will build on a study by Kiken et al (2015) who investigated the psychological effects of a mindfulness training. Mindfulness is a cognitive state of nonjudgmental awareness in which an individual pays attention to the thoughts, emotions, and sensations of the moment. Kiken et al (2015) measured state mindfulness with the Toronto Mindfulness Scale (Lau et al.,2006) at seven equally distanced measurement occasions during an ongoing mindfulness training that was directed at increasing the participants' general level of mindfulness. Using an LGCM, they concluded that while the training led on average to an increase in mindfulness, there were noticeable differences between individuals regarding the amount of change, i.e., there was considerable variability in the slope of state mindfulness. σ I 2 = 10, σ E 2 = 10, (σ S 2 |H1) = 2 In this example application, we assume that researchers developed a new training method that is supposed to be equally effective for all participants. As the researchers would like to quantify evidence in favor of the null hypothesis (r 2 S ¼ 0), they decide to use Bayesian hypothesis testing (Wagenmakers et al, 2018). When planning the study, they have two goals: Making sure that their envisioned sample size is large enough to obtain strong evidence in favor of the null hypothesis (BF 01 ! 10) if the null hypothesis is true and minimizing the overall study costs. For this example, we roughly estimate that the costs for each measurement occasion are $10 per participant (e.g., for participant compensation or data entry), and that the running costs are $500 per week (e.g., for renting lab space, employing assistants to run the study). We further assume that the envisioned sample size of the researchers is N = 50. Thus, when planning the study, two design questions come up: (1) Is a sample size of N = 50 enough to achieve strong evidence in favor of the null hypothesis when the null hypothesis is true, and (2) which of the power equivalent designs is most cost-efficient?
First, the researchers can now conduct a BFDA based on the design and results of the original study, that is seven equally distanced measurement occasions, a variance of intercepts of r 2 I ¼ 43:6, and an error variance of r 2 E ¼ 21:45. The results for a sample size of N = 50 show that the Bayes factor ðBF 10 Þ will be smaller than 0.1 in 99.8% of the cases,  that is, there is a high chance to obtain strong evidence in favor of the null hypothesis if the null hypothesis is true (see Figure 5). Being convinced by the high degree of informativeness, the researchers can now proceed to find the most cost-efficient design with the same power. Using power equivalence, the researchers can come up with several power-equivalent measurement designs. Table 1 shows three power-equivalent designs with 3, 7, and 10 measurement occasions, respectively (see the Appendix for details about the computation). All these designs share the same Bayes factor distribution based on the BFDA of the original design. However, they differ in their respective costs. As we can see from the total costs in Table 1, the measurement design with three measurement occasions is the most cost-efficient. Prolonging the overall study duration by 1.2 weeks, but reducing the number of measurement occasions to three can therefore lead to a cost reduction of roughly 20%. There is no need to recalculate the BFDA because the researchers already know that all power-equivalent designs are equally informative.

Discussion
Reducing study costs while keeping the results informative is an important practical objective of experimental design (Hunter & Hoff, 1967). In longitudinal studies, a cost reduction can often be achieved by finding a trade-off between the total duration of the study and the number of measurement occasions. In a frequentist setting, researchers can optimize this trade-off while keeping the design informative by comparing several power equivalent models (von Oertzen, 2010;von Oertzen & Brandmaier, 2013). While these models all have the same statistical power (Cohen, 1992), they exhibit different combinations of overall study length and number of measurement occasions. In this paper, we showed that the notion of power equivalence can be transferred to a Bayesian hypothesis testing framework. Specifically, we could show that power equivalence models yield the same Bayes factor distributions in a Bayes Factor Design Analysis (BFDA; Sch€ onbrodt & Wagenmakers 2018). Therefore, power equivalent designs are equally informative both from a frequentist and Bayesian viewpoint. This shows that power equivalent models can also be used in Bayesian design planning to negotiate trade-offs between costs and informativeness in longitudinal studies.
Our findings can be interpreted as an extension of both power equivalence (von Oertzen & Brandmaier, 2013;von Oertzen, 2010) and BFDA (Sch€ onbrodt & Wagenmakers, 2018). From the perspective of power equivalence, we provide a straightforward generalization of the approach and show that it can also be used in the Bayesian process design planning. This highlights the relevance of the approach and raises the question whether further generalizations are possible. For example, the general notion of power equivalence could be generalized to statistical models other than Latent Growth Curve Models (LGCMs). Our results show that this would be a relevant contribution to design planning methods both from a frequentist and a Bayesian viewpoint. From the perspective of BFDA, our findings provide a first step towards a simplification of the procedure. Since the approach is based on Monte Carlo simulations, conducting a BFDA can be computationally expensive. Finding models that yield the same BFDA results can substantially facilitate the process of Bayesian design planning because a BFDA needs to be conducted only once for all of these power equivalent models. Our results show that finding such power equivalent models is possible. Future research could be directed at finding more conditions for equality of BFDA results and to extend our results to sequential Bayesian designs.
By making power equivalence available to a new statistical domain, our study increases its practical applicability to the planning of experimental designs. Additionally, we make it easy for researchers to optimize their study designs based on power equivalence and BFDA by providing the code for all analyses conducted in this paper online (see https:// osf.io/hkt4p/). By using well-documented functions, we hope to encourage researchers to reuse our code and adapt it to their own practical applications. However, currently, the practical applicability of power equivalence in experimental design is still restricted by two important limitations. Firstly, the mathematical derivation for power equivalence requires that parameters which are not part of the hypothesis (in our example the variance of the intercept r 2 I and the error variance r 2 E ) are fixed. In practice, this is a strong assumption. However, if these parameters are not known, they (or the effective error r 2 eff ) can be estimated prior to the computation of power equivalence. A second limitation is that currently power equivalence requires a fixed structure matrix (von Oertzen, 2010), so it is only directly applicable to models like LGCMs, Change Score Models, Dual Change Score Models, Latent Differential Models, and basic models (e.g., ANOVAs). Nevertheless, these describe a considerable part of SEMs used today.
From a broader perspective, our findings illustrate that despite of methodological differences and occasional heated debates between frequentist and Bayesian methods and their respective proponents (see e.g., Wagenmakers, et al., 2008), often relevant insights can be gained from describing the world from both perspectives. We hope that by showing how the notion of power equivalence and the BFDA method can be combined, we will have made a contribution towards an increased feasibility of Bayesian experimental planning. Eventually, we hope that the existence of straightforward methods for design planning can encourage more researchers to plan their study designs for efficiency and informativeness.