Bayesian evaluation of informative hypotheses for multiple populations
Abstract
The software package Bain can be used for the evaluation of informative hypotheses with respect to the parameters of a wide range of statistical models. For pairs of hypotheses the support in the data is quantified using the approximate adjusted fractional Bayes factor (BF). Currently, the data have to come from one population or have to consist of samples of equal size obtained from multiple populations. If samples of unequal size are obtained from multiple populations, the BF can be shown to be inconsistent. This paper examines how the approach implemented in Bain can be generalized such that multiple‐population data can properly be processed. The resulting multiple‐population approximate adjusted fractional Bayes factor is implemented in the R package Bain.
1 Introduction
This paper is the most recent addition to a sequence of papers in which an alternative to null hypothesis significance testing has been developed. Important landmarks in this development are Klugkist, Laudy, and Hoijtink (2005a) and Kuiper, Klugkist, and Hoijtink (2010) who added order constrained hypotheses to the classical null hypothesis and showed in the context of analysis‐of‐variance models how these can be evaluated using the Bayes factor (Kass & Raftery, 1995); Mulder, Hoijtink, and de Leeuw (2012) who generalized the approach to Bayesian evaluation of informative hypotheses (Hoijtink, 2012), that is, hypotheses specified using equality and inequality (or order) constraints among the parameters of multivariate normal linear models; Gu, Mulder, Dekovic, and Hoijtink (2014) who developed a Bayes factor for the evaluation of inequality‐constrained hypotheses in a rather wide range of statistical models; and Mulder (2014) and Gu, Mulder, and Hoijtink (2018) who generalized the latter Bayes factor into the approximate adjusted fractional Bayes factor (AAFBF, henceforth abbreviated to BF) which can be used to evaluate informative hypotheses for one‐population data for a wide range of statistical models such as normal linear models, logistic regression models, confirmatory factor analysis, and structural equation models.11
The interested reader is referred to http://informative-hypotheses.sites.uu.nl/where all the books, dissertations, papers, and software produced during the course of this development are presented.
The BF is simple to compute and the only input needed are estimates of the model parameters, the corresponding covariance matrix, and the sample size. However, as will be discussed in this paper, the BF is inconsistent if samples of unequal size are obtained from multiple populations (similar to O'Hagan, 1995; fractional Bayes factor, as is shown by De Santis & Spezzaferri, 2001). This paper examines how the BF can be generalized into the multiple‐population approximate adjusted fractional Bayes factor (MBF). This Bayes factor is simple to compute too; the only input needed are estimates of the model parameters, separate estimates of the corresponding covariance matrix for each population, and the sample size obtained from each population. As will be shown, the MBF is consistent and can therefore be used for testing informative hypotheses with respect to multiple populations.
With the availability of the MBF (and corresponding software) researchers have a viable alternative for null hypothesis significance testing. In a wide range of statistical models, the null hypothesis can be replaced by informative hypotheses, and the p‐values can be replaced by the MBF. The recent and current critical appraisal of null hypothesis significance testing in the literature will not be reiterated here. However, the interested reader is referred to Cohen (1994) who called the null hypothesis the nill hypothesis because he could not come up with research examples in which the null hypothesis might be a realistic representation of the population of interest. This point of view was further elaborated by Royal (1997, pp. 79–81) who claims that the null hypothesis cannot be true, and consequently, that data are not needed in order to be able to reject it. However, the interested reader is also referred to Wainer (1999) who highlights that there are situations where, without dispute, the null hypothesis is relevant. Landmark papers criticizing the use of p‐values and significance levels are Ioannides (2005) and Wagenmakers (2007), among others. The latter paper also motivates and illustrates the replacement of p‐values by Bayesian hypothesis testing using the Bayesian information criterion (Schwartz, 1978; Raftery, 1995). However, the interested reader is also referred to the American Statistical Association's statement on p‐values (Wasserstein & Lazar, 2016) which gives a to‐the‐point and balanced overview of what can and cannot be done with p‐values, and to Benjamin, Berger, Johannesson, Nosek, Wagenmakers, and Johnson (2018) who propose to redefine statistical significance.
The focus of this paper is on the evaluation of informative hypotheses using the Bayes factor. Note that model selection criteria like the Akaike and Bayesian information criteria (Raftery, 1995; Schwartz, 1978) cannot be used (Mulder, Klugkist, Meeus, van de Schoot, Selfhout, & Hoijtink, 2009; Section 3). The penalty for model complexity in both criteria is a function of the number of parameters in the model at hand. Since the number of parameters in an unconstrained hypothesis (e.g. Hu : θ1, θ2, θ3) is the same as in a constrained hypothesis (e.g. H1 : θ1 > θ2 > θ3), it does not reflect the fact that H1 is more parsimonious than Hu. This problem is solved by Kuiper and Hoijtink (2013) who present the generalized order‐restricted information criterion (GORIC), which is a generalization of the Akaike information criterion with a penalty term that does properly reflect the fact that H1 is more parsimonious than Hu. However, the GORIC can only be applied in the context of the multivariate normal linear model, while, as discussed above, the range of application of the (M)BF is not limited to the multivariate normal linear model.
Also, as is elaborated in Van de Schoot, Hoijtink, Romeijn, and Brugman (2012), the penalty for model complexity used by the deviance information criterion (DIC; Spiegelhaler, Best, Carlin, & Van der Linde, 2002) is also not suited to quantifying how parsimonious an informative hypothesis is. Using a modification of the loss function used by the DIC, they obtain the prior information criterion (PIC) which in the examples provided can be used to evaluate informative hypotheses. However, as was shown by Mulder (2014), using the Bayes factor results in more desirable selection behaviour when testing constrained hypotheses than using the PIC.
Silvapulle and Sen (2004) show how so‐called Type A testing problems (evaluating a null hypothesis against an informative hypothesis) and Type B testing problems (evaluating an informative hypothesis against an unconstrained hypothesis) can be evaluated using p‐values in a wide range of statistical models. Those in favour of null hypothesis significance testing are well advised to consult this book and the R packages restriktor and ic.infer. The main limitation of this approach is that it cannot be used to directly compare two competing informative hypotheses.
Stern (2005) proposes using the posterior density of Hk for k = 1, …, K to select the best hypothesis. However, as is elaborated in Klugkist, Laudy, and Hoijtink (2005b), this amounts to using fk to select the best hypothesis, that is, the complexity ck is ignored. This will work if each hypothesis has the same complexity. However, if, for example, Hu is compared to H1, irrespective of the data, Hu will always be preferred because it has by definition a larger fit than H1 (cf. equation 7).
This paper starts by introducing the BF. Using a simple two‐group setup, it will be shown and illustrated that it may show inconsistent behaviour if samples of unequal size are obtained from multiple populations. Subsequently, the BF will be generalized into the MBF and, using the same two‐group setup, it will be shown and illustrated that the MBF does not exhibit inconsistent behaviour if samples of unequal size are obtained from multiple populations. Further illustrations of the approach proposed in the context of an analysis‐of‐covariance (ANCOVA) model and a logistic regression analysis will be provided. Illustrations are executed using the R package22
https://www.r‐project.org
Bain.33
https://informative‐hypotheses.sites.uu.nl/software/bain/
The R codes and data used in this paper can be found at the bottom of the Bain website (click on the title of this paper). The paper is concluded with a short discussion and contains an Appendix with a further discussion of the consistency of the MBF.
2 The approximate adjusted fractional Bayes factor
(1)
(2)
(3)where Y denotes the data that are modelled (e.g., the dependent variable in a multiple regression) and X the data that are not modelled and considered to be fixed (e.g., the predictor variables in a multiple regression). The idea of fractional Bayes factors is to use a fraction b of the information in the likelihood function to specify the prior distribution. Usually the fraction b is chosen such that it corresponds to the size of a minimal training sample (Berger & Pericchi, 1996, 2004). For the evaluation of informative hypotheses we implemented in the R package Bain b = J*/N, where J* denotes the number of independent constraints in [S1, R1, …, SK, RK] and N the sample size. This choice can be illustrated using a simple example. If H1: θ1 > θ2 > θ3 and H2 : θ1 = θ2 = θ3, the number of independent constraints J* = 2, that is, there are two underlying parameters that are combinations of the target parameters with respect to which hypotheses are formulated: θ1 – θ2 and θ2 – θ3. Our choice is motivated by the fact that in the normal linear model, the minimal training sample needed to obtain a proper posterior distribution is equal to the number of parameters. If, for example, a variable is modelled using a normal distribution with unknown mean μ and variance σ2, the minimum training sample needed to obtain a proper posterior based on the prior h(μ, σ2) = 1/σ2 is 2 (cf. Berger & Pericchi, 2004, Example 1). If, a variable is a linear combination of two predictors with normal error, there are four parameters (intercept, two regression coefficients, residual variance) and, consequently, the minimum training sample equals 4.
(4)where
denotes the maximum likelihood estimate of θ and
the corresponding covariance matrix. Note that the ‘approximate’ in the name ‘approximate adjusted fractional Bayes factor’ reflects the fact that for its computation a normal approximation of the posterior distribution is used. An implication of the approximation is that the BF can only be used if a normal approximation to the posterior distribution of θ is reasonable. If the sample size is not too small (see below), this is the case with unbounded parameters such as means and regression coefficients as they appear in generalized linear models and structural equation models. It is also the case for the fixed regression coefficients (the random effect would be treated as nuisance parameters) in, for example, two‐level models. In the latter case, the sample size used is the number of level‐two units (and not the number of observations of the dependent variable). This is not necessarily the case with naturally bounded parameters such as variances (naturally bounded to be positive) and probabilities (naturally bounded between 0 and 1), although even there, if the sample size is large, a normal approximation of the posterior distribution may be accurate. The interested reader is referred to Gu et al. (2014) who show that, for the evaluation of inequality‐constrained hypotheses in the context of a multiple regression with two predictors, the difference between the approximate BF implemented in Bain and the corresponding non‐approximate BF implemented in Biems (Mulder et al., 2012) is negligible if the sample size is at least 20. They also show that inequality constrained hypotheses with respect to the probabilities in a two by two contingency table render an approximate BF that is very similar the the non‐approximate BF presented by Klugkist, Laudy, and Hoijtink (2010) if the sample size is at least 40. Although these results give confidence in the performance of the approximate adjusted fractional Bayes factor, further research in the context of different models is needed in order to strengthen these results.
(5)
(6)where [Y, X]b stresses that the prior distribution is based on a fraction b of the information in the data. Note that θB is called the adjusted mean (Mulder, 2014) of the prior distribution, which explains the ‘adjusted’ in the name ‘approximate adjusted fractional Bayes factor’. As was shown by Mulder (2014), if, for example, H1 : θ > 0 is compared with Hu : θ, it holds that the more the data support H1 the smaller the support in the fractional Bayes factor for H1! This phenomenon is addressed if the adjusted fractional Bayes factor is used, that is, if the prior mean is in agreement with equation 5, the more the data are in agreement with H1 the larger the support in the adjusted fractional Bayes factor for H1 (see Mulder, 2014, for further details). Note, furthermore, that
is a so‐called encompassing prior, that is, the prior distribution of θ under Hk is proportional to
, where the indicator function is 1 if the argument is true and 0 otherwise (Klugkist et al., 2005a; Wetzels et al., 2010).
There are situations in which there is no solution to equation 5. For example, if hypotheses are specified using range constraints, for example, H1 : |θ| < 0.2 (i.e. H1 : θ > −0.2, θ < 0.2), there is no solution. Bain addresses this problem in the following manner: in equation 5 (and only in this equation) this (part of a) hypothesis is represented as H1 : θ = 0, that is, θB will be equal to the midpoint of the range specified. The rationale is that H1 essentially implies that θ ≈ 0. Another example is given by the hypotheses H1 : θ = 0 and H2 : θ > 2. Although each of these hypotheses can be evaluated by itself, they cannot be compared using the approximate adjusted fractional Bayes factor because there is no solution to equation equation 5, that is, both hypotheses are not compatible because hu (·) is different for each hypothesis (Hoijtink, 2012; section 9.9.2.1.). Testing non‐compatible hypotheses can be done using BIEMS (Mulder et al., 2012) by instructing the program to use the same unconstrained prior for each of the hypotheses under consideration.
(8)respectively. The interested reader is referred to Gu (2016, Chapter 3) for the algorithms with which the fit and complexity are computed. The strength of the BF lies in its simplicity. Its computation is based only on maximum likelihood estimates and the corresponding asymptotic covariance matrix, and the choice of the fraction b, which is completely determined by the sample size N and the number of independent constraints J*.
The approximate adjusted fractional Bayes factor, also in the paragraphs that follow abbreviated as BF, falls in the category of default, automatic, or pseudo Bayes factors because no priors have to be manually specified. Instead, the prior is automatically constructed using a small fraction of the data, while the remaining fraction is used for hypothesis testing, similar to the fractional Bayes factor (O'Hagan, 1995). The BF is coherent in the sense that BFuk = 1/BFku and BFkk’ = BFku/BFk'u (O'Hagan, 1997; sections 3.1 and 3.2). Note that these coherence properties do not necessarily hold for other default Bayes factors (O'Hagan, 1997; Robert, 2007; p. 240).
As further noted by Robert (2007, p. 242), a potential issue of the fractional Bayes factor, and therefore also of the BF in equation 2, is that there is no clear‐cut procedure to choose the fraction b. We believe, however, that the use of a minimal fraction is reasonable as it results in a minimally informative default prior while maximal information in the data is used for hypothesis testing (Berger & Mortera, 1995). Furthermore, it has been shown that this choice results in consistent testing behaviour (Mulder, 2014; O'Hagan, 1995). Nevertheless, further research on the choice of b would strengthen the approach we present in this paper. The interested reader is referred to Gu, Hoijtink, and Mulder (2016), for one evaluation of the choice of b. Another potential issue highlighted by Robert (2007, p. 242) is that default Bayes factors can be computationally intensive. The BF procedure that is proposed here, however, is very easy to compute: only the maximum likelihood estimates, error covariance matrix and sample size are needed (Gu et al., 2018).
Finally, it is important to note that default Bayes factors may behave as ordinary Bayes factors based on on so‐called intrinsic priors (Berger & Pericchi, 1996). Currently, however, intrinsic priors have not yet been explored for the BF. Although this too is a topic worthy of further research, from a pragmatic point of view it is more important to know whether the BF is consistent, that is, whether the support for the true hypothesis goes to infinity when the sample size grows to infinity. According to O'Hagan (1997; section 2.1), if hypotheses are nested (in the cases we consider all hypotheses are nested within Hu) and if b → 0 if N → ∞ (which holds for our b), the fractional Bayes factor is consistent. However, De Santis and Spezzaferri (2001) show that the fractional Bayes factor may show inconsistent behaviour if data from multiple populations are sampled. Similarly, as shown in the next section, the BF is also inconsistent if the data are sampled from multiple populations. In line with the solution proposed by De Santis and Spezzaferri (2001) for the fractional Bayes factor, the MBF is an extension of the BF that is consistent when testing hypotheses in the case of multiple populations.
3 Consistency of the approximate adjusted fractional Bayes factor
The consistency of the (M)BF this will be discussed in terms of (M)BFku if Hk is specified using only equality constraints. In this case a Bayes factor is called consistent if (M)BFku → ∞ (or 0) when Hk (or Hu) is true and Ng → ∞ at the same rate for all g = 1, …, G. If Hk is specified using only inequality constraints, it will be discussed in terms of (M)BFkc, where Hc is the complement of Hk. In this situation, a Bayes factor is called consistent if BFkc → ∞ (or 0) when Hk (or Hc) is true, as Ng → ∞ at the same rate for all g = 1, …, G. Both scenarios imply that the G populations are treated as one population from which a sample of increasing size (proportionally increasing the sample sizes from each of the G populations) is taken. Note that BFkc = BFku/BFcu, where the numerator and denominator can be computed using equation 2. Note, furthermore, that for hypotheses specified using only equality constraints Hu = Hc.
When BFku ↛ ∞ or BFkc ↛ ∞ for the same limit, the Bayes factor is called inconsistent. Another form of inconsistency that will be considered in this paper is whether (M)BFku → ∞ or 0, and (M)BFkc → ∞ or 0, as Ng → ∞ for some populations but not all G populations. This situation applies if a sample of increasing size is obtained from some of the G populations while the sample size from the other populations remains fixed. De Santis and Spezzaferri (2001) showed for this limit that the fractional Bayes factor (O'Hagan, 1995) is inconsistent. In this section it will be illustrated, in line with De Santis and Spezzaferri (2001), that the same holds for the BF. In the next section the MBF will be introduced, which can be seen as an extension of the BF to multiple populations which avoids this form of inconsistency.
(9)where D1i equals 1 for i = 1, …, N1 and 0 otherwise, D2i equals 1 for i = N1 + 1, …, N1 + N2 and 0 otherwise (and, consequently, θ1 and θ2 denote the means in group 1 and group 2, respectively, and ω the residual variance), and N1 and N2 denote the sample sizes of groups 1 and 2, respectively, with N = N1 + N2. Connecting this notation to that of the previous section renders Y = y and X = [D1, D2].
Consider testing H1 : θ1 = θ2 against Hc : θ1 ≠ θ2. Note that the marginal likelihood of Hc is equal to the marginal likelihood of the unconstrained hypothesis Hu : θ1, θ2 because θ1 = θ2 has zero probability assuming a bivariate normal prior for θ1, θ2 under Hu. For the exposition that follows we arbitrarily assume that
. The approximated unconstrained posterior and prior distribution of θ1 and θ2 from equations 4 and 6 are then given by
(10)and
(11)
, then
(12)
(13)which is a constant independent of n. Equations 12 and 13 imply that BF1u → ∞ if n → ∞, which is consistent. If
,
(14)
(15)
(16)if n → ∞. This implies that in the limit BF1u → ∞ also if Hu is true, which is inconsistent behaviour.
To get more insight into the (in)consistency, the BF was computed for various numerical examples in Tables 1, 2, and 3. In the case of support for H0 we set
and
, and in the case of support for Hu we set
and
. In both situations we again let
. As can be seen in Table 1, when N1 = N2 and both increase at the same rate, BF1u → ∞ if H1 is true and BF1u → 0 if Hu is true, that is, the Bayes factor shows consistent behaviour. Table 2 shows that BF1u also shows consistent behaviour if both sample sizes increase at the same rate if N1 ≠ N2. However, as can be seen in Table 3, if there is support for H1, BF1u increases if N2 increases while N1 remains fixed, but if there is support in the data for Hu, BF1u at first decreases but then starts to increase, which implies that the evidence accumulates in the wrong direction. As N2 keeps increasing BF1u goes to infinity as shown above. This is a simple illustration of inconsistent behaviour of the BF where multiple populations are considered while the sample size does not increase for all populations. De Santis and Spezzaferri (2001) show that this behaviour can also be observed for the fractional Bayes factor. The problem is caused by the fact that the prior variances of θ1 and θ2 are dependent on the sample sizes in both groups because b = 1/N (Table 3). As N2 increases, the fraction that is used to construct the default prior for θ1 also goes to zero even though the sample size of group 2 does not increase. This undesirable property can be avoided using population‐specific fractions in line with Iwaki (1997), Berger and Pericchi (1998), De Santis and Spezzaferri (1999, 2001), and Mulder (2014). In the remainder of this paper it will be detailed how this can be done for the BF to obtain the MBF for multiple populations.
) or for Hu (
) in the case of equal sample sizes for both groups increasing at the same rate
| N 1 | N 2 | BF | MBF | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| b | ![]() |
|
|
|
b1 | b2 |
|
![]() |
|
|
||
| BF1u | BF1u | MBF1u | MBF1u | |||||||||
| 10 | 10 | .05 | 2 | 2 | 4.47 | 1.31 | .05 | .05 | 2 | 2 | 4.47 | 1.31 |
| 25 | 25 | .02 | 2 | 2 | 7.07 | 0.33 | .02 | .02 | 2 | 2 | 7.07 | 0.33 |
| 50 | 50 | .01 | 2 | 2 | 10.00 | 0.02 | .01 | .01 | 2 | 2 | 10.00 | 0.02 |
| 100 | 100 | .005 | 2 | 2 | 14.14 | 0.00 | .005 | .005 | 2 | 2 | 14.14 | 0.00 |
Note
-
N1 and N2 denote the sample sizes in groups 1 and 2, respectively; b denotes the fraction of information in the density of the data, and b1 and b2 denote the fraction of information in the density of the data for groups 1 and 2, respectively;
and
denote the prior variances of θ1 and θ2 from equation 11 and
and
the prior variances of θ1 and θ2 from equation 25. The numbers in italics are referred to in the text.
) or for Hu (
) in the case of unequal sample sizes for the two groups increasing at the same rate
| N 1 | N 2 | BF | MBF | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| b | ![]() |
|
|
|
b1 | b2 |
|
|
|
|
||
| BF1u | BF1u | MBF1u | MBF1u | |||||||||
| 10 | 50 | .017 | 6 | 1.2 | 7.74 | 1.01 | .05 | .01 | 2 | 2 | 5.77 | 0.75 |
| 25 | 125 | .007 | 6 | 1.2 | 12.25 | 0.07 | .02 | .004 | 2 | 2 | 9.13 | 0.06 |
| 50 | 250 | .003 | 6 | 1.2 | 17.32 | 0.00 | .01 | .002 | 2 | 2 | 12.91 | 0.00 |
Notes
-
N1 and N2 denote the sample sizes in groups 1 and 2, respectively; b denotes the fraction of information in the density of the data, and b1 and b2 denote the fraction of information in the density of the data for groups 1 and 2, respectively;
and
denote the prior variances of θ1 and θ2 from equation 11 and
and
the prior variances of θ1 and θ2 from equation 25.
) or for Hu (
) in the case of unequal sample sizes where only the size of group 2 increases
| N 1 | N 2 | BF | MBF | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| b | ![]() |
|
|
|
b1 | b2 |
|
|
|
|
||
| BF1u | BF1u | MBF1u | MBF1u | |||||||||
| 10 | 10 | .05 | 2 | 2 | 4.47 | 1.31 | .05 | .05 | 2 | 2 | 4.47 | 1.31 |
| 10 | 25 | .029 | 3.5 | 1.4 | 5.92 | 1.03 | .05 | .02 | 2 | 2 | 5.34 | 0.93 |
| 10 | 50 | .017 | 6.0 | 1.2 | 7.74 | 1.01 | .05 | .01 | 2 | 2 | 5.77 | 0.75 |
| 10 | 100 | .009 | 11 | 1.13 | 10.48 | 1.13 | .05 | .005 | 2 | 2 | 6.03 | 0.65 |
| 10 | 200 | .005 | 21 | 1.05 | 14.94 | 1.41 | .05 | .0025 | 2 | 2 | 6.17 | 0.60 |
| 10 | 1000 | .001 | 101 | 1.01 | 31.78 | 2.81 | .05 | .0005 | 2 | 2 | 6.29 | 0.56 |
Note
-
N1 and N2 denote the sample sizes in groups 1 and 2, respectively; b denotes the fraction of information in the density of the data, and b1 and b2 denote the fraction of information in the density of the data for groups 1 and 2, respectively;
and
denote the prior variances of θ1 and θ2 from equation 11 and
and
the prior variances of θ1 and θ2 from equation 11. The numbers in italics are referred to in the text.
4 The approximate adjusted fractional Bayes factor for multiple populations
(17)Example 1, continued. The following notation will be used to denote which parts of the data belong to groups 1 and 2. The subscripts 1 and 2 in y1, y2 denote data sampled from populations 1 and 2, respectively. Analogously, the second subscript in D11, D12, and D21, D22, denotes data from populations 1 and 2, respectively. Using this notation, the density of the data for the comparison of two independent means can be factored as:
(18)The covariance matrix of the parameters in equation 17 can be obtained as a function of the observed or expected Fisher information matrix (the interested reader is referred to Efron and Hinkley (1978), for details of the relative (dis)advantages of both types of information). Using the observed Fisher information, this leads to
(19)
, that is, the unconstrained maximum likelihood estimates of the model parameters obtained using the full density of the data from equation 17. If the expected Fisher information is used, the expected value of each entry in the last part of equation 21 has to be taken. The corresponding normal approximation of the posterior distribution of the structural parameters is
(20)Note that
can be constructed using the observed Fisher information matrix for the parameters of each group:
(21)where each second‐order derivative is be evaluated using
, that is, the maximum likelihood estimates of the model parameters obtained using the full density of the data displayed in equation 17 –not only using the data for group g. Analogously, equation 23 can be replaced by the corresponding expected Fisher information matrix. Comparing equation 23 for g = 1, …,G to equation 21 shows that the former contains all the elements needed to construct the latter. This is important since the input for Bain consists of the covariance matrices for each group, from which Bain constructs the overall covariance matrix. As will be detailed in the next paragraph, these group‐specific covariance matrices are needed in order to be able to construct the prior distribution based on a fraction bg of the information of the data in each group.
for g = 1, …, G has been obtained it is straightforward to obtain the multiple‐population counterpart of the prior distribution displayed in equation 6 which is based on a covariance matrix using a fraction bg of the information in Yg, Xg for g = 1, …, G (see, equation 17). Using the mathematical rule that ∂2 log p(v, w)u ∂v ∂w = u ∂ 2 log p(v, w) ∂v ∂w, it can be seen that
(22)Reassembling these matrices (cf. equation 21) renders
(23)
denotes a covariance matrix based on fractions b = [b1, …, bG] of the information in the data, rendering the multiple‐population adjusted fractional prior distribution of the structural parameters
(24)Note that […]b in equation 26 denotes a prior distribution based on fractions b of the information in the data. It can be interpreted as a default prior that contains the information of group‐specific data fractions, bg, for the parameters of interest. Note, furthermore, that the subscript B in θB,1, …, θB,G, ηB highlights that the prior means of the structural parameters are in agreement with equation 5, that is, centred on the boundary of the hypotheses specified.
(25)
(26)from which, using equation 19, it is straightforward to obtain that
(27)The counterpart of equation 26 for the example at hand has θB = [0, 0] and, applying equation 25,
(28)With respect to the computation of equation 23 three situations can be distinguished.
(29)
. Multiple populations arise if two or more of the predictors are used to create groups. Two groups with group‐specific intercepts are created if, for example, x1i = 1 if person i is a member of group 1 and 0 otherwise and x2i = 1 if person i is a member of group 2 and 0 otherwise. Group‐specific regression coefficients can additionally be obtained if, for example, x3i = x*ix1i and x4i = x*ix2i (where x*i denotes a continuous predictor for which group‐specific regression coefficients are required), that is, the predictor x3i gets a regression coefficient β3z in group 1 and β4z in group 2. With Z = 1 the model could be
(30)
(31)for group 2, and
.
(32)Situation 2. Models with only group‐specific parameters. When all of the parameters (including ω) in the density of the data are group specific, the covariance matrix in equation 21 will be block diagonal with one block for each group. Consequently, it is straightforward to use R packages tailored to the statistical model of interest to obtain estimates
for g = 1, …, G, and, for each group, the corresponding covariance matrix
. Note that this does not apply to the example given for situation 1 (equations 32 and 33) because σ2 was not group specific. This would have applied if, in addition to the intercept and regression coefficient, σ2 had been group specific too.
Situation 3. All other situations. In all other situations R packages can be used to obtain the estimates
, but the equations rendering
based on
and Yg, Xg for g = 1, …, G will either have to be programmed in R or obtained through the use of R packages like numDeriv which provides numerical approximations of second‐order derivatives based on the log density of the data of the statistical model of interest. Later in this paper a logistic regression will be used in Example 3 to illustrate this situation. For users with limited experience of statistical modelling and R, the third situation will be difficult to handle: the likelihood function of the statistical model at hand has to be formulated and numDeriv has to be used to estimate the covariance matrix for each group (and not the overall covariance matrix) using the overall groups estimates of the model parameters. Currently, one annotated example (a logistic regression model) is provided on the Bain website. Users requiring support in the context of other models can send an email to the first author of this paper with the request to add additional examples to the website.
5 Choosing bg
(33)
(34)This choice is in keeping with the concept of a minimal fraction from each population to construct an implicit default prior.
6 Consistency of the multiple population approximate adjusted fractional Bayes factor
De Santis and Spezzaferri (2001) show that their generalized fractional Bayes factor is consistent if N → ∞, that is, if the Ng increase at the same rate (cf. De Santis & Spezzaferri, 2001; Theorem 4.1). It will be illustrated below, via a continuation of Example 1, that the same holds for the MBF, that is, if Ng → ∞, for all g at the same rate, then MBFku → ∞ (or 0) when Hk (or Hu) is true. A more general discussion of the consistency of the MBF is given in the Appendix. The example below also shows that the MBF avoids the inconsistent behaviour shown by the BF when fixing the sample size of one population while letting the sample size of the other population go to infinity.
(35)
and b2 =
. As can be seen, the prior distribution in equation 35 is independent of N1 and N2. This can be interpreted as the amount of prior information being independent of the sample size, which is a desirable property. Also note that the prior mean does not depend on the information in the data but is chosen to be in agreement with equation 5.
(36)
and n → ∞, then f1 → ∞ and c1 is constant. This implies that MBF1u → ∞. If
and n → ∞, then f1 → 0 and c1 is constant. This implies that MBF1u → 0. Stated otherwise, for n → ∞, MBF1u is consistent. Furthermore, if N2 → ∞ while N1 is fixed, then if
, in the limit (see the last term of equation 36
, which is larger than 1, that is, correctly expresses support for H1. Although for N2 → ∞ MBF1u does not approach ∞, this is reasonable behaviour and the inconsistent behaviour of the BF is avoided. If
the limiting behaviour of MBF1u is shown by the last term of equation 36. If, for example, N1 = 25 and
, MBF1u = 8.8, that is, H1 is supported. This too is reasonable, because both the sample size of group 1 and the effect size are small and therefore the effect is not convincingly different from zero. If both are larger, for example, N1 = 49 and
, MBF1u = 0.03, that is, Hu is supported. As is illustrated, the degree support for or against H1 is based on the sample size and the effect size. This too is reasonable behaviour and again the inconsistent behaviour of the BF is avoided.
As can be seen in the last two columns in the middle and right‐hand panels of Tables 1 and 2, if both sample sizes are proportionally increasing, both BF1u and MBF1u show consistent behaviour in the sense that (M)BF1u → ∞ if θ1 = θ2 and (M)BF1u → 0 if
. Note that, as required by our choice of bg in equation 36, for equal sample sizes in both groups both Bayes factors are equal (see Table 1).
Furthermore, as can be seen in the last two columns in the middle and right‐hand panels of Table 3, if one sample size is fixed and the other is increasing, in contrast to BF1u, MBF1u does not show inconsistent behaviour in the sense that MBF1u is monotonically increasing if
and MBF1u is monotonically decreasing if
. As can be seen, in this situation, when only N2 is increased, MBF1u converges to the upper bound 6.325 (0.546) when θ1 = θ2 (
) based on the limit in equation 1.
As can be seen by comparing the last number on the last row in Table 1 (N = 200, N1 = N2 = 100) with the last number on the penultimate row in Table 3 (N = 210, N1 = 10, N2 = 200), it makes a huge difference in outcome whether or not the sample sizes are balanced. Evidence in favor of the true hypothesis is larger with balanced than with unbalanced sample sizes.
7 Example 2: Analysis of covariance
(37)
(38)
(39)
(40)where Xg = [Dgg, x1g, x2g], in which the second subscript indicates to which group the data elements belong. Equation 5 is obtained using the expected Fisher information. Since the expected value of the second‐order derivatives with respect to either θ1, …, θ5, β1, β2 on the one hand, or σ2 on the other hand, are zero,
constructed using equation 21 based on the expected Fisher information for only these parameters is identical to the corresponding part in
(cf. equations 28 and 29).
, for g = 1,2, renders the Fisher information matrices for the groups. Using equation 21, these can be assembled into the overal Fisher information matrix which, after inverting and multiplication by −1, renders
. Modifying equation 5 according to equation 22 renders
(41)that is, the elements of the expected Fisher information matrix for each group g. Reassembling these elements using equation 25 renders
, that is, the covariance matrix of the prior distribution.
This example concludes using data from Stevens (1996, appendix A) concerning the effect of the first year of the Sesame Street series on the knowledge of 240 children in the age range 34–69 months. We will use the following variables: y, knowledge of numbers after watching Sesame Street; x1, the knowledge of numbers before watching Sesame Street; x2, a test measuring the mental age of children; and D1, …,D5 dummy variables representing the children's background (1 = disadvantaged inner city, 2 = advantaged suburban, 3 = advantaged rural, 4 = disadvantaged rural, 5 = disadvantaged Spanish‐speaking).
(42)
(43)Hypothesis 1 states that the knowledge of numbers after watching Sesame Street does not depend on background correcting for initial knowledge and mental age. Hypothesis 2 states that the advantaged children have a greater knowledge after watching Sesame Street than the disadvantaged children.
Table 4 presents the input the R package Bain needs in order to evaluate H1 and H2, that is, estimates of the adjusted means, regression coefficients, and residual variance, and, for each group, the covariance matrix for the group‐specific adjusted mean and both regression coefficients, computed using
(cf. equation 5), and the sample size. Table 5 first of all presents the posterior covariance matrix of the structural parameters computed from the group‐specific covariance matrices using equation 21. Then the vector b computed using bg = 1/5 × 4/Ng for g = 1, …, G is presented. Note that J* equals 4 because the number of independent constraints in
|
|
|
|
|
|
|
|
| 29.16 | 34.38 | 28.90 | 27.12 | 30.89 | 0.70 | 0.05 | 84.06 |
| N 1 | N 2 | N 3 | N 4 | N 5 | |||
| 60 | 55 | 64 | 43 | 18 | |||
|
|
||||||
| 1.62 | −0.05 | 0.05 | 3.07 | −0.01 | −0.10 | ||
| −0.05 | 0.02 | −0.01 | −0.01 | 0.02 | −0.01 | ||
| 0.05 | −0.01 | 0.01 | −0.10 | −0.01 | 0.01 | ||
|
|
||||||
| 2.32 | 0.07 | 0.09 | 2.21 | 0.04 | 0.04 | ||
| 0.07 | 0.03 | −0.01 | 0.04 | 0.03 | −0.09 | ||
| 0.09 | −0.01 | 0.01 | 0.04 | −0.01 | 0.01 | ||
|
|||||||
| 5.47 | 0.20 | −0.20 | |||||
| 0.20 | 0.09 | −0.05 | |||||
| −0.20 | −0.05 | 0.05 |
|
||||||
| 1.45 | −0.09 | 0.03 | 0.02 | −0.04 | −0.01 | 0.01 |
| −0.09 | 1.96 | −0.24 | −0.12 | 0.14 | 0.01 | −0.03 |
| 0.03 | −0.24 | 1.46 | 0.07 | −0.07 | 0.00 | 0.02 |
| 0.02 | −0.12 | 0.07 | 1.99 | −0.03 | 0.00 | 0.01 |
| −0.04 | 0.14 | −0.07 | −0.03 | 4.72 | 0.01 | −0.01 |
| −0.01 | 0.01 | 0.00 | 0.00 | 0.01 | 0.01 | −0.00 |
| 0.01 | −0.03 | 0.02 | 0.01 | −0.01 | −0.00 | 0.00 |
| b | ||||||
| 0.013 | 0.015 | 0.012 | 0.019 | 0.044 | ||
|
||||||
| 108.14 | −5.52 | 2.02 | 0.89 | −2.66 | −0.85 | 0.67 |
| −5.52 | 129.41 | −13.40 | −6.77 | 7.90 | 0.46 | −1.82 |
| 2.02 | −13.40 | 113.04 | 4.09 | −3.86 | 0.17 | 0.85 |
| 0.89 | −6.77 | 4.09 | 107.20 | −1.89 | 0.14 | 0.41 |
| −2.66 | 7.90 | −3.86 | −1.89 | 108.07 | 0.51 | −0.72 |
| −0.85 | 0.46 | 0.17 | 0.14 | 0.51 | 0.31 | −0.14 |
| 0.67 | −1.82 | 0.85 | 0.41 | −0.72 | −0.14 | 0.17 |
| MBF1u | MBF2u | MBF12 | ||||
| 2.94 | 1.34 | 2.21 | ||||
Note
- The number in italics is referred to in the text.
(44)which specifies the equality constraints in H1 and
(45)which specifies the inquality constraints in H2, is equal to 4, that is, the number of independent rows in the combination of S1 and R2 is equal to 4. Next, the prior covariance matrix of the structural parameters computed using the group‐specific covariance matrices and b and equations 22 and 23 is displayed. Finally, MBF1u, MBF2u, and MBF12 are presented. As can be seen, the support in the data is 2.21 times greater for H1 than for H2, that is, it is slightly more likely that the gain in knowledge of numbers is equal for advantaged and disadvantaged children than that the gain is greater for the advantaged children. More data would be needed to obtain a more decisive conclusion.
8 Example 3: Logistic regression
Example 2 illustrated how
for g = 1, …, G can be computed if the statistical model at hand is a member of the (multivariate) normal linear model (previously labelled situation 1). In this section it will be illustrated how
for g = 1, …, G can be obtained for models outside the (multivariate) normal linear modelling framework (previously labelled situation 3) based on the observed Fisher information using the R package numDeriv.44
https://cran.r-project.org/web/packages/numDeriv/
Again using the data from Stevens (1996; appendix A), a logistic regression model is specified in which y, whether a child is encouraged to watch Sesame Street (0 = no, 1 = yes), is predicted from gender (D1i equals 1 for a girl and zero otherwise, D2i equals 1 for a boy and zero otherwise), and centred age x:
(46)The hypothesis of interest is
(47)that is, girls are more encouraged than boys and older children are more encouraged than younger children.
The top part of Table 6 presents the input the R package Bain needs in order to evaluate H1. Note that
and
are computed using the observed Fisher information matrix rendered by the R package numDeriv using the data for group 1 (D1g,xg) and group 2 (D2g,xg), respectively. In the bottom part of Table 6 the output resulting from Bain is presented. It can be observed that H1 is not supported by the data, with MBF1u = 0.53. Note that, for the example at hand,
computed using the observed Fisher information matrix is virtually identical to
computed using the expected Fisher information matrix with the R package glm. Note, however, that this does not always have to be the case. Researchers preferring the expected Fisher information matrix (but see Efron & Hinkley, 1978) will have to replace the computations with numDeriv by formulae for the expected Fisher information for logistic regression models (see, for example, McCullagh & Nelder, 1989, pp. 115–117).
| Input for the R package Bain | |||
|
|
|
|
| 0.50 | 0.60 | −0.01 | |
| N 1 | N 2 | ||
| 125 | 115 | ||
|
|
||
| 0.03 | 0.00 | 0.04 | −0.00 |
| 0.00 | 0.00 | −0.00 | 0.00 |
| Output from the R package Bain | |||
|
|||
| 0.03 | −0.00 | 0.00 | |
| −0.00 | 0.04 | −0.00 | |
| 0.00 | −0.00 | 0.00 | |
| b | |||
| 0.008 | 0.009 | ||
|
|||
| 4.28 | −0.02 | 0.03 | |
| −0.02 | 4.39 | −0.04 | |
| 0.03 | −0.04 | 0.06 | |
| MBF1u | |||
| 0.53 | |||
Note
-
The number in italics is referred to in the text. Also
does not change when computed using the expected Fisher information.
9 Discussion
In this paper the approximate adjusted fractional Bayes factor BF, which is suited for the evaluation of informative hypotheses if data are sampled from one population, has been generalized to the multiple population approximate adjusted fractional Bayes factor MBF, which is suited for the evaluation of informative hypotheses if data are sampled from one or multiple populations. Both BF and MBF are implemented in the R package Bain.
The result is a versatile and generally applicable approach for the evaluation of informative hypotheses by means of the Bayes factor in a wide range of statistical models. However, as mentioned earlier in the paper, there are number of topics that deserve further research. The first topic is which sample sizes are required to obtain an accurate normal approximation of the posterior distribution for a wide range of statistical models. The second topic concerns the choice of b, that is, what are the properties of our proposal and what are potential alternatives (the interested reader is referred to Gu et al. (2016) for one study on this topic). The third topic is further development of Bain such that it is easier for users to deal with, what was previously called situation 3, that is, models for which numDeriv or other approaches have to be used to obtain the covariance matrix of the parameters of interest for each of the groups in the data set. The fourth topic is more philosophical in nature. It concerns the question whether there is an intrinsic Bayes factor corresponding to our MBF. The fifth topic concerns a modification of the approach presented in this paper such that it can be applied in variable selection problems (see, for example, O'Hara & Sillanpaa, 2009). The spike‐and‐slap prior is known to perform well in variable selection problems with sparse data, for example, regression models with a relatively large number of persons to number of predictors ratio, and in which only a few predictors are expected to have a substantial regression coefficient. Spike‐and‐slab prior‐based variable selection is currently an exploratory approach. In the future we will consider a more confirmatory approach based on an efficient evaluation of sets of informative hypotheses in which it is considered not only if the regression coefficient is substantial, but also its direction, and (partial) orderings of regression coefficients.
Acknowledgments
The first author is supported by the Consortium on Individual Development (CID) which is funded through the Gravitation program of the Dutch Ministry of Education, Culture, and Science and the Netherlands Organization for Scientific Research (NWO grant no. 024.001.003). The third author is supported by a NWO Vidi Grant (number 452‐17‐006).
Appendix 1
1 Further discussion of the consistency of the MBF
We consider two different cases. First, we consider the case where hypothesis Hk only contains inequality constraints, and no equality constraints. Second, we consider the case where Hk contains (only) equality constraints. We will discuss both N → ∞ and one or more but not all of the Ng → ∞.
(48)If Hk only contains inequality constraints, that is, Hk : Rk θ > rk, MBF reduces to
(49)
, then
because there are six possible orderings of three parameters that each have an equal probability.
Again let Ng = ag n, where the ag represent the relative size of the samples from the G populations and let θ* denote the true value of θ. If the data support Hk, that is, θ* ∈ Hk, then if n → ∞ then
, and the posterior distribution in the numerator of equation 12 is increasingly concentrated around
and consequently fk → 1. Analogously, if the data do not support Hk, that is,
, fk → 0. This follows from asymptotic theory; see, for example, Gelman et al. (2013, Chapter 4). The prior distribution in the denominator of equation 12 is independent of the Ng and thus independent of n. As can be seen from the combination of equations 22 and 34, for each group the second‐order derivatives (which can for the vast majority of statistical models be written as the sum of Ng contributions) are weighted with bg = J*/G × 1/Ng, that is, asymptotically each element of equation 22 is independent of Ng. Consequently, asymptotically ck is a constant that is independent of n. This is exemplified by equation 35.
(50)Then if θ* ∈ Hk and n → ∞, then MBFkc → 1/ck × cc/0 → ∞ and if
and n → ∞, then MBFkc → 0/ck × cc/1 → 0, which implies consistency.
Theorem 4.1 from De Santis and Spezzaferri (2001) for the generalized fractional Bayes factor and our exposition in the context of Example 1 for the MBF provide evidence for consistency if Hk : Skθ = sk. Further evidence is obtained by realizing that each equality constraint (e.g. θ = 0) can be written as an about‐equality constraint
for z → 0. If each equality constraint is rewritten in this manner, the exposition given at the beginning of this section applies to Hk : Skθ = sk and also to Hk : Skθ = sk, Rkθ > rk.
If Ng → ∞ for some but not all of the G groups, an analogous line of reasoning can be used to show that MBF shows reasonable behaviour. If the data support Hk, that is, θ* ∈ Hk and some of the group sizes increase, then the posterior distribution in the numerator of equation 12 is increasingly concentrated around the parameters corresponding to the groups with increasing group sizes (some of the
) and η*. Consequently, fk will become larger but will not attain its maximum value 1.0. Analogously, if
, fk will become smaller, but will will not attain its minimum value 0.0. Note that ck is a constant irrespective of whether n → ∞ or that some of the group sizes go to infinity. These ingredients can be used to show that the behaviour of the MBF is reasonable. Looking at equation 14, it can be seen that if θ* ∈ Hk, MBF will increase (to a boundary value, not to infinity) if some of the group sizes go to infinity; and if
, MBF will decrease (to a boundary value, not to zero). A proof and illustration in the context of a simple model can be found in Example 1.
References
Citing Literature
Number of times cited according to CrossRef: 1
- Qianrao Fu, Herbert Hoijtink, Mirjam Moerbeek, Sample-size determination for the Bayesian t test and Welch’s test using the approximate adjusted fractional Bayes factor, Behavior Research Methods, 10.3758/s13428-020-01408-1, (2020).









