Early View
REGISTERED REPORT STAGE 2
Open Access
Open DataPreregistered

Facial basis of stereotypes: Judgements of warmth and competence based on cross-group typicality/distinctiveness of faces

S. Adil Saribay

Corresponding Author

S. Adil Saribay

Kadir Has University, Istanbul, Turkey

Correspondence

S. Adil Saribay, Kadir Has University, Cibali Mah., Kadir Has Cad., Fatih, Istanbul 34083, Türkiye.

Email: [email protected]

Contribution: Conceptualization, Data curation, Methodology, Project administration, Resources, Supervision, Writing - original draft, Writing - review & editing

Search for more papers by this author
Šimon Pokorný

Šimon Pokorný

Charles University, Prague, Czech Republic

Contribution: Data curation, ​Investigation, Methodology, Project administration, Resources, Writing - original draft, Writing - review & editing

Search for more papers by this author
Petr Tureček

Petr Tureček

Charles University, Prague, Czech Republic

Center for Theoretical Study, Charles University, Prague, Czech Republic

Contribution: Formal analysis, Methodology, Validation, Visualization, Writing - original draft, Writing - review & editing

Search for more papers by this author
Karel Kleisner

Karel Kleisner

Charles University, Prague, Czech Republic

Contribution: Conceptualization, Data curation, Funding acquisition, Methodology, Project administration, Resources, Supervision, Validation, Visualization, Writing - original draft, Writing - review & editing

Search for more papers by this author
First published: 30 September 2024

Abstract

Human migration is an increasingly common phenomenon and migrants are at risk of disadvantageous treatment. We reasoned that migrants may receive differential treatment by locals based on the closeness of their facial features to the host average. Residents of Türkiye, the country with the largest number of refugees currently, served as participants. Because many of these refugees are of Arabic origin, we created target facial stimuli varying along the axis connecting Turkish and Arabic morphological prototypes (excluding skin colour) computed using geometric morphometrics and available databases. Participants made judgements of two universal dimensions of social perception–warmth and competence–on these faces. We predicted that participants judging faces manipulated towards the Turkish average would provide higher warmth and competence ratings compared to judging the same faces manipulated towards the Arabic average. Bayesian statistical tools were employed to estimate parameter values in multilevel models with intercorrelated varying effects. The findings did not support the prediction and revealed raters (as well as target faces) to be an important source of variation in social judgements. In the absence of simple cues (e.g. skin colour, group labels), the effect of facial morphology on social judgements may be much more complex than previously assumed.

BACKGROUND

As various threats such as climate chaos destabilize the world, migration is bound to be a central phenomenon in the 21st century. Consequently, host populations will increasingly encounter individuals from different regions of the world. In many encounters between host and migrant individuals, the mutual physical appearances and patterns of behaviour may not be familiar to each other. Decades of social psychological research clearly shows that with lack of familiarity comes a risk of disadvantaged treatment (Zebrowitz et al., 2008). On the other hand, cultural outgroups (e.g., migrants) whose appearance somehow resembles the cultural ingroup (e.g., host population) may escape such costs or even receive beneficial treatment from locals (e.g., Zebrowitz et al., 2007). The central aim of the present research is to demonstrate these differential potentials, focusing on the human face.

Over the last several decades, the concept of stereotype–the mental association between a social group and traits–has taken a central position in the field of social psychology. While providing mental efficiency in the processing of social information (Macrae et al., 1994), stereotypes can have various negative consequences because they are linked to prejudice and discrimination and can be used to justify inequality in society (Uhlmann et al., 2010). Research suggests that variation in judgements of outgroup members can be efficiently captured with two dimensions–sometimes referred to as the “Big Two.” Here, we adopt the framework provided by the Stereotype Content Model (SCM) (Fiske et al., 2002), which offers the labels warmth and competence for these two dimensions, but the same idea has emerged in other similar contemporary models (e.g. communion and agency, Abele & Wojciszke, 2007; trustworthiness and dominance, Todorov et al., 2008) as well as earlier work (e.g., Bakan, 1966). Warmth captures the motivation to “get along” and qualities such as being warm, friendly, sincere, and trustworthy while competence captures the motivation to “get ahead” and qualities such as being competent, persistent, capable, and determined. SCM further argues that perceptions of competence are driven by differences in the societal status of groups whereas warmth perceptions are driven by competition between groups. Thus, if members of a host population perceive a group of immigrants from a certain culture as having low status and restricting their access to valued resources (e.g., jobs), they would stereotype individuals in that group as low on both competence and warmth, respectively. According to the related work on Behaviours from Integroup Affect and Stereotypes (BIAS map; Cuddy et al., 2007), such stereotyping would be associated with emotional reactions of contempt and disgust toward this immigrant outgroup and with the behavioural tendency to cause active harm (e.g., violence). In fact, research has demonstrated that judgements of warmth and competence predict actual behaviour (e.g., research allocation in economic games; Jenkins et al., 2018). Thus, it is important to understand the factors underlying such judgements of outgroup members more precisely.

Both social (e.g., stereotype-based) and facial evaluations contribute to impressions of individual targets (Xie et al., 2021). In fact, models developed separately in these respective domains have important similarities. In the latter domain, Todorov et al.' (2008) model suggests that facial evaluations are best captured by two dimensions: valence (trustworthiness) and dominance. These dimensions show considerable similarity to SCM's warmth and competence, respectively. Research suggests that facial dominance is somewhat related to perceived competence (Hensel et al., 2020), though this may be exclusive to male targets (Wang et al., 2018; see also Oliveira et al., 2019). There is an even stronger overlap between warmth and trustworthiness (Sutherland et al., 2016). For our purposes, this overlap is reassuring and helpful but not absolutely required as our aim is to assess the relation between warmth and competence judgements and morphological characteristics of the face (see Imhoff et al., 2013; for a similar approach).

Especially since the beginning of the 2000s, the work on social perception within Social Psychology is increasingly being integrated with work on face perception in other subfields of Psychology. The present effort is aimed at pushing such integration further using the framework of “Cross-Group Typicality Distinctiveness Metric” (CTDM; Kleisner et al., 2019). In earlier efforts, several key observations were made: First, facial typicality plays an important role in social perception. For instance, faces more typical of the ingroup are perceived as more trustworthy than outgroup typical faces in cross-cultural encounters (Bürhan & Alici, 2023; Kunst et al., 2018; Sofer et al., 2017; Zebrowitz et al., 2007). This is sensible in light of other work showing that a feeling of familiarity, which would be triggered to a much greater extent by viewing a face that is more typical of the ingroup versus the outgroup, is linked to liking and attraction (Zebrowitz et al., 2008). On the flip side, members of minority groups whose faces are perceived as more typical of their group might be subject to harsher prejudice and discrimination from majority group members (Maddox, 2004). Dangerously, this could drive harsher punishment, even capital punishment, in actual court cases (Blair et al., 2004; Eberhardt et al., 2006).

Second, most work on these issues is conducted within Social Psychology and relies on impressionistic treatment of faces, rather than on precise measurement of facial characteristics. For instance, how typical a face is of (or how distinct it is from) a given target group is often decided by human raters (Hebl et al., 2012). While this is a sensible approach, it is imprecise, subject to the same human biases under investigation (i.e., stereotypical perception), and could cause replication difficulties since judgements of human faces would change if the raters changed. While work on face perception increasingly takes advantage of new objective metrics, such innovation is still not well integrated with the work on stereotyping. We aim to contribute to the bridging of this gap.

Third, computational biometric methods, such as geometric morphometrics (GM; Rohlf & Marcus, 1993), can be used to improve the social-psychological work on these issues. It is possible to construct a quantitative, objective (i.e., not relying on human judgement), sample-dependent (data-driven) index of how typical or distinct vis-à-vis given populations an individual face is, with major advantages. Importantly, while subjective ratings of typicality are necessarily influenced by category membership because the social groups defining the face-to-be-rated are usually impossible to hide from raters, computational methods are based on morphology of the face alone and are not unnecessarily influenced by human-constructed categories such as race and ethnicity. In addition, subjective ratings may be subject to confounding of the distance of a face from various group prototypes. That is, social-psychological research often relies on the “(stereo)typicality” of faces in terms of a single social category in preparing stimuli or obtaining ratings from participants (e.g., Blair et al., 2002). Assume that an African-American face is labelled as being “low in stereotypicality.” A reasonable assumption is that this means greater distance between this face and the African-American facial prototype, compared to another African-American face labelled as “highly stereotypical.” However, this distance from the ingroup prototype says nothing about the distance between this face and other prototypes, such as those of White, Hispanic, and Asian Americans. This is important because if we want to predict how an African-American face will be treated by a White-American observer, we need to know not just about its typicality/distinctiveness vis-à-vis the African-American prototype but also its typicality/distinctiveness vis-à-vis the White-American prototype. Some work has indeed asked participants to make such ratings on a scale with two anchors representing racial groups such as Asian versus Caucasian (Gwinn & Brooks, 2013). This practice is not common, and it is not completely clear how well human raters can perform such complicated ratings and how reliable and valid they are. Computational methods may offer some distinct advantages over subjective ratings in this sense because they can be used to compute the position of a face along an axis connecting two groups. Indeed, this is what the CTDM precisely does and does better than similar computational alternatives as demonstrated in earlier work (Kleisner et al., 2019).

Earlier, evidence that computations based on CTDM behave in line with theoretical expectations was provided (Kleisner et al., 2019). However, more convincing evidence for the utility and validity of CTDM would be possible if it was demonstrated to predict a complex real-world phenomenon, such as stereotypical or prejudicial responses in an intergroup context. Such a demonstration would in turn help to make the social-psychological work on stereotyping and prejudice more precise and objective in terms of its treatment of faces and more integrated with cutting-edge computational methods from other fields of science.

It is noteworthy that the cross-cultural nature of human interactions in modern globalized world cannot be properly analysed by the concept of typicality based on local-specific approaches such as the widely used concept of intra-populational “averageness” that measures proximity to the population mean. Therefore, CTDM was proposed as a simple measure of distinctiveness/typicality based on the position of an individual along the axis connecting the facial averages of two populations under comparison. Accordingly, the more distant a face is from its ingroup population mean toward the outgroup mean the more distinct it is (vis-à-vis the ingroup) and the more it resembles the outgroup standards. In other words, using CTDM, it is possible to gauge the extent to which a person belonging to a certain population (e.g., Syrian refugee) shares facial features with individuals from another population (e.g., Turkish host) within that particular context. This information uniquely reveals the person's facial features vis-à-vis the ecology of faces (i.e., the social context encountered in the real world), which can then be associated with various perceptions and judgements about that person. Additionally, CTDM enables the creation of manipulated stimuli that preserve the inherent variations in human facial features within a particular population. Thus, CTDM utilizes a typicality/distinctiveness concept sensitive to inter-cultural dynamics which is applicable to situations where individuals from multiple populations are compared.

For the purposes of applying the CTDM in a real-world integroup context, sampled our participants from Türkiye which currently hosts around 4 million refugees and has been the country hosting the largest number of refugees for nearly a decade (https://reporting.unhcr.org/turkey#toc-narratives). Most of the refugees in Türkiye are of Syrian origin and ethnic Arabs make up the largest proportion of the population of Syria (https://www.arab-reform.net/publication/the-impossible-partition-of-syria/; https://www.cia.gov/the-world-factbook/countries/syria/). Thus, the pairing of Turkish (host) participants and Arabic faces is perhaps the most suitable and relevant test of the present hypotheses in today's world and also addresses various recent calls to increase the diversity of stimuli in this area of research (Cook & Over, 2021; Foo et al., 2022). However, we go beyond simply using Arabic faces and aim to demonstrate the systematic variance in judgements of warmth and competence based on exactly where a particular face falls on the axis connecting Turkish and Arabic facial averages, using CTDM. More importantly, to demonstrate the causal influence of facial typicality/distinctiveness vis-à-vis these ethnic groups on stereotyping, we manipulate individual faces by moving them on the axis between the Turkish-Arabic averages. That is, holding all other features constant, we examine differences in responses to the same (categorically Arabic) face when it conforms more to the Turkish average facial appearance versus to the Arabic average facial appearance. First, we validated our carefully produced stimuli, based on standardized images, via a pilot study. Then, we tested the causal influence of faces on impressions employing Bayesian multilevel modelling.

We hypothesized that greater conformity to the ingroup (i.e., host population) average face will evoke stronger warmth and competence judgements from members of the host population. Thus, the specific prediction was that Turkish participants judging faces manipulated towards the Turkish average will provide higher warmth and competence ratings compared to judging the same faces manipulated toward the Arabic average. We aimed to test whether subtle but systematic differences in facial characteristics of the same group of immigrants affect how they are perceived by members of the host population.

METHODS

Ethics information

All the experiment protocol for involving humans was in accordance to guidelines of national/international/institutional or Declaration of Helsinki. Informed consent was obtained from all participants. Participants received monetary compensation and/or a chance to participate in a lottery for a small monetary prize (in the form of a gift card) in return for their effort. All procedures mentioned and followed were approved by the Institutional Review Board of the Faculty of Science of the Charles University (protocol ref. number 2020/04).

Design

To prepare the facial stimuli, we used facial photographs from two databases: The Bogazici face database (Saribay et al., 2018) and the Caucasian and North African French Faces (CaNAFF; Courset et al., 2018). From the CaNAFF database, we selected 50 individuals that ranked highest on the prototypicality scale (meaning they were the most typical North Africans). North Africans are mostly ethnic Arabs; therefore, these individuals would appear as the most Arabic looking and are suitable for use in the intergroup context (i.e., Turkish hosts–Arabic refugees) under investigation. From all the 115 males in Bogazici database, we selected 111 individuals that did not wear glasses. Using these faces, we aimed to create faces to be presented in the experiment and that conformed closely to either Turkish or Arabic facial morphology, as explained below.

Each face was digitized in the TPSDig2 software version 2.30 (Rohlf, 2017) using 72 landmarks (36 landmarks, 36 semi-landmarks). Landmarks are corresponding locations found on each face defined by anatomical structures. Semi-landmarks outline the curves between landmarks (Figure 1; for definitions see Table 1).

Details are in the caption following the image
Landmark layout, full dots stand for landmarks, circles stand for semi-landmarks. Figure depicts an artificial face. For landmark definitions see Table 1.
TABLE 1. Landmark and semilandmark definitions.
1 TRICHION: Midpoint of the hairline, that is, on the hairline through the midline of the forehead
2 MENTON: the lowest point of the lower border of the mandible (along the jaw line)
3 LABIALE INFERIUS: the midline point of the lower vermilion line (border of the lower lip)
4, 5 the midpoints between LABIALE INFERIUS (3) and CHEILON (9,10)
6, 7 CHEILON: the outer corner of the mouth where the outer edges of the upper and lower lip meet
8 LABIALE SUPERIUS: the upper midpoint of the upper vermilion line, a point of maximum local curvature between the christae philtri
9, 10 CHRISTA PHILTRI: the point on the crest of the philtrum, the vertical groove in the median portion of the upper lip on the vermilion border
11, 12 the midpoints between LABIALE SUPERIUS (8) and CHRISTA PHILTRI (9,10)
13 SUBNASALE: midpoint of the angle at the columella base where the lower border of the nasal septum and the surface of the upper lip meet
14, 15 COLUMELLA APEX: highest point of the columella crest at the apex of the nostril
16, 17 ALARE: the most lateral point on the ala contour
18, 19 ALAE ORIGIN (alar curvature point): the most posterolateral point of the curvature of the base of the nasal alae
20, 27 ENDOCANTHION: the inner corner of the eye fissure where eyelids meet
21, 28 EXOCANTHION: the outer corner of the eye fissure where eyelids meet
22, 30 PALPEBRALE SUPERIUS: the highest visible point of the iris
23, 29 PALPEBRALE INFERIUS: the lowest visible point of the iris
24, 31 Iris Outer Border: the rightmost point of the right iris (leftmost of the left iris)
25, 32 Iris Inner Border: the leftmost point of the right iris (rightmost of the left iris)
26, 33 the midpoint between ENDOCANTHION (20,27) and PALPEBRALE INFERIUS (23,29)
34, 36 SUPERCILIARE LATERALE: the most lateral point of the eyebrow
35, 37 SUPERCILIARE MEDIALE: the most medial point of the eyebrow
38–40, 44–46 the eyebrow upper curve: three semilandmarks with regular spacing between SUPERCILIARE LATERALE (34,36) and SUPERCILIARE MEDIALE (35,37)
41–43, 47–49 the eyebrow lower curve: three semilandmarks with regular spacing between SUPERCILIARE LATERALE (34,36) and SUPERCILIARE MEDIALE (35,37)
50, 59 ZYGION: the most lateral point of the zygomatic arch
51–58, 60–67 the lower jaw: eight semilandmarks with regular spacing between MENTON (2) and ZYGION (50,59)
68 STOMION: centre of the lip crack, lying on the midline between LABIALE SUPERIUS (8) and LABIALE INFERIUS (3)
69, 70 the midpoints between STOMION (68) and CHEILON (6,7)
71, 72 Pupil: center of the pupil

We ran generalized Procrustes analyses on all faces (111 Turkish and 50 North African). Procrustes-aligned configurations were projected into a tangent space. Semi-landmark positions were optimized based on minimizing the bending energy between corresponding points. Procrustes residuals were then symmetrized so that the left and right sides were reflected, and the original and mirrored configurations averaged. Symmetrization was performed after semi-landmarks were slid along their tangent directions.

We calculated the mean facial shape for both Turkish and North African faces. Subsequently, we selected 50 Turkish faces that were closest to Turkish average facial configuration (calculated based on all 111 faces). To manipulate the Turkish faces, we first calculated the distance between Turkish and North African mean facial configuration. Then, we added/subtracted the ½ of this distance to each of the 50 selected Turkish faces. Using TpsSuper software, we unwarped all 50 facial images to target configurations corresponding to Turkish mean ± ½ distance between Turkish and North African mean face. This resulted in 50 faces manipulated to conform to Turkish typical facial shape and 50 faces manipulated to have more North African (Arabic) facial morphology. The two versions of each facial photograph differed solely in their facial shape. No other features such as skin colour were manipulated.

All stimuli were afterwards edited in Adobe Photoshop CS6. They were cropped to a 4:5 ratio, so that the face was centred and eyes were in the same absolute height in each facial image. Possible artefacts, resulting from the shape manipulation process, were fixed using the Liquify tool. These edits were done for the purpose of a natural look and did not affect either the overall shape or shape of the particular facial traits. Visible parts of clothing were turned to greyscale with the black and white filter. A pilot study was conducted to provide independent evidence that this process of the manipulation of faces worked as intended (see Supporting information section 1). Although the pilot study provided evidence that the manipulation worked as intended in general, there was considerable face- and rater-based variance in the perception of these faces as relatively more Arabic-looking versus more Turkish-looking. Thus, we urge the reader to take these nuances and the overall weak effect of Arabization on perception into consideration in interpreting the results of the main study.

A rating session was prepared with the Qualtrics online software. Participants were informed that the study was about how people perceive faces. No information was given about the ethnicity of the faces or the manipulation. Participants provided informed consent and reported their sex, age, height, and weight. In the first rating block, participants were presented with either Arabized or Turkified version of each photograph in a randomized order, showing one stimulus at a time. They were asked to rate the warmness (on a slider scale from 0–cold to 100–warm) of the presented faces. In a second block, the stimuli were presented again, and participants were asked to rate the competence (on a slider scale from 0–very incompetent to 100–very competent) of the presented faces. These ratings were paired, so that each participant rated the warmness and competence of the same manipulated version (either Arabized or Turkified) of each photograph, but the order of the stimuli was randomized separately in each of the blocks. Participants could not return to previous stimuli and change their response. No time limit was set, and each rating screen (face and slider scale) remained in display until the participant provided a response. Participants who did not complete the ratings were excluded from the analysis.

Sampling

We sampled participants from the pool of an internet research company in Türkiye, which includes individuals from a wide range of demographics. While our data-analytic approach was Bayesian, our proposal could still benefit from a power analysis, as explained here. The power analysis had two stages. At the first stage, we determined the minimal sample to conclusively detect a small effect within a frequentist framework (Cohen's d = 0.1), since there are no universally accepted guidelines for the sample size determination for Bayesian analysis. We used the approach suggested by Westfall et al. (2014). For the counterbalanced design with varying intercept dominated variance parcellation (residual VPC = 0.3, participant intercept VPC = 0.3, stimulus intercept VPC = 0.3, and participant slope VPC = 0.1), the suggested number of participants to achieve 80% power was n = 335 $$ n=335 $$ . So n = 335 $$ n=335 $$ was set as the minimal number of participants to be collected to avoid Type II error of missing a small but meaningful effect.

Since we were working in a Bayesian framework with two possibly intercorrelated response variables, we did not limit ourselves to the analytical suggestion of a frequentist single-response-variable tool. After collecting 100 responses (43 men and 57 women), we proceeded to stage 2 of the power analysis, which was based on the full model. We analysed the data and got the posterior distribution of all model parameters (see the set of equations in Supporting information section 2), including a posterior distribution of variance parcellation parameters describing unstandardized data ( σ r a w , σ r b w , σ r a c , σ r b c , σ t a w , σ t b w , σ t a c , σ t b c $$ {\sigma}_{r_{a_w}},{\sigma}_{r_{b_w}},{\sigma}_{r_{a_c}},{\sigma}_{r_{b_c}},{\sigma}_{t_{a_w}},{\sigma}_{t_{b_w}},{\sigma}_{t_{a_c}},{\sigma}_{t_{b_c}} $$ ). Based on the initial sample of n = 100 $$ n=100 $$ , we bootstrapped larger samples and determined the effects of interest. As specified in the preregistration available at https://osf.io/hgwtd/.

There were no conclusively positive or negative effects large enough to be detectable under n = 500 $$ n=500 $$ with power 0.8 if they have the same magnitude as the posterior mean estimated on n = 100 $$ n=100 $$ . (By conclusively positive or negative, we mean an estimate whose posterior distribution's 89% Compatibility Interval (CI) does not overlap 0.) All slopes b w G b c G $$ \left({b}_w\left[G\right],{b}_c\left[G\right]\right) $$ and differences between them had posterior means close to zero. Therefore, no less than n = 235 $$ n=235 $$ further responses had to be collected to comply with the results of the stage 1 power analysis (to achieve total sample size of 335). The original 100 responses were analysed together with the responses collected after the stage 2 power analysis to obtain the final results. Both stages of power analysis and the final analysis were preregistered and thoroughly described at https://osf.io/ncdtb/.

This paper was submitted as a registered report so the whole procedure was reviewed and approved before the data collection. The research design is summarized in Table 2.

TABLE 2. Design table.
Question Hypothesis Sampling plan (e.g. power analysis) Analysis plan Interpretation given to different outcomes
For Turkish raters, are impressions of categorically foreign (Arabic) faces that are morphologically closer to the Turkish (ingroup) average more positive than impressions of faces that are closer to the Arabic (outgroup) average?

H1: Warmth and competence ratings will be conclusively higher in Turkified (more ingroup-like) faces than in Arabized (more outgroup-like) faces.

H0: The 89% CI of the mean difference in warmth and competence between Turkified and Arabized faces will overlap 0.

A two-stage power analysis was conducted. The first, frequentist step guards against missing small effects. The second Bayesian step requires data from the first 100 participants based on which expected effect sizes will be computed. If no meaningful effects are discovered or if all effects are strong enough to come out as conclusively different from 0, 235 further responses are collected as suggested by the frequentist approach. If there are small but meaningful effects, the total sample size is calibrated with a set of computer simulations (we are willing to collect up to 400 further responses) to allow indicating effects of given magnitude as conclusive.

The pair of warmth and competence ratings in each model will be treated as a multivariate normal response variable. Mean warmth and mean competence ratings will be modelled as a function of the experimental target manipulation, rater's gender and the interaction between the manipulation and rater's gender.

To account for repetitions in the data, each target and rater in the model is characterized by two varying intercepts (one for each characteristic) and up to two varying slopes.

Three models are fitted on the data and compared (using Widely Applicable Information Criterion, WAIC): (1) the null model that works with fixed and varying intercepts only (2) the model that includes also fixed and varying slopes, (3) the model that allows different fixed intercept and slope for each gender. See extended analysis plan for details.

If there are no meaningful effects, we conclude that the manipulation is not strong enough to elicit a detectable reaction. If Arabized faces are rated as more competent, we conclude that the status of a refugee brings positive connotations in perceivers' minds stemming from factors such as the resilience of refugees in the face of environmental harshness. If Arabized faces are rated as more warm, we conclude that the positive content of the Arabic stereotype (e.g., hospitable people) is too strong to be overridden by its negative aspects (e.g., freeriding refugee) or that common ingroups (e.g., Muslims) facilitate higher warmth perceptions.

Analysis

Model description

We have manipulated 50 facial photographs of average Turks (faces that were closest to the Turkish average) to be either 50% more or 50% less Arabic (the Arabic average was based on 50 facial photographs). We administered the manipulated photographs, either the “Arabized” or “Turkified” version of each face, to every rater to find out how the “Arabization” influences subjective ratings of warmth and competence. The ratings were recorded with continuous sliders over the implicit scale from 0 to 100 with the step of size 0.01 to get as much nuance as possible. Either scale anchored by its extremes (“cold-warm” or “incompetent-competent”) were displayed below each photograph. Each rater was presented with a set 50 warmth rating screens (showing either Arabized or Turkified face; Figure 2) in random order, followed by a set of 50 competence rating screens.

Details are in the caption following the image
Examples of warmth and competence rating screens in the qualtrics online questionnaire. Figure depicts artificial faces.

We employed Bayesian statistical tools to estimate parameter values in multilevel models with intercorrelated varying effects. The pair of warmth and competence ratings in each model was treated as a multivariate normal response variable. Mean warmth and mean competence ratings were modelled as a function of the experimental target manipulation, rater's gender, and the interaction between the manipulation and rater's gender. The manipulation entered the analysis as a contrast variable A $$ \mathcal{A} $$ with two values, A = 0.5 $$ \mathcal{A}=0.5 $$ for the “Arabized” A = 0.5 $$ \mathcal{A}=-0.5 $$ for the “Turkified” target. Rater's gender ( G $$ G $$ ) was treated as an index variable with two levels; G = 1 $$ G=1 $$ for female, G = 2 $$ G=2 $$ for male raters. Using this index variable, we drew not only two independent intercepts ( a i 1 $$ {a}_i\left[1\right] $$ and a i 2 $$ {a}_i\left[2\right] $$ ) but also independent slopes ( b i 1 $$ {b}_i\left[1\right] $$ and b i 2 $$ {b}_i\left[2\right] $$ ) to get the interaction between manipulation and gender.

To account for repetitions in the data, each target and rater in the model was characterized by two varying intercepts (one for each characteristic) and two varying slopes. These effects correspond to variance in average ratings (i.e. some raters might give systematically higher or lower ratings than average; some faces may systematically receive higher/lower ratings than average) and variance in slope (i.e. some raters might systematically rate Arabized faces as warm, others as cold; some faces, when Arabized, may systematically receive higher competence ratings, while others may be systematically rated as less competent after this manipulation). Varying intercepts and slopes per rater and per target were drawn together as possibly intercorrelated sets of four effects (i.e., targets that receive high warmth ratings may not benefit too much from any manipulations; raters that give very high average warmth rating may tend to give high competence ratings as well, etc.).

Three models were fitted on the data and compared (using widely applicable information criterion, WAIC): (1) the null model that works with fixed and varying intercepts only, (2) the model that includes also fixed and varying slopes, and (3) the model that allows different fixed intercept and slope for each gender. The syntax of the full model as a set of equations is available as Supporting information section 2. Its simplification to other models is trivial. Parameter estimates from all models that are not included in the main text are reported in the Supporting information S1. The main text discusses fixed parameter estimates from the best model as promised in the registration of this report.

We assigned weakly regularized unbiased priors to all model parameters. Since we worked with unstandardized ratings, we used normally distributed prior with mean = 50 $$ mean=50 $$ and sd = 25 $$ sd=25 $$ for the fixed intercepts. Thus, scale extremes lie on the ±2SD boundaries of the prior distribution. Fixed slopes were assigned with a normally distributed prior with mean = 0 $$ mean=0 $$ and sd = 25 $$ sd=25 $$ to allow for a similarly permissive span of possible effects but centred around the agnostic mean concerning the manipulation. Strictly positive parameters, i.e. standard deviations of the ratings and standard deviations of the varying effects, were characterized by exponential prior distributions with rate λ = 0.04 $$ \lambda =0.04 $$ . This rate assumes the average to be 1 / λ = 25 $$ 1/\lambda =25 $$ , which is equal to the standard deviation of the other priors. The prior distribution of correlation matrices was specified using LKJ vine and onion method (Lewandowski et al., 2009). Parameter η = 2 $$ \eta =2 $$ of the distribution indicates a permissive unbiased prior, sceptical of extreme coefficients like 1 $$ -1 $$ and 1 $$ 1 $$ (see the comparison of prior and posterior of the full model in Figure S6).

All models were coded as decomposed and decentred to increase the sampling performance. The Stan's Hamiltonian Monte Carlo (HMC) infrastructure (Stan Development Team, 2021) was used with the assistance of McElreath's (2020) rethinking package. The models were encoded in the Stan syntax to evaluate each observed rating pair's logarithm likelihood properly. Logarithm likelihoods were used in the calculation of WAIC. The supplementary code from the Stage 1 preregistration contains the model definition using the simplified ulam syntax for readers' comfort. These models cannot be, however, used in the model comparison. Constructing the likelihood function for the concatenate pairs of warmth and competence ratings was crucial for the extraction of the reliable regression parameter estimates. A set of two regression models would lead to a biased prediction. Suppose, for example, that the manipulation directly influenced warmth (e.g., Arabized faces are rated as more warm), but not competence, ratings, but faces that are perceived as “cold” are more likely perceived as “competent” regardless of this effect. Then, the independent regression model of competence ratings would render an illusory relationship between manipulation and competence (Arabized faces would seem to be rated as less competent). A multivariate regression model such as the one outlined above allows the extraction of correct values of both slopes and the residual correlation between the ratings.

Contrasts between equivalent estimates were drawn from joint posterior samples. For example, to get the posterior distribution of the difference between warmth regression slopes ( δ w $$ {\delta}_w $$ ), one subtracts the estimate for men from the estimate for women in each posterior sample δ w = b w 1 b w 2 $$ {\delta}_w={b}_w\left[1\right]-{b}_w\left[2\right] $$ . Similarly, differences between equivalent parameter estimates for warmth and competence were calculated and summarized.

Models' predictions are illustrated with plots that combine raw data and mean predictions with compatibility bounds. The predictions result from linking simulated data (average raters per gender) with sampled posterior distributions. We used bespoke functions to draw the figures. The analysis was conducted using R version 3.6.2 at stage 1 and R version 4.3.2 at stage 2 and the final analysis.

RESULTS

As specified in the preregistration, we aimed to collect at least 335 complete responses if no promising effects of manipulation are present after stage 1, which was the case, to avoid false negatives. In total, 438 responses (including 100 from stage 1) were collected, of which 340 were complete and used in the analysis. Of those, 189 came from women (Mage = 33.81, SD = 8.36, Range = 18–62) and 151 from men (Mage = 35.54, SD = 11.05, Range = 18–70).

The expected out-of-sample fit of the models were compared using WAIC and the resulting weights
w i = e 0.5 i j e 0.5 j $$ {w}_i=\frac{e^{-0.5{\Delta }_i}}{\sum_j{e}^{-0.5{\Delta }_j}} $$
where i $$ {\Delta }_i $$ represents the difference between i th $$ {i}^{th} $$ model's WAIC and the lowest WAIC in the set. The results of the comparison are shown in Table 3. The model with the interaction between the effect of gender and the manipulation and the model with the manipulation effect only promised virtually identical out-of sample fit. The model without the effect of gender was even a shade better (Figure 3, for numerical results see Table S5). The null model was much worse than both models that take the photograph manipulation into account.
TABLE 3. Models ordered by their expected out-of-sample fit, lower WAIC is better.
Model WAIC SE dWAIC dSE pWAIC Weight
m1 288,119 415.59 0.00 NA 912.75 0.55
mG 288,120 415.58 0.39 3.58 915.43 0.45
m0 288,216 415.22 96.89 25.35 746.53 0.00
  • Abbreviations: dSE, standard error of the difference; dWAIC, differrence from the best model; m0, null model; m1, manipulation only; mG, interaction between manipulation and rater's gender; pWAIC, penalty term (average variance in lppd); SE, standard error.
Details are in the caption following the image
Posterior distribution of the linear model's parameter values (the best model according to Table 3). The dependent variable is on the left, predictor on the right side of the tilde (~), which can be read as “depending on.” The abbreviation cor() stands for the correlation of the variables in parentheses, and sd() stands for the standard deviation of the variable in parentheses. Square brackets […] denote the level of varying parameter values for which the given hyperparameter is reported. The white dot indicates the mean of sampled parameter values on a given posterior margin. The black bar represents the 89% compatibility interval (CI). The density plots are based on the kernel density estimation of posterior sample margins and are truncated at the lowest and highest posterior sample values.

Despite this, the average effect of manipulation was very weak if present at all. Arabized faces were rated higher than Turkified faces by 1.10 points [89% CI: 0.19, 1.98] along the warmth dimension by men (Figure 4, Figure S6). The interpretation of this and similar effects shall be careful because the by-gender model was slightly worse than the model where the effect of Arabization is held constant across genders, where the expected difference is just 0.70 points [89% CI: −0.03, 1.42] (Figure 3). The average effect on competence ratings was even lower.

Details are in the caption following the image
Density plots showing raw data (women's ratings in orange, men's in blue) and lines indicating counterfactual predictions of the expected change in the rating of average woman or man when the other rating in the pair is held constant (based on the second-best model). Solid lines represent mean predictions, semi-transparent corridors span 89% compatibility intervals based on 4000 samples from the posterior distribution.

The reason why these models prevailed over the null (Figures S13–S16, Tables S7 and S8) was their ability to capture the systematic differences in Arabization slope per rater and target. Some raters seemed to rate Turkified faces higher in warmth (sd = 2.65 [89% CI: 1.88, 3.35]) and competence (sd = 1.91 [89% CI: 0.53, 2.89]), while others gave higher scores to Arabized targets. These varying effects characterized by standard deviations easily override the mean effects that do not strongly lean to any side at the population level. (The sd and correlation parameters are almost identical in both good models. In the text, we report the estimates from the best model, i.e., the model without the effect of gender.) Raters differ a lot in the usage of scale. Variance in intercepts per rater is by far the largest parameter value. It almost matches the standard deviation of residuals. Targets also tend to systematically differ in mean ratings of warmth (slightly more) and competence (slightly less) they elicit, but the effect of raters dominates the prediction. The slope parameter sds by target is, however, not much lower than their analogues by rater. Arabization tends to suit some faces, while Turkification tends to suit others.

The covariation of varying effects is also of interest. Higher rated warmth means higher rated competence at the level of rater (r = 0.70 [89% CI: 0.65, 0.74]), at the level of target (r = 0.80 [89% CI: 0.69, 0.88]) as well as at the level of otherwise unexplained residual (r = 0.18 [89% CI: 0.17, 0.20]). Higher warmth ratings under Arabization meant higher competence ratings under Arabization and vice versa on both rater (r = 0.42 [89% CI: 0.01, 0.75]) and target (r = 0.59 [89% CI: 0.26, 0.84]) levels. Two other meaningful correlations were found only on the target level. Faces that received overall higher competence ratings could benefit more from facial Arabization compared to Turkification (r = 0.38 [89% CI: 0.03, 0.67]). A similar correlation could be observed between warmth intercept and competence slope (r = 0.45 [89% CI: 0.13, 0.72]). A substantial part of this association is likely due to the strong relationship between the intercepts, but it should not be dismissed as solely an artefact of multivariate normal distribution because the symmetrical case (correlation between competence intercept and warmth slope) is not conclusively different from 0, and its most likely value is negative (r = 0.11 [89% CI: −0.37, 0.16]). Moreover, we see no similar pattern on the level of rater.

DISCUSSION

The present research attempted to take advantage of computational methods used in research on faces–namely, geometric morphometrics–and a recently developed model of the cross-group typicality/distinctiveness of human faces in a real-world intergroup context to shed light on stereotypes of warmth and competence. Facial morphology is only one cue used to categorize individuals in real-world contexts and is therefore confounded with many other cues. We aimed to understand more precisely than previous research whether and to what extent facial morphology–and not simpler cues to ethnicity such as skin colour–contributes to differential perception of warmth and competence by manipulating individual faces towards ingroup (Turkish) and outgroup (Arabic) prototypes on the axis connecting these prototypes in the face space. In line with past findings and models on stereotypes, we reasoned that for Turkish perceivers, the version of the same face that was moved closer to their ingroup prototype would elicit higher warmth and competence ratings compared to the version moved closer to the outgroup prototype.

We conducted advanced Bayesian multilevel models to test this hypothesis and to comprehensively represent the sources of variation in ratings of warmth and competence in our sample. The results do not support the relatively simple idea that faces more strongly resembling the ingroup (vs. outgroup) elicit greater competence and warmth ratings. Perhaps the most important message from the data is that it defies any such simplification. This is because a major source of variance in ratings turned out to be the raters. In addition, particular faces contributed importantly beyond the effect of the manipulation (i.e., whether they were Arabized or Turkified). In a way, this suggests how powerful group labels and direct cues to membership are in guiding social perception, which has been a major theme in social psychological theories (e.g., Tajfel & Turner, 1979). In other words, the effects of facial features on social perception may become much more complex when the perceiver is not guided by knowledge that an individual belongs to a certain group, as was the case for raters in our experiment. In fact, the world may turn out to be so complex without such guidance that even infants rely not just on features but concepts and categories in navigating the social world (Liberman et al., 2017). When the assimilative influence of group labels on the perception of faces is weakened or removed in experiments like ours, it may become difficult or impossible to find any consistent differences between how ingroup and outgroup faces are perceived. This is reminiscent of the idea that individuation–focusing on the particular qualities of an individual–can counter the effect of categorization and stereotyping (e.g., Brewer, 1988).

While our effort had several important unique strengths, there were limitations and challenges that should be acknowledged, as well. First, despite using cutting edge computational methods with the highest quality facial images available to us, a pilot study showed that our manipulation did not create the expectedly strong effects in terms of group categorization (see Supporting information section 1). It is not the case that it failed to work in a technical sense, but there were a non-negligible number of observations that did not fit expectations (i.e. an Arabized face being judged as more Turkish- (vs. Arabic-) looking and vice versa). This could also be explained by the association between trustworthy-looking facial appearance and ethnic typicality. Previous research has provided evidence that people with facial morphology closer to ingroup standards are perceived as more trustworthy and competent (Sofer et al., 2015, 2017). The strong association between both perceived and measured ingroup membership (facial typicality) was also corroborated by recent works (Kleisner et al., 2024; Tracy et al., 2020). Arab faces (in contrast to Turkish faces) may, on average, possess certain features that tend to be statistically associated with trustworthiness, such as relatively fuller lips, and prolonged facial shape. This “trustworthy-ingroup effect” could potentially bias ingroup-outgroup categorization.

This in fact converges with the main experiment's findings: when skin colour is taken out of the equation and there are no other cues to category membership, perceptions become less aligned with the ingroup-outgroup dichotomy. An interesting future research question arises: If, as our findings suggest, facial morphology is not sufficient to create categorical social perception, what is? What featural cues are essential or sufficient to create perceptual and judgemental effects that resemble the real-life differences these participants supposedly show in judging ingroup and outgroup individuals?

Our finding regarding the insufficiency of facial shape to indicate ethnic differences is similar to some experimental evidence on the role of colour contrast in sex recognition. Russell (2009) showed that the same facial shape could be interpreted as either male or female based solely on changes in the colour contrast of certain facial features, such as between the lips and their surroundings. The relationship between shape and colour seems complex. For example, Burton et al. (1993) demonstrated that sex classification based solely on shape is possible but far less accurate than when texture information is also available. Moreover, Hill et al. (1995) found that perceivers relied more on colour when exposed to a frontal view and more on shape when viewing a profile. Additionally, Yip and Sinha (2002) observed that when shape information is degraded, colour cues become useful in identity recognition. It is, therefore, possible that shape alone, without colour, does not provide enough information for ethnic recognition. Even small changes in colour and contrast may enhance the ability to recognize subtle morphological differences among ethnic groups.

Second, while not a limitation per se, the chosen categories may have contributed to the emerging complexity. Due to more than five centuries of shared history (under the Ottoman Empire), there has been close exchange between Turkish and Arabic cultures. These groups, in terms of the faces of individuals, may be too close to each other for a simpler story to emerge from this kind of experiment. Taking skin colour out the picture probably made this challenge even more extreme. Thus, our original hypothesis may very well receive support in other contexts. Once again, future research might bring data on the minimal average degree of distance in face space between two categories required for such perceptual effects to emerge. In a similar vein, the potential effects of our participants' background and social networks (e.g. whether they had Arabic ancestry or Arabic individuals in their family and surroundings) would have been worth examining.

Although this study did not bring categorical and unambiguous conclusions regarding the influence of minor changes in human faces on their stereotyping, we believe that it presents a valuable methodological guide combining the up-to-date approaches of Bayesian multilevel modelling and geometric morphometrics that may inspire other inter-population investigations. In particular, the value of our effort stems from the combination of several unique aspects: Most importantly, in previous work, it is often not clear how the various types of facial stimuli actually differ objectively from each other, whether they differ in unintended ways, and what it means for a face to be low or high in prototypicality. In contrast, our faces differ only along the CTDM axis that allows to control the shape variation of facial stimuli in a mathematically precise manner within a natural range observed in target populations (Kleisner et al., 2019). Second, an important unique feature of the faces (and the corresponding social categories) used in our work is that they go beyond the pairs of categories widely studied in the literature, such as White- versus African-Americans, that are relatively easily categorized on the basis of a single prominent dimension (skin colour). Instead, the pair we examined (Turkish versus Arabic) is representative of a perhaps even more common type of intergroup encounter in the real world, one in which the categories are much less clear. Thus, our research is relevant to a recent call to “focus on better understanding the mechanism underlying featural biases” (Maddox et al., 2022; p. 11) as opposed to category-driven biases.

Despite the failure to support the central hypothesis, we were still able to observe a few notable effects: First, perceived warmth and competence indeed go together. “Cold” faces are not, on average, perceived as more competent, despite popular cultural image of the “ice-cool detached rational professional.” This is true on the level of target (which is the most important from an evolutionary perspective), rater, as well as otherwise unexplained residual. Second, as far as average ratings are concerned, raters explain much more variance than targets. Some faces do pose systematically as warmer and more competent than others, but it is the individual's willingness to rate others high on these desirable characteristics that allows to predict the resulting ratings precisely. Finally, part of this variance between raters is due to sex. Women are less prone to rate male faces high on warmth and competence. Perhaps the signs of sex-biased evolutionary error-management are at play here (Haselton, 2007). Women who did not perceive random strangers as warm and competent were probably more cautious in the subsequent interactions and entered friendly and reproductive relationships only with the highest-quality mates.

The goal to understand how specific morphological and other types of features contribute to stereotypes remains important and though the current findings do not allow for definitive conclusions on this, we remain optimistic that our effort can help to shape the future of empirical efforts pursuing this goal. If the processes examined here are as complex as our findings suggest, then continued research on them will be even more critical.

AUTHOR CONTRIBUTIONS

S. Adil Saribay: Conceptualization; data curation; methodology; project administration; resources; supervision; writing – original draft; writing – review and editing. Šimon Pokorný: Data curation; investigation; methodology; project administration; resources; writing – original draft; writing – review and editing. Petr Tureček: Formal analysis; methodology; validation; visualization; writing – original draft; writing – review and editing. Karel Kleisner: Conceptualization; data curation; funding acquisition; methodology; project administration; resources; supervision; validation; visualization; writing – original draft; writing – review and editing.

ACKNOWLEDGEMENTS

The research was supported by the Czech Science Foundation project reg. no 21-10527S. The funders have/had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

    CONFLICT OF INTEREST STATEMENT

    The authors declare no competing interests.

    CODE AVAILABILITY

    The code of the final analysis as well as the code used in the two rounds of preregistration are publicly available at: https://osf.io/ncdtb/

    OPEN RESEARCH BADGES

    Open DataPreregistered

    This article has earned Open Data and Preregistered Research Designs badges. Data and the preregistered design and analysis plan are available at [[URL(s) https://osf.io/ncdtb/ and https://osf.io/hgwtd/ on the Open Research Disclosure Form]].

    DATA AVAILABILITY STATEMENT

    All data are publicly available at https://osf.io/ncdtb/. The raw data used in the final analysis specifically can be found at https://osf.io/4nh2u.