Replication of Study The Face of Success by Rule & Ambady (2008, Psychological Science)
------------------------------------------------------------------------
Casey Eggleston
cm5hv@virginia.edu
Minha Lee
mhl5b@virginia.edu
Thomas Talhelm
tat8dc@virginia.edu
**Introduction**
Rule and Ambady (2008) showed that naïve college undergraduates’ assessment of features of Fortune 500 CEO’s faces correlated with company profits. Specifically, composite ratings of both power and leadership correlated with company profits, even after controlling for CEO age, CEO attractiveness, and the affect CEOs were displaying in the pictures. Composite warmth did not significantly predict revenue or profits (although it correlated at r = -.12 and -.14, which suggests that the effect may become significant with a larger sample). We chose the correlation between leadership and profits as our a-priori criterion of successful replication.
**Methods**
*Power Analysis*
To calculate power, we focused on the two main results, which were a correlation of r = .36 between company profits and composite ratings of power, as well as a correlation of r = .30 between leadership and company profits. The original study had two separate samples of n = 50, one to rate traits and one to rate leadership. To be conservative, we calculated power based on the smaller effect size. For 90% power, that meant we needed a sample of 112 for each group.
Was the original study underpowered? To evaluate the power of the initial study, we can use the larger correlation (r = .36) and the more common (and less stringent) power standard of 80%. This gives a necessary sample size of 58, which is close to the original sample size of 50. Note that the sample size in our study is more than double the original sample size, meaning that our study is much higher powered.
In addition, the original article used a separate sample to rate attractiveness (n = 34, 53% female). We aimed to collect a similar sample size as the original article, and we aimed for 50% male and 50% female raters because attractiveness ratings could very well depend on gender. However, because the psychology participant pool at the University of Virginia has more females than males, we were prepared to open up additional timeslots open only to male participants so that we could get an even male-female ratio.
*Planned Sample*
To obtain that sample, we recruited participants through the University of Virginia participant pool, since the original article appeared to use a participant pool. Rule and Ambady’s sample was 65% female. Because the authors did not mention gender effects and because we had no reason to suspect gender effects for the leadership and power ratings, we did not restrict our sample along gender lines.
*Materials*
After several months, the authors sent us the original Medialab and Direct RT materials. The authors sent an Excel sheet with the companies’ mean revenue and profit for fiscal years 2004 through 2006 from Forbes.com. To control for the potential effect of CEO age, the list also included each CEO’s age.
Traits. The trait group of participants rated the CEOs’ faces on five traits: likeability, dominance, competence, maturity, and trustworthiness. The trait scales were ran from 1 to 7, and only 1 and 7 were labeled. The likeability rating anchors were 1 (Not Likeable) and 7 (Very Likeable). The dominance rating anchors were 1 (Submissive) and 7 (Dominant). The competence rating anchors were 1 (Not Competent) and 7 (Very Competent). The trustworthy rating anchors were 1 (Not Trustworthy) and 7 (Very Trustworthy). The maturity rating anchors were 1 (Babyish) and 7 (Mature). The participants in the leadership condition rated the CEOs’ faces on a single trait, successful leadership. The question read “How good would this person be at leading a company?” The rating anchors were 1 (Not Successful) and 7 (Very Successful).
Instructions. Participants in the trait condition read five pages of instructions. Participants in the leadership condition and the attractiveness condition read only one page of instructions. See the online supplementary materials for the full wordings. Most importantly, participants were asked to imagine that they were choosing a CEO to hire:
Imagine you work for a big company and you have a bunch of potential candidates to be the company’s next CEO (Chief Executive Officer—the person who runs the company). The faces you’ll be seeing in this experiment are your candidates. Your task is to rate these faces as to how successful [trait varies] you think they each would be as a CEO.
The original authors also encouraged participants to work quickly:
The experiment isn’t being timed, so you don’t need to rush through it. Definitely don’t spend too long thinking about any of the pictures or ratings, though. Just go with whatever your first instinct, gut-feeling is. So don’t pound the keys frantically, but DO go through them rapidly.
Attractiveness. A separate group of participants (planned sample n = 34, 50% female) was asked to rate the faces on attractiveness. The attractiveness rating anchors were 1 (Not Attractive) and 7 (Very Attractive). Points between the anchors were unlabeled.
Recognition. At the end, participants were asked if they recognized any of the CEOs, and their data was removed if they recognized any of the faces. This question read “Did you recognize any of the faces you saw today?” The answers were “yes” and “no.”
Affect. Finally, we had four research assistants code each photo for affect on a 7-point scale. The scale was not included in the original materials. The original article did not describe what “affect” meant. The wording seemed to imply that it was a single item, but it’s not clear what the item wording was. Was it from no affect to high affect? Was it from negative to neutral to positive? After discussion, we decided that 7-point scales usually need two poles to have seven meaningful items. Therefore, we used a bipolar scale from negative to positive:
1 = very negative,
2 = negative,
3 = slightly negative,
4 = neutral,
5 = slightly positive,
6 = positive,
7 = very positive.
For affect, we calculated interrater reliability with the Spearman-Brown r, which is what the original article used. We know of no agreed-upon standard for interrater reliability, so after discussion, we decided to set an a priori criterion of r = .8. If agreement met or exceeded .8, we would accept the ratings. If agreement failed to meet that criterion, we would redo the training and ratings until we met the criterion. Agreement in the original article was r = .903.
*Procedure*
As in the original study, participants took the study in the lab, using Medialab. As in the original study, participants rated all faces for one trait, and then moved onto the next trait. Within each trait, the faces were randomly ordered. The order of the blocks of traits was randomized.
*Analysis Plan*
The only exclusion criteria for data was if the participants recognized the CEOs. We planned to exclude participants if they reported recognizing any of the faces. The original article also removed data for one CEO “because his score was more than 3 standard deviations below the mean.” It is not clear what rating this refers to, so we scanned the results for +/- 3 SD outliers for all ratings and excluded any CEOs that fit this criterion.
As in the original article, we calculated simple correlations between ratings and earnings, as well as partial correlations controlling for CEO age, affect, and attractiveness. We averaged trait and leadership ratings across participants for each CEO.
To verify the composite indices, we conducted a principal components analysis with varimax rotation. The original article found (1) a power component made up of competence, dominance, and facial maturity that explained 47% of the variance and (2) a warmth component made up of likeability and trustworthiness, which explained 39% of the variance. Their criteria for determining the number of factors were unclear, so we conducted an exploratory factor analysis to verify that the items indeed loaded on the right factors. We conducted several confirmatory factor analyses to compare the two-factor model to a one-factor model and three-factor model.
*Differences from Original Study*
Although we got the original materials from one of the original authors, the author was unable to locate the rating scale for coding of affect (see section on “Affect.”)