27 Reproducing the original findings

Reproducing the analysis in the original study is a valuable step when planning a replication.

It allows you to verify that the original analysis is correct. Minor errors are common, and occasionally the original finding does not hold up under scrutiny.
Reproduction deepens your understanding of the methodology by showing which variables were recorded and how they were used.
The reproduction often yields code that you can reuse, ensuring that your analysis will match the methods in the original paper.

Reproduction is only possible when the original data are available, either in a public repository or directly from the authors.

27.1 Reproducing the algorithm aversion paper

Here I reproduce the analysis of Study 3b in (Dietvorst et al., 2015).

The data are available from ResearchBox and the Journal of Experimental Psychology: General website. The download includes the Stata code used in the paper. I translated that code into R (with minor fixes) to run the analysis below.

27.1.1 Load the data

I start by downloading and unzipping the data file and loading the CSV for Study 3b into the R environment.

Code

# Create data directory
if (!dir.exists("data")) {
  dir.create("data")
}

# Check if data exists, if not download and unzip
if (!file.exists("data/ResearchBox 379/Data/Study 3b Data.csv")) {
  # Download if zip doesn't exist
  if (!file.exists("data/ResearchBox_379.zip")) {
    download.file("https://researchbox.org/379&PEER_REVIEW_passcode=MOQTEQ", 
                  destfile = "data/ResearchBox_379.zip")
  }
  # Unzip
  unzip("data/ResearchBox_379.zip", exdir = "data")
}

# The downloaded zip file should be placed in the 'data' folder
# unzip("data/ResearchBox_379.zip", exdir = "data")

# Load Study 3b data
data <- read.csv("data/ResearchBox 379/Data/Study 3b Data.csv", header = TRUE,
                 sep = ",", na.strings = ".")

27.1.2 Clean variable names and labels

I then amended the variable names to make them more readable (e.g. replacing the numbers for each condition with the words they represent: 1=control, etc.) and convert them into the correct variable type (e.g. convert to factor).

Code

# Camel-case variable names
colnames(data) <- paste(tolower(substring(colnames(data), 1, 1)),
                        substring(colnames(data), 2), sep = "")
names(data)[1] <- "ID"

# Label mappings
condition <- c("1" = "control", "2" = "human", "3" = "model", "4" = "model&human")
modelBonus <- c("0" = "choseHuman", "1" = "choseModel")
binary <- c("0" = "no", "1" = "yes")
betterStage2Bonus <- c("1" = "model", "2" = "equal", "0" = "human")
confidence <- c("1" = "none", "2" = "little", "3" = "some",
                "4" = "fairAmount", "5" = "aLot")

# Apply labels
data$condition <- factor(data$condition, labels = condition)
data$modelBonus <- factor(data$modelBonus, labels = modelBonus)
data$humanAsGoodStage1 <- factor(data$humanAsGoodStage1, labels = binary)
data$humanBetterStage1 <- factor(data$humanBetterStage1, labels = binary)
data$humanBetterStage2 <- factor(data$humanBetterStage2, labels = binary)
data$betterStage2Bonus <- factor(data$betterStage2Bonus, labels = betterStage2Bonus)
data$model <- factor(data$model, labels = binary)
data$human <- factor(data$human, labels = binary)
data$modelConfidence <- factor(data$modelConfidence, labels = confidence)
data$humanConfidence <- factor(data$humanConfidence, labels = confidence)

27.1.3 Check sample sizes

I then run a quick check on the number of observations. I note from the paper that 217 participants failed attention checks, 187 did not complete enough questions to get to the dependent variable, and there was a final sample of 1036. This gives 1440 participants.

Code

# Total participants
length(data$ID)

[1] 1440

Code

# Participants who chose between the model and human forecasts
table(data$modelBonus)


choseHuman choseModel 
       517        519

As expected, I found 1440 participants and 517+519=1036 dependent variable observations.

27.1.4 Choices by condition

I then broke down the dependent variable findings in more detail to see what each participant chose (human or model) in each condition.

Code

# Choice counts
tbl <- table(data$modelBonus, data$condition)
tbl <- rbind(tbl, total = colSums(tbl))
tbl

           control human model model&human
choseHuman     122   104   142         149
choseModel     145   157   111         106
total          267   261   253         255

27.1.5 Chi-squared tests

Two chi-squared tests replicate the results shown in Figure 3 of the paper. The first was a test of whether there was a difference between the two conditions without the model and the two conditions with the model. This is a test of the core hypothesis.

Code

# Compare conditions without vs. with the model
chisq.test(table(data$modelBonus, data$model), correct = FALSE)


    Pearson's Chi-squared test

data:  table(data$modelBonus, data$model)
X-squared = 21.715, df = 1, p-value = 3.163e-06

The second was a test of whether there was a difference between the two conditions without and the two conditions with the human experience. This test was run to see whether the experience of seeing their own errors changed the participants’ choices.

Code

# Compare conditions without vs. with the human experience
chisq.test(table(data$modelBonus, data$human), correct = FALSE)


    Pearson's Chi-squared test

data:  table(data$modelBonus, data$human)
X-squared = 0.31302, df = 1, p-value = 0.5758

27.1.6 Plot the results

I then plotted that data in a bar chart to illustrate the results. This matches Figure 3 in the paper.

Code

library(ggplot2)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Code

library(tidyr)

conditions <- as.data.frame.matrix(tbl)
conditions <- rbind(conditions, percent = conditions[2, ] / conditions[3, ])
conditions <- tibble::rownames_to_column(conditions, "variable")
conditions_long <- pivot_longer(conditions, cols = -variable,
                                names_to = "treatment", values_to = "value",
                                cols_vary = "slowest")
plot_data <- filter(conditions_long, variable == "percent")

ggplot(plot_data, aes(x = treatment, y = value)) +
  geom_col() +
  labs(title = "Percent choosing model in each condition",
       x = "Condition", y = "Percent choosing model") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  geom_hline(yintercept = 0, linewidth = 0.25)

This reproduces Figure 3 from the original paper.

27.1.7 Study 3b t-tests

Finally, I replicate the analysis related to Study 1 as shown in Table 3. The table includes several elements:

A t-test of whether the bonus would have been higher if the model was chosen rather than the human
A t-test of whether the average absolute error was higher for the model or human in the Stage 1 unincentivised forecasts
A t-test of whether the average absolute error was higher for the model or human in the Stage 3 incentivised forecasts

The code generates the numbers required to fill the table.

Code

# Calculate human and model mean rewards 
m <- round(mean(data$bonusFromModel, na.rm=TRUE), 2)
h <- round(mean(data$bonusFromHuman, na.rm=TRUE), 2)

# Bonus if chose model vs. human
bonusModel <- t.test(data$bonusFromModel, data$bonusFromHuman, paired=TRUE)
p <- signif(bonusModel$p.value, 2)
t <- round(bonusModel$statistic, 2)
d <- round(bonusModel$estimate, 2)

# Participants' AAE compared to model's AAE for stage 1 forecasts
stage1AAE <- t.test(data$humanAAE1, data$modelAAE1, paired=TRUE)

p1 <- signif(stage1AAE$p.value, 2)
t1 <- round(stage1AAE$statistic, 2)
d1 <- round(stage1AAE$estimate, 2)

AAEm <- round(mean(data[data$condition=='model&human', 'modelAAE1'], na.rm=TRUE), 2)
AAEh <- round(mean(data[data$condition=='model&human', 'humanAAE1'], na.rm=TRUE), 2)

# Participants' AAE compared to model's AAE for stage 2 forecasts
stage2AAE <- t.test(data$humanAAE2, data$modelAAE2, paired=TRUE)

p2 <- signif(stage2AAE$p.value, 2)
t2 <- round(stage2AAE$statistic, 2)
d2 <- round(stage2AAE$estimate, 2)

AAEm2 <- round(mean(data$modelAAE2, na.rm=TRUE), 2)
AAEh2 <- round(mean(data$humanAAE2, na.rm=TRUE), 2)

Table 3 results

	Model	Human	Difference	t score	p value
Bonus	$0.49	$0.3	$0.2	14.4	5.8^{-43}
Stage 1 error	4.39	8.32	3.93	17.3	6.2^{-45}
Stage 2 error	4.32	8.34	4.03	15.44	1.5^{-48}

The numbers match those provided in the paper.

27.1.8 Statcheck

The above analysis relates to Study 3b only. While I won’t reproduce the other results from the paper, one quick robustness check can be performed using statcheck. To use statcheck, you upload a pdf, HTML or docx of the paper. Statcheck then extracts details of the tests reported in the paper and checks that the reported numbers are consistent. (There is also an R package for statcheck allowing you to run your own statcheck implementation.)

I uploaded the Dietvorst et al (2015) pdf, but the analysis did not work. I then accessed a HTML download of the paper using the UTS library website. Uploading the HTML version of the paper resulted in the following.

The results appear consistent with the reported tests.

Statcheck does not work on all papers and only checks that the tests are consistent with the reported numbers, but it is a quick way to look for red flags.