31 Replication report resources
Below is a set of resources to support the development of your replication report. Many come from Gilad Feldman’s teaching resources.
31.1 Templates and guides
- Replication report template: Download the Word template. It can be treated as a fill-in-the-blanks document or a guide. The template is written in the past tense—write as though data collection has occurred, leaving placeholders for results that will be inserted later. Feldman’s long-form template and supplementary material template provide additional detail.
- Guide for conducting a preregistered replication: Google Doc covering the full process (geared toward statistically advanced students).
- Reviewer template: Google Doc that doubles as a self-review checklist.
- Registered report checklist: Download the checklist from (Kiyonaga and Scimeca, 2019).
- Replication recipe: (Brandt and Others, 2014) provide a structured set of questions; download the recipe.
- Example registered replication reports: Feldman’s students have published multiple registered replication reports. See the list at mgto.org/core-team for links to preprints and publications.
31.2 Applying the registered report checklist
The notes below apply the (Kiyonaga and Scimeca, 2019) checklist to the (Dietvorst et al., 2015) algorithm aversion study.
31.2.1 Hypotheses, predictions and interpretations
Delineate testable hypotheses: Is the scientific premise justified and sound?
The hypothesis is that people decrease their tendency to use algorithmic predictions (rather than their own) after seeing the algorithm err.
The idea that people lose trust in an algorithm faster than they lose faith in their own judgement is based initially on a thought experiment. Creating an experimental foundation for this hypothesis is the purpose of the paper.
Make concrete predictions: Include directional descriptions of possible outcomes (e.g. ‘condition A greater than condition B’)
There is a clear directional prediction to the hypothesis. Fewer participants will use the algorithmic prediction after seeing the algorithm err.
Across the four conditions, we would expect the following ranking in the proportion of participants using the algorithmic prediction:
human >= control > model&human >= model
The key relation for the hypothesis is the middle relation. The use of the model should decline when the model has been seen to err, either alone or in combination with the opportunity to see human error.
The two other relations come from the possible effect of seeing the human err. If seeing the human err increases their use of the algorithm, it could lead to the human condition having greater algorithmic use than the control condition, and the model and human condition having greater algorithmic use than the model condition. However, the hypothesis does require that any effect of seeing the human err is less than that of seeing the algorithm err.
Describe how predictions will be tested: What specific measures? What statistical tests?
To test whether people use the model or their prediction, we will use a chi-squared test, comparing the control and human conditions with the model and model and human conditions. We will not apply a continuity correction. This is the same test used in the original paper.
Due to the potential for an effect of human error, we will also use a chi-squared test comparing the control and model conditions with the human and model and human conditions. This was also done in the original paper.
Link hypotheses and tests to interpretations: How would you interpret possible outcomes? Which results would support which theories?
A significant reduction in the use of the model when the participant has experience with the model (has seen the model err) would provide support for the hypothesis.
The main potential for ambiguity is if there is a difference between the model and model and human conditions, or the conditions without human practice and those with. This would suggest an effect of seeing the human err, which may, in turn, suggest it is the presence of error generally rather than just algorithmic error. As in the original paper, this is addressed by testing for a difference between the control and model conditions and the human and model and human conditions. In case of difference, a new interpretation would be required.
For example, if the human condition is different from the control, this suggests a role in seeing human error. I would predict the direction of this effect to be an increase in trust in the algorithm. This may then influence the model and human condition or our interpretation of it. Seeing humans err may moderate the effect of the algorithm erring.
Ask whether the design can generate diagnostic results which clearly inform the predictions: Will possible data patterns clearly support proposed interpretations? Will results make a valuable contribution to the field regardless of the outcome?
The data pattern will be clear, noting the potential mix of results above.
The result will be a useful contribution. While there is a replication of study 3b (Jung and Seiter, 2021), it is the only published replication of a highly cited paper that has generated a substantial amount of subsequent research.
31.2.2 Power analyses and sample size
Identify critical tests: What specific statistics will be used to test predictions? (e.g. t-test, correlation, ANOVA interaction, etc.)
As noted above, I will use a chi-squared test comparing the control and human conditions with the model and mode and human conditions. I will not apply a continuity correction. This is the same test used in the original paper.
I will also use a chi-squared test comparing the control and model conditions and comparing the human and model and human conditions. This will test to confirm the effect of human error.
Estimate expected effect size: Scour the literature for studies using similar methods, designs, and statistical tests to find the likely range of effect sizes for your tests. Consider the lower end of this range and account for publication bias to attain a conservative effect size estimate. Assess the smallest effect size that would be theoretically meaningful for what you are studying. Pilot data are encouraged to show feasibility of experiment and plausibility of effect size, but are insufficient on their own.
The effect size in Study 3b in the original paper is a difference of approximately 0.14 between groups. This is a typical effect size in that paper. Similar effect sizes were found in Study 3a and in the replication of 3b by Jung and Seiter (2021). It is smaller than the effect size in Study 1.
Determine required sample size with power analysis: For your power analysis input, use effect size estimated from statistical test that is comparable to proposed test. Conduct a priori power analysis for your proposed test (e.g., via statistical toolbox or software package) to calculate required sample size.
I propose to use a sample size of 250 per condition. This will provide 99% power if the original effect size holds. A larger sample would enable the detection of a smaller effect size, but this may be of limited theoretical interest. If I was confident that the initial effect size was representative of the true effect size, I could reduce it to 125 per condition (249 in each of the two combined groups). However, I am reluctant to replicate with a smaller sample size and believe maintaining a buffer is appropriate.
31.2.3 Reproducible methods and exclusion criteria plan
Explicitly define dependent variables: Explain any calculation that will be applied to raw data before the data are submitted to statistical tests. Define basic terms (e.g., does ‘average’ describe mean or median?) as well as transformations (e.g., z-scores).
The dependent variable is the proportion of participants who use the algorithmic prediction in each condition, which is simply the number who use the algorithmic prediction divided by the total number of participants. There will be no other data transformations.
Exhaustively describe inclusion criteria that will be applied before collecting any critical data: What conditions must participants meet to be enrolled? What criteria will preclude a participant from enrolling? Include standard criteria (e.g. safety) along with specific conditions that must be met to test hypotheses (e.g., baseline screening performance).
The criteria for participant enrolment is that they are in Australia and over 18 years old. There will be no other participant screening outside of an attention check.
Exhaustively describe exclusion criteria that will be applied after collecting any critical data: Specify every plausible circumstance that would justify removal of data or participants after data has been collected. Include technical issues and quality assurance steps, outlier removal, and any specific data conditions that preclude (or must exist prior to) testing your hypothesis.
I will exclude the following participants:
- Participants who fail the attention check
- Participants who enter the same number for all predictions or 0 and 50 for all predictions
- Participants who do not complete all ten predictions (they will not be excluded if they do not complete the subsequent questions)
Define outcome-neutral controls: How can one be certain the manipulation was successful? Will these controls convince a reader that neither positive nor negative results are spurious?
Each participant will only see one condition. There is little scope for the manipulation to fail.
Ask whether the methods are truly transparent: If someone were to repeat the study and found a different result, are there any degrees of freedom that might explain the difference?
The main degree of freedom is the particular task, data and algorithm used. This is seen in the original paper, where various tasks were presented to participants across the different studies. There is likely to be some variation across these tasks. However, we are using nearly identical Qualtrics wording to the original task, so a study using these same materials is, at a minimum, robust for that setting.
The other degree of freedom is the study population. I propose using an Australian panel, which is likely more representative of the population than the original sample of MTurk workers.
31.2.4 Is the proposed protocol doable
Check for unnecessary methodological constraints: Is every criterion well-motivated? Are you confident all methods are feasible? Is there a contingency plan if some constraints are unmet? Can a simpler design address the same question?
The study is not complex and the sample size is not large. The study is not time-consuming for participants, and the study is not complex to run. The study is not complex to analyse.
Assess required sample size and effect size estimate: If required sample size is impractical, or no comparable effects exist in the literature, consider an alternative sample size estimation approach (e.g., step-wise peeking procedure, Bayesian stopping rule, hard cap on sample size). Seek out resources, staff, or collaborations that can help.
The required sample size is achievable.