The aim of this experiment was to address the CR performance floor effect observed in Experiment 2a (see https://osf.io/z47r3/), that clouded interpretation of our findings (i.e., no FR/CR variability difference, or merely restricted CR range?).
Here we document various attempts to get CR performance off floor.
**Pilot 1: Increasing CR pair study time**
Our first attempt involved increasing the per-pair CR study time to 10s (from 5s, FR per-word study time left at 5s). We collected *N* = 30 pilot participants from Prolific, and after excluding 10 participants (3 reporting major tech difficulty, 3 reporting understanding less than 75% of words, 1 reporting a major distraction, 1 reporting cheating, 10 who didn't get at least 1 correct on both lists), these were the performance stats:
@[osf](4fz3a)
Doesn't look good, and mean performance (low *N* notwithstanding) isn't higher than in the previous study with 5s CR study time. BUT, when breaking this down by task order (i.e., CR first vs. FR first), the stats look a little different:
@[osf](k9stq)
People do a bit better on the surprise FR test after CR study when they do CR 2nd (note the discrepant *n*'s between conditions due to a higher performance-based exclusion rate from those who did CR first). So perhaps one way to address the floor issue is to have *all* participants do CR second. In all but one experiment, task order didn't affect the variability difference (and in the one case that it did, the direction of the variability difference was the same across conditions).
One potential wrinkle: The difference in CR performance for FR -> CR vs. CR -> FR was not particularly large in our previous main sample:
@[osf](xqs3v)
But maybe the combination of CR 2nd and increased CR study time might work?
**Pilot 2a: Using a different, more restricted "Objects" wordset**
We reasoned that using a more restricted or semantically-related wordset might get performance off floor. The object words from Popp & Serra (2016) have some appealing properties (similar FR and CR performance, no floor effects, persistent variability difference in other experiments).
We ran *N* = 31 participants on Prolific using this wordset (and setting CR study time back to 5s, same as FR). After excluding 9 participants (2 reporting major tech difficulty, 1 reporting cheating, 7 who didn't get at least 1 correct on both lists), we ended up with *n* = 22 participants, with these performance stats:
@[osf](hwnd8)
Seems promising--more responses near the middle of the distribution than the previous pilot.
Split by test order:
@[osf](8v75n)
Also promising--especially for FR-> CR. Similar performance, and fewer near-floor responses on CR than FR. As with the previous pilot, perhaps we could justify running all subjects FR -> CR with the object wordset?
**Pilot 2b: Using a different, more restricted "Animals" wordset**
Given that the Object word results were not particulary compelling, we also tried the Animals words from P&S (2016). Although Animal CR performance was lower than Object CR performance in P&S (2016) and in many of our subsequent animacy studies, Animals words have shown a FR advantage. Thus, it is possible that Animals might lead to better performance in the current hybrid "CR study -> FR test" design.
Another post exclusion *N* = 20 on Prolific (of 30 total, 1 excluded for technical issues, 1 understanding less than 75% of the words, 10 didn't get at least 1 right on both lists):
@[osf](89cuh)
Still floor issues (if anything, performance seems to be a bit lower than objects). Split by task order:
@[osf](5q4yh)
Not too promising. A bit interesting though that Surprise FR animal performance was lower than Surprise FR object performance--might suggest that the study phase (i.e., CR, where there is an Object advantage) matters more than the test phase (i.e., FR, where there is an Animal advantage) for performance.
But overall, seems unlikely that tweaking the wordset or increasing CR study time will solve our problems.
**Pilot 3: Using higher-performance DRM word pairs from another experiment**
In an additional experiment (using related probe-lure DRM word pairs--see https://osf.io/pfhu9/), CR performance was higher than that observed in our other experiments. We reasoned that this wordset might work for the "Surprise FR" design.
So, we pilot tested these words with a small group of Prolific participants (*N* = 15). Participants received a single CR study list containing the 20 pairs we used in the prior experiment (along with 4 primacy/recency buffers).
Using exclusion criteria from our previous experiments, exclusions were high: 7 failed to get at least 3 correct on the surprise FR test, 1 understood less than 75% of the words, and 1 reported completing a similar previous study on another platform.
Like our previous attempts, performance was at floor, both when making exclusions:
@[osf](mcea2)
...and when including all participants:
@[osf](jac5p)
**Pilot 4: Simplifying instructions + color-coding words**
The previous pilots suggest that it may not be the wordset but the task itself that contributes to poor CR performance. E.g., it is possible that a) studying for CR is unlikely to result in good performance on an FR test, or b) that participants were in some way misunderstanding the test. To address possibility b), I tweaked the previous pilot in the following ways:
- Reduced the # of studied words back to 15 (from 20), but keeping the 4 primary/recency buffers.
- Color-coded the cues (black) and targets (red)
- With the above, greatly simplified the test instructions ("We told you we would present the black words and ask you for the corresponding red word, but we just want you to recall as many red words as you can")
From a pilot sample of *N* = 20 Prolific participants, I excluded 5 (3 who reported understanding less than 50% of the words, 1 reporting a major distraction, 2 who got 0 correct on the test). Performance statistics were as follows:
@[osf](dteh7)
Performance is higher than in the previous DRM pilot, but still lower than desired--mean performance was similar to that observed in the main experiment.
For this pilot, I did include a qualitative question about the memory test experience (i.e., "Did you find it easy/hard, and if so, why?"). Example prompts give in the question related to instructions ("Were the instructions clear") and mixing up cues/targets ("Did you have trouble remembering if a word was black or red"). Looking at the responses:
- 20% mentioned having trouble remembering whether a word was a cue or target
- 15% mentioned it was difficult because there were too many pairs
- 25% mentioned that they focused more on the cues than the targets (+ forming associations)
At the very least, these results seem to suggest that we can't expect factors normally associated with better CR performance (e.g., cue-target relatedness) to improve "surprise FR" performance after CR study.
**Pilot 5: SONA sample**
To date, we had not tested the "Surprise FR" procedure in an undergraduate sample. So, we tested a pilot sample of *N* = 24 students using the DRM lists/pairs and CR/FR between-subjects design from Pilot 3/Experiment 5. After excluding 2 participants who reported completing a prior study, and 6 participants who didn't get at least 3/20 correct on one of the lists (notably, all in the CR condition), we were left with a pilot sample of *N* = 17 (one overlap in exclusion criteria):
@[osf](5rxnz)
Aside from the hefty # of exclusions for CR participants, performance on the Surprise FR after CR study was characteristically low. So, it's unlikely that a sample change is likely to solve our floor problems.
**Potential next steps**
- **Change the cued recall task?** Based on the results of Pilot 4, maybe framing the CR task as "memorize both words so that you could recall one if the other was presented"?
- **Consider repeated presentations at study?** Evidence esp. for distributed practice effects (e.g., tiger - dolphin, eagle - dog, swan - ladybug, tiger - dolphin, eagle - dog, ...) suggests we'd get an improvement. Of course, concern that this will push FR performance to ceiling. Could be worth a pilot, though?