This OSF archive contains all datasets and analyses scripts for the project analyzing early second language vocabulary learning in a big dataset by [Elise Hopman][1], [Bill Thompson][2], [Joe Austerwei][3]l and [Gary Lupyan][4]. It also contains the PDF for the CogSci 6 page paper writeup as well as the slides for the CogSci talk presented during CogSci 2018.
Our actual regression analysis is in the file 'script7_duolingo_regression_analysis.R', and was done on the (trimmed) dataset 'dataset6_duolingo_analyzed_corpus.csv'.
The folder 'scripts and data' contains all python and R scripts, as well as .txt and .csv datafiles that we used in our analyses to get from the original Duolingo learning traces dataset released by [Settles & Meeder (2016)][5] to the corpus with psycholinguistic predictors that we analyzed. Most of our coding work consisted of putting together the duolingo data with other corpora, so scripts 1-6 deal with creating the dataset. We have made all scripts and intermediate datasets available in case this is of interest to anyone; the **most useful dataset for other researchers interested in investigating the duolingo dataset from a psycholinguistic point of view** is the file on this OSF storage named: 'dataset5_duolingo_full_corpus.csv'.
If you have any questions about any of these data or scripts, please feel free to contact us at hopman@wisc.edu.
[1]: https://github.com/duolingo/halflife-regression
[2]: https://billdthompson.github.io
[3]: https://alab.psych.wisc.edu/people/
[4]: http://sapir.psych.wisc.edu
[5]: https://billdthompson.github.io