Main content

Home

Menu

Loading wiki pages...

View
Wiki Version:
This OSF project is associated with the following study, which is available on [PsyArXiv](https://psyarxiv.com/4vtja/): - Sönning, Lukas. 2023. *Down-sampling from hierarchically structured corpus data.* PsyArXiv. https://psyarxiv.com/4vtja/ The study was presented at *Corpus Linguistics 2023* in Lancaster, UK. The **presentation slides** can be found [here](https://osf.io/g6uey). This is the **abstract**: - *Resource constraints often force researchers to down-size the list of tokens returned by a corpus query. This paper sketches a methodology for down-sampling and offers a survey of current practices. We build on earlier work and extend the evaluation of down-sampling designs to settings where tokens are clustered by text file and lexeme. Our case study deals with third-person present-tense verb inflection in Early Modern English and focuses on five predictors: time, gender, genre, frequency, and phonological context. We evaluate two strategies for selecting 2,000 (out of 12,24411,645) tokens: simple down-sampling, where each hit has the same selection probability; and structured down-sampling, where this probability is inversely proportional to the author- and verb-specific token count. We form 500 sub-samples using each scheme and compare regression results to a reference model fit to the full set of cases. We observe that structured down-sampling shows better performance on several evaluation criteria.* The study uses **data** from Jenset and McGillivray's (2017) monograph *Quantitative historical linguistics: A corpus framework*. For the purposes of the present methodological study, this dataset was reduced to a subset of 11,645 tokens, and the coding of variables was in some parts revised, completed, or modified. The resulting dataset is available in the TROLLing archive: - Sönning, Lukas. 2023. *Data from Jenset & McGillivray (2017), adapted for "Down-sampling from hierarchically structured corpus data"*. DataverseNO, V1. https://doi.org/10.18710/5KCE4U For the documentation of the analyses in the paper, we tried to follow the **TIER protocol 4.0** (https://www.projecttier.org/tier-protocol/). The file **00ReadMe.pdf** gives instructions for reproducing the analyses. Note that all **R scripts** (see folder "scripts") are commented in detail. **Images** created for this study can be found in the folder "output/figures". They are published under a Creative Commons Attribution 4.0 licence (**CC BY 4.0**), which means that the licence terms for their use are quite generous (see http://creativecommons.org/licenses/by/4.0).
OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.
Accept
This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information, see our Privacy Policy and information on cookie use.
Accept
×

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.