Main content

Home

Menu

Loading wiki pages...

View
Wiki Version:
The manuscript is available as a **preprint** on *PsyArXiv* (https://psyarxiv.com/h9mvs/): - Sönning, Lukas. (in review). Evaluation of text-level measures of lexical dispersion: Robustness and consistency. *PsyArXiv preprint*. Here is the **abstract**: - *The traditional approach to measuring lexical dispersion is to form corpus parts of equal size and then compare the occurrence rate of an item across these units. In recent methodological work, this strategy has met with criticism due to its ignorance to corpus structure. Dispersion, it is argued, should be measured across linguistically meaningful units such as the individual text files constituting the corpus. Though desirable on linguistic grounds, a shift to texts as the unit of analysis raises new methodological issues. While the ability of dispersion measures to handle unevenly-sized corpus units has received attention in the literature, the question of how existing metrics perform in these novel settings has only been partly addressed. This paper aims to shed light on relevant statistical properties of a wide range of text-level dispersion measures. Specifically, we consider the robustness of different indicators, i.e. whether they are (overly) sensitive to data situations that can arise when texts differ (considerably) in length. We use hypothetical data scenarios to identify weak spots in existing measures, and then propose modifications to DP- and DA-related indexes to implement useful statistical properties and effect more resistant estimators. Along with the other measures, these are then evaluated against actual corpus data drawn from the BNC. We observe that adapted DP- and DA-variants perform at least as well as their original versions. Our permutation-based simulation study also demonstrates that Carroll’s D2 shows the same weakness as Juilland’s D, i.e. a noticeable sensitivity to the number of units that enter the analysis.* **Data** used in the study will be published on TROLLing. Since the paper is in review, only an anonymized version of the data set can currently be accessed (use second link): - Sönning, Lukas. 2022. Biber et al.'s (2016) set of 150 BNC items for the analysis of dispersion measures: Dataset for "Evaluation of text-level measures of lexical dispersion", https://doi.org/10.18710/MNVB36, DataverseNO, DRAFT VERSION. [An anonymized version of the dataset is available at https://dataverse.no/privateurl.xhtml?token=a25d30a0-6067-4989-837a-19468c9fa661.] **Images** created for this study can be found in the folder "figures". They are published under a Creative Commons Attribution 4.0 licence (**CC BY 4.0**), which means that the licence terms for their use are quite generous (see http://creativecommons.org/licenses/by/4.0).
OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.
Accept
This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information, see our Privacy Policy and information on cookie use.
Accept
×

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.