Main content

Home

Menu

Loading wiki pages...

View
Wiki Version:
Title: Understanding corpus text prototypicality: A multifaceted problem Authors: Anthony, L., Smith, N., Hoffmann, S., & Rayson, P. Keywords: prototypicality; representativeness; sampling; corpus building; DDL; close reading; stylistics Abstract: Prototypicality is a complex, multifaceted concept relating to the centrality and typicality of examples in a category. While prominent in cognitive psychology and linguistics, it is often overlooked in corpus studies. Corpora are ideally built to be representative of a target domain or language variety. To achieve this goal, corpus builders need to identify an accurate sampling frame and collect relevant texts that capture the diversity of language in and across the sampling categories. In practice, however, corpora are built within the limitations of text availability, time, and human resources leading to questions about the suitability/prototypicality of individual texts in a corpus and their effect on the representativeness of the corpus as whole. Prototypicality also comes into play at the analysis stage. Most corpus analysis approaches use the corpus as a whole as the unit of analysis, including concordance and keyword analysis. To validate findings, a necessary but often omitted step is the close reading of individual texts. Here, a significant challenge is identifying which texts to read. A researcher may decide to randomly choose texts, but it is an open question if such texts are representative/prototypical of the corpus. Prototypicality also comes into play when corpora are used for pedagogic purposes, such as Data-Driven Learning (DDL). In these situations, there is often an implicit conflation of two facets of prototypicality, namely frequency of use and closeness to an ideal, particularly in the case of expert writing. In this paper, we first outline the multifaceted character of corpus text prototypicality. Next, we describe experiments that attempt to rank the prototypicality of individual corpus texts at different linguistic levels as a guide to choosing texts for close reading or excluding texts from a corpus at the data collection stage. Results using a modified version of the ProtAnt tool (Anthony and Baker, 2015) show prototypicality rankings can be dramatically affected by the linguistic level of analysis applied. Standard keywords effectively rank the prototypicality of texts in terms of topic, but the results can be enhanced using key semantic tags. On the other hand, key part-of-speech (POS) tags allow for a more nuanced view of text prototypicality centered on stylistics. The results also reveal the limitations of current corpus software tools and offer suggestions for how new tools might be developed to increase our understanding of prototypicality at the textual level. APA Citation: Anthony, L., Smith, N., Hoffmann, S., & Rayson, P. (2023, May 18). Understanding corpus text prototypicality: A multifaceted problem [Conference presentation]. ICAME 44, NWU Vanderbijlpark, South Africa. https://osf.io/zc7gk.
OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.
Accept
This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information, see our Privacy Policy and information on cookie use.
Accept
×

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.