Main content
A sentence embedding approach to concordance searching and sorting
Date created: | Last Updated:
: DOI | ARK
Creating DOI. Please wait...
Category: Project
Description: Concordancing has long been a cornerstone of corpus linguistics research, providing scholars with a powerful method to explore lexical and grammatical patterns in target corpora. It is also one of the most common approaches introduced to learners in a data-driven learning (DDL) classroom. Despite the strengths of the approach, it also suffers from two major limitations. Firstly, concordance searching requires the use of single or multi-word queries that are often fixed in nature and can quickly increase in complexity depending on the aim. For example, to account for possible variations in usage, these queries usually require the use of alternative options or the inclusion of in-word or between-word wildcards. If the researcher, teacher, or learner hopes to capture subtle variations in usage in the corpus (e.g., spelling differences between UK and US speakers, idiomatic expression with synonym variations, semantically equivalent words or phrases), these differences have to be recognized from the outset and accounted for in the query. A second limitation of concordancing relates to the sorting of results. Typically, results are sorted alphabetically on the center (node) word, or words to the left or right of the node word. This ordering leads researchers, teachers, and learners to have to scan through all results to find relevant, salient patterns of usage. Recently, we have seen innovations such as KWIC patterns (Anthony, 2018, 2022) that calculate the frequency of occurrence of concordance result patterns and order the results accordingly. However, even here, if the query generates many thousands of hits for a particular pattern, there is still a need to sort these results in some meaningful way before they can be interpreted. Over the past year, much attention has begun to focus on the potential impact of Artificial Intelligence (AI) on corpus research. In this paper, I introduce an innovative approach to concordance querying and sorting that integrates traditional concordance methods with transformer-based sentence (or sentence fragment) embeddings. Using sentence embeddings, I show how concordance search queries can be greatly simplified and also allow for more nuanced and context-aware analysis of linguistic phenomena than previously possible. In a case study using the BE06 (Baker, 2009) and AmE06 (Potts and Baker, 2012) corpora, I first demonstrate how traditional concordance queries can be interpreted in a “fuzzy” way, allowing subtle differences in language usage to be captured without the need for careful crafting of the query itself. Next, I show how an embedding model can be used to cluster the results of a traditional concordance analysis based on semantic similarity, leading to novel groupings and orderings of results. I then show how an embedding model can be used to match expressions in one language variety with those in another, leading to truly novel concordance analyses. The paper finishes with a discussion of future directions in AI and the potential impact on concordance tool development.