Main content

Home

Menu

Loading wiki pages...

View
Wiki Version:
This repository contains distributional semantic models (also called vector space models or semantic word embeddings) for verbs and nouns based on the COCA and COHA corpora (https://www.english-corpora.org/), specifically the original release of the offline version of each corpus (440 and 380 million words respectively). Some of these models were used in previous studies of syntactic productivity, notably Perek (2014, 2016, 2018), Hilpert & Perek (2015), and Perek & Hilpert (2017); the references are provided below. **Update August 2022**: New versions of the models trained with word2vec (Mikolov et al. 2013) were added to the repository. These models are a marked improvement over the original 'bag-of-words' models. Separate models for nouns, verbs, and adjectives are provided, trained on the COCA or the COHA. ---------- **'Bag-of-words' models** The 'bag-of-words' models capture word meanings (and the similarity between them) as a function of lexical collocates. These models were created using a variety of parameters (as detailed below) with the DISSECT toolkit in Python (Dinu et al. 2013), from co-occurrence data extracted from the corpora with bespoke software. Note, however, that the models do not provide collocation information per se: each vector consists of values that reflect their degree of co-occurrence with other words, but there is no information as to what these other words are. This is in compliance with the COCA and COHA purchase agreement, which prohibits the distribution of frequency or collocation information. There are separate 'bag-of-words' models for verbs and nouns, based on COCA or COHA, placed in four separate components of this repository. The models contain semantic vectors for all lemmatised common nouns (NN* tag) and lexical verbs (VV* tag) with a lemma frequency of at least 1,000 in the corresponding corpus. Each component contains models for various combinations of parameters, as reflected in the file names, and as explained below: - w2, w5 = window size, i.e. how many words to the left and right of the target word the collocates are extracted from (2 words vs. 5 words). - all10K, nvjr10K = selected collocates, i.e. what words are included as the collocates of the target words (the 10,000 most frequent words in the corpus vs. the 10,000 most frequent nouns, verbs, adjectives, and adverbs; in both cases, the collocates are lemmatised and PoS-tagged). - plmi, ppmi, raw = weighting scheme, i.e. how the degree of co-occurrence between the target word and each collocate is measured (Positive Local Mutual Information vs. Positive Point-wise Mutual Information vs. raw frequency counts, i.e. no weighting scheme). - svd300 = dimensionality reduction, i.e. whether Singular Value Decomposition is used to reduce the 10,000 collocates to 300 dimensions. The more "basic" models (i.e. the 'raw' models and more generally all the non-dimensionality-reduced models) can be used to create new models with a different combination of parameters and/or different algorithms. ---------- **Word2vec models** The word2vec models can be found in a separate component. They were trained on the one-sentence-per-line version of the COCA or COHA using the gensim library in Python. Each model contains semantic for all lemmatised common nouns (NN* tag), lexical verbs (VV* tag), and adjectives (JJ* tag) with a frequency of at least 100 in the corresponding corpus, as indicated in the file name. As for the 'bag-of-words' models, models using various combinations of parameters were created, as explained below: - w2, w5 = window size, i.e. how many words to the left and right of the target word the contexts are extracted from (2 words vs. 5 words). Contrary to the 'bag-of-words' models, all words in the contexts are included. - skip, cbow = the type of word2vec algorithm used to train the model, SkipGram (skip) vs. Continuous Bag of Words (cbow). In the former, the neural network is trained to predict a context given a word, and to predict a word given a context in the latter (cf. Mikolov et al. 2013). - 300d, 1000d = the dimensionality of the vectors. 1000-dimension vectors can encode finer-grained semantic information, and therefore give better performance, but the 300-dimension vectors result in more compact versions of the model with relatively little loss in quality. ---------- The models (both bag-of-words and word2vec) were evaluated and compared by measuring how they correlate with human-generated judgments of semantic similarity. Pearson and Spearman correlations were calculated between cosine similarity measures from each model and pairwise similarity judgments from the SimLex-999 and SimVerb-3500 datasets (Hill et al. 2014, Gerz et al. 2016). Separate correlations are reported for SimLex-333, i.e. a subset of the 333 most associated words from SimLex-999, which are often the hardest for computational models to capture. The results of these comparisons are reported in CSV files. Some models appear to perform far better than others, but note that model performance may vary according to the task, and thus the best model might be different for other tasks than semantic similarity. The models are stored as '.dm' text files (.txt for the word2vec models), with the vectors capturing the meaning of the target verbs or nouns as rows of floats, separated by tabs. The names of the nouns or verbs are stored in the first column of a '.dm', and they can also be found in a '.rows' file. An example R script showing how to load and manipulate the models is also provided. ---------- **References** Dinu, Georgiana, Nghia The Pham, & Marco Baroni. 2013. DISSECT: DIStributional SEmantics Composition Toolkit. In Proceedings of the System Demonstrations of ACL 2013 (51st Annual Meeting of the Association for Computational Linguistics), 31–36. East Stroudsburg, PA: ACL. Gerz, Daniela, Ivan Vulic, Felix Hill, Roi Reichart & Anna Korhonen. 2016. SimVerb-3500: A Large-Scale Evaluation Set of Verb Similarity. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, November 1-5, 2016, 2173–2182. Hill, Felix, Roi Reichart & Anna Korhonen. 2014. SimLex-999: Evaluating semantic models with (genuine) similarity estimation. URN: arXiv:1408.3456 <http://arxiv.org/abs/1408.3456v1> (14 April 2021). Hilpert, Martin & Florent Perek. 2015. Meaning change in a petri dish: Constructions, semantic vector spaces, and motion charts. Linguistics Vanguard 1(1). 339–350. Mikolov, Tomas, Kai Chen, Greg Corrado & Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. Retrieved from https://arxiv.org/abs/1301.3781. Perek, Florent. 2014. Vector spaces for historical linguistics: Using distributional semantics to study syntactic productivity in diachrony. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, Maryland USA, June 23-25 2014, 309–314. East Stroudsburg, PA: ACL. Perek, Florent. 2016. Using distributional semantics to study syntactic productivity in diachrony: A case study. Linguistics 54(1). 149–188. Perek, Florent. 2018. Recent change in the productivity and schematicity of the way-construction: a distributional semantic analysis. Corpus Linguistics and Linguistic Theory 14(1). 65–97. Perek, Florent & Martin Hilpert. 2017. A distributional semantic approach to the periodization of change in the productivity of constructions. International Journal of Corpus Linguistics 22(4). 490–520. Rehurek, Radim & Petr Sojka. 2010. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, Valletta, Malta, May 22 2010, 45–50. ELRA.
OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.
Accept
This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information, see our Privacy Policy and information on cookie use.
Accept
×

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.