Main content

Home

Menu

Loading wiki pages...

View
Wiki Version:
# README The files in this repository accompany the paper: *From language development to language evolution: A unified view of human lexical creativity* Due to OSF-storage and third party restrictions, not all intermediate files are shared. ---- ### Main files of interest * `multiModalChildColex/data/local/fits/noGP/` contains the three best regression models we report on. The colexification model is prefixed by `m-`; the overextension model by `o-` and the semantic change one by `d-`. They are generated by * `multiModalChildColex/src/logreg/get-fits-colex.R` * `multiModalChildColex/src/logreg/get-fits-overext2.R` * `multiModalChildColex/src/logreg/get-fits-datsemshift.R` See below and `multiModalChildColex/src/data-manipulation.sh` on how to generate these and intermediate files, as well as for further information of the resources we build on ---- ### Requirements All code was executed using `R 4.1.3` and `python 3.10.4`. A conda environment, named *renv* and specified in full in `environment.yml`, fulfills all the requirements to run the code. To import and activate it, run: conda env create --file environment.yml #import environment from YML conda activate renv Importing the environment can create conflicts that need to be resolved due to differences across operating systems (version numbers sometimes differ across them). In an attempt to mitigate these issues, we share two additional environments that you may want to try if `environment.yml` raises conflicts: `environment-nbuild.yml` was created using the ` --no-builds` flag; and `environment-fhistory.yml` using `--from-history`. ---- ### Commented directory * `multiModalChildColex/data/` contains the input data that this study builds on. We distribute data from other sources where their licenses allow for it. More generally, all resources openly available. See below on how to obtain what we cannot directly redistribute * `multiModalChildColex/src/` contains processing and analysis code. See/run `data-manipulation.sh` to run the entire pipeline in order (**WARNING**: some computations are expensive; they were ran on a cluster with 500GB RAM on multiple cores. We do not recommend running all steps in one go.) * `data-wrangling/` contains general purpose scripts that manipulate datasets and add other resources to them * `affectiveness/`; `associativity/`; `taxonomy/` and `vision/` contain scripts to process the respective knowledge strucutre * `logreg/` contains scripts that pertain to logistic regressions, robustness checks and diagnostics ---- ### Data sources * The [CLICS3](https://clics.clld.org/) database is available at: [https://github.com/clics/clics3](https://github.com/clics/clics3). The file `data/clics/df_all_raw.csv` is the output of `src/data-wrangling/get-all-raw.R`: a CSV file with the data from CLICS3 (`data/local/clics3/clics.sqlite`) enriched with [Concepticon](https://concepticon.clld.org/)'s [MRC](http://websites.psychology.uwa.edu.au/school/MRCDataBase/uwa_mrc.htm) data (`data/local/concepticon-data/concepticondata/concept_set_meta/mrc.tsv`) * The overextension data from Ferreira-Pinto & Xu (2021) is available at [https://github.com/r4ferrei/computational-theory-overextension/tree/master/dataset](https://github.com/r4ferrei/computational-theory-overextension/tree/master/dataset). The script `data-manipulation.sh` provides a convenience curl command to download the data automatically * The DatSemShift data is available at: [https://datsemshift.ru/](https://datsemshift.ru/). * The source associativity data from Small World of Words is available at: [https://smallworldofwords.org](https://smallworldofwords.org). More precisely, go to [https://smallworldofwords.org/en/project/research](https://smallworldofwords.org/en/project/research) to download the English data. The code to obtain the transformation of the data reported on is a minimally adapted version of the scripts in [https://github.com/SimonDeDeyne/SWOWEN-2018](https://github.com/SimonDeDeyne/SWOWEN-2018). Finally, the folder `SWOWEN-2018/` redistributes the minimal data from SWOW necessary to reproduce the computations reported on in the analyses * WordNet data is automatically retrieved through python's [NLTK module](https://www.nltk.org/) but can also be manually queried at: [https://wordnet.princeton.edu/](https://wordnet.princeton.edu/) * The affectiveness norms from Mohammad (2018) are available at: [https://saifmohammad.com/WebPages/nrc-vad.html](https://saifmohammad.com/WebPages/nrc-vad.html). The script `data-manipulation.sh` provides a convenience curl command to download the data automatically * Due to copyright and accesability restrictions the affectiveness norms of Warriner et al. (2013) need to be retrieved manually at: [https://doi.org/10.3758%2Fs13428-012-0314-x](https://doi.org/10.3758%2Fs13428-012-0314-x) * The post-processed similarities between images from Visual Genome from both the supervised and self-supervised models are available in `data/vision`. The full processing pipeline for the supervised model is documented in: [https://osf.io/q72ne/](https://osf.io/q72ne/). The self-supervised model is deployed in analogous fashion, employ `scripts/2-select_VG_objects.py` from [https://osf.io/q72ne/](https://osf.io/q72ne/) to extract objects from Visual Genome and then `src/vision/object_prototypes_release.py` to process them. ---- ### Contact Get in touch with Thomas Brochhagen (thomas.brochhagen@upf.edu) if you have any questions, comments, or corrections.
OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.
Accept
This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information, see our Privacy Policy and information on cookie use.
Accept
×

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.