Main content

Home

Menu

Loading wiki pages...

View
Wiki Version:
# README The files in this repository accompany the paper: *When do languages use the same word for different meanings? The Goldilocks Principle in colexification* Due to OSF-storage restrictions, not all intermediate files are shared. ---- ### Main files of interest * `goldilocks-semantics/data/local/final-dfs-recoded/` contains the core data our analysis is based on, using Dutch and English as meta-languages * `goldilocks-semantics/src/analaysis1/fits/` contains the three main regression models we report on. They are generated by `goldilocks-semantics/src/analysis1/get-fits.R` See below and `goldilocks-semantics/src/run-all.sh` on how to generate these and intermediate files, as well as for further information of the resources we build on ---- ### Requirements All code was executed using `R 4.1.2` and `python 3.8.13`. A conda environment, named *goldilocks* and specified in full in `environment.yaml`, fulfills all the requirements to run the code. To import and activate it, run: conda env create --file environment.yaml #import environment from YAML conda activate goldilocks Importing the environment can create conflicts that need to be resolved due to differences across operating systems (version numbers sometimes differ across them). In an attempt to mitigate these issues, we share two additional environments that you may want to try if `environment.yaml` raises conflicts: `environment-nbuild.yaml` was created using the ` --no-builds` flag; and `environment-fhistory.yaml` using `--from-history`. ---- ### Commented directory * `goldilocks-semantics/data/` contains the input data that this study builds on. We distribute data from other sources where their licenses allow for it. More generally, all resources openly available. See below on how to obtain what we cannot redistribute (SUBTLEX frequency data) * `goldilocks-semantics/src/` contains processing and analysis code. See/run `run-all.sh` to run the entire pipeline in order (**WARNING**: some computations are expensive; they were ran on a cluster with 500GB RAM on multiple cores. We do not recommend running all steps in one go.) * `data-wrangling/` contains general purpose scripts that manipulate CLICS data and add other resources to it * `fasttext-pipeline` contains scripts that match Glottocodes to fastText embeddings; and then extracts, for each Glottocode with an appropriate word embedding model, the CLICS3 concepts that can be expressed in this language. * `cosine-pipeline` contains scripts that compute cosine similarities of meanings that can be expressed in a given language. * `asso-pipeline/` contains scripts that compute associativity measures from Small World of Words data. They are slightly adapted versions of code written by Simon De Deyne. The original scripts are available at [github.com/SimonDeDeyne/SWOWEN-2018](https://github.com/SimonDeDeyne/SWOWEN-2018) * `analysis1/` contains scripts pertaining to the regression models reported on. The subfolder `fits/` contains the main models we discuss. The other models are not included due to storage limitations, but reach out if you want them. * `analysis2/` contains scripts pertaining to the WordNet analysis reported on. ---- ### Data sources * The source [CLICS3](https://clics.clld.org/) database is available at: [https://github.com/clics/clics3](https://github.com/clics/clics3). The file `data/clics/df_all_raw.csv` is the output of `src/data-wrangling/get-all-raw.R`: a CSV file with the data from CLICS3 (`data/local/clics3/clics.sqlite`) enriched with [Concepticon](https://concepticon.clld.org/)'s [MRC](http://websites.psychology.uwa.edu.au/school/MRCDataBase/uwa_mrc.htm) data (`data/local/concepticon-data/concepticondata/concept_set_meta/mrc.tsv`) * The phylogenetic distance information from Jäger 2018 is available at: [https://osf.io/cufv7/](https://osf.io/cufv7/), the relevant data is read in from `data/distance/pmiWorld.csv` * fastText embeddings are automatically downloaded and queried when running `src/cosine-pipeline/get_multiling_cosines.py`, but they can also be downloaded manually at: [https://fasttext.cc/docs/en/crawl-vectors.html](https://fasttext.cc/docs/en/crawl-vectors.html) * The source associativity data from Small World of Words is available at: [https://smallworldofwords.org](https://smallworldofwords.org). More precisely, go to [https://smallworldofwords.org/en/project/research](https://smallworldofwords.org/en/project/research) to download the Dutch and English data. The code to obtain the three transformations of the data reported on is a minimally adapted version of the scripts in [https://github.com/SimonDeDeyne/SWOWEN-2018](https://github.com/SimonDeDeyne/SWOWEN-2018). Finally, the folders `SWOWEN-2018/` and `SWOWNL-2012/` redistribute the minimal data from SWOW necessary to reproduce the computations reported on in the analyses * WordNet data is automatically retrieved through python's [NLTK module](https://www.nltk.org/) but can also be manually queried at: [https://wordnet.princeton.edu/](https://wordnet.princeton.edu/) * The frequency information for English from SUBTLEX-US is available at: [https://www.ugent.be/pp/experimentele-psychologie/en/research/documents/subtlexus](https://www.ugent.be/pp/experimentele-psychologie/en/research/documents/subtlexus). Step-by-step commands for an automated download using `curl` are provided within `src/run-all.sh` * The frequency information for Dutch from SUBTLEX-NL is available at: [http://crr.ugent.be/programs-data/subtitle-frequencies/subtlex-nl](http://crr.ugent.be/programs-data/subtitle-frequencies/subtlex-nl). Step-by-step commands for an automated download using `curl` are provided within `src/run-all.sh` ---- ### Contact Get in touch with Thomas Brochhagen (thomas.brochhagen@upf.edu) if you have any questions or comments.
OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.
Accept
This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information, see our Privacy Policy and information on cookie use.
Accept
×

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.