Data

doi:None

Title	Authors

Home

------------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------------ Note that all csvs are gzipped. Associations_computed.csv Used for Sect. "3.1.1 Comparing entropy/length distributions across LMs and corpora," This dataset contains the following columns/variables: - corpus1 and corpus2 - type1 and type2 - var1 and var2 - LM1 and LM2 - rho_none and N_none - rho_phylo and N_phylo - rho_geo and N_geo - rho_both and N_both Each row calculates the pairwise correlations (N_xxx indicates the number of cases) for the combinations of corpus1 and corpus2 as well as LM1 and LM2, including both unfiltered (rho_none) and filtered (rho_phylo, rho_geo, and rho_both) correlations. Below, we describe the details for the first group of variables (compressor1, var1, etc.), which apply analogously to the second group (compressor2, var2, etc.). - type1: symbol level (word_level, char_level, bpe_level) - var1: h_trained, length, K_ext - LM1 can take the following values: - One of the seven LMs (PPM2, PPM6, PAQ, LSTM, TRFsmall, TRFmed, TRFbig) - best (i.e., the "best" h_trained or K_ext per document) - PPMtrunc (only relevant for section A.3.1 Testing for a potential systematic length bias) - LENGTH - UNIGRAM (H_cr for the Crubadan Info) Please note the following rules: - If LM1 = 'LENGTH', then var1 must be 'length'. - If LM1 = 'best', compressor2 will either be 'best' or var2 will be 'length' and vice versa. - If type1 = 'bpe_level', then LM1 can't be 'LENGTH' and var1 can't be 'length'. - If LM1 = 'PPMtrunc', then type1 can't be 'bpe_level' - If LM1 = 'UNIGRAM', then type1 can only be 'word_level' and var1 can only be 'h_trained'. - If var1 = 'K_ext', then var2 also has to be 'K_ext'. For var1 = 'length' or 'h_trained', calculations are performed across symbols, corpora, and all LMs, expect 'best' and 'PPMtrunc'. ------------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------------ Full_multilevel_effects_model_results.csv Used for Sect. "3.1.2 Multi-model multilevel inference" Contains estimates for all 3,520,000 MLEM models, which form the basis for the FMA analyses, Sect. - ID = unique model identifier - N = Number of cases - LL = Log-likelihood (maximised) - k_r = Number of random effects - k_f = Number of fixed effects - type: word_level, char_level, bpe_level _ outcome: h, K, or L: h_trained, K_ext or length - LM: one of the seven LMs, or 'best'; importantly, the outcome 'L' (length) only exists for 'best', since text length does not depend on the LM. - out_type: standardized (outcome standardized per corpus, results in the paper are based on this) vs. log_transformed (outcome log-transformed) - converged (0/1): Whether the model converged with the 20 gradient-based iterations (1) or not (0). - feffects: considered fixed effects for the respective model - reffects: considered random effects/intercepts - rslopes: considered random slopes (only if the corresponding considered random effect is also considered for the respective model). For each fixed effect, there are 2 designated columns: beta_*feffect* and se_*feffect* (e.g., beta_log_pop and se_log_pop) that list the relevant estimates for the fixed effect in question. If *feffect* is not included in the corresponding model, cells are empty. Considered fixed effects: log_pop EGIDSg1 EGIDSg2 EGIDSg3 EGIDSg4 log_N_countries parallel Considered random effects: script corpus macroarea macrofamily subfamily ISO Considered random slopes for log_pop: script corpus macroarea macrofamily subfamily AIC is computed as - 2*LL + 2*(k_f + k_r) BIC is computed as - 2*LL + log(N)*(k_f + k_r) CAIC is computed as - 2*LL + (k_f + k_r)*(log(N) + 1) BICn is computed as -2*LL + log(N-k_f)*k_r - BICn is based on Vonesh and Chinchilli (1997), Gurka (2006), see https://www.stata.com/manuals/restatic.pdf ------------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------------ Prepared_FMA_results.csv Pre-prepared FMA estimates for 48,600 FMA specifications Fixed effects are always considered, while random effects and slopes are varied. - out_type: standardized (outcome standardized per corpus, results in the paper are based on this) vs. log_transformed (outcome log-transformed) - outcome: h, K, or L: h_trained, K_ext or length - type: word, bpe, char: word_level, char_level, bpe_level - N_models: number of candidates models R - Mtype_reffects: considered random effects/intercepts - Mtype_rslopes: considered random slopes (only if the corresponding considered random effect is also considered for the respective model). - Mtype_IC: AIC, BIC, CAIC, BICn - LM: one of the seven LMs, or 'best'; importantly, the outcome 'L' (length) only exists for 'best', since text length does not depend on the LM. - sigma_weight_*feffect*: relative variable importance for the respective fixed effect (eq. 16 in the paper) - sigma_weight_Re_*reffect*: relative variable importance for the respective random effect (if the respective reffect is not considered in Mtype_reffects, the corresponding cell is empty) - sigma_weight_Rs_*rslope*: relative variable importance for the respective random slope (if the respective rslope is not considered in Mtype_rslopes, the corresponding cell is empty) - beta_fma_*feffect* and se_fma_*feffect*: FMA estimates for each fixed effect plus standard errors (eqs. 14 and 15 in the paper) - z_*feffect*: z-values for the respective fixed effect (see Sect. 2.6.2 in the paper) - c_lb_*feffect* and c_ub_*feffect*: lower and upper bounds of the 95% confidence interval for the respective fixed effect (see Sect. 2.6.2 in the paper) - p_*feffect*: p-values for the respective fixed effect (see Sect. 2.6.2 in the paper) Considered fixed effects: log_pop EGIDSg1 (original value 0) EGIDSg2 (original value 1) EGIDSg3 (original value 2) EGIDSg4 (original value 3) log_N_countries parallel Considered random effects: script corpus macroarea macrofamily subfamily ISO Considered random slopes for log_pop: script corpus macroarea macrofamily subfamily ------------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------------ FactorsPCA_precomputed.csv Pre-prepared PCA data for PC factor analysis (to produce Paper figures: (i) Fig. 5 or Appendix Fig. 1 via the excludiong LM = 'FULL' (ii) Fig. 6 by restricting to LM = 'FULL' and control_version = 'both' ) - level: symbol level (word_level, char_level, bpe_level) - corpusID: corpus identifier - outcome: h, K, or L: h_trained, K_ext or length - LM: one of the seven LMs plus 'best', 'PPMtrunc' and 'FULL' - LM_PCA_version: 'Not applicable' or one of the seven LMs (for LM = 'FULL') - control_version: none, phylo, geo, or both - parallel: parallel corpus, 1 if parallel, 0 if not - Factor_N = Nth factor - Factor_Percentage = percentage of variance explained by the respective factor - Factor_N and Factor_Percentage are per combination of LM and control_version. Info is given for the first 50 factors - Remaining columns contain the factor loadings for the respective factor scaled to the interval -1 to 1 ------------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------------ entropy_rates.csv - corpus: corpus name - corpus_type: parallel, comparable, word_list (<--H_cr) - docname: document name - LM: one of the seven LMs plus 'best', 'PPMtrunc' (<-A.4.1 Testing for a potential systematic length bias) and 'UNIGRAM' (<-H_cr) - type: symbol level (word_level, char_level, bpe_level) - length: length of the test data in (words for words) or (characters for characters AND BPE) - r_c: compressed length of the full document / length in symbols of the full document - r_c50: compressed length of the first 50% of the document / length in symbols of the first 50% of the document - K_ext: compressed length of the full document - compressed length of the first 50% of the document - h_trained: K_ext / length - script: writing system - log_docs: Number of crawled documents (logged) for which the corresponding word list in the Crúbadán data was generated (only applicable for H_cr) - log_length: Text length (logged) based on which the corresponding word list in the Crúbadán data was generated (only applicable for H_cr) - trunc_yes: Binary indicator indicating if the corresponding word list in the Crúbadán data is truncated (1) or not (0) (only applicable for H_cr) - log_pop: speaker population size (logged) - parallel: parallel corpus, 1 if parallel, 0 if not - N_corpora: Number of corpora with a least one available document per language - ISO 639-3: ISO 639-3 language code (see Sect. 2.2) - macroarea: macroarea (see Sect. 2.2) - macrofamily: macrofamily (see Sect. 2.2) - subfamily: subfamily (see Sect. 2.2) - log_N_countries: (logged) number of countries in which each language is spoken (see Sect. 2.2) - error: binary indicator to filter out the 6 documents for which NNCP produced an error (1) or not (0) (see Appendix A.1) - longitude: longitude of the centroid of the country in which the language is spoken (see Sect. 2.2) - latitude: latitude of the centroid of the country in which the language is spoken (see Sect. 2.2) - EGIDS_i: Expanded Graded Intergenerational Disruption Scale value (see Sect. 2.2); mapping of numeric values to original EGIDS values: 1 == orig '0', 2 == orig '1', 3 == orig '10', 4 == orig '2', 5 == orig '3', 6 == orig '4', 7 == orig '5', 8 == orig '6a', 9 == orig '6b', 10 == orig '7', 11 == orig '8a', 12 == orig '8b', 13 == orig '9'

OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.

This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information, see our Privacy Policy and information on cookie use.

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.

Create an Account Learn More Hide this message

Main content

Home

Menu

Start managing your projects on the OSF today.

Main content

Links to this project

Home

Menu

Add new wiki page

Page permissions have changed

Wiki page deleted

Connected to the collaborative wiki

Connecting to the collaborative wiki

Collaborative wiki is unavailable

Browser unsupported

Start managing your projects on the OSF today.