------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------
Note that all csvs are gzipped.
Associations_computed.csv
Used for Sect. "3.1.1 Comparing entropy/length distributions across LMs and corpora,"
This dataset contains the following columns/variables:
- corpus1 and corpus2
- type1 and type2
- var1 and var2
- LM1 and LM2
- rho_none and N_none
- rho_phylo and N_phylo
- rho_geo and N_geo
- rho_both and N_both
Each row calculates the pairwise correlations (N_xxx indicates the number of cases) for the combinations of corpus1 and corpus2
as well as LM1 and LM2, including both unfiltered (rho_none) and filtered (rho_phylo, rho_geo, and rho_both) correlations.
Below, we describe the details for the first group of variables (compressor1, var1, etc.),
which apply analogously to the second group (compressor2, var2, etc.).
- type1: symbol level (word_level, char_level, bpe_level)
- var1: h_trained, length, K_ext
- LM1 can take the following values:
- One of the seven LMs (PPM2, PPM6, PAQ, LSTM, TRFsmall, TRFmed, TRFbig)
- best (i.e., the "best" h_trained or K_ext per document)
- PPMtrunc (only relevant for section A.3.1 Testing for a potential systematic length bias)
- LENGTH
- UNIGRAM (H_cr for the Crubadan Info)
Please note the following rules:
- If LM1 = 'LENGTH', then var1 must be 'length'.
- If LM1 = 'best', compressor2 will either be 'best' or var2 will be 'length' and vice versa.
- If type1 = 'bpe_level', then LM1 can't be 'LENGTH' and var1 can't be 'length'.
- If LM1 = 'PPMtrunc', then type1 can't be 'bpe_level'
- If LM1 = 'UNIGRAM', then type1 can only be 'word_level' and var1 can only be 'h_trained'.
- If var1 = 'K_ext', then var2 also has to be 'K_ext'.
For var1 = 'length' or 'h_trained', calculations are performed across symbols, corpora, and all LMs, expect 'best' and 'PPMtrunc'.
------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------
Full_multilevel_effects_model_results.csv
Used for Sect. "3.1.2 Multi-model multilevel inference"
Contains estimates for all 3,520,000 MLEM models, which form the basis for the FMA analyses, Sect.
- ID = unique model identifier
- N = Number of cases
- LL = Log-likelihood (maximised)
- k_r = Number of random effects
- k_f = Number of fixed effects
- type: word_level, char_level, bpe_level
_ outcome: h, K, or L: h_trained, K_ext or length
- LM: one of the seven LMs, or 'best'; importantly, the outcome 'L' (length) only exists for 'best', since text length does not depend on the LM.
- out_type: standardized (outcome standardized per corpus, results in the paper are based on this) vs. log_transformed (outcome log-transformed)
- converged (0/1): Whether the model converged with the 20 gradient-based iterations (1) or not (0).
- feffects: considered fixed effects for the respective model
- reffects: considered random effects/intercepts
- rslopes: considered random slopes (only if the corresponding considered random effect is also considered for the respective model).
For each fixed effect, there are 2 designated columns: beta_*feffect* and se_*feffect* (e.g., beta_log_pop and se_log_pop) that list the relevant estimates for the fixed effect in question.
If *feffect* is not included in the corresponding model, cells are empty.
Considered fixed effects: log_pop EGIDSg1 EGIDSg2 EGIDSg3 EGIDSg4 log_N_countries parallel
Considered random effects: script corpus macroarea macrofamily subfamily ISO
Considered random slopes for log_pop: script corpus macroarea macrofamily subfamily
AIC is computed as - 2*LL + 2*(k_f + k_r)
BIC is computed as - 2*LL + log(N)*(k_f + k_r)
CAIC is computed as - 2*LL + (k_f + k_r)*(log(N) + 1)
BICn is computed as -2*LL + log(N-k_f)*k_r
- BICn is based on Vonesh and Chinchilli (1997), Gurka (2006), see https://www.stata.com/manuals/restatic.pdf
------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------
Prepared_FMA_results.csv
Pre-prepared FMA estimates for 48,600 FMA specifications
Fixed effects are always considered, while random effects and slopes are varied.
- out_type: standardized (outcome standardized per corpus, results in the paper are based on this) vs. log_transformed (outcome log-transformed)
- outcome: h, K, or L: h_trained, K_ext or length
- type: word, bpe, char: word_level, char_level, bpe_level
- N_models: number of candidates models R
- Mtype_reffects: considered random effects/intercepts
- Mtype_rslopes: considered random slopes (only if the corresponding considered random effect is also considered for the respective model).
- Mtype_IC: AIC, BIC, CAIC, BICn
- LM: one of the seven LMs, or 'best'; importantly, the outcome 'L' (length) only exists for 'best', since text length does not depend on the LM.
- sigma_weight_*feffect*: relative variable importance for the respective fixed effect (eq. 16 in the paper)
- sigma_weight_Re_*reffect*: relative variable importance for the respective random effect (if the respective reffect is not considered in Mtype_reffects, the corresponding cell is empty)
- sigma_weight_Rs_*rslope*: relative variable importance for the respective random slope (if the respective rslope is not considered in Mtype_rslopes, the corresponding cell is empty)
- beta_fma_*feffect* and se_fma_*feffect*: FMA estimates for each fixed effect plus standard errors (eqs. 14 and 15 in the paper)
- z_*feffect*: z-values for the respective fixed effect (see Sect. 2.6.2 in the paper)
- c_lb_*feffect* and c_ub_*feffect*: lower and upper bounds of the 95% confidence interval for the respective fixed effect (see Sect. 2.6.2 in the paper)
- p_*feffect*: p-values for the respective fixed effect (see Sect. 2.6.2 in the paper)
Considered fixed effects: log_pop EGIDSg1 (original value 0) EGIDSg2 (original value 1) EGIDSg3 (original value 2) EGIDSg4 (original value 3) log_N_countries parallel
Considered random effects: script corpus macroarea macrofamily subfamily ISO
Considered random slopes for log_pop: script corpus macroarea macrofamily subfamily
------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------
FactorsPCA_precomputed.csv
Pre-prepared PCA data for PC factor analysis
(to produce Paper figures:
(i) Fig. 5 or Appendix Fig. 1 via the excludiong LM = 'FULL'
(ii) Fig. 6 by restricting to LM = 'FULL' and control_version = 'both'
)
- level: symbol level (word_level, char_level, bpe_level)
- corpusID: corpus identifier
- outcome: h, K, or L: h_trained, K_ext or length
- LM: one of the seven LMs plus 'best', 'PPMtrunc' and 'FULL'
- LM_PCA_version: 'Not applicable' or one of the seven LMs (for LM = 'FULL')
- control_version: none, phylo, geo, or both
- parallel: parallel corpus, 1 if parallel, 0 if not
- Factor_N = Nth factor
- Factor_Percentage = percentage of variance explained by the respective factor
- Factor_N and Factor_Percentage are per combination of LM and control_version. Info is given for the first 50 factors
- Remaining columns contain the factor loadings for the respective factor scaled to the interval -1 to 1
------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------
entropy_rates.csv
- corpus: corpus name
- corpus_type: parallel, comparable, word_list (<--H_cr)
- docname: document name
- LM: one of the seven LMs plus 'best', 'PPMtrunc' (<-A.4.1 Testing for a potential systematic length bias) and 'UNIGRAM' (<-H_cr)
- type: symbol level (word_level, char_level, bpe_level)
- length: length of the test data in (words for words) or (characters for characters AND BPE)
- r_c: compressed length of the full document / length in symbols of the full document
- r_c50: compressed length of the first 50% of the document / length in symbols of the first 50% of the document
- K_ext: compressed length of the full document - compressed length of the first 50% of the document
- h_trained: K_ext / length
- script: writing system
- log_docs: Number of crawled documents (logged) for which the corresponding word list in the Crúbadán data was generated (only applicable for H_cr)
- log_length: Text length (logged) based on which the corresponding word list in the Crúbadán data was generated (only applicable for H_cr)
- trunc_yes: Binary indicator indicating if the corresponding word list in the Crúbadán data is truncated (1) or not (0) (only applicable for H_cr)
- log_pop: speaker population size (logged)
- parallel: parallel corpus, 1 if parallel, 0 if not
- N_corpora: Number of corpora with a least one available document per language
- ISO 639-3: ISO 639-3 language code (see Sect. 2.2)
- macroarea: macroarea (see Sect. 2.2)
- macrofamily: macrofamily (see Sect. 2.2)
- subfamily: subfamily (see Sect. 2.2)
- log_N_countries: (logged) number of countries in which each language is spoken (see Sect. 2.2)
- error: binary indicator to filter out the 6 documents for which NNCP produced an error (1) or not (0) (see Appendix A.1)
- longitude: longitude of the centroid of the country in which the language is spoken (see Sect. 2.2)
- latitude: latitude of the centroid of the country in which the language is spoken (see Sect. 2.2)
- EGIDS_i: Expanded Graded Intergenerational Disruption Scale value (see Sect. 2.2); mapping of numeric values to original EGIDS values: 1 == orig '0', 2 == orig '1', 3 == orig '10', 4 == orig '2', 5 == orig '3', 6 == orig '4', 7 == orig '5', 8 == orig '6a', 9 == orig '6b', 10 == orig '7', 11 == orig '8a', 12 == orig '8b', 13 == orig '9'