Main content

Home

Menu

Loading wiki pages...

View
Wiki Version:
------------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------------ Note that all csvs are gzipped. Associations_computed.csv Used for Sect. "3.1.1 Comparing entropy/length distributions across LMs and corpora," This dataset contains the following columns/variables: - corpus1 and corpus2 - type1 and type2 - var1 and var2 - LM1 and LM2 - rho_none and N_none - rho_phylo and N_phylo - rho_geo and N_geo - rho_both and N_both Each row calculates the pairwise correlations (N_xxx indicates the number of cases) for the combinations of corpus1 and corpus2 as well as LM1 and LM2, including both unfiltered (rho_none) and filtered (rho_phylo, rho_geo, and rho_both) correlations. Below, we describe the details for the first group of variables (compressor1, var1, etc.), which apply analogously to the second group (compressor2, var2, etc.). - type1: symbol level (word_level, char_level, bpe_level) - var1: h_trained, length, K_ext - LM1 can take the following values: - One of the seven LMs (PPM2, PPM6, PAQ, LSTM, TRFsmall, TRFmed, TRFbig) - best (i.e., the "best" h_trained or K_ext per document) - PPMtrunc (only relevant for section A.3.1 Testing for a potential systematic length bias) - LENGTH - UNIGRAM (H_cr for the Crubadan Info) Please note the following rules: - If LM1 = 'LENGTH', then var1 must be 'length'. - If LM1 = 'best', compressor2 will either be 'best' or var2 will be 'length' and vice versa. - If type1 = 'bpe_level', then LM1 can't be 'LENGTH' and var1 can't be 'length'. - If LM1 = 'PPMtrunc', then type1 can't be 'bpe_level' - If LM1 = 'UNIGRAM', then type1 can only be 'word_level' and var1 can only be 'h_trained'. - If var1 = 'K_ext', then var2 also has to be 'K_ext'. For var1 = 'length' or 'h_trained', calculations are performed across symbols, corpora, and all LMs, expect 'best' and 'PPMtrunc'. ------------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------------ Full_multilevel_effects_model_results.csv Used for Sect. "3.1.2 Multi-model multilevel inference" Contains estimates for all 3,520,000 MLEM models, which form the basis for the FMA analyses, Sect. - ID = unique model identifier - N = Number of cases - LL = Log-likelihood (maximised) - k_r = Number of random effects - k_f = Number of fixed effects - type: word_level, char_level, bpe_level _ outcome: h, K, or L: h_trained, K_ext or length - LM: one of the seven LMs, or 'best'; importantly, the outcome 'L' (length) only exists for 'best', since text length does not depend on the LM. - out_type: standardized (outcome standardized per corpus, results in the paper are based on this) vs. log_transformed (outcome log-transformed) - converged (0/1): Whether the model converged with the 20 gradient-based iterations (1) or not (0). - feffects: considered fixed effects for the respective model - reffects: considered random effects/intercepts - rslopes: considered random slopes (only if the corresponding considered random effect is also considered for the respective model). For each fixed effect, there are 2 designated columns: beta_*feffect* and se_*feffect* (e.g., beta_log_pop and se_log_pop) that list the relevant estimates for the fixed effect in question. If *feffect* is not included in the corresponding model, cells are empty. Considered fixed effects: log_pop EGIDSg1 EGIDSg2 EGIDSg3 EGIDSg4 log_N_countries parallel Considered random effects: script corpus macroarea macrofamily subfamily ISO Considered random slopes for log_pop: script corpus macroarea macrofamily subfamily AIC is computed as - 2*LL + 2*(k_f + k_r) BIC is computed as - 2*LL + log(N)*(k_f + k_r) CAIC is computed as - 2*LL + (k_f + k_r)*(log(N) + 1) BICn is computed as -2*LL + log(N-k_f)*k_r - BICn is based on Vonesh and Chinchilli (1997), Gurka (2006), see https://www.stata.com/manuals/restatic.pdf ------------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------------ Prepared_FMA_results.csv Pre-prepared FMA estimates for 48,600 FMA specifications Fixed effects are always considered, while random effects and slopes are varied. - out_type: standardized (outcome standardized per corpus, results in the paper are based on this) vs. log_transformed (outcome log-transformed) - outcome: h, K, or L: h_trained, K_ext or length - type: word, bpe, char: word_level, char_level, bpe_level - N_models: number of candidates models R - Mtype_reffects: considered random effects/intercepts - Mtype_rslopes: considered random slopes (only if the corresponding considered random effect is also considered for the respective model). - Mtype_IC: AIC, BIC, CAIC, BICn - LM: one of the seven LMs, or 'best'; importantly, the outcome 'L' (length) only exists for 'best', since text length does not depend on the LM. - sigma_weight_*feffect*: relative variable importance for the respective fixed effect (eq. 16 in the paper) - sigma_weight_Re_*reffect*: relative variable importance for the respective random effect (if the respective reffect is not considered in Mtype_reffects, the corresponding cell is empty) - sigma_weight_Rs_*rslope*: relative variable importance for the respective random slope (if the respective rslope is not considered in Mtype_rslopes, the corresponding cell is empty) - beta_fma_*feffect* and se_fma_*feffect*: FMA estimates for each fixed effect plus standard errors (eqs. 14 and 15 in the paper) - z_*feffect*: z-values for the respective fixed effect (see Sect. 2.6.2 in the paper) - c_lb_*feffect* and c_ub_*feffect*: lower and upper bounds of the 95% confidence interval for the respective fixed effect (see Sect. 2.6.2 in the paper) - p_*feffect*: p-values for the respective fixed effect (see Sect. 2.6.2 in the paper) Considered fixed effects: log_pop EGIDSg1 (original value 0) EGIDSg2 (original value 1) EGIDSg3 (original value 2) EGIDSg4 (original value 3) log_N_countries parallel Considered random effects: script corpus macroarea macrofamily subfamily ISO Considered random slopes for log_pop: script corpus macroarea macrofamily subfamily ------------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------------ FactorsPCA_precomputed.csv Pre-prepared PCA data for PC factor analysis (to produce Paper figures: (i) Fig. 5 or Appendix Fig. 1 via the excludiong LM = 'FULL' (ii) Fig. 6 by restricting to LM = 'FULL' and control_version = 'both' ) - level: symbol level (word_level, char_level, bpe_level) - corpusID: corpus identifier - outcome: h, K, or L: h_trained, K_ext or length - LM: one of the seven LMs plus 'best', 'PPMtrunc' and 'FULL' - LM_PCA_version: 'Not applicable' or one of the seven LMs (for LM = 'FULL') - control_version: none, phylo, geo, or both - parallel: parallel corpus, 1 if parallel, 0 if not - Factor_N = Nth factor - Factor_Percentage = percentage of variance explained by the respective factor - Factor_N and Factor_Percentage are per combination of LM and control_version. Info is given for the first 50 factors - Remaining columns contain the factor loadings for the respective factor scaled to the interval -1 to 1 ------------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------------ entropy_rates.csv - corpus: corpus name - corpus_type: parallel, comparable, word_list (<--H_cr) - docname: document name - LM: one of the seven LMs plus 'best', 'PPMtrunc' (<-A.4.1 Testing for a potential systematic length bias) and 'UNIGRAM' (<-H_cr) - type: symbol level (word_level, char_level, bpe_level) - length: length of the test data in (words for words) or (characters for characters AND BPE) - r_c: compressed length of the full document / length in symbols of the full document - r_c50: compressed length of the first 50% of the document / length in symbols of the first 50% of the document - K_ext: compressed length of the full document - compressed length of the first 50% of the document - h_trained: K_ext / length - script: writing system - log_docs: Number of crawled documents (logged) for which the corresponding word list in the Crúbadán data was generated (only applicable for H_cr) - log_length: Text length (logged) based on which the corresponding word list in the Crúbadán data was generated (only applicable for H_cr) - trunc_yes: Binary indicator indicating if the corresponding word list in the Crúbadán data is truncated (1) or not (0) (only applicable for H_cr) - log_pop: speaker population size (logged) - parallel: parallel corpus, 1 if parallel, 0 if not - N_corpora: Number of corpora with a least one available document per language - ISO 639-3: ISO 639-3 language code (see Sect. 2.2) - macroarea: macroarea (see Sect. 2.2) - macrofamily: macrofamily (see Sect. 2.2) - subfamily: subfamily (see Sect. 2.2) - log_N_countries: (logged) number of countries in which each language is spoken (see Sect. 2.2) - error: binary indicator to filter out the 6 documents for which NNCP produced an error (1) or not (0) (see Appendix A.1) - longitude: longitude of the centroid of the country in which the language is spoken (see Sect. 2.2) - latitude: latitude of the centroid of the country in which the language is spoken (see Sect. 2.2) - EGIDS_i: Expanded Graded Intergenerational Disruption Scale value (see Sect. 2.2); mapping of numeric values to original EGIDS values: 1 == orig '0', 2 == orig '1', 3 == orig '10', 4 == orig '2', 5 == orig '3', 6 == orig '4', 7 == orig '5', 8 == orig '6a', 9 == orig '6b', 10 == orig '7', 11 == orig '8a', 12 == orig '8b', 13 == orig '9'
OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.
Accept
This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information, see our Privacy Policy and information on cookie use.
Accept
×

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.