Main content

Home

Menu

Loading wiki pages...

View
Wiki Version:
**Contents of the datasets** ======================== *For general description see [main Wiki pages][1]* Unique types ======================== it2020_uni ---------- **Word form information** *word* - bare word form in lowercase, unique in this file (may be used as the primary key in a database) **Frequency information:** *fq_c_sum* - the overall frequency (including all uppercase/lowercase variants and POS tags) **Spelling** *fq_c_lc ; fq_c_uc ; fq_c_lc_rate* - absolute frequencies of the form with lowercase and uppercase spelling and the rate of lowercase spelling variant. The uppercase/lowercase spelling was determined using the first letter of the word form only, in order to identify potential proper nouns (i.e. *Roma* and *ROMA* are counted as uppercase, while *roma* or *rOMA* are counted as lowercase). **POS tags (by Google n-grams)** *fq_c_tag_noun;fq_c_tag_noun_rate;fq_c_tag_verb;fq_c_tag_verb_rate;fq_c_tag_notag;fq_c_tag_notag_rate;fq_c_tag_adj;fq_c_tag_adj_rate;fq_c_tag_adv;fq_c_tag_adv_rate;fq_c_tag_x;fq_c_tag_x_rate* - absolute frequencies (and rates) of the form with original Google n-gram POS tags. Notice that according to our analysis, only about 50% of original data was POS tagged. **POS tags (according to MorphIt)** *abl;adj;adv;int;noun;npr;pon;ss;verb* - boolean info whether the form is represented with the given tag in the MorphIt dictionary **Rounded frequency counts and percentiles** *fq_c_lc_rate_rounded;fq_c_lc_rate_rounded_10;fq_c_tag_notag_rate_rounded;fq_c_tag_notag_rate_percentile;fq_c_tag_x_rate_rounded* it2020_unidet ------------- *wordlc* - word form in lowercase, unique in this file (may be used as the primary key in a database) *fq* - the overall frequency (usually lower than *fq_c_sum* in it2020_uni, because only frequencies in contexts where the form was preceded by a determiner or a preposition were taken into account) it2020_bi --------- **Word form information** *w1_wordlc;w2_wordlc* - word forms in lowercase, the combination of these two fields is unique in this file (may be used as a combined primary key in a database) **Frequency and spelling** *fq* - the overall frequency of the w1 & w2 combination (including both the hyphenated and the loose spelling variant) *fq_hyphen;fq_hyphen_rate* - frequency of the hyphenated variant and its rate (=fq_hyphen/fq) **POS** *w1_noun;w2_noun* - boolean info whether the form could be a noun according to the MorphIt dictionary **Statistics** *avg_mi_1800_1900;avg_mi_1950_2020* - average MI-score in the given time span Diachronic data ======================== **it2020_uni_dia** - word; year; fq; volumes (*word* matches *word* from it2020_uni) **it2020_uni_dia_grouped** - the same as *it2020_uni_dia*, but a different data format, includes also the *fq_rel* value (relative frequency in i.p.100m - items per 100 million) **it2020_uni_dia10** - word; decade; fq; vol; fq_rel (*decade* = the first 3 digits of the decade, i.e. 147 = 1470-1479; *fq_rel* in i.p.100m) **it2020_uni_dia_grouped** - word TAB year(1);fq;vol;fqrel TAB [...] year(n);fq;vol;fqrel EOL Other data ======================== **it_determiners.csv** - Contains the list of determiners and/or prepositions that were used for filtering the words in UNIDET datasets, either from bigrams or unigrams. I.e. each word present in a UNIDET dataset was preceded by one determiner/preposition from this list in original data. So the overwhealming majority of words in UNIDET should be NOUNs. **it_synsemantics_stoplist.csv** - Contains word forms that were filtered out from the original bigram data when assembling the BI datasets. Custom data ======================== The folder contains specific data extracted on demand for different researchers **it2020_valentina_pro_1900_1909** - List of words beginning in "pro" with their frequencies for the decade 1900-1909 [1]: https://osf.io/46qcd/wiki/home/
OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.
Accept
This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information, see our Privacy Policy and information on cookie use.
Accept
×

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.