Italian Google n-grams

doi:None

Title	Authors

Home

**Contents of the datasets** ======================== *For general description see [main Wiki pages][1]* Unique types ======================== it2020_uni ---------- **Word form information** *word* - bare word form in lowercase, unique in this file (may be used as the primary key in a database) **Frequency information:** *fq_c_sum* - the overall frequency (including all uppercase/lowercase variants and POS tags) **Spelling** *fq_c_lc ; fq_c_uc ; fq_c_lc_rate* - absolute frequencies of the form with lowercase and uppercase spelling and the rate of lowercase spelling variant. The uppercase/lowercase spelling was determined using the first letter of the word form only, in order to identify potential proper nouns (i.e. *Roma* and *ROMA* are counted as uppercase, while *roma* or *rOMA* are counted as lowercase). **POS tags (by Google n-grams)** *fq_c_tag_noun;fq_c_tag_noun_rate;fq_c_tag_verb;fq_c_tag_verb_rate;fq_c_tag_notag;fq_c_tag_notag_rate;fq_c_tag_adj;fq_c_tag_adj_rate;fq_c_tag_adv;fq_c_tag_adv_rate;fq_c_tag_x;fq_c_tag_x_rate* - absolute frequencies (and rates) of the form with original Google n-gram POS tags. Notice that according to our analysis, only about 50% of original data was POS tagged. **POS tags (according to MorphIt)** *abl;adj;adv;int;noun;npr;pon;ss;verb* - boolean info whether the form is represented with the given tag in the MorphIt dictionary **Rounded frequency counts and percentiles** *fq_c_lc_rate_rounded;fq_c_lc_rate_rounded_10;fq_c_tag_notag_rate_rounded;fq_c_tag_notag_rate_percentile;fq_c_tag_x_rate_rounded* it2020_unidet ------------- *wordlc* - word form in lowercase, unique in this file (may be used as the primary key in a database) *fq* - the overall frequency (usually lower than *fq_c_sum* in it2020_uni, because only frequencies in contexts where the form was preceded by a determiner or a preposition were taken into account) it2020_bi --------- **Word form information** *w1_wordlc;w2_wordlc* - word forms in lowercase, the combination of these two fields is unique in this file (may be used as a combined primary key in a database) **Frequency and spelling** *fq* - the overall frequency of the w1 & w2 combination (including both the hyphenated and the loose spelling variant) *fq_hyphen;fq_hyphen_rate* - frequency of the hyphenated variant and its rate (=fq_hyphen/fq) **POS** *w1_noun;w2_noun* - boolean info whether the form could be a noun according to the MorphIt dictionary **Statistics** *avg_mi_1800_1900;avg_mi_1950_2020* - average MI-score in the given time span Diachronic data ======================== **it2020_uni_dia** - word; year; fq; volumes (*word* matches *word* from it2020_uni) **it2020_uni_dia_grouped** - the same as *it2020_uni_dia*, but a different data format, includes also the *fq_rel* value (relative frequency in i.p.100m - items per 100 million) **it2020_uni_dia10** - word; decade; fq; vol; fq_rel (*decade* = the first 3 digits of the decade, i.e. 147 = 1470-1479; *fq_rel* in i.p.100m) **it2020_uni_dia_grouped** - word TAB year(1);fq;vol;fqrel TAB [...] year(n);fq;vol;fqrel EOL Other data ======================== **it_determiners.csv** - Contains the list of determiners and/or prepositions that were used for filtering the words in UNIDET datasets, either from bigrams or unigrams. I.e. each word present in a UNIDET dataset was preceded by one determiner/preposition from this list in original data. So the overwhealming majority of words in UNIDET should be NOUNs. **it_synsemantics_stoplist.csv** - Contains word forms that were filtered out from the original bigram data when assembling the BI datasets. Custom data ======================== The folder contains specific data extracted on demand for different researchers **it2020_valentina_pro_1900_1909** - List of words beginning in "pro" with their frequencies for the decade 1900-1909 [1]: https://osf.io/46qcd/wiki/home/

OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.

This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information, see our Privacy Policy and information on cookie use.

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.

Create an Account Learn More Hide this message

Main content

Home

Menu

Start managing your projects on the OSF today.

Main content

Links to this project

Home

Menu

Add new wiki page

Page permissions have changed

Wiki page deleted

Connected to the collaborative wiki

Connecting to the collaborative wiki

Collaborative wiki is unavailable

Browser unsupported

Start managing your projects on the OSF today.