Main content

Home

Menu

Loading wiki pages...

View
Wiki Version:
**Project description** =================== The project aims to build from the raw data of the 3rd version of Italian and French Google N-grams a set of electronic dictionaries of word-forms available in open-access that would facilitate the use of GNs for diachronic research in lexical morphology. Source data was gathered from [Google n-grams (2020)][1] **Structure of the data** ======= Data filtering: UNI, UNIDET and BI datasets ----------------------------------------------- Data extracted from original unigrams, bigrams and trigrams were organized in three different types of datasets: UNI, UNIDET and BI. All datasets contain only alphabetical strings from original data. **UNI** - the UNI dataset was extracted from unigrams. It contains either full alphabetical unigrams (such as It. "esempio") or parts of them (such as It. "esempio" from the original unigram "sull'esempio" that represents a fusion of PREP/ART+NOUN). In the file with unique types, each bare word in the lowercase form is represented just once. **UNIDET** - the UNIDET dataset was extracted mainly from bigrams (but also from and unigrams) in order to allow disambiguation of nouns from verbs and/or adjectives. Each word form in UNIDET (originally located on the 2nd position in source bigrams or placed after an apostrophe in unigrams) was preceded by a determiner (or a preposition), located originally on the 1nd position in bigrams (or before the apostrophe in unigrams). The list of determiners used for filtering is provided separately for each language. **BI** - the BI dataset is intended for analysis of compounds written either as two separate words (such as Fr. "mot clé") or with hyphenated components (such as Fr. "mot-clé"). It was extracted from original bigrams and trigrams (with the hyphen in the middle position), respectively. A stoplist of the most frequent synsemantics was applied in the filtering process. The data has not been lemmatized. **Unique types and diachronic data** ------------------------------------ For each filtering procedure (UNI, UNIDET, BI), two type of datasets are available: lists of unique types and lists with diachronic data. **Unique types** - in these lists, each word form (UNI, UNIDET) or pair of word forms (BI), transformed in lowercase, is represented just once. It provides the overall absolute frequency that represents the frequency sum of all underlying representations (e.g. the Italian word form "esempio" from it2020_uni had 255 different underlying representations in the original unigram data, including forms such as: *esempio, esempio_NOUN, Esempio, Esempio_NOUN, sull'esempio, all'esempio, dall'esempio, nell'esempio, sull'esempio_ADV, dell'esempio, coll'esempio, nell'esempio_ADV, all'esempio_ADV, dall'esempio_ADP, bell'esempio, dall'esempio_ADV*. etc.). Besides that, the **UNI** dataset of unique types contains also frequency information about the original spelling (uppercase, lowercase) and the frequency and rate of the different POS tags attributed by Google. **Diachronic datasets** - for each word form (UNI, UNIDET) or pair of word forms (BI), both the absolute and the relative frequencies are provided in a given time span, typically for specific years or decades. **Diachronic tendencies** - for selected lists of unique types, diachronic tendencies based on Mann-Kendal statistics will be provided. References ---------- Further information about data filtering and the content of these datasets will be available in: Jan Radimský, « Towards a diachronic analysis of Romance morphology through Google n-grams », *Corpus* [Online], 23 | 2022. URL : http://journals.openedition.org/corpus/6640 [1]: https://storage.googleapis.com/books/ngrams/books/datasetsv3.html
OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.
Accept
This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information, see our Privacy Policy and information on cookie use.
Accept
×

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.