<p><strong>vkvec.0.1.0.vec</strong></p> <p>These are word representations for Russian language trained using <a href="https://github.com/facebookresearch/fastText" rel="nofollow">fastText</a> on corpus from <a href="http://vk.com" rel="nofollow">VK</a> (Russian social networking site). These embeddings might be preferred to embeddings trained on Wikipedia data set for analysis of Russian social media texts as they might better represent slang, common typos, etc. The data was collected for all users who indicated in their profiles that they study or graduated from one of the Saint Petersburg schools. Users who have no friends from the same school were removed. All public posts of these users up to 01.01.2017 were included in the corpus. Tokens were defined and sequences of Cyrillic letters. FastText was run with default parameters. Note that it is possible to collect much larger data set using data from VK but we have no computational resources to perform such task. We would welcome such efforts from others. However the size of our corpus is comparable to Wikipedia. The number of tokens is 1,458,030,644 and vocabulary size is 1,323,807. This model outperformed Wikipedia model on some prediction tasks using VK data.</p> <p><strong>posts_XXXX.json</strong></p> <p>Posts of users who were born in XXXX. The data was collected as described above. Format: key is user ID, value is the list of pairs, where the first element is post ID and the second is the time when post was made (unix timestamp). The content of the posts could be downloaded via <a href="https://vk.com/dev/wall.getById" rel="nofollow">VK API</a>.</p> <p><strong>children_words.csv</strong></p> <p>The list of words that might be used to mention children. The list was created from 3 * 1000 words closest to 'сын' (son), 'дочь' (daughter) and 'ребенок' (child) according to the model by removing obviously irrelevant words. Note that typos, spelling mistakes, and non-dictionary words are intentionally preserved. However, we find that their frequency is too low to have any observable effect on the results.</p>
