Main content

Home

Menu

Loading wiki pages...

View
Wiki Version:
Embeddings sets are contained in two file each: `_terms.txt` files have one term per line, which correspond to lines in `.dat.bz2` files, which are [bzip2][1]-compressed, space-delimited numeric vectors. [set_info.csv][2] associates set names with file ids. Indices for each set are also mapped to a single term list in [term_map.csv][3] (`57 MB`) -- row names are terms, column names are set names, and values are row numbers in the given set (starting at 1, with 0 indicating the term is not present). The spaces are trimmed to a set of 476,154 lowercase terms, and their values are standardized: trimmed space / max(trimmed space) * 100, rounded to 6 digits. The only characters appearing in any terms files are `'-./abcdefghijklmnopqrstuvwxyz`, and in any data file are `- .0123456789`. The source of each set of embeddings is linked to in their wiki pages. Those sources provide the original spaces used here, and sometimes have variants with different corpora, models, and/or parameters. Outside of those, some other repositories have a wider range of embeddings sets, such as the [Nordic Language Processing Laboratory][4] and [gensim][5] repositories. The [lingmatch][6] R package offers tools to download and use these files. The [load_embeddings.r][7] and [load_embeddings.py][8] files also offer standalone functions to partially or fully load spaces into R or Python environments. ## Available Spaces: - **[100k][9]**: [100k_terms.txt][10] (`99,188` terms), [100k.dat.bz2][11] (`100 MB`) - **[100k_cbow][12]**: [100k_cbow_terms.txt][13] (`99,186` terms), [100k_cbow.dat.bz2][14] (`111 MB`) - **[100k_lsa][15]**: [100k_lsa_terms.txt][16] (`99,188` terms), [100k_lsa.dat.bz2][17] (`85 MB`) - **[blogs][18]**: [blogs_terms.txt][19] (`27,277` terms), [blogs.dat.bz2][20] (`25 MB`) - **[CoNLL17_skipgram][21]**: [CoNLL17_skipgram_terms.txt][22] (`459,818` terms), [CoNLL17_skipgram.dat.bz2][23] (`168 MB`) - **[dcp_cbow][24]**: [dcp_cbow_terms.txt][25] (`215,142` terms), [dcp_cbow.dat.bz2][26] (`309 MB`) - **[dcp_svd][27]**: [dcp_svd_terms.txt][28] (`215,142` terms), [dcp_svd.dat.bz2][29] (`387 MB`) - **[eigenwords][30]**: [eigenwords_terms.txt][31] (`159,908` terms), [eigenwords.dat.bz2][32] (`125 MB`) - **[eigenwords_tscca][33]**: [eigenwords_tscca_terms.txt][34] (`55,879` terms), [eigenwords_tscca.dat.bz2][35] (`43 MB`) - **[facebook_crawl][36]**: [facebook_crawl_terms.txt][37] (`81,653` terms), [facebook_crawl.dat.bz2][38] (`62 MB`) - **[facebook_wiki][39]**: [facebook_wiki_terms.txt][40] (`97,605` terms), [facebook_wiki.dat.bz2][41] (`69 MB`) - **[glove_crawl][42]**: [glove_crawl_terms.txt][43] (`467,538` terms), [glove_crawl.dat.bz2][44] (`482 MB`) - **[glove_twitter][45]**: [glove_twitter_terms.txt][46] (`203,944` terms), [glove_twitter.dat.bz2][47] (`126 MB`) - **[glove_wiki][48]**: [glove_wiki_terms.txt][49] (`269,610` terms), [glove_wiki.dat.bz2][50] (`294 MB`) - **[google][51]**: [google_terms.txt][52] (`345,655` terms), [google.dat.bz2][53] (`183 MB`) - **[hpca][54]**: [hpca_terms.txt][55] (`162,137` terms), [hpca.dat.bz2][56] (`115 MB`) - **[lexvec_nnegpmi][57]**: [lexvec_nnegpmi_terms.txt][58] (`238,284` terms), [lexvec_nnegpmi.dat.bz2][59] (`268 MB`) - **[lexvec_wiki][60]**: [lexvec_wiki_terms.txt][61] (`277,140` terms), [lexvec_wiki.dat.bz2][62] (`311 MB`) - **[nasari][63]**: [nasari_terms.txt][64] (`151,776` terms), [nasari.dat.bz2][65] (`169 MB`) - **[paragram_sl999][66]**: [paragram_sl999_terms.txt][67] (`456,295` terms), [paragram_sl999.dat.bz2][68] (`502 MB`) - **[paragram_ws353][69]**: [paragram_ws353_terms.txt][70] (`456,295` terms), [paragram_ws353.dat.bz2][71] (`501 MB`) - **[senna][72]**: [senna_terms.txt][73] (`116,539` terms), [senna.dat.bz2][74] (`21 MB`) - **[sensembed][75]**: [sensembed_terms.txt][76] (`409,078` terms), [sensembed.dat.bz2][77] (`575 MB`) - **[snaut][78]**: [snaut_terms.txt][79] (`373,596` terms), [snaut.dat.bz2][80] (`396 MB`) - **[tasa][81]**: [tasa_terms.txt][82] (`79,371` terms), [tasa.dat.bz2][83] (`70 MB`) - **[turian_hlbl][84]**: [turian_hlbl_terms.txt][85] (`122,772` terms), [turian_hlbl.dat.bz2][86] (`41 MB`) - **[ukwac_cbow][87]**: [ukwac_cbow_terms.txt][88] (`171,187` terms), [ukwac_cbow.dat.bz2][89] (`254 MB`) [1]: https://www.sourceware.org/bzip2 [2]: https://osf.io/9yzca [3]: https://osf.io/yauzm/?action=download [4]: http://vectors.nlpl.eu/repository [5]: https://github.com/RaRe-Technologies/gensim-data [6]: https://github.com/miserman/lingmatch [7]: https://osf.io/358cw [8]: https://osf.io/7ra5f [9]: https://osf.io/489he/wiki/100k [10]: https://osf.io/download/s95uj [11]: https://osf.io/download/c8759 [12]: https://osf.io/489he/wiki/100k_cbow [13]: https://osf.io/download/5v76j [14]: https://osf.io/download/mnjks [15]: https://osf.io/489he/wiki/100k_lsa [16]: https://osf.io/download/5ngjk [17]: https://osf.io/download/djrhy [18]: https://osf.io/489he/wiki/blogs [19]: https://osf.io/download/6w2t5 [20]: https://osf.io/download/qj5ez [21]: https://osf.io/489he/wiki/CoNLL17_skipgram [22]: https://osf.io/download/r3762 [23]: https://osf.io/download/n5kmv [24]: https://osf.io/489he/wiki/dcp_cbow [25]: https://osf.io/download/hju8m [26]: https://osf.io/download/y4ezj [27]: https://osf.io/489he/wiki/dcp_svd [28]: https://osf.io/download/abewt [29]: https://osf.io/download/aydne [30]: https://osf.io/489he/wiki/eigenwords [31]: https://osf.io/download/d5kw9 [32]: https://osf.io/download/v7mwh [33]: https://osf.io/489he/wiki/eigenwords_tscca [34]: https://osf.io/download/zb28h [35]: https://osf.io/download/7nvzy [36]: https://osf.io/489he/wiki/facebook_crawl [37]: https://osf.io/download/fvkny [38]: https://osf.io/download/f2vbk [39]: https://osf.io/489he/wiki/facebook_wiki [40]: https://osf.io/download/2pf64 [41]: https://osf.io/download/auy9n [42]: https://osf.io/489he/wiki/glove_crawl [43]: https://osf.io/download/kx29t [44]: https://osf.io/download/dzcqt [45]: https://osf.io/489he/wiki/glove_twitter [46]: https://osf.io/download/u8enr [47]: https://osf.io/download/g3e78 [48]: https://osf.io/489he/wiki/glove_wiki [49]: https://osf.io/download/g6yms [50]: https://osf.io/download/38hfz [51]: https://osf.io/489he/wiki/google [52]: https://osf.io/download/ntzfq [53]: https://osf.io/download/7q8y4 [54]: https://osf.io/489he/wiki/hpca [55]: https://osf.io/download/3ctxj [56]: https://osf.io/download/x45af [57]: https://osf.io/489he/wiki/lexvec_nnegpmi [58]: https://osf.io/download/mf5aq [59]: https://osf.io/download/kqy5b [60]: https://osf.io/489he/wiki/lexvec_wiki [61]: https://osf.io/download/gvmb9 [62]: https://osf.io/download/ruhdw [63]: https://osf.io/489he/wiki/nasari [64]: https://osf.io/download/mepkc [65]: https://osf.io/download/27wzx [66]: https://osf.io/489he/wiki/paragram_sl999 [67]: https://osf.io/download/9tvn2 [68]: https://osf.io/download/yhq7j [69]: https://osf.io/489he/wiki/paragram_ws353 [70]: https://osf.io/download/eub4n [71]: https://osf.io/download/fmhvu [72]: https://osf.io/489he/wiki/senna [73]: https://osf.io/download/vqs8z [74]: https://osf.io/download/bz2a5 [75]: https://osf.io/489he/wiki/sensembed [76]: https://osf.io/download/muzey [77]: https://osf.io/download/5w4ns [78]: https://osf.io/489he/wiki/snaut [79]: https://osf.io/download/x7a4j [80]: https://osf.io/download/2yaht [81]: https://osf.io/489he/wiki/tasa [82]: https://osf.io/download/jqdh2 [83]: https://osf.io/download/v974s [84]: https://osf.io/489he/wiki/turian_hlbl [85]: https://osf.io/download/93veh [86]: https://osf.io/download/f6rek [87]: https://osf.io/489he/wiki/ukwac_cbow [88]: https://osf.io/download/7b2ua [89]: https://osf.io/download/g7hn8
OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.
Accept
This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information, see our Privacy Policy and information on cookie use.
Accept
×

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.