Main content
Home
Menu
Loading wiki pages...
Embeddings sets are contained in two file each: _terms.txt
files have one term per line, which correspond to lines in .dat.bz2
files, which are bzip2-compressed, space-delimited numeric vectors. set_info.csv associates set names with file ids.
Indices for each set are also mapped to a single term list in term_map.csv (57 MB
) -- row names are terms, column names are set names, and values are row numbers in the given set (starting at 1, with 0 indicating the term is not present).
The spaces are trimmed to a set of 476,154 lowercase terms, and their values are standardized: trimmed space / max(trimmed space) * 100, rounded to 6 digits. The only characters appearing in any terms files are '-./abcdefghijklmnopqrstuvwxyz
, and in any data file are - .0123456789
.
The source of each set of embeddings is linked to in their wiki pages. Those sources provide the original spaces used here, and sometimes have variants with different corpora, models, and/or parameters. Outside of those, some other repositories have a wider range of embeddings sets, such as the Nordic Language Processing Laboratory and gensim repositories.
The lingmatch R package offers tools to download and use these files. The load_embeddings.r and load_embeddings.py files also offer standalone functions to partially or fully load spaces into R or Python environments.
Available Spaces:
-
100k: 100k_terms.txt (
99,188
terms), 100k.dat.bz2 (100 MB
) -
100k_cbow: 100k_cbow_terms.txt (
99,186
terms), 100k_cbow.dat.bz2 (111 MB
) -
100k_lsa: 100k_lsa_terms.txt (
99,188
terms), 100k_lsa.dat.bz2 (85 MB
) -
blogs: blogs_terms.txt (
27,277
terms), blogs.dat.bz2 (25 MB
) -
CoNLL17_skipgram: CoNLL17_skipgram_terms.txt (
459,818
terms), CoNLL17_skipgram.dat.bz2 (168 MB
) -
dcp_cbow: dcp_cbow_terms.txt (
215,142
terms), dcp_cbow.dat.bz2 (309 MB
) -
dcp_svd: dcp_svd_terms.txt (
215,142
terms), dcp_svd.dat.bz2 (387 MB
) -
eigenwords: eigenwords_terms.txt (
159,908
terms), eigenwords.dat.bz2 (125 MB
) -
eigenwords_tscca: eigenwords_tscca_terms.txt (
55,879
terms), eigenwords_tscca.dat.bz2 (43 MB
) -
facebook_crawl: facebook_crawl_terms.txt (
81,653
terms), facebook_crawl.dat.bz2 (62 MB
) -
facebook_wiki: facebook_wiki_terms.txt (
97,605
terms), facebook_wiki.dat.bz2 (69 MB
) -
glove_crawl: glove_crawl_terms.txt (
467,538
terms), glove_crawl.dat.bz2 (482 MB
) -
glove_twitter: glove_twitter_terms.txt (
203,944
terms), glove_twitter.dat.bz2 (126 MB
) -
glove_wiki: glove_wiki_terms.txt (
269,610
terms), glove_wiki.dat.bz2 (294 MB
) -
google: google_terms.txt (
345,655
terms), google.dat.bz2 (183 MB
) -
hpca: hpca_terms.txt (
162,137
terms), hpca.dat.bz2 (115 MB
) -
lexvec_nnegpmi: lexvec_nnegpmi_terms.txt (
238,284
terms), lexvec_nnegpmi.dat.bz2 (268 MB
) -
lexvec_wiki: lexvec_wiki_terms.txt (
277,140
terms), lexvec_wiki.dat.bz2 (311 MB
) -
nasari: nasari_terms.txt (
151,776
terms), nasari.dat.bz2 (169 MB
) -
paragram_sl999: paragram_sl999_terms.txt (
456,295
terms), paragram_sl999.dat.bz2 (502 MB
) -
paragram_ws353: paragram_ws353_terms.txt (
456,295
terms), paragram_ws353.dat.bz2 (501 MB
) -
senna: senna_terms.txt (
116,539
terms), senna.dat.bz2 (21 MB
) -
sensembed: sensembed_terms.txt (
409,078
terms), sensembed.dat.bz2 (575 MB
) -
snaut: snaut_terms.txt (
373,596
terms), snaut.dat.bz2 (396 MB
) -
tasa: tasa_terms.txt (
79,371
terms), tasa.dat.bz2 (70 MB
) -
turian_hlbl: turian_hlbl_terms.txt (
122,772
terms), turian_hlbl.dat.bz2 (41 MB
) -
ukwac_cbow: ukwac_cbow_terms.txt (
171,187
terms), ukwac_cbow.dat.bz2 (254 MB
)
Page permissions have changed
Your browser should refresh shortly…
Renaming wiki...
Wiki page deleted
Press Confirm to return to the project wiki home page.
Connected to the collaborative wiki
This page is currently connected to the collaborative wiki. All edits made will be visible to contributors with write permission in real time. Changes will be stored but not published until you click the "Save" button.
Connecting to the collaborative wiki
This page is currently attempting to connect to the collaborative wiki. You may continue to make edits. Changes will not be saved until you press the "Save" button.
Collaborative wiki is unavailable
The collaborative wiki is currently unavailable. You may continue to make edits. Changes will not be saved until you press the "Save" button.
Browser unsupported
Your browser does not support collaborative editing. You may continue to make edits. Changes will not be saved until you press the "Save" button.

Start managing your projects on the OSF today.
Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.
Copyright © 2011-2025
Center for Open Science
|
Terms of Use
|
Privacy Policy
|
Status
|
API
TOP Guidelines
|
Reproducibility Project: Psychology
|
Reproducibility Project: Cancer Biology