Main content

Home

Toggle view:
View
Compare

Menu

Loading wiki pages...

View
Wiki Version:

Embeddings sets are contained in two file each: _terms.txt files have one term per line, which correspond to lines in .dat.bz2 files, which are bzip2-compressed, space-delimited numeric vectors. set_info.csv associates set names with file ids.

Indices for each set are also mapped to a single term list in term_map.csv (57 MB) -- row names are terms, column names are set names, and values are row numbers in the given set (starting at 1, with 0 indicating the term is not present).

The spaces are trimmed to a set of 476,154 lowercase terms, and their values are standardized: trimmed space / max(trimmed space) * 100, rounded to 6 digits. The only characters appearing in any terms files are '-./abcdefghijklmnopqrstuvwxyz, and in any data file are - .0123456789.

The source of each set of embeddings is linked to in their wiki pages. Those sources provide the original spaces used here, and sometimes have variants with different corpora, models, and/or parameters. Outside of those, some other repositories have a wider range of embeddings sets, such as the Nordic Language Processing Laboratory and gensim repositories.

The lingmatch R package offers tools to download and use these files. The load_embeddings.r and load_embeddings.py files also offer standalone functions to partially or fully load spaces into R or Python environments.

Available Spaces:

OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.
Accept
This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information, see our Privacy Policy and information on cookie use.
Accept
×

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.